In [4]:
pip install -e ../dataforest

Obtaining file:///Users/austinmckay/code/dataforest
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Processing /Users/austinmckay/Library/Caches/pip/wheels/13/90/db/290ab3a34f2ef0b5a0f89235dc2d40fea83e77de84ed2dc05c/PyYAML-5.3.1-cp38-cp38-macosx_10_15_x86_64.whl
Installing collected packages: pyyaml, dataforest
  Attempting uninstall: dataforest
    Found existing installation: dataforest 0.0.1
    Uninstalling dataforest-0.0.1:
      Successfully uninstalled dataforest-0.0.1
  Running setup.py develop for dataforest
Successfully installed dataforest pyyaml-5.3.1
Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install -e .

Obtaining file:///Users/austinmckay/code/cellforest
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Installing collected packages: cellforest
  Attempting uninstall: cellforest
    Found existing installation: cellforest 0.0.1
    Uninstalling cellforest-0.0.1:
      Successfully uninstalled cellforest-0.0.1
  Running setup.py develop for cellforest
Successfully installed cellforest
Note: you may need to restart the kernel to use updated packages.


In [19]:
# install tree (filesystem viewer used later)
!apt-get install tree

/bin/sh: apt-get: command not found


In [9]:
%load_ext autoreload
%autoreload 2

import cellforest
from cellforest import Counts
import pandas as pd
from pathlib import Path

# Load Sample Data

In [2]:
cellranger_dir = Path("tests/data/v3_gz/sample_1")

In [3]:
from tests.utils.get_test_data import get_test_data
if not cellranger_dir.exists():
    get_test_data()
    if not cellrangder_dir.exists():
        raise ValueError("Notebook must be updated to conform to `get_test_data`")

In [7]:
ls {cellranger_dir}

barcodes.tsv.gz  features.tsv.gz  matrix.mtx.gz


# Quick Start

## Specify root of working directory tree

Use any directory, doesn't have to exist yet

In [8]:
example_dir = "tests/data/example_usage"

In [9]:
root_dir = f"{example_dir}/root"

# Counts Matrix

The counts matrix is a cells x genes matrix, built as a wrapper around `scipy.sparse.csr_matrix`. 

This data structure is central to the functionality of cellforest, so it's important to understand how it works. When using cellforest, you normally won't load/save it directly, but rather let cellforest handle that. However, if you don't need to automate workflows, and just want to do some counts matrix analysis outside of cellforest, you may want to instantiate it directly.

### Load from cellranger

`Counts` objects can be instantiated directly from cellranger outputs

In [9]:
rna = Counts.from_cellranger(cellranger_dir)

In [10]:
rna

<class 'cellforest.structures.Counts.Counts'>: [cell_ids x genes] matrix
<200x100 sparse matrix of type '<class 'numpy.float64'>'
	with 540 stored elements in Compressed Sparse Row format>

### Data attributes

The sparse matrix is stored in the `_matrix` attribute, which you generally shouldn't interact with directly. The `Counts` object has inherited most of the relevant methods of `csr_matrix`, so you can still do relevant calculations

In [11]:
rna.shape

(200, 100)

In [12]:
rna.sum()

828.0

The genes and ensembl IDs are stored in `features`, and can also be accessed via `genes` and `ensgs`, respectively. These function as the column index for the matrix. Genes can also be accessed via `columns` (like pandas). Note that the enseml IDs will be stripped in any data downstream of a Seurat process, since Seurat lacks ensembl support.

In [13]:
rna.features.head()

Unnamed: 0,ensgs,genes
0,ENSG00000243485,MIR1302-10
1,ENSG00000237613,FAM138A
2,ENSG00000186092,OR4F5
3,ENSG00000238009,RP11-34P13.7
4,ENSG00000239945,RP11-34P13.8


The 10X cell barcodes are stored in `cell_ids`, which can also be accessed via `index` (like pandas)

In [14]:
rna.cell_ids.head()

0    AAACATACAACCAC-1
1    AAACATTGAGCTAC-1
2    AAACATTGATCAGC-1
3    AAACCGTGCTTCCG-1
4    AAACCGTGTATGCG-1
Name: 0, dtype: object

### Slicing

The matrix can be sliced with integer indices, cell_ids, gene names, or ensembl IDs. The latter three can be presented in the form of strings, lists of strings, or `pandas.Series`. The matrix, features, and cell_ids will all be sliced correspondingly.

In [15]:
rna[:10, :20]

<class 'cellforest.structures.Counts.Counts'>: [cell_ids x genes] matrix
<10x20 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

In [16]:
rna["AAACATACAACCAC-1"]

<class 'cellforest.structures.Counts.Counts'>: [cell_ids x genes] matrix
<1x100 sparse matrix of type '<class 'numpy.float64'>'
	with 1 stored elements in Compressed Sparse Row format>

In [17]:
rna[["AAACATACAACCAC-1", "AAACATTGATCAGC-1"]]

<class 'cellforest.structures.Counts.Counts'>: [cell_ids x genes] matrix
<2x100 sparse matrix of type '<class 'numpy.float64'>'
	with 4 stored elements in Compressed Sparse Row format>

In [18]:
rna[["AAACATACAACCAC-1", "AAACATTGATCAGC-1"], 1:10]

<class 'cellforest.structures.Counts.Counts'>: [cell_ids x genes] matrix
<2x9 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

In [19]:
rna[:, ["MIR1302-10", "RP11-34P13.7"]]

<class 'cellforest.structures.Counts.Counts'>: [cell_ids x genes] matrix
<200x2 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

In [20]:
rna[:, "ENSG00000238009"]

<class 'cellforest.structures.Counts.Counts'>: [cell_ids x genes] matrix
<200x1 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

In [21]:
rna[rna.cell_ids[:20]]

<class 'cellforest.structures.Counts.Counts'>: [cell_ids x genes] matrix
<20x100 sparse matrix of type '<class 'numpy.float64'>'
	with 55 stored elements in Compressed Sparse Row format>

### Plotting

In [94]:
# TODO: histogram notimplemented

In [None]:
# TODO: scatter notimplemented

### Concatenation

A list of `Counts` objects can be `concatenate`ed, or one or more `Counts` objects can be appended to an existing one. Concatenation can occur along either the cells (`axis=0`) or genes (`axis=1`) dimensions, whereas `append` assumes the cells dimension. `hstack` and `vstack` can be used as alternatives to concatenation along the cells and genes axes, respectively (like numpy).

In [87]:
rna_2 = Counts.concatenate([rna[:20], rna[30:50]])
rna_2

<class 'cellforest.structures.Counts.Counts'>: [cell_ids x genes] matrix
<40x100 sparse matrix of type '<class 'numpy.float64'>'
	with 106 stored elements in Compressed Sparse Row format>

In [91]:
Counts.concatenate([rna[:, :20], rna[:, 30:50]], axis=1)

<class 'cellforest.structures.Counts.Counts'>: [cell_ids x genes] matrix
<200x40 sparse matrix of type '<class 'numpy.float64'>'
	with 134 stored elements in Compressed Sparse Row format>

In [88]:
rna_2.append(rna[60:100])

<class 'cellforest.structures.Counts.Counts'>: [cell_ids x genes] matrix
<80x100 sparse matrix of type '<class 'numpy.float64'>'
	with 209 stored elements in Compressed Sparse Row format>

`append` is not an `inplace` operation

In [90]:
rna_2

<class 'cellforest.structures.Counts.Counts'>: [cell_ids x genes] matrix
<40x100 sparse matrix of type '<class 'numpy.float64'>'
	with 106 stored elements in Compressed Sparse Row format>

### Drop

We can drop specified cells or genes. This doesn't occur `inplace`

In [92]:
# TODO: .drop not implemented

We can also drop cells or genes with no UMIs

In [97]:
rna.dropna()

<class 'cellforest.structures.Counts.Counts'>: [cell_ids x genes] matrix
<185x40 sparse matrix of type '<class 'numpy.float64'>'
	with 540 stored elements in Compressed Sparse Row format>

In [98]:
rna.dropna(axis=1)

<class 'cellforest.structures.Counts.Counts'>: [cell_ids x genes] matrix
<200x40 sparse matrix of type '<class 'numpy.float64'>'
	with 540 stored elements in Compressed Sparse Row format>

### I/O

**DataFrame (converts to dense)**

In [100]:
rna.to_df().head()

genes,MIR1302-10,FAM138A,OR4F5,RP11-34P13.7,RP11-34P13.8,AL627309.1,RP11-34P13.14,RP11-34P13.9,AP006222.2,RP4-669L17.10,...,RP11-345P4.7,CDK11A,SLC35E2,NADK,GNB1,RP1-140A9.1,CALML6,TMEM52,C1orf222,RP11-547D24.1
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AAACATACAACCAC-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AAACATTGAGCTAC-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AAACATTGATCAGC-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AAACCGTGCTTCCG-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
AAACCGTGTATGCG-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Save**

In [101]:
example_counts_path = "tests/data/example/counts/rna.pickle"
rna.save(example_counts_path)

In [102]:
ls {example_counts_path}

tests/data/example/counts/rna.pickle


**Load**

In [106]:
Counts.load(example_counts_path)

<class 'cellforest.structures.Counts.Counts'>: [cell_ids x genes] matrix
<200x100 sparse matrix of type '<class 'numpy.float64'>'
	with 540 stored elements in Compressed Sparse Row format>

**Others**

There are also `to_cellranger`, `to_rds`, and `from_rds` I/O methods. Note that they may leave behind intermediate artifacts (e.g. pickle files).

# Cellforest interface

## Loading Data

We can load data from cellranger outputs. If there are multiple samples and metadata is available, option 3 should be used. The data is loaded and combined in a `Counts` matrix as an attribute of our `CellForest` object. Python (.pickle) and Seurat (.rds) versions are saved in our `root_dir`. A `meta.tsv` file will also be created, which will include `cell_id`s (barcodes) as an index, and any additional sample metadata for each cell.

The `root_dir` will serve as the base for all of our downstream analysis. Once this directory has been populated, you can use option 4 to load from the .pickle file rather than re-processing the cellranger outputs.

### Option 1: from single cellranger output

In [107]:
cf = cellforest.from_input_dirs(root_dir, cellranger_dir)
cf



<cellforest.templates.CellForest.CellForest at 0x11fd60850>

In [108]:
ls {root_dir}

meta.tsv    rna.pickle  rna.rds


In [28]:
pd.read_csv(f"{root_dir}/meta.tsv", sep="\t", index_col=0).head()

Unnamed: 0_level_0,sample
0,Unnamed: 1_level_1
AAACATACAACCAC-1,sample_1
AAACATTGAGCTAC-1,sample_1
AAACATTGATCAGC-1,sample_1
AAACCGTGCTTCCG-1,sample_1
AAACCGTGTATGCG-1,sample_1


### Option 2: from multiple cellranger outputs

In [29]:
cellranger_dir_2 = "tests/data/v3_gz/sample_2"
cf = cellforest.from_input_dirs(root_dir, [cellranger_dir, cellranger_dir_2])



### Option 3 (PREFERRED): From metadata

This is preferred because this will allow you to include your metadata in analysis

In [11]:
# load example metadata
meta = pd.read_csv("tests/data/sample_metadata.tsv", sep="\t")
meta.head()

Unnamed: 0,sample,path_rna
0,sample_1,/Users/austinmckay/code/cellforest/tests/data/...
1,sample_2,/Users/austinmckay/code/cellforest/tests/data/...


In [11]:
cf = cellforest.from_metadata(root_dir, meta)



### Option 4 (for every subsequent load): From existing root

In [12]:
cf = cellforest.load(root_dir)



# Cellforest Interface

## Metadata

In [14]:
cf.meta

Unnamed: 0_level_0,sample
0,Unnamed: 1_level_1
AAACATACAACCAC-1,sample_1
AAACATTGAGCTAC-1,sample_1
AAACATTGATCAGC-1,sample_1
AAACCGTGCTTCCG-1,sample_1
AAACCGTGTATGCG-1,sample_1
...,...
ACTTTGTGGAAAGT-1,sample_2
ACTTTGTGGATAGA-1,sample_2
AGAAACGAAAGTAG-1,sample_2
AGAAAGTGCGCAAT-1,sample_2


## Counts

In [16]:
cf.rna

<class 'cellforest.structures.Counts.Counts'>: [cell_ids x genes] matrix
<400x100 sparse matrix of type '<class 'numpy.float64'>'
	with 1032 stored elements in Compressed Sparse Row format>

## Workflow automation

The purpose of cellforest isn't just to interact with metadata and counts matrices -- we want to automate workflows and interact with the outputs. We do this with a specification, which we input as a dictionary, and gets converted to a `Spec` object internally. Each key represents a process name, and the values represent input parameters to that process.

In [109]:
# TODO: start with linear process spec

In [6]:
spec = {
    "normalize": {
        "min_genes": 5,
        "max_genes": 5000,
        "min_cells": 5,
        "nfeatures": 30,
        "perc_mito_cutoff": 20,
        "method": "seurat_default",
    },
    "reduce": {
        # TODO: not set up yet
    },
    "cluster": {
        # TODO: not set up yet
    }
}
process_order = ["normalize", "reduce", "cluster"]

In [12]:
cf = cellforest.from_metadata(root_dir, meta, spec=spec, process_order=process_order)

In [14]:
ls {root_dir}

meta.tsv    rna.pickle  rna.rds


We can now execute processes from the spec

In [15]:
cf.process.normalize()

In [16]:
ls {root_dir}

meta.tsv    [1m[36mnormalize[m[m/  rna.pickle  rna.rds


In [20]:
!tree {root_dir}

[01;34mtests/data/example_usage/root[00m
├── meta.tsv
├── [01;34mnormalize[00m
│   └── [01;34mmax_genes:5000-method:seurat_default-min_cells:5-min_genes:5-nfeatures:30-perc_mito_cutoff:20[00m
│       ├── meta.tsv
│       ├── normalize.err
│       ├── normalize.out
│       ├── rna.pickle
│       └── rna.rds
├── rna.pickle
└── rna.rds

2 directories, 8 files


The `normalize` directory contains a single process run directory, which is named with all parameters used to run it. This syntax is called `ForestQuery`, and can be used to represent any dictionary.

In [21]:
from dataforest.filesystem.core.DataTree import DataTree

Note: you won't ever need to interact directly with `DataTree`, it is just instantiated here for illustrative purposes

In [32]:
dt = DataTree({"a": {1, 2}, "b": {"c": 3}, "d": 4})

The keys are organized alphabetically. `:` and `-` represent inward and outward traversal, respectively, and `+` is a delimiter for elements on the same level (uses set rather than list).

In [34]:
str(dt)

'a:1+2-b:c:3--d:4'

The default processes are defined in `cellforest.process`

In [13]:
from cellforest.processes import processes
process_path = str(Path(processes.__file__).parent)

In [14]:
!tree {process_path}

[01;34m/Users/austinmckay/code/cellforest/cellforest/processes/processes[00m
├── __init__.py
├── [01;34m__pycache__[00m
│   └── __init__.cpython-38.pyc
├── [01;34mcluster[00m
│   ├── __init__.py
│   ├── [01;34m__pycache__[00m
│   │   └── process.cpython-38.pyc
│   └── process.py
├── [01;34mexpression[00m
│   ├── __init__.py
│   ├── [01;34m__pycache__[00m
│   │   └── process.cpython-38.pyc
│   └── process.py
├── [01;34mgsea[00m
│   ├── __init__.py
│   ├── [01;34m__pycache__[00m
│   │   └── process.cpython-38.pyc
│   └── process.py
├── [01;34mnormalize[00m
│   ├── __init__.py
│   ├── [01;34m__pycache__[00m
│   │   └── process.cpython-38.pyc
│   ├── process.py
│   └── seurat_default_normalize.R
└── [01;34mreduce[00m
    ├── __init__.py
    ├── [01;34m__pycache__[00m
    │   └── process.cpython-38.pyc
    └── process.py

11 directories, 18 files


The normalize process both filters out cells by mitochondrial fraction and does counts matrix normalization using either Seurat default normalization or sctransform, as specified in the parameters. It outputs 

### Data Specification

In addition to parameter specification, we may also want to specify the data which flows into each process. We can do that either by subsetting the data, to include only those which match the specification, or by filtering the data to exclude those which match the specification.

**Subset**

### Filter

### Partition

# Customizing modules