# Imports

In [1]:
%load_ext autoreload
%autoreload 2

import cellforest as cf
from matplotlib import pyplot as plt
import pandas as pd
from pathlib import Path

# Branch

There are many ways to process a single dataset -- changing parameters, adding or removing processing steps, filtering or partitioning data. A `CellBranch` represents a single path through a chain of processes, multiple of which can be easily created to comparatively peer into "parallel universes" of processing.

### Load 10X Sample Data

TODO: this is just loading the 150 x 300

In [55]:
cellranger_dir = Path("tests/data/filtered_gene_bc_matrices/hg19/")

In [56]:
from tests.utils.get_test_data import get_test_data
if not cellranger_dir.exists():
    get_test_data(keep_raw=True)
    if not cellranger_dir.exists():
        raise ValueError("Notebook must be updated to conform to `get_test_data`")



In [57]:
ls {cellranger_dir}

barcodes.tsv  genes.tsv     matrix.mtx


### Get or Create Metadata

This set of metadata contains only two columns:

`entity_id`: contains the sample or lane identifier, which will be appended to the cell barcodes to prevent barcode collisions in the case of multiple lanes

`path_rna`: the directory path to the cellranger outputs containing the `matrix.mtx(.gz)`, `barcodes.mtx(.gz)`, and `features.mtx(.gz)`. (Many older file formats are supported as well)

Usually, the metadata would contain much more information which may be useful during analysis, e.g. patient information, cell processing information, etc. The only requirement is that column names don't start with `path_`.

In [58]:
df = pd.read_csv("tests/data/sample_metadata.tsv", sep="\t")
df.head()

Unnamed: 0,entity_id,path_rna
0,sample_1,/Users/austinmckay/code/cellforest/tests/data/...
1,sample_2,/Users/austinmckay/code/cellforest/tests/data/...


TODO: more samples

In [60]:
df = pd.DataFrame({"entity_id": ["sample_1"], "path_rna": ["/Users/austinmckay/code/cellforest/tests/data/filtered_gene_bc_matrices/hg19/"]})

### Create a Branch

The branch requires a `root`, or a directory where all datasets are combined into a single matrix. From here, plotting, QC, data subsetting, and further processing can be done.

Specify a directory for the root. Use any directory -- it will be created if it doesn't yet exist

In [61]:
example_dir = "tests/data/example_usage"

In [62]:
root_path = f"{example_dir}/root"

Use the `from_sample_metadata` function to combine the input datasets specified in `meta` into the `root_dir` and instantiate a branch (`CellBranch`) object from it.

This is only required the first time, however, as the combined data has been saved to `root_path`. Once the root has been created, `cf.load` can be used in subsequent sessions.

In [64]:
br = cf.from_sample_metadata(root_path, df)
# br = cf.load(root_path)

In [65]:
br

<cellforest.templates.CellBranch.CellBranch at 0x13461c490>

Note: If you find yourself grievously impovrished of metadata, you can replace `from_sample_metadata` with `from_input_dirs`, passing a list of paths to 10X outs dirs rather than a dataframe as the second arg.

### Branch Files

In [13]:
ls {root_dir}

[1m[36m_logs[m[m/      meta.tsv    rna.pickle  rna.rds


- `_logs`: contains logs
- `meta.tsv`: supplied metadata (`df` in this notebook), but expanded to contain a row for each cell
- `rna.pickle`: python version of sparse matrix for use by `cellforest`
- `rna.rds`: Seurat version of sparse matrix and metadata for use by `cellforestR`

### Metadata

The metadata is accessible via the `CellBranch.meta` attribute. Any columns in the dataframe passed to `from_sample_metadata` less `path_rna` will appear here as well as any data generated during downstream processing.

In [15]:
br.meta

Unnamed: 0,entity_id
AAACATACAACCAC-sample_1,sample_1
AAACATTGAGCTAC-sample_1,sample_1
AAACATTGATCAGC-sample_1,sample_1
AAACCGTGCTTCCG-sample_1,sample_1
AAACCGTGTATGCG-sample_1,sample_1
...,...
ATACCGGACTTCGC-sample_2,sample_2
ATACCGGAGGTGTT-sample_2,sample_2
ATACCGGATCTCGC-sample_2,sample_2
ATACCTACGCATCA-sample_2,sample_2


TODO: metadata modification

### Matrix

The counts matrix is a `Counts` object of cells x genes, which is a wrapper around a scipy sparse matrix

In [16]:
br.rna

<class 'cellforest.structures.counts.Counts.Counts'>: [cell_ids x genes] matrix
<600x150 sparse matrix of type '<class 'numpy.float64'>'
	with 2104 stored elements in Compressed Sparse Row format>

It has most of the standard scipy sparse matrix functionality

In [33]:
br.rna.sum(axis=0)[:, :10]

matrix([[0., 0., 0., 0., 0., 1., 0., 0., 0., 0.]])

In [34]:
br.rna.shape

(600, 150)

The matrix can be easily sliced via cell barcodes, gene names, ensembl IDs (if present), or numerical indices.

In [17]:
br.rna["AAACATACAACCAC-sample_1"]

<class 'cellforest.structures.counts.Counts.Counts'>: [cell_ids x genes] matrix
<1x150 sparse matrix of type '<class 'numpy.float64'>'
	with 1 stored elements in Compressed Sparse Row format>

In [21]:
br.rna[:10, ["CCDC27", "SMIM1", "CEP104"]]

<class 'cellforest.structures.counts.Counts.Counts'>: [cell_ids x genes] matrix
<10x3 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

The row and column indices can be inspected via `cell_ids` and `genes`

In [25]:
br.rna.cell_ids.head()

0    AAACATACAACCAC-sample_1
1    AAACATTGAGCTAC-sample_1
2    AAACATTGATCAGC-sample_1
3    AAACCGTGCTTCCG-sample_1
4    AAACCGTGTATGCG-sample_1
Name: 0, dtype: object

In [26]:
br.rna.genes.head()

0      MIR1302-10
1         FAM138A
2           OR4F5
3    RP11-34P13.7
4    RP11-34P13.8
Name: genes, dtype: object

Additional plotting functions can be found in `advanced_usage.ipynb`

# Processing

### Spec

A set or process steps is defined by a `branch_spec`, which is a list of dictionaries, each dictionary representing the specification for a given process, and the order of the list representing the order of execution. The keys in each dictionary tell the process how to execute.

Rather than attempting to remember the structure of the `branch_spec`, one can start with the provided example and modify it as needed.

In [36]:
spec = cf.defaults.spec_markers
spec

[{'_PROCESS_': 'normalize',
  '_PARAMS_': {'min_genes': 0,
   'max_genes': 37000,
   'min_cells': 3,
   'nfeatures': 3000,
   'perc_mito_cutoff': 100,
   'method': 'seurat_default'}},
 {'_PROCESS_': 'reduce',
  '_PARAMS_': {'pca_npcs': 30,
   'umap_n_neighbors': 30,
   'umap_min_dist': 0.3,
   'umap_n_components': 2,
   'umap_metric': 'correlation'}},
 {'_PROCESS_': 'cluster', '_PARAMS_': {'num_pcs': 3, 'res': 0.5, 'eps': 0.1}},
 {'_PROCESS_': 'markers',
  '_PARAMS_': {'logfc_thresh': 0.25, 'test': 'wilcox'}}]

The keys in this default spec are:
- `_PROCESS_`: the name of the process to be executed
- `_PARAMS_`: a dict of params to be fed to that process
Additionally, keys can be specified to subset, filter, or partition the cells based on columns in the metadata before it's passed to the process.
- `_SUBSET_`: a dict of column names and accepted column values. For example: `{"entity_id": "sample_1"}`. To accept multiple values for a given column, provide a list of accepted values, and to subset by multiple columns, use multiple keys
- `_FILTER_`: same as `_SUBSET_`, but rather than specified values being kept, they are removed. Be default, if multiple keys are provided, each filter is applied separately. The `_MULTI_` key can be used to apply multiple filters jointly, which is covered in `advanced_usage.ipynb`.

Note that subsets and filters will persist into any downstream processes of that in which they're specified.

For `normalize`, the `_PARAMS_` show that it doesn't do filtering by minimum expressed genes or percent mitochondrial UMIs, which is a good choice for a first pass analysis to do QC and get a lay of the land. Let's pretend that we've already done this, and want to set a mitochondrial filtering threshold of 25% for our branch. 

In [37]:
spec[1]["_PARAMS_"]["perc_mito_cutoff"] = 25

Our current branch has no `branch_spec`, only the root, so it isn't actually ready for processing. To make a processing-ready branch, we must supply this information

In [44]:
br = cf.load(root_path, branch_spec=spec)

In [45]:
br.spec

[{'_PROCESS_': 'normalize',
  '_PARAMS_': {'min_genes': 0,
   'max_genes': 37000,
   'min_cells': 3,
   'nfeatures': 3000,
   'perc_mito_cutoff': 100,
   'method': 'seurat_default'}},
 {'_PROCESS_': 'reduce',
  '_PARAMS_': {'pca_npcs': 30,
   'umap_n_neighbors': 30,
   'umap_min_dist': 0.3,
   'umap_n_components': 2,
   'umap_metric': 'correlation',
   'perc_mito_cutoff': 25}},
 {'_PROCESS_': 'cluster', '_PARAMS_': {'num_pcs': 3, 'res': 0.5, 'eps': 0.1}},
 {'_PROCESS_': 'markers',
  '_PARAMS_': {'logfc_thresh': 0.25, 'test': 'wilcox'}}]

### Process Execution

Now we can access the processes via the `CellBranch.process` attribute

In [46]:
br.process.normalize()



Note: `stop_on_error=True` and `stop_on_hook_error=True` may be helpful arguments for debugging.

### Branch Nodes

We are now at the second node on our `CellBranch`, the first being `root`.

In [48]:
br.current_process

'normalize'

We can use the bracket syntax to access an object which contains information about a given node in the branch.

In [49]:
br["root"].path

PosixPath('tests/data/example_usage/root')

In [50]:
br["normalize"].path

PosixPath('tests/data/example_usage/root/normalize/eZcx25Xs')

Note that the directories of the two nodes differ by `normalize/<hash ID>`. While the hash may look to be an attempt to protect cellforest internals from miscreant meddlings
