# Transcriptomic Clustering

In [1]:
import transcriptomic_clustering as tc
import scanpy as sc

## Loading Data
transcriptomic_clustering functions operate on Annotation Data, from part of scanpy/anndata. This supports both dense np.ndarray's and sparse scipy.csr arrays for the matrix, and additional .['obs'], .['obsm'] and .['var'], [.varm] fields for storing information about the observations (cells) and variables (genes). Data is stored as an H5 file, allowing for both in-memory and file-backed operations. For more information, see https://anndata.readthedocs.io/en/latest/

In [2]:
# Data is load into memory. Add 'r' or 'r+' for reading and modifying filebacked annotation data
tasic_adata = sc.read_h5ad('./data/tasic2016counts_csr.h5ad')

## General API patterns
### Appending to AnnData objects
Many functions in the transcriptomic package calculate features of the data. By default, they will return the values, e.g.
``` 
means, variances = tc.get_gene_means_variances(adata)
```
However, some functions allow you to add the data to the AnnData object instead
```
tc.get_gene_means_variances(adata, annotate=True)
print(adata.var['mean_exp'])
print(adata.var['var_exp'])
```
### Modifying AnnData in Place or Making a Copy
A function that modifies AnnData will by default modify it in place.
You can make a copy by passing one of these two optional arguments
to functions that modify AnnData.X

#### In Place
Simply call the function on the AnnData object, and it will be modified in place.
```
tc.normalize(tasic_adata)
```
All functions will return a reference to the modified AnnData object, copied or not.
For example, the following is the same as the above:
```
normalized_data = tc.normalize(tasic_adata)
```
Since this is in-place, normalized_data and tasic_adata is the same object.
You can verify this, as `normalized_adata is tasic_adata` returns True.

#### In Memory
To make a copy in memory, pass `inplace=False`
For example:
```
normalized_adata = tc.normalize(tasic_adata, inplace=False)
```
In this case, normalized_adata is a copy, and `normalized_adata is tasic_adata` returns False
#### Filebacked
To make a file-backed copy, pass `copy_to=path/to/new/file.h5ad`
For example:
```
normalized_adata = tc.normalize(tasic_adata, copy_to='path/to/new/file.h5ad')
```


## Running the Pipeline

### Normalization
(insert details here)

In [3]:
normalized_adata = tc.normalize(tasic_adata)

### Selecting Highly Variant Genes
(insert details here)

In [4]:
means, variances, genes = tc.means_vars_genes(adata=normalized_adata, chunk_size=300)
normalized_genefiltered_adata = normalized_adata[:, genes]
tc.highly_variable_genes(adata=normalized_adata, means=means, variances=variances, max_genes=3000)
normalized_adata.var['highly_variable']

0610005C13Rik    False
0610007C21Rik    False
0610007L01Rik    False
0610007N19Rik     True
0610007P08Rik    False
                 ...  
mt_FR668231      False
mt_FW313083      False
mt_GU332589      False
mt_X57779        False
mt_X57780        False
Name: highly_variable, Length: 24057, dtype: bool

### PCA analysis
The `tc.pca()` function uses scipy's pca solver or SKlearn under the hood

In [None]:
(components, explained_variance_ratio, explained_variance) = \
    tc.pca(normalized_adata, cell_mask=1000, use_highly_variable=True)

### Louvain Clustering

### Future APIs
filter known modes
wcgna
ward clustering
merging
hierarchical sorting

### Managing Memory
setting limits
writing files
chunking

## Accuracy Compared to R

### Normalize
(plots/stats)

### HVG
(plots/stats)

### PCA