# Transcriptomic Clustering

In [None]:
import scanpy as sc
import transcriptomic_clustering as tc

## Loading Data
transcriptomic_clustering functions operate on Annotation Data, from part of scanpy/anndata. This supports both dense sparse matrices, and additional .['obs'], .['obsm'] and .['var'], [.varm] fields for storing information about the observations (cells) and variables (genes). Data is stored as an HDF5 file, allowing for both in-memory and file-backed operations. For more information, see https://anndata.readthedocs.io/en/latest/

<img src="https://falexwolf.de/img/scanpy/anndata.svg" width="480"> 

In [None]:
# Data is load into memory. Add 'r' for reading and 'r+' for modifying filebacked data
tasic_adata = sc.read_h5ad('./data/tasic2016counts_csr.h5ad', backed='r')
print(tasic_adata)
print(tasic_adata.X)

## Running the Pipeline

### Normalization
To start, we normalize the expression matrix

In [None]:
normalized_adata = tc.normalize(tasic_adata, copy_to='./data/normalized2.h5ad')
normalized_adata

### Select Highly Variable Genes

We extract highly variable genes (HVGs) to further reduce the dimensionality of the dataset and include only the most informative genes. HVGs will be used for the following dimensionality reduction and clustering.

In [None]:
# Compute means and variances:
means, variances, gene_mask = tc.means_vars_genes(adata=normalized_adata)

# Find highly variable genes:
tc.highly_variable_genes(adata=normalized_adata, 
                         means=means, variances=variances, 
                         gene_mask=gene_mask, max_genes=3000)
normalized_adata

### PCA analysis
Now we can do principal component analysis on a subset of the data. 

In [None]:
# Do PCA on 1000 random cells and highly variable genes
(components, explained_variance_ratio, explained_variance) = \
    tc.pca(normalized_adata, cell_select=1000, use_highly_variable=True, svd_solver='arpack')

## General API patterns
### Appending to AnnData objects
Many functions in the transcriptomic package calculate features of the data. By default, they will return the values:

In [None]:
adata = sc.read_h5ad('./data/tasic2016counts_csr.h5ad')
hvg_df = tc.highly_variable_genes(adata, 
                         means=means, 
                         variances=variances, 
                         gene_mask=gene_mask, 
                         max_genes=3000, 
                         annotate=False)

adata

However, some functions allow you to add the data to the AnnData object instead

In [None]:
tc.highly_variable_genes(adata, 
                         means=means, 
                         variances=variances, 
                         gene_mask=gene_mask, 
                         max_genes=3000, 
                         annotate=True)

adata

### Modifying AnnData in Place or Making a Copy
A function that modifies AnnData will by default make a copy.

In [None]:
# By default will create a new object:
tasic_adata_inmemory = sc.read_h5ad('./data/tasic2016counts_csr.h5ad')
normalized_adata = tc.normalize(tasic_adata_inmemory)
print(normalized_adata is tasic_adata_inmemory) # False: different objects

But passing `inplace=True` will overwrite instead

In [None]:
# But can also modify inplace:
normalized_adata = tc.normalize(tasic_adata_inmemory, inplace=True)
print(normalized_adata is tasic_adata_inmemory) # True: same  objects

### Managing Memory


#### Available Memory and Setting Memory Limits

In [None]:
tc.memory.get_available_system_memory_GB()

In [None]:
!free -h

In [None]:
tc.memory.set_memory_limit(percent_current_available=10)

In [None]:
tc.memory.set_memory_limit(GB=2)

In [None]:
tc.memory.get_available_memory_GB()

#### Chunked Processing
By default, chunked processing is turned-off, and tc won't do any memory management for you

In [None]:
normalized_adata_backed = tc.normalize(tasic_adata, copy_to='./data/normalized3.h5ad')

But setting `tc.memory.allow_chunking = True`, you can enable automatic chunked processing

In [None]:
tc.memory.allow_chunking=True

In [None]:
normalized_adata_backed = tc.normalize(tasic_adata, copy_to='./data/normalized4.h5ad')

In [None]:
normalized_adata_backed

In [None]:
normalized2_adata_backed = tc.normalize(tasic_adata, copy_to='./data/normalized14.h5ad',chunk_size=200)

In [None]:
normalized2_adata_backed