# Transcriptomic Clustering

In [1]:
import scanpy as sc
import transcriptomic_clustering as tc

## Loading Data
transcriptomic_clustering functions operate on Annotation Data, from part of scanpy/anndata. This supports both dense sparse matrices, and additional .['obs'], .['obsm'] and .['var'], [.varm] fields for storing information about the observations (cells) and variables (genes). Data is stored as an HDF5 file, allowing for both in-memory and file-backed operations. For more information, see https://anndata.readthedocs.io/en/latest/

<img src="https://falexwolf.de/img/scanpy/anndata.svg" width="480"> 

In [2]:
# Data is load into memory. Add 'r' for reading and 'r+' for modifying filebacked data
tasic_adata = sc.read_h5ad('./data/tasic2016counts_sparse.h5ad', backed='r')
print(tasic_adata)
print(tasic_adata.X)

AnnData object with n_obs × n_vars = 1809 × 24057 backed at 'data/tasic2016counts_csr.h5ad'
    obs: 'cells'
<HDF5 sparse dataset: format 'csr', shape (1809, 24057), type '<f4'>


## Running the Pipeline

### Normalization
To start, we normalize the expression matrix

In [3]:
normalized_adata = tc.normalize(tasic_adata, copy_to='./data/normalized2.h5ad')
normalized_adata

processing: 100%|██████████| 1/1 [00:00<00:00,  1.40it/s]


AnnData object with n_obs × n_vars = 1809 × 24057 backed at 'data/normalized2.h5ad'
    obs: 'cells'
    uns: 'normalized'

### Select Highly Variable Genes

We extract highly variable genes (HVGs) to further reduce the dimensionality of the dataset and include only the most informative genes. HVGs will be used for the following dimensionality reduction and clustering.

In [4]:
# Compute means and variances:
means, variances, gene_mask = tc.means_vars_genes(adata=normalized_adata)

# Find highly variable genes:
tc.highly_variable_genes(adata=normalized_adata, 
                         means=means, variances=variances, 
                         gene_mask=gene_mask, max_genes=3000)
normalized_adata

AnnData object with n_obs × n_vars = 1809 × 24057 backed at 'data/normalized2.h5ad'
    obs: 'cells'
    var: 'highly_variable'
    uns: 'normalized', 'hvg'

### PCA analysis
Now we can do principal component analysis on a subset of the data. 

In [5]:
# Do PCA on 1000 random cells and highly variable genes
(components, explained_variance_ratio, explained_variance) = \
    tc.pca(normalized_adata, cell_select=1000, use_highly_variable=True, svd_solver='arpack')

## General API patterns
### Appending to AnnData objects
Many functions in the transcriptomic package calculate features of the data. By default, they will return the values:

In [6]:
adata = sc.read_h5ad('./data/tasic2016counts_csr.h5ad')
hvg_df = tc.highly_variable_genes(adata, 
                         means=means, 
                         variances=variances, 
                         gene_mask=gene_mask, 
                         max_genes=3000, 
                         annotate=False)

adata

AnnData object with n_obs × n_vars = 1809 × 24057
    obs: 'cells'

However, some functions allow you to add the data to the AnnData object instead

In [7]:
tc.highly_variable_genes(adata, 
                         means=means, 
                         variances=variances, 
                         gene_mask=gene_mask, 
                         max_genes=3000, 
                         annotate=True)

adata

AnnData object with n_obs × n_vars = 1809 × 24057
    obs: 'cells'
    var: 'highly_variable'
    uns: 'hvg'

### Modifying AnnData in Place or Making a Copy
A function that modifies AnnData will by default make a copy.

In [8]:
# By default will create a new object:
tasic_adata_inmemory = sc.read_h5ad('./data/tasic2016counts_csr.h5ad')
normalized_adata = tc.normalize(tasic_adata_inmemory)
print(normalized_adata is tasic_adata_inmemory) # False: different objects

False


But passing `inplace=True` will overwrite instead

In [9]:
# But can also modify inplace:
normalized_adata = tc.normalize(tasic_adata_inmemory, inplace=True)
print(normalized_adata is tasic_adata_inmemory) # True: same  objects



True


### Managing Memory


#### Available Memory and Setting Memory Limits

In [10]:
tc.memory.get_available_system_memory_GB()

3.4089927673339844

In [11]:
!free -h

              total        used        free      shared  buff/cache   available
Mem:           15Gi        10Gi       1.9Gi       826Mi       2.6Gi       3.4Gi
Swap:         2.0Gi       1.7Gi       356Mi


In [16]:
tc.memory.set_memory_limit(percent_current_available=10)

In [14]:
tc.memory.set_memory_limit(GB=2)

In [17]:
tc.memory.get_available_memory_GB()

0.3396278381347656

#### Chunked Processing
By default, chunked processing is turned-off, and tc won't do any memory management for you

In [18]:
normalized_adata_backed = tc.normalize(tasic_adata, copy_to='./data/normalized3.h5ad')

MemoryError: The process: `normalize` cannot fit in memory, but could be done using chunking.
Set transcriptomic_clustering.memory.allow_chunking=True

But setting `tc.memory.allow_chunking = True`, you can enable automatic chunked processing

In [19]:
tc.memory.allow_chunking=True

In [20]:
normalized_adata_backed = tc.normalize(tasic_adata, copy_to='./data/normalized4.h5ad')

processing: 100%|██████████| 5/5 [00:01<00:00,  3.98it/s]


In [21]:
normalized_adata_backed

AnnData object with n_obs × n_vars = 1809 × 24057 backed at 'data/normalized4.h5ad'
    obs: 'cells'
    uns: 'normalized'

In [22]:
normalized2_adata_backed = tc.normalize(tasic_adata, copy_to='./data/normalized14.h5ad',chunk_size=200)

processing: 100%|██████████| 10/10 [00:01<00:00,  7.65it/s]


In [23]:
normalized2_adata_backed

AnnData object with n_obs × n_vars = 1809 × 24057 backed at 'data/normalized14.h5ad'
    obs: 'cells'
    uns: 'normalized'