For the sole purpose of demonstrating the scalability of the CS-CORE Python package in terms of time and memory usage, we consider two scenarios of co-expression network inference:

1. n=1994 cells, p=5000 genes (same as CSCORE_python_example.html)
2. n=44721 cells, p=5000 genes (to demonstrate the scalability)

In [1]:
import scanpy as sc
import numpy as np
import pandas as pd
import time
import tracemalloc
from CSCORE import CSCORE

# load single cell data generated by the codes above.
pbmc = sc.read_h5ad('data/updated_covid.h5ad')

## 
# Scenario 1: copied from CSCORE_python_example.html
##

# focus on B cells
pbmc_B = pbmc[pbmc.obs['cell.type.coarse'] == 'B',:]

# scale counts by seq depths
sc.pp.normalize_total(pbmc_B, target_sum=1)
mean_exp = (pbmc_B.X).sum(axis=0).A1

mean_exp_df = pd.DataFrame({'gene': pbmc_B.var.SCT_features, 'mean_expression': mean_exp})
top_genes_df = mean_exp_df.sort_values(by='mean_expression', ascending=False).head(5000)
# obtain indexes for the gene set of interest (top 5000 highly expressed)
top_genes_indices = top_genes_df.index.astype(int).to_numpy()

pbmc_B_healthy = pbmc_B[pbmc_B.obs['Status'] == 'Healthy', :]

  view_to_actual(adata)


# Setting 1: n=1994 cells, p=5000 genes

In [2]:
pbmc_B_healthy

View of AnnData object with n_obs × n_vars = 1994 × 26361
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'percent.rps', 'percent.rpl', 'percent.rrna', 'nCount_SCT', 'nFeature_SCT', 'SCT_snn_res.1', 'seurat_clusters', 'singler', 'Admission.level', 'cell.type.fine', 'cell.type.coarse', 'cell.type', 'IFN1', 'HLA1', 'Donor.orig', 'Donor.full', 'Donor', 'Status', 'Sex', 'DPS', 'DTF', 'Admission', 'Ventilated'
    var: 'SCT_features', '_index', 'features'
    obsm: 'X_umap'
    layers: 'SCT'

In [3]:
start = time.time()
tracemalloc.start()

res = CSCORE(pbmc_B_healthy, top_genes_indices)

current_memory, peak_memory = tracemalloc.get_traced_memory()
end = time.time()

IRLS converged after 4 iterations.
15 among 5000 genes have negative variance estimates. Their co-expressions with other genes were set to 0.
0.1075% co-expression estimates were greater than 1 and were set to 1.
0.0618% co-expression estimates were greater than 1 and were set to 1.


In [4]:
# elapsed memory (in minutes)
(end - start) / 60

0.2194434444109599

In [5]:
print(f"Current memory usage: {current_memory / (1024**3)} GB")
print(f"Peak memory usage: {peak_memory / (1024**3)} GB")
tracemalloc.stop()

Current memory usage: 0.5588010940700769 GB
Peak memory usage: 2.317520773038268 GB


# Setting 2: n=44721 cells, p=5000 genes

In [6]:
pbmc

AnnData object with n_obs × n_vars = 44721 × 26361
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'percent.rps', 'percent.rpl', 'percent.rrna', 'nCount_SCT', 'nFeature_SCT', 'SCT_snn_res.1', 'seurat_clusters', 'singler', 'Admission.level', 'cell.type.fine', 'cell.type.coarse', 'cell.type', 'IFN1', 'HLA1', 'Donor.orig', 'Donor.full', 'Donor', 'Status', 'Sex', 'DPS', 'DTF', 'Admission', 'Ventilated'
    var: 'SCT_features', '_index', 'features'
    obsm: 'X_umap'
    layers: 'SCT'

In [7]:
start = time.time()
tracemalloc.start()

res = CSCORE(pbmc, top_genes_indices)

current_memory, peak_memory = tracemalloc.get_traced_memory()
end = time.time()

IRLS converged after 3 iterations.
0.0002% co-expression estimates were greater than 1 and were set to 1.
0.0000% co-expression estimates were greater than 1 and were set to 1.


In [8]:
# elapsed memory (in minutes)
(end - start) / 60

3.690111990769704

In [9]:
print(f"Current memory usage: {current_memory / (1024**3)} GB")
print(f"Peak memory usage: {peak_memory / (1024**3)} GB")
tracemalloc.stop()

Current memory usage: 0.5587991932407022 GB
Peak memory usage: 17.219492229633033 GB


# Note on large p

We note that we highly recommend **against** in estimating the co-expression network for all 20,000 genes.

The sparsity levels are extremely high for genes with mean expression levels ranked lower than, e.g. 10,000. It would be extremely challenging to infer the co-expressions for these genes (for any method). 

Additionally, using all genes in CS-CORE could result in much longer running time and memory costs. For example, the computational complexity scales quadratically with $p$.