# RAPIDS & Scanpy Single-Cell RNA-seq Workflow on 1 Million Cells

Copyright (c) 2020, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License") you may not use this file except in compliance with the License. You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0 

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

This notebook demonstrates a single-cell RNA analysis workflow that begins with preprocessing a count matrix of size `(n_gene, n_cell)` and results in a visualization of the clustered cells for further analysis.

For demonstration purposes, we use a dataset of 1M brain cells with Unified Virtual Memory to oversubscribe GPU memory. See the README for instructions to download this dataset.

## Import requirements

In [1]:
import numpy as np
import scanpy as sc
import anndata

import time

import cudf
import cuml
import cupy as cp

import os, wget

from cuml.decomposition import PCA
from cuml.manifold import TSNE
from cuml.cluster import KMeans

import anndata

import rapids_scanpy_funcs
import utils

import warnings
warnings.filterwarnings('ignore', 'Expected ')
warnings.simplefilter('ignore')

We use the RAPIDS memory manager to enable Unified Virtual Memory management, which allows us to oversubscribe the GPU memory.

In [2]:
import rmm

rmm.reinitialize(managed_memory=True)
cp.cuda.set_allocator(rmm.rmm_cupy_allocator)

## Input data

In the cell below, we provide the path to the sparse `.h5ad` file containing the count matrix to analyze. Please see the README for instructions on how to download the dataset we use here.

To run this notebook using your own dataset, please see the README for instructions to convert your own count matrix into this format. Then, replace the path in the cell below with the path to your generated `.h5ad` file.

In [3]:
input_file = "../data/1M_brain_cells_10X.sparse.h5ad"

if not os.path.exists(input_file):
    print('Downloading import file...')
    os.makedirs('../data', exist_ok=True)
    wget.download('https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad',
              input_file)

In [4]:
USE_FIRST_N_CELLS = 100000

## Set parameters

In [5]:
# marker genes
MITO_GENE_PREFIX = "mt-" # Prefix for mitochondrial genes to regress out
markers = ["Stmn2", "Hes1", "Olig1"] # Marker genes for visualization

# filtering cells
min_genes_per_cell = 200 # Filter out cells with fewer genes than this expressed 
max_genes_per_cell = 6000 # Filter out cells with more genes than this expressed 

# filtering genes
n_top_genes = 4000 # Number of highly variable genes to retain

# PCA
n_components = 50 # Number of principal components to compute

# Batched PCA
pca_train_ratio = 0.35 # percentage of cells to use for PCA training
n_pca_batches = 10

# t-SNE
tsne_n_pcs = 20 # Number of principal components to use for t-SNE

# k-means
k = 35 # Number of clusters for k-means

# KNN
n_neighbors = 15 # Number of nearest neighbors for KNN graph
knn_n_pcs = 50 # Number of principal components to use for finding nearest neighbors

# UMAP
umap_min_dist = 0.3 
umap_spread = 1.0

In [6]:
start = time.time()

## Load and Prepare Data

We load the sparse count matrix from an `h5ad` file using Scanpy. The sparse count matrix will then be placed on the GPU. 

In [7]:
data_load_start = time.time()

In [8]:
%%time
adata = anndata.read(input_file, use_gpu=True)
# adata.var_names_make_unique()
adata.shape

Variable names are not unique. To make them unique, call `.var_names_make_unique`.


CPU times: user 1min 19s, sys: 41.5 s, total: 2min 1s
Wall time: 2min 1s


(1306127, 27998)

In [9]:
type(adata.X), type(adata.var_names)

(cupyx.scipy.sparse.csr.csr_matrix, cudf.core.index.StringIndex)

For this example, we select the first 1 million cells in the dataset. We maintain the index of unique genes in our dataset:

In [10]:
%%time
# genes = cudf.Series(adata.var_names)
# adata.var_names = genes.to_pandas()
genes = adata.var_names

adata = anndata.AnnData(adata.X[:USE_FIRST_N_CELLS])
adata.var_names = genes # Fix for conversion error anndata/_core/anndata.py:931

CPU times: user 360 ms, sys: 216 ms, total: 576 ms
Wall time: 577 ms


Verify the shape of the resulting sparse matrix:

In [12]:
adata.shape

(100000, 27998)

And the number of non-zero values in the matrix:

In [13]:
adata.X.nnz

197980827

In [14]:
data_load_time = time.time()
print("Total data load and format time: %s" % (data_load_time-data_load_start))

Total data load and format time: 137.09226393699646


## Preprocessing

In [15]:
preprocess_start = time.time()

### Filter

We filter the count matrix to remove cells with an extreme number of genes expressed.

In [16]:
%%time
sparse_gpu_array = rapids_scanpy_funcs.filter_cells(adata.X, 
                                                    min_genes=min_genes_per_cell, 
                                                    max_genes=max_genes_per_cell)

CPU times: user 980 ms, sys: 1.81 s, total: 2.79 s
Wall time: 2.79 s


Some genes will now have zero expression in all cells. We filter out such genes.

In [17]:
%%time
sparse_gpu_array, genes = rapids_scanpy_funcs.filter_genes(sparse_gpu_array, 
                                                           adata.var_names, 
                                                           min_cells=1)

CPU times: user 2.24 s, sys: 1.69 s, total: 3.93 s
Wall time: 3.93 s


The size of our count matrix is now reduced.

In [18]:
sparse_gpu_array.shape

(99109, 21673)

### Normalize

We normalize the count matrix so that the total counts in each cell sum to 1e4.

In [19]:
%%time
sparse_gpu_array = rapids_scanpy_funcs.normalize_total(sparse_gpu_array, target_sum=1e4)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 1.23 ms


Next, we log transform the count matrix.

In [20]:
%%time
sparse_gpu_array = sparse_gpu_array.log1p()

CPU times: user 232 ms, sys: 256 ms, total: 488 ms
Wall time: 487 ms


### Select Most Variable Genes

We convert the count matrix to an annData object.

In [21]:
%%time
adata = anndata.AnnData(sparse_gpu_array)
# adata.var_names = genes.to_pandas()
adata.var_names = genes

CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 8.54 ms


Before filtering the count matrix, we save the 'raw' expression values of the marker genes to use for labeling cells afterward.

In [22]:
%%time
marker_genes_raw = {
    ("%s_raw" % marker): adata.X[:, adata.var_names == marker]
    for marker in markers
}

CPU times: user 928 ms, sys: 688 ms, total: 1.62 s
Wall time: 1.62 s


Using scanpy, we filter the count matrix to retain only the most variable genes.

In [23]:
%%time
import pandas as pd
from statsmodels import robust

def highly_variable_genes_filter(adata,
                                 sum_mat=None,
                                 sum_sq_mat=None,
                                 n_top_genes=None):
    """
    Finds the most variable genes.
    """
    data_mat = adata.X
    genes = adata.var_names
    
    if n_top_genes is None:
        n_top_genes = genes.shape[0] // 10

    if not sum_mat:
        sum_mat = data_mat.sum(axis=0)
    if not sum_sq_mat:
        sum_sq_mat = data_mat.power(2).sum(axis=0)

    mean = sum_mat / data_mat.shape[0]
    mean[mean == 0] = 1e-12

    mean_sq = sum_sq_mat / data_mat.shape[0]
    variance = mean_sq - mean ** 2
    variance *= data_mat.shape[1] / (data_mat.shape[0] - 1)
    dispersion = variance / mean

    df = pd.DataFrame()
    df['genes'] = genes.to_array()
    df['means'] = mean[0].tolist()
    df['dispersions'] = dispersion[0].tolist()
    df['mean_bin'] = pd.cut(
        df['means'],
        np.r_[-np.inf, np.percentile(df['means'], np.arange(10, 105, 5)), np.inf],
    )

    disp_grouped = df.groupby('mean_bin')['dispersions']
    disp_median_bin = disp_grouped.median()

    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        disp_mad_bin = disp_grouped.apply(robust.mad)
        df['dispersions_norm'] = (
            df['dispersions'].values - disp_median_bin[df['mean_bin'].values].values
        ) / disp_mad_bin[df['mean_bin'].values].values

    dispersion_norm = df['dispersions_norm'].values

    dispersion_norm = dispersion_norm[~np.isnan(dispersion_norm)]
    dispersion_norm[::-1].sort()

    if n_top_genes > df.shape[0]:
        n_top_genes = df.shape[0]

    disp_cut_off = dispersion_norm[n_top_genes - 1]
    vaiable_genes = np.nan_to_num(df['dispersions_norm'].values) >= disp_cut_off

    vaiable_genes = cp.array(vaiable_genes)
    vaiable_genes = cp.where(vaiable_genes)[0]
    
    highly_variable_adata = anndata.AnnData(adata.X[:, vaiable_genes])
    highly_variable_adata.var_names = genes[vaiable_genes]
    return highly_variable_adata


adata = highly_variable_genes_filter(adata, n_top_genes=n_top_genes)

CPU times: user 7.29 s, sys: 9.29 s, total: 16.6 s
Wall time: 16.6 s


### Regress out confounding factors (number of counts, mitochondrial gene expression)

We can now perform regression on the count matrix to correct for confounding factors -  for example purposes, we use the number of counts and the expression of mitochondrial genes (named starting with `mt-`).

We now calculate the total counts and the percentage of mitochondrial counts for each cell.

In [24]:
%%time
mito_genes = adata.var_names.str.startswith(MITO_GENE_PREFIX)
n_counts = adata.X.sum(axis=1)
percent_mito = (adata.X[:,mito_genes].sum(axis=1) / n_counts).ravel()

n_counts = cp.array(n_counts).ravel()
percent_mito = cp.array(percent_mito).ravel()

CPU times: user 20 ms, sys: 40 ms, total: 60 ms
Wall time: 57.8 ms


And perform regression:

In [25]:
%%time
sparse_gpu_array = cp.sparse.csc_matrix(adata.X)

CPU times: user 28 ms, sys: 24 ms, total: 52 ms
Wall time: 49.9 ms


In [26]:
%%time
sparse_gpu_array = rapids_scanpy_funcs.regress_out(sparse_gpu_array, n_counts, percent_mito)

CPU times: user 47.9 s, sys: 20.7 s, total: 1min 8s
Wall time: 1min 8s


### Scale

Finally, we scale the count matrix to obtain a z-score and apply a cutoff value of 10 standard deviations, obtaining the preprocessed count matrix.

In [27]:
%%time
sparse_gpu_array = rapids_scanpy_funcs.scale(sparse_gpu_array, max_value=10)

CPU times: user 136 ms, sys: 156 ms, total: 292 ms
Wall time: 289 ms


In [28]:
preprocess_time = time.time()
print("Total Preprocessing time: %s" % (preprocess_time-preprocess_start))

Total Preprocessing time: 95.14461851119995


## Cluster & Visualize

We store the preprocessed count matrix as an AnnData object, which is currently in host memory. We also add the expression levels of the marker genes as observations to the annData object.

In [29]:
%%time
# TODO: Try to remove the need to create new, instead filter.
genes = adata.var_names
adata = anndata.AnnData(sparse_gpu_array)
adata.var_names = genes

for name, data in marker_genes_raw.items():
    adata.obs[name] = data.todense()

# adata.shape

CPU times: user 28 ms, sys: 8 ms, total: 36 ms
Wall time: 34.7 ms


In [30]:
# del sparse_gpu_array
adata

AnnData object with n_obs × n_vars = 99109 × 4000
    obs: 'Stmn2_raw', 'Hes1_raw', 'Olig1_raw'

### Reduce

We use PCA to reduce the dimensionality of the matrix to its top 50 principal components.

If the number of cells was smaller, we would use the command `adata.obsm["X_pca"] = cuml.decomposition.PCA(n_components=n_components, output_type="numpy").fit_transform(adata.X)` to perform PCA on all the cells.

However, we cannot perform PCA on the complete dataset using a single GPU. Therefore, we use the batched PCA function in `utils.py`, which uses only a fraction of the total cells to train PCA.

In [31]:
%%time
adata = utils.pca(adata, n_components=n_components, 
                  train_ratio=pca_train_ratio, 
                  n_batches=n_pca_batches,
                  gpu=True)

CPU times: user 384 ms, sys: 284 ms, total: 668 ms
Wall time: 666 ms


### t-SNE + K-means

We cluster the cells using k-means on the principal components. For example purposes, we set k=35.

In [32]:
%%time
adata.obsm['X_tsne'] = TSNE().fit_transform(adata.obsm["X_pca"][:,:tsne_n_pcs])

[W] [00:37:02.237812] # of Nearest Neighbors should be at least 3 * perplexity. Your results might be a bit strange...
CPU times: user 1.06 s, sys: 1.21 s, total: 2.27 s
Wall time: 2.26 s


In [33]:
%%time
kmeans = KMeans(n_clusters=k, init="k-means++", random_state=0).fit(adata.obsm['X_pca'])
adata.obs['kmeans'] = kmeans.labels_.astype(str)

CPU times: user 116 ms, sys: 108 ms, total: 224 ms
Wall time: 227 ms


We visualize the cells using t-SNE and label cells by color according to the k-means clustering.

In [34]:
import cudf as cd

from bokeh.plotting import figure
from bokeh.io import push_notebook, show

from bokeh.models import ColorBar, ColumnDataSource
from bokeh.palettes import Blues9 as Blues
from bokeh.models import LinearColorMapper, ColorBar

from bokeh.io.export import export_png
from bokeh.models.tickers import FixedTicker
from bokeh.io import output_notebook

output_notebook()

In [35]:
%%time

# sc.pl.tsne(adata, color=["kmeans"])

def show_scatter(df, x, y, cluster_col, title):
    tsne_fig = figure(title=title, width=800, output_backend="webgl")
    clusters = df[cluster_col].unique().values_host

    a = ((np.random.random(size=clusters.shape[0]) * 255))
    b = ((np.random.random(size=clusters.shape[0]) * 255))

    colors = ["#%02x%02x%02x" % (int(r), int(g), 125) for r, g in zip(a, b)]

    for cluster in clusters:
        cdf = df.query(cluster_col + ' == ' + str(cluster))
        if cdf.shape[0] == 0:
            continue
        tsne_fig.circle(cdf[x].to_array(),
                        cdf[y].to_array(),
                        size=2,
                        color=colors[cluster],
                        legend_label = str(cluster))

    tsne_fig.legend.location = 'top_right'

    tsne_fig_handle = show(tsne_fig, notebook_handle=True)
    push_notebook(handle=tsne_fig_handle)

tsne_data = cd.DataFrame()
tsne_data[0] = adata.obsm['X_tsne'][:, 0]
tsne_data[1] = adata.obsm['X_tsne'][:, 1]
tsne_data['kmeans'] = adata.obs['kmeans'].astype(int)

show_scatter(tsne_data, 0, 1, 'kmeans', 'TSNE')

CPU times: user 4.82 s, sys: 172 ms, total: 4.99 s
Wall time: 4.98 s


We label the cells using the `Stmn2` and `Hes1` marker genes, for neuronal and glial cells respectively. These visualizations show us the separation of neuronal and glial cells on the t-SNE plot.

In [36]:
%%time

def show_scatter_grad(df, x, y, color_col, title):

    color_array = cp.fromDlpack(df[color_col].to_dlpack())
    source = ColumnDataSource(dict(x=df[x].to_array(),
                                   y=df[y].to_array(),
                                   color_col=color_array.get()))

    mapper = LinearColorMapper(palette=Blues,
                               low=df[color_col].min(),
                               high=df[color_col].max())

    tsne_fig = figure(title=title,
                      width=800,
                      output_backend="webgl")


    tsne_fig.scatter('x', 'y',
                     color={'field': 'color_col', 'transform':mapper},
                     source=source,
                     size=2)

    color_bar = ColorBar(color_mapper=mapper, width=8,  location=(0,0))
    tsne_fig.add_layout(color_bar, 'right')

    tsne_fig_handle = show(tsne_fig, notebook_handle=True)
    push_notebook(handle=tsne_fig_handle)

for marker_gene in marker_genes_raw:
    tsne_data[marker_gene] = marker_genes_raw[marker_gene].todense()

show_scatter_grad(tsne_data, 0, 1, 'Stmn2_raw', 'Stmn2 - TSNE')
show_scatter_grad(tsne_data, 0, 1, 'Hes1_raw', 'Hes1 - TSNE')

# sc.pl.tsne(adata, color=["Stmn2_raw"], color_map="Blues", vmax=1, vmin=-0.05)
# sc.pl.tsne(adata, color=["Hes1_raw"], color_map="Blues", vmax=1, vmin=-0.05)

CPU times: user 412 ms, sys: 0 ns, total: 412 ms
Wall time: 409 ms


### UMAP + Graph clustering

We can also visualize the cells using the UMAP algorithm in Rapids. Before UMAP, we need to construct a k-nearest neighbors graph in which each cell is connected to its nearest neighbors. This can be done conveniently using rapids functionality already integrated into Scanpy.

Note that Scanpy uses an approximation to the nearest neighbors on the CPU while the GPU version performs an exact search. While both methods are known to yield useful results, some differences in the resulting visualization and clusters can be observed.

In [37]:
%%time
sc.pp.neighbors(adata, n_neighbors=n_neighbors, n_pcs=knn_n_pcs, method='rapids')

CPU times: user 5.73 s, sys: 704 ms, total: 6.43 s
Wall time: 6.1 s


The UMAP function from Rapids is also integrated into Scanpy.

In [38]:
%%time
sc.tl.umap(adata, min_dist=umap_min_dist, spread=umap_spread, method='rapids')



CPU times: user 1.34 s, sys: 1.54 s, total: 2.88 s
Wall time: 2.87 s


Next, we use the Louvain algorithm for graph-based clustering.

In [39]:
%%time
sc.tl.louvain(adata, flavor='rapids')

CPU times: user 164 ms, sys: 64 ms, total: 228 ms
Wall time: 224 ms


We plot the cells using the UMAP visualization, and using the Louvain clusters as labels.

In [40]:
%%time
umap_data = cd.DataFrame()
umap_data[0] = adata.obsm['X_umap'][:, 0]
umap_data[1] = adata.obsm['X_umap'][:, 1]

umap_data['louvain'] = adata.obs['louvain'].astype(int)
show_scatter(umap_data, 0, 1, 'louvain', 'UMAP - Louvain')

# sc.pl.umap(adata, color=["louvain"])

CPU times: user 3.74 s, sys: 32 ms, total: 3.77 s
Wall time: 3.76 s


We can also use the Leiden clustering method in RAPIDS. This method has not been integrated into Scanpy and needs to be called separately.

In [41]:
%%time

adata.obs['leiden'] = rapids_scanpy_funcs.leiden(adata)

CPU times: user 124 ms, sys: 80 ms, total: 204 ms
Wall time: 203 ms


In [42]:
%%time

umap_data['leiden'] = adata.obs['leiden'].astype(int)
show_scatter(umap_data, 0, 1, 'leiden', 'UMAP - Leiden')

# sc.pl.umap(adata, color=["leiden"])

CPU times: user 3.87 s, sys: 56 ms, total: 3.93 s
Wall time: 3.93 s


And we can visualize the cells labeled by expression of the `Stmn2` and `Hes1` marker genes, for neuronal and glial cells respectively.

In [43]:
%%time
for marker_gene in marker_genes_raw:
    umap_data[marker_gene] = marker_genes_raw[marker_gene].todense()

show_scatter_grad(umap_data, 0, 1, 'Stmn2_raw', 'UMAP - Stmn2')
show_scatter_grad(umap_data, 0, 1, 'Hes1_raw', 'UMAP - Hes1')

# sc.pl.umap(adata, color=["Stmn2_raw"], color_map="Blues", vmax=1, vmin=-0.05)
# sc.pl.umap(adata, color=["Hes1_raw"], color_map="Blues", vmax=1, vmin=-0.05)

CPU times: user 460 ms, sys: 4 ms, total: 464 ms
Wall time: 461 ms


## Create Zoomed View

The speedup offered by Rapids makes it easy to interactively re-analyze subsets of cells. To illustrate this, we select glial cells (Hes1+) from the dataset.

In [44]:
reanalysis_start = time.time()

In [45]:
%%time
hes1_cells = marker_genes_raw["Hes1_raw"] > 0.0
hes1_cells = cp.where(hes1_cells.todense())[0]

# TODO: This looks like a work-around for errors in ANNDATA GPU version
genes = adata.var_names
obs = adata.obs
adata1 = anndata.AnnData(adata.X[hes1_cells,:])
adata1.var_names = genes
adata1.obs = obs.take(hes1_cells)
print(adata1.shape)
print(hes1_cells.shape)

obs.shape, hes1_cells.shape

(9104, 4000)
(9104,)
CPU times: user 24 ms, sys: 8 ms, total: 32 ms
Wall time: 31.9 ms


((99109, 6), (9104,))

We can repeat the dimension reduction, clustering and visualization using this subset of cells in seconds. Here, we can perform PCA for all of the selected cells on a single GPU.

In [46]:
adata.obs

Unnamed: 0,Stmn2_raw,Hes1_raw,Olig1_raw,kmeans,louvain,leiden
0,0.000000,0.0,0.000000,16,6,6
1,0.000000,0.0,0.000000,10,24,24
2,0.000000,0.0,0.000000,7,2,2
3,3.652791,0.0,0.000000,0,4,4
4,2.722037,0.0,0.000000,1,0,0
...,...,...,...,...,...,...
99104,3.664146,0.0,0.000000,34,17,17
99105,0.000000,0.0,3.499097,26,18,18
99106,2.986721,0.0,0.000000,9,5,5
99107,0.000000,0.0,0.000000,8,26,26


In [47]:
%%time
adata.obsm["X_pca"] = PCA(n_components=n_components, output_type="numpy").fit_transform(adata.X)
sc.pp.neighbors(adata, n_neighbors=n_neighbors, n_pcs=knn_n_pcs, method='rapids')
sc.tl.umap(adata, min_dist=umap_min_dist, spread=umap_spread, method='rapids')
adata.obs['leiden'] = rapids_scanpy_funcs.leiden(adata)



CPU times: user 5.29 s, sys: 2.38 s, total: 7.66 s
Wall time: 7.57 s


Finally, we visualize the selected neuronal cells labeled by their new clusters, and by the expression of `Olig1`, a marker gene for oligodendrocytes.

In [None]:
%%time
adata.obs.reset_index()

adata.obs['umap_0'] = adata.obsm['X_umap'][:, 0]
adata.obs['umap_1'] = adata.obsm['X_umap'][:, 1]

show_scatter(adata.obs, 'umap_0', 'umap_1', 'leiden', 'UMAP - leiden')
show_scatter_grad(adata.obs, 'umap_0', 'umap_1', 'Olig1_raw', 'UMAP - Hes1')

# sc.pl.umap(adata, color=["leiden"])
# sc.pl.umap(adata, color=["Olig1_raw"], color_map="Blues", vmax=1, vmin=-0.05)

In [49]:
reanalysis_time = time.time()
print("Total reanalysis time : %s" % (reanalysis_time-reanalysis_start))

Total reanalysis time : 8.287331819534302


In [50]:
print("Full time: %s" % (time.time() - start))

Full time: 267.26948595046997
