# FacsimiLab for scRNAseq

This jupyter notebook is designed to test the FacsimiLab docker container's ability to analyze single-cell RNA sequencing (scRNAseq) data. It utilizes `scvi`, `scanpy`, and `pytorch`.


This jupyter notebook is a modification of the scverse tutorial called [Introduction to scvi-tools](https://docs.scvi-tools.org/en/stable/tutorials/notebooks/quick_start/api_overview.html). The original source code is available [on Github](https://github.com/scverse/scvi-tutorials/blob/c62f43f1c8c58710d99afe2e0d374c17a587b566/quick_start/api_overview.ipynb). We'd like to thank the YosefLab for their incredible tools and resources. This tutorial notebooks is licensed with **BSD 3-Clause License** and a complete copy of their license can be found at the end of this notebook.


In [3]:
import os
import sys
import tempfile
from IPython.display import display, Markdown

import scanpy as sc
import scvi
import torch

from scipy.sparse import csr_matrix
import pandas as pd
import numpy as np

import jax
import jaxlib
import flax

In [4]:
# Check if Pytorch has succssfully detected and loaded an Nvidia GPU with CUDA support
if torch.cuda.is_available():

    display(Markdown("## Facsimilab: Nvidia CUDA GPU Detected"))
    display(Markdown(f"GPU Name: {torch.cuda.get_device_name(0)}"))
    display(Markdown(f"GPU Available: {torch.cuda.is_available()}"))

    display(Markdown("### System Information"))

    display(
        Markdown(
            f"- Python version: `{sys.version}` \n - PyTorch version: `{torch.__version__}`\n - CUDNN version: `{torch.backends.cudnn.version()}`\n - Number CUDA Devices: `{torch.cuda.device_count()}`"
        )
    )

    display(Markdown("### Devices"))

    display(
        Markdown(
            f"- Available devices `{torch.cuda.device_count()}`\n - Active CUDA device: `{torch.cuda.current_device()}`"
        )
    )

    display(
        Markdown(
            "Python starts numbering from '0'. Therefore, the `Active CUDA device` name/number is expected to be `0` above."
        )
    )

else:
    display(Markdown("## No CUDA GPU Detected"))
    display(
        Markdown(
            "This notebook will use the CPU instead of the GPU. Analysis time is expected to be _**significantly longer, but still possible.**_"
        )
    )

    display(Markdown(f"GPU Available: {torch.cuda.is_available()}"))

    display(Markdown("### System Information"))

    display(
        Markdown(
            f"- Python version: `{sys.version}` \n - PyTorch version: `{torch.__version__}`\n - CUDNN version: `{torch.backends.cudnn.version()}`\n - Number CUDA Devices: `{torch.cuda.device_count()}`"
        )
    )

## Facsimilab: Nvidia CUDA GPU Detected

GPU Name: NVIDIA GeForce RTX 3060

GPU Available: True

### System Information

- Python version: `3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]` 
 - PyTorch version: `2.4.0`
 - CUDNN version: `90100`
 - Number CUDA Devices: `1`

### Devices

- Available devices `1`
 - Active CUDA device: `0`

Python starts numbering from '0'. Therefore, the `Active CUDA device` name/number is expected to be `0` above.

In [5]:
scvi.settings.seed = 0
sc.set_figure_params(figsize=(4, 4))
torch.set_float32_matmul_precision("medium")
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
%config InlineBackend.figure_format='retina'

Global seed set to 0


## Loading and preparing data

Let us first load a subsampled version of the heart cell atlas dataset described in Litviňuková et al. (2020). scvi-tools has many "built-in" datasets as well as support for loading arbitrary `.csv`, `.loom`, and `.h5ad` (AnnData) files. Please see our tutorial on data loading for more examples.

-   Litviňuková, M., Talavera-López, C., Maatz, H., Reichart, D., Worth, C. L., Lindberg, E. L., ... & Teichmann, S. A. (2020). Cells of the adult human heart. Nature, 588(7838), 466-472.

```{important}
All scvi-tools models require AnnData objects as input.
```


In [16]:
data_directory = "./data"
verbosity = True

In [26]:
adata = scvi.data.heart_cell_atlas_subsampled(save_path=data_directory)
adata.write_h5ad(f"./data/heart_cell_atlas_supersubsampled.h5ad")
adata

[34mINFO    [0m File .[35m/data/[0m[95mhca_subsampled_20k.h5ad[0m already downloaded                                                    


AnnData object with n_obs × n_vars = 18641 × 26662
    obs: 'NRP', 'age_group', 'cell_source', 'cell_type', 'donor', 'gender', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'region', 'sample', 'scrublet_score', 'source', 'type', 'version', 'cell_states', 'Used'
    var: 'gene_ids-Harvard-Nuclei', 'feature_types-Harvard-Nuclei', 'gene_ids-Sanger-Nuclei', 'feature_types-Sanger-Nuclei', 'gene_ids-Sanger-Cells', 'feature_types-Sanger-Cells', 'gene_ids-Sanger-CD45', 'feature_types-Sanger-CD45', 'n_counts'
    uns: 'cell_type_colors'

In [14]:
# adata = adata[0:1000].copy()
# adata.write_h5ad(f"./data/heart_cell_atlas_supersubsampled.h5ad")

In [15]:
# Train the model
scvi.model.SCVI.setup_anndata(adata)
vae = scvi.model.SCVI(adata)
vae.train()

solo = scvi.external.SOLO.from_scvi_model(vae)
solo.train()

# See if we have doublets
doublets = solo.predict()
doublets["prediction"] = solo.predict(soft=False)

# Strip off the "-1" which is on the barcodes
doublets.index = doublets.index.map(lambda x: x[:-2])

if verbosity == True:
	display(doublets)

# Count the number of doublets
display(doublets.groupby("prediction").count())

# Create a doublet "difference" score parameter in `df.["DSS"]`
doublets["DSS"] = doublets["doublet"] - doublets["singlet"]
doublets
if verbosity == True:
	display(doublets)

CUDA backend failed to initialize: Unable to use CUDA because of the following issues with CUDA components:
Outdated cuDNN installation found.
Version JAX was built against: 8907
Minimum supported: 9100
Installed version: 8907
The local installation version must be no lower than 9100..(Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  rank_zero_warn(


Epoch 400/400: 100%|█| 400/400 [01:08<00:00,  6.12it/s, v_num=1, train_loss_step=3.9e+3, train_los

`Trainer.fit` stopped: `max_epochs=400` reached.


Epoch 400/400: 100%|█| 400/400 [01:08<00:00,  5.87it/s, v_num=1, train_loss_step=3.9e+3, train_los
[34mINFO    [0m Creating doublets, preparing SOLO model.                                                                  


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Epoch 339/400:  85%|▊| 339/400 [00:26<00:04, 12.85it/s, v_num=1, train_loss_step=0.237, train_loss
Monitored metric validation_loss did not improve in the last 30 records. Best score: 0.357. Signaling Trainer to stop.


  return func(*args, **kwargs)
  return func(*args, **kwargs)


NameError: name 'verbosity' is not defined

In [18]:
sample_name = "Heart-Subsampled"

# Create a new column to contain a cell barcode starting with the sample name
adata.obs["Cell_Barcode"] = sample_name
# Append the index (cell barcode) to the sample name in each row
adata.obs['Cell_Barcode'] = adata.obs['Cell_Barcode'].map(str) + "_" + adata.obs.index

# Strip off the "-1" which is on the barcodes
adata.obs['Cell_Barcode'] = adata.obs['Cell_Barcode'].map(lambda x: x[:-2])

# Confirm the number of unique barcodes (should equal the number of rows)
display(f"All `adata.obs` rows have a unique barcode: {len(adata.obs['Cell_Barcode'].unique()) == adata.obs.shape[0]} ({len(adata.obs['Cell_Barcode'].unique())} cells barcoded)")

# Create a new column to contain a cell barcode starting with the sample name
doublets['Cell_Barcode'] = sample_name

# Append the index (cell barcode) to the sample name in each row
doublets['Cell_Barcode'] = doublets['Cell_Barcode'].map(str) + "_" + doublets.index


# Confirm the number of unique barcodes (should equal the number of rows)
display(f"All `doublets` rows have a unique barcode: {len(doublets['Cell_Barcode'].unique()) == doublets.shape[0]} ({len(doublets['Cell_Barcode'].unique())} cells barcoded)")

# Confirm that the doublets dataframe has the same barcodes as the adata.obs dataframe
display(f"Do adata.obs and doublets have the same barcodes?\n{doublets['Cell_Barcode'].isin(adata.obs['Cell_Barcode']).value_counts()}")

# Merge the doublets dataframe into adata.obs
adata.obs = pd.merge(adata.obs, doublets, on='Cell_Barcode')

# Make the cell barcodes be the index column
adata.obs.set_index('Cell_Barcode')

'All `adata.obs` rows have a unique barcode: True (1000 cells barcoded)'

'All `doublets` rows have a unique barcode: True (1000 cells barcoded)'

'Do adata.obs and doublets have the same barcodes?\nCell_Barcode\nTrue    1000\nName: count, dtype: int64'

Unnamed: 0_level_0,NRP,age_group,cell_source,cell_type,donor,gender,n_counts,n_genes,percent_mito,percent_ribo,...,source,type,version,cell_states,Used,_scvi_batch,_scvi_labels,doublet,singlet,prediction
Cell_Barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Heart-Subsampled_AACTCCCCACGAGAGT-1-HCAHeart78440,Yes,65-70,Sanger-CD45,Myeloid,D6,Male,1420.0,738,0.054930,0.064789,...,CD45+,DCD,V2,LYVE1+MØ1,Yes,0,0,0.063017,0.936983,singlet
Heart-Subsampled_ATAACGCAGAGCTGGT-1-HCAHeart78299,No,70-75,Sanger-Nuclei,Ventricular_Cardiomyocyte,D4,Female,844.0,505,0.001185,0.001185,...,Nuclei,DCD,V2,vCM1,Yes,0,0,0.010113,0.989887,singlet
Heart-Subsampled_GTCAAGTCATGCCACG-1-HCAHeart77028,Yes,60-65,Sanger-Nuclei,Fibroblast,D2,Male,1491.0,862,0.000000,0.005366,...,Nuclei,DCD,V2,FB2,Yes,0,0,0.323940,0.676060,singlet
Heart-Subsampled_GGTGATTCAAATGAGT-1-HCAHeart81028,Yes,60-65,Sanger-CD45,Endothelial,D11,Female,2167.0,1115,0.064144,0.027227,...,CD45+,DCD,V3,EC10_CMC-like,Yes,0,0,0.390160,0.609840,singlet
Heart-Subsampled_AGAGAATTCTTAGCAG-1-HCAHeart81028,Yes,60-65,Sanger-Cells,Endothelial,D11,Female,7334.0,2505,0.093537,0.040496,...,Cells,DCD,V3,EC5_art,Yes,0,0,0.416016,0.583984,singlet
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Heart-Subsampled_GAAACTCCAATGGAAT-1-HCAHeart78299,No,70-75,Sanger-Nuclei,Endothelial,D4,Female,527.0,395,0.001898,0.001898,...,Nuclei,DCD,V2,EC5_art,Yes,0,0,0.002073,0.997927,singlet
Heart-Subsampled_ATAGACCGTTTCACTT-1-HCAHeart81028,Yes,60-65,Sanger-CD45,Pericytes,D11,Female,5144.0,1979,0.174767,0.034798,...,CD45+,DCD,V3,PC3_str,Yes,0,0,0.746039,0.253961,doublet
Heart-Subsampled_GACCAATAGCCACGTC-1-HCAHeart78505,Yes,60-65,Sanger-CD45,Lymphoid,D7,Male,1850.0,940,0.018378,0.039459,...,CD45+,DCD,V2,NK,Yes,0,0,0.167472,0.832528,singlet
Heart-Subsampled_TCTACATTCGAGAACG-1-HCAHeart79053,Yes,65-70,Sanger-Cells,Pericytes,D6,Male,5860.0,2465,0.024573,0.051195,...,Cells,DCD,V3,PC3_str,Yes,0,0,0.590734,0.409266,doublet


In [27]:
# Basic quality control
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.filter_cells(adata, min_genes=3000)

# Note this is an incomplete set of QC. We are proving that scanpy is operational

In [28]:
# Normalize and Log Transform
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata

In [29]:
sc.pp.highly_variable_genes(
    adata,
    n_top_genes=1200,
    subset=True,
    layer="counts",
    flavor="seurat_v3",
    batch_key="cell_source",
)

ValueError: b'reciprocal condition number  1.3526e-16\n'