# Testing the FacsimiLab Container

This jupyter notebook is designed to test the FacsimiLab docker container's ability to analyze single-cell RNA sequencing (scRNAseq) data. It utilizes `scvi`, `scanpy`, and `pytorch`.


This jupyter notebook is a modification of the scverse tutorial called [Introduction to scvi-tools](https://docs.scvi-tools.org/en/stable/tutorials/notebooks/quick_start/api_overview.html). The original source code is available [on Github](https://github.com/scverse/scvi-tutorials/blob/c62f43f1c8c58710d99afe2e0d374c17a587b566/quick_start/api_overview.ipynb). We'd like to thank the YosefLab for their incredible tools and resources. This tutorial notebooks is licensed with **BSD 3-Clause License** and a complete copy of their license can be found at the end of this notebook.


In [None]:
import os
import sys
import tempfile
from IPython.display import display, Markdown

import scanpy as sc
import scvi
import torch

In [None]:
import jax
import jaxlib
import flax

# Print library versions
print("JAX version:", jax.__version__)
print("JAXlib version:", jaxlib.__version__)
print("Flax version:", flax.__version__)
print("PyTorch version:", torch.__version__)

print("PyTorch CUDA version:", torch.version.cuda)
print("scVI version:", scvi.__version__)
print("Scanpy version:", sc.__version__)

## GPU accelerated analysis

The following code is designed to evaluate the presence of an Nvidia GPU with CUDA support.


In [None]:
# Check if Pytorch has succssfully detected and loaded an Nvidia GPU with CUDA support
if torch.cuda.is_available():

    display(Markdown("## Facsimilab: Nvidia CUDA GPU Detected"))
    display(Markdown(f"GPU Name: {torch.cuda.get_device_name(0)}"))
    display(Markdown(f"GPU Available: {torch.cuda.is_available()}"))

    display(Markdown("### System Information"))

    display(Markdown(f"- Python version: `{sys.version}` \n - PyTorch version: `{torch.__version__}`\n - CUDNN version: `{torch.backends.cudnn.version()}`\n - Number CUDA Devices: `{torch.cuda.device_count()}`"))

    display(Markdown("### Devices"))

    display(Markdown(f"- Available devices `{torch.cuda.device_count()}`\n - Active CUDA device: `{torch.cuda.current_device()}`"))

    display(Markdown("Python starts numbering from '0'. Therefore, the `Active CUDA device` name/number is expected to be `0` above."))

else:
    display(Markdown("## No CUDA GPU Detected"))
    display(Markdown("This notebook will use the CPU instead of the GPU. Analysis time is expected to be _**significantly longer, but still possible.**_"))

    display(Markdown(f"GPU Available: {torch.cuda.is_available()}"))

    display(Markdown("### System Information"))

    display(Markdown(f"- Python version: `{sys.version}` \n - PyTorch version: `{torch.__version__}`\n - CUDNN version: `{torch.backends.cudnn.version()}`\n - Number CUDA Devices: `{torch.cuda.device_count()}`"))

## Initialize `scvi`


In [None]:
scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__)

You can modify `save_dir` below to change where the data files for this tutorial are saved.


In [None]:
sc.set_figure_params(figsize=(4, 4))
torch.set_float32_matmul_precision("high")
save_dir = tempfile.TemporaryDirectory()

# for white background of figures (only for docs rendering)
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
%config InlineBackend.figure_format='retina'

## Loading and preparing data

Let us first load a subsampled version of the heart cell atlas dataset described in Litviňuková et al. (2020). scvi-tools has many "built-in" datasets as well as support for loading arbitrary `.csv`, `.loom`, and `.h5ad` (AnnData) files. Please see our tutorial on data loading for more examples.

-   Litviňuková, M., Talavera-López, C., Maatz, H., Reichart, D., Worth, C. L., Lindberg, E. L., ... & Teichmann, S. A. (2020). Cells of the adult human heart. Nature, 588(7838), 466-472.


```{important}
All scvi-tools models require AnnData objects as input.
```


In [None]:
data_directory = "./data"

In [None]:
adata = scvi.data.heart_cell_atlas_subsampled(save_path=data_directory)
adata

In [None]:
adata.write_h5ad(
        f"./data/heart_cell_atlas_subsampled.h5ad"
    )

In [None]:
verbosity=True

In [None]:
def detect_doublets(sample_name, sample_file_path):

    adata = sc.read_h5ad(sample_file_path)
    display(f'Loaded: {sample_file_path}')

    # Train the model
    scvi.model.SCVI.setup_anndata(adata)
    vae = scvi.model.SCVI(adata)
    vae.train()

    solo = scvi.external.SOLO.from_scvi_model(vae)
    solo.train()

    # See if we have doublets
    doublets = solo.predict()
    doublets["prediction"] = solo.predict(soft=False)

    # Strip off the "-1" which is on the barcodes
    doublets.index = doublets.index.map(lambda x: x[:-2])

    if verbosity == True:
        display(doublets)

    # Count the number of doublets
    display(doublets.groupby("prediction").count())

    # Create a doublet "difference" score parameter in `df.["DSS"]`
    doublets["DSS"] = doublets["doublet"] - doublets["singlet"]
    doublets
    if verbosity == True:
        display(doublets)



    # Create a new column to contain a cell barcode starting with the sample name
    adata.obs['Cell_Barcode'] = sample_name
    # Append the index (cell barcode) to the sample name in each row
    adata.obs['Cell_Barcode'] = adata.obs['Cell_Barcode'].map(str) + "_" + adata.obs.index

    # Strip off the "-1" which is on the barcodes
    adata.obs['Cell_Barcode'] = adata.obs['Cell_Barcode'].map(lambda x: x[:-2])

    # Confirm the number of unique barcodes (should equal the number of rows)
    display(f"All `adata.obs` rows have a unique barcode: {len(adata.obs['Cell_Barcode'].unique()) == adata.obs.shape[0]} ({len(adata.obs['Cell_Barcode'].unique())} cells barcoded)")

    # Create a new column to contain a cell barcode starting with the sample name
    doublets['Cell_Barcode'] = sample_name

    # Append the index (cell barcode) to the sample name in each row
    doublets['Cell_Barcode'] = doublets['Cell_Barcode'].map(str) + "_" + doublets.index


    # Confirm the number of unique barcodes (should equal the number of rows)
    display(f"All `doublets` rows have a unique barcode: {len(doublets['Cell_Barcode'].unique()) == doublets.shape[0]} ({len(doublets['Cell_Barcode'].unique())} cells barcoded)")

    # Confirm that the doublets dataframe has the same barcodes as the adata.obs dataframe
    display(f"Do adata.obs and doublets have the same barcodes?\n{doublets['Cell_Barcode'].isin(adata.obs['Cell_Barcode']).value_counts()}")

    # Merge the doublets dataframe into adata.obs
    adata.obs = pd.merge(adata.obs, doublets, on='Cell_Barcode')

    # Make the cell barcodes be the index column
    adata.obs.set_index('Cell_Barcode')

    return adata, doublets

In [None]:
import scvi
from scipy.sparse import csr_matrix
import pandas as pd

In [None]:
detect_doublets("heart1", "./data/heart_cell_atlas_subsampled.h5ad")

Now we preprocess the data to remove, for example, genes that are very lowly expressed and other outliers. For these tasks we prefer the [Scanpy preprocessing module](https://scanpy.readthedocs.io/en/stable/api.html#module-scanpy.pp).


In [None]:
sc.pp.filter_genes(adata, min_counts=3)

In scRNA-seq analysis, it's popular to normalize the data. These values are not used by scvi-tools, but given their popularity in other tasks as well as for visualization, we store them in the anndata object separately (via the `.raw` attribute).


```{important}
Unless otherwise specified, scvi-tools models require the raw counts (not log library size normalized). scvi-tools models will run for non-negative real-valued data, but we strongly suggest checking that these possibly non-count values are intended to represent pseudocounts (e.g. SoupX-corrected counts), and not some other normalized data, in which the variance/covariance structure of the data has changed dramatically.
```


In [None]:
adata.layers["counts"] = adata.X.copy()  # preserve counts
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata  # freeze the state in `.raw`

Finally, we perform feature selection, to reduce the number of features (genes in this case) used as input to the scvi-tools model. For best practices of how/when to perform feature selection, please refer to the model-specific tutorial. For scVI, we recommend anywhere from 1,000 to 10,000 HVGs, but it will be context-dependent.


In [None]:
sc.pp.highly_variable_genes(
    adata,
    n_top_genes=1200,
    subset=True,
    layer="counts",
    flavor="seurat_v3",
    batch_key="cell_source",
)

Now it's time to run `setup_anndata()`, which alerts scvi-tools to the locations of various matrices inside the anndata. It's important to run this function with the correct arguments so scvi-tools is notified that your dataset has batches, annotations, etc. For example, if batches are registered with scvi-tools, the subsequent model will correct for batch effects. See the full documentation for details.

In this dataset, there is a "cell_source" categorical covariate, and within each "cell_source", multiple "donors", "gender" and "age_group". There are also two continuous covariates we'd like to correct for: "percent_mito" and "percent_ribo". These covariates can be registered using the `categorical_covariate_keys` argument. If you only have one categorical covariate, you can also use the `batch_key` argument instead.


In [None]:
scvi.model.SCVI.setup_anndata(
    adata,
    layer="counts",
    categorical_covariate_keys=["cell_source", "donor"],
    continuous_covariate_keys=["percent_mito", "percent_ribo"],
)

```{warning}
If the adata is modified after running `setup_anndata`, please run `setup_anndata` again, before creating an instance of a model.
```


## Creating and training a model

While we highlight the scVI model here, the API is consistent across all scvi-tools models and is inspired by that of [scikit-learn](https://scikit-learn.org/stable/). For a full list of options, see the scvi [documentation](https://scvi-tools.org).


In [None]:
model = scvi.model.SCVI(adata)

We can see an overview of the model by printing it.


In [None]:
model

```{important}
All scvi-tools models run faster when using a GPU. By default, scvi-tools will use a GPU if one is found to be available. Please see the installation page for more information about installing scvi-tools when a GPU is available.
```


In [None]:
model_dir = os.path.join(data_directory, "scvi_model")

In [None]:
# If a scVI model does not exists, train a new one. If one does exist, load it

if not os.path.exists(os.path.join(model_dir, "model.pt")):
    if not os.path.exists(model_dir):
        os.makedirs(model_dir)

    # Train the model
    model.train()

    # Save the scVI trained model
    model.save(model_dir)

else:
    model = scvi.model.SCVI.load(model_dir, adata)

    # model = scvi.model.SCVI.load("scRNA-seq/analysis/runs/run_020/models/Res_1/scVI_Run020_Res1.model", adata)

model

## Obtaining model outputs


It's often useful to store the outputs of scvi-tools back into the original anndata, as it permits interoperability with Scanpy.


In [None]:
SCVI_LATENT_KEY = "X_scVI"

latent = model.get_latent_representation()
adata.obsm[SCVI_LATENT_KEY] = latent
latent.shape

The `model.get...()` functions default to using the AnnData that was used to initialize the model. It's possible to also query a subset of the anndata, or even use a completely independent anndata object as long as the anndata is organized in an equivalent fashion.


In [None]:
adata_subset = adata[adata.obs.cell_type == "Fibroblast"]
latent_subset = model.get_latent_representation(adata_subset)
latent.shape

In [None]:
denoised = model.get_normalized_expression(adata_subset, library_size=1e4)
denoised.iloc[:5, :5]

Let's store the normalized values back in the anndata.


In [None]:
SCVI_NORMALIZED_KEY = "scvi_normalized"

adata.layers[SCVI_NORMALIZED_KEY] = model.get_normalized_expression(library_size=10e4)

## Interoperability with Scanpy


Scanpy is a powerful python library for visualization and downstream analysis of scRNA-seq data. We show here how to feed the objects produced by scvi-tools into a scanpy workflow.


### Visualization without batch correction


```{warning}
We use UMAP to *qualitatively* assess our low-dimension embeddings of cells. We do not advise using UMAP or any similar approach quantitatively. We do recommend using the embeddings produced by scVI as a plug-in replacement of what you would get from PCA, as we show below.
```


First, we demonstrate the presence of nuisance variation with respect to nuclei/whole cell, age group, and donor by plotting the UMAP results of the top 30 PCA components for the raw count data.


In [None]:
# run PCA then generate UMAP plots
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=30, n_neighbors=20)
sc.tl.umap(adata, min_dist=0.3)

In [None]:
sc.pl.umap(
    adata,
    color=["cell_type"],
    frameon=False,
)
sc.pl.umap(
    adata,
    color=["donor", "cell_source"],
    ncols=2,
    frameon=False,
)

We see that while the cell types are generally well separated, nuisance variation plays a large part in the variation of the data.


### Visualization with batch correction (scVI)


Now, let us try using the scVI latent space to generate the same UMAP plots to see if scVI successfully accounts for batch effects in the data.


In [None]:
# use scVI latent space for UMAP generation
sc.pp.neighbors(adata, use_rep=SCVI_LATENT_KEY)
sc.tl.umap(adata, min_dist=0.3)

In [None]:
sc.pl.umap(
    adata,
    color=["cell_type"],
    frameon=False,
)
sc.pl.umap(
    adata,
    color=["donor", "cell_source"],
    ncols=2,
    frameon=False,
)

We can see that scVI was able to correct for nuisance variation due to nuclei/whole cell, age group, and donor, while maintaining separation of cell types.


### Clustering on the scVI latent space


The user will note that we imported curated labels from the original publication. Our interface with scanpy makes it easy to cluster the data with scanpy from scVI's latent space and then reinject them into scVI (e.g., for differential expression).


In [None]:
# neighbors were already computed using scVI
SCVI_CLUSTERS_KEY = "leiden_scVI"
sc.tl.leiden(adata, key_added=SCVI_CLUSTERS_KEY, resolution=0.5)

In [None]:
sc.pl.umap(
    adata,
    color=[SCVI_CLUSTERS_KEY],
    frameon=False,
)

## Differential expression

We can also use many scvi-tools models for differential expression. For further details on the methods underlying these functions as well as additional options, please see the [API docs](https://docs.scvi-tools.org/en/stable/api/reference/scvi.model.SCVI.differential_expression.html#scvi.model.SCVI.differential_expression).


In [None]:
adata.obs.cell_type.head()

For example, a 1-vs-1 DE test is as simple as:


In [None]:
de_df = model.differential_expression(
    groupby="cell_type", group1="Endothelial", group2="Fibroblast"
)
de_df.head()

We can also do a 1-vs-all DE test, which compares each cell type with the rest of the dataset:


In [None]:
de_df = model.differential_expression(
    groupby="cell_type",
)
de_df.head()

We now extract top markers for each cluster using the DE results.


In [None]:
markers = {}
cats = adata.obs.cell_type.cat.categories
for i, c in enumerate(cats):
    cid = f"{c} vs Rest"
    cell_type_df = de_df.loc[de_df.comparison == cid]

    cell_type_df = cell_type_df[cell_type_df.lfc_mean > 0]

    cell_type_df = cell_type_df[cell_type_df["bayes_factor"] > 3]
    cell_type_df = cell_type_df[cell_type_df["non_zeros_proportion1"] > 0.1]

    markers[c] = cell_type_df.index.tolist()[:3]

In [None]:
sc.tl.dendrogram(adata, groupby="cell_type", use_rep="X_scVI")

In [None]:
sc.pl.dotplot(
    adata,
    markers,
    groupby="cell_type",
    dendrogram=True,
    color_map="Blues",
    swap_axes=True,
    use_raw=True,
    standard_scale="var",
)

We can also visualize the scVI normalized gene expression values with the `layer` option.


In [None]:
sc.pl.heatmap(
    adata,
    markers,
    groupby="cell_type",
    layer="scvi_normalized",
    standard_scale="var",
    dendrogram=True,
    figsize=(8, 12),
)

## Logging information

Verbosity varies in the following way:

-   `logger.setLevel(logging.WARNING)` will show a progress bar.
-   `logger.setLevel(logging.INFO)` will show global logs including the number of jobs done.
-   `logger.setLevel(logging.DEBUG)` will show detailed logs for each training (e.g the parameters tested).

This function's behaviour can be customized, please refer to its documentation for information about the different parameters available.

In general, you can use `scvi.settings.verbosity` to set the verbosity of the scvi package.
Note that `verbosity` corresponds to the logging levels of the standard python `logging` module. By default, that verbosity level is set to `INFO` (=20).
As a reminder the logging levels are:

<table class="docutils align-center">
<colgroup>
<col style="width: 48%">
<col style="width: 52%">
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Level</p></th>
<th class="head"><p>Numeric value</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">CRITICAL</span></code></p></td>
<td><p>50</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">ERROR</span></code></p></td>
<td><p>40</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">WARNING</span></code></p></td>
<td><p>30</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">INFO</span></code></p></td>
<td><p>20</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">DEBUG</span></code></p></td>
<td><p>10</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">NOTSET</span></code></p></td>
<td><p>0</p></td>
</tr>
</tbody>
</table>


## Clean up

Uncomment the following line to remove all data files created in this tutorial:


In [None]:
save_dir.cleanup()

## Licenses

### MIT License - Pranav Kumar Mishra
```
Copyright 2024 Pranav Kumar Mishra

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the “Software”), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
of the Software, and to permit persons to whom the Software is furnished to do
so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

```


### BSD3 License - YosefLab

```
BSD 3-Clause License

Copyright (c) 2020, YosefLab
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
   list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice,
   this list of conditions and the following disclaimer in the documentation
   and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its
   contributors may be used to endorse or promote products derived from
   this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
```
