# scANVI
An important task of single-cell analysis is the integration of several samples, which we can perform with scVI. For integration, scVI treats the data as unlabelled. When our dataset is fully labelled (perhaps in independent studies, or independent analysis pipelines), we can obtain an integration that better preserves biology using scANVI, which incorporates cell type annotation information. Here we demonstrate this functionality with an integrated analysis of cells from the lung atlas integration task from the scIB manuscript. The same pipeline would generally be used to analyze any collection of scRNA-seq datasets.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import anndata
from scipy import io

import scanpy as sc
import scvi
import scib
import torch

In [None]:
adata = sc.read("/home/jovyan/researcher_home/Documents/Tom/Atlas/data/Atlas/merged.h5ad")

In [None]:
adata

## Dataset preprocessing
This dataset was already processed as described in the scIB manuscript. Generally, models in scvi-tools expect data that has been filtered/aggregated in the same fashion as one would do with Scanpy/Seurat.

Another important thing to keep in mind is highly-variable gene selection. While scVI and scANVI both accomodate using all genes in terms of runtime, we usually recommend filtering genes for best integration performance. This will, among other things, remove batch-specific variation due to batch-specific gene expression.

We perform this gene selection using the Scanpy pipeline while keeping the full dimension normalized data in the adata.raw object. We obtain variable genes from each dataset and take their intersections.

In [None]:
adata.raw = adata  # keep full dimension safe

In [None]:
sc.pp.log1p(adata)


In [None]:
sc.pp.highly_variable_genes(adata, flavor="seurat", batch_key="Author", n_top_genes=2000,subset=True)

In [None]:
adata

## Integration with scVI
As a first step, we assume that the data is completely unlabelled and we wish to find common axes of variation between the two datasets. There are many methods available in scanpy for this purpose (BBKNN, Scanorama, etc.). In this notebook we present scVI. To run scVI, we simply need to:

Register the AnnData object with the correct key to identify the sample and the layer key with the count data.

Create an SCVI model object.

In [None]:
#adata.raw.X


In [None]:
pip install -U "jax[cuda12]"

In [None]:
scvi.model.SCVI.setup_anndata(adata, batch_key="Author")

We note that these parameters are non-default; however, they have been verified to generally work well in the integration task.

In [None]:
vae = scvi.model.SCVI(adata, n_layers=2, n_latent=30, gene_likelihood="nb", )

Now we train scVI

In [None]:
torch.set_float32_matmul_precision('high')

In [None]:
vae.train()


Once the training is done, we can evaluate the latent representation of each cell in the dataset and add it to the AnnData object

In [None]:
adata.obsm["X_scVI"] = vae.get_latent_representation()

In [None]:
pip install pymde

To visualize the scVI’s learned embeddings, we use the pymde package wrapperin scvi-tools. This is an alternative to UMAP that is GPU-accelerated.

In [None]:
from scvi.model.utils import mde
import pymde

In [None]:
adata.obsm["X_mde"] = mde(adata.obsm["X_scVI"])

In [None]:
adata

In [None]:
sc.pl.embedding(adata, basis="X_mde", color=["CellType","CellType_Integration","Batch"], frameon=False, ncols=1)

In [None]:
adata

In [None]:
dir_path = "scVI_model/"
vae.save(dir_path, overwrite=True) 

In [None]:
adata.write("scVI_model/scVI.h5ad")

## Integration with scANVI
Previously, we used scVI as we assumed we did not have any cell type annotations available to guide us. Consequently, after the previous analysis, one would have to annotate clusters using differential expression, or by other means.

Now, we assume that all of our data is annotated. This can lead to a more accurate integration result when using scANVI, i.e., our latent data manifold is better suited to downstream tasks like visualization, trajectory inference, or nearest-neighbor-based tasks. scANVI requires:

the sample identifier for each cell (as in scVI)

the cell type/state for each cell

scANVI can also be used for label transfer and we recommend checking out the other scANVI tutorials to see explore this functionality.

Since we’ve already trained an scVI model on our data, we will use it to initialize scANVI. When initializing scANVI, we provide it the labels_key. As scANVI can also be used for datasets with partially-observed annotations, we need to give it the name of the category that corresponds to unlabeled cells. As we have no unlabeled cells, we can give it any random name that is not the name of an exisiting cell type.

In [None]:
lvae = scvi.model.SCANVI.from_scvi_model(
    vae,
    adata=adata,
    labels_key="CellType",
    unlabeled_category="TBD",
)

In [None]:
lvae.train(max_epochs=20, n_samples_per_label=100)

In [None]:
adata

Now we can retrieve the latent space

In [None]:
adata.obsm["X_scANVI"] = lvae.get_latent_representation(adata)


In [None]:
adata.obsm["X_mde_scanvi"] = mde(adata.obsm["X_scANVI"])


In [None]:
adata

In [None]:
sc.pl.embedding(adata, basis="X_mde_scanvi", color=["CellType","Batch"], ncols=1, frameon=False, )


In [None]:
sc.tl.pca(adata)

In [None]:
dir_path = "scANVI_model/"
lvae.save(dir_path, overwrite=True) 

In [None]:
adata.write("scANVI_model/scANVI.h5ad")

In [None]:
adata