# Specify vae for KCN integration

Using the same params and data from scVI_meta-atlas_4_datasets_V3.ipynb to get the vae model for KCN integration

In [1]:
import sys

#if branch is stable, will install via pypi, else will install from source
#branch = "stable"
#IN_COLAB = "google.colab" in sys.modules

#if IN_COLAB and branch == "stable":
 #   !pip install --quiet scvi-tools[tutorials]
  #  !pip install --quiet git+https://github.com/theislab/scib.git
#elif IN_COLAB and branch != "stable":
 #   !pip install --quiet --upgrade jsonschema
  #  !pip install --quiet git+https://github.com/yoseflab/scvi-tools@$branch#egg=scvi-tools[tutorials]
   # !pip install --quiet git+https://github.com/theislab/scib.git

The history saving thread hit an unexpected error (OperationalError('disk I/O error')).History will not be written to the database.


In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

import scanpy as sc
import scvi
#import scib

sc.set_figure_params(figsize=(4, 4))

Global seed set to 0


In [3]:
adata = sc.read("fc_epi.h5ad")

Note that this dataset has the counts already separated in a layer. Here, `adata.X` contains log transformed scran normalized expression.

In [4]:
adata

AnnData object with n_obs × n_vars = 65036 × 27656
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'S.Score', 'G2M.Score', 'Phase', 'old.ident', 'RNA_snn_res.0.15', 'seurat_clusters', 'costum_clustering', 'Condition', 'pANN_0.25_0.06_1806', 'DF.classifications_0.25_0.06_1806', 'pANN_0.25_0.005_1289', 'DF.classifications_0.25_0.005_1289', 'RNA_snn_res.0.1', 'pANN_0.25_0.18_1347', 'DF.classifications_0.25_0.18_1347', 'pANN_0.25_0.14_831', 'DF.classifications_0.25_0.14_831', 'batch'
    var: 'features'

In [5]:
#sc.pp.log1p(adata)
adata

AnnData object with n_obs × n_vars = 65036 × 27656
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'S.Score', 'G2M.Score', 'Phase', 'old.ident', 'RNA_snn_res.0.15', 'seurat_clusters', 'costum_clustering', 'Condition', 'pANN_0.25_0.06_1806', 'DF.classifications_0.25_0.06_1806', 'pANN_0.25_0.005_1289', 'DF.classifications_0.25_0.005_1289', 'RNA_snn_res.0.1', 'pANN_0.25_0.18_1347', 'DF.classifications_0.25_0.18_1347', 'pANN_0.25_0.14_831', 'DF.classifications_0.25_0.14_831', 'batch'
    var: 'features'

### Dataset preprocessing

This dataset was already processed as described in the scIB manuscript. Generally, models in scvi-tools expect data that has been filtered/aggregated in the same fashion as one would do with Scanpy/Seurat.


Another important thing to keep in mind is highly-variable gene selection. While scVI and scANVI both accomodate using all genes in terms of runtime, we usually recommend filtering genes for best integration performance. This will, among other things, remove batch-specific variation due to batch-specific gene expression.

We perform this gene selection using the Scanpy pipeline while keeping the full dimension normalized data in the `adata.raw` object. We obtain variable genes from each dataset and take their intersections. 

In [6]:
adata.raw = adata  # keep full dimension safe
sc.pp.highly_variable_genes(
    adata, 
    flavor="seurat_v3", 
    n_top_genes=2000, 
    layer=None,  # "counts" in original
    batch_key="batch",
    subset=True
)



<div class="alert alert-info">

Important

We see a warning about the data not containing counts. This is due to some of the samples in this dataset containing SoupX-corrected counts. scvi-tools models will run for non-negative real-valued data, but we strongly suggest checking that these possibly non-count values are intended to represent pseudocounts, and not some other normalized data, in which the variance/covariance structure of the data has changed dramatically.

</div>

## Integration with scVI

As a first step, we assume that the data is completely unlabelled and we wish to find common axes of variation between the two datasets. There are many methods available in scanpy for this purpose (BBKNN, Scanorama, etc.). In this notebook we present scVI. To run scVI, we simply need to:

* Register the AnnData object with the correct key to identify the sample and the layer key with the count data.
* Create an SCVI model object.

In [7]:
scvi.model.SCVI.setup_anndata(adata, layer=None, batch_key="batch") #layer="counts" in original

[34mINFO    [0m Using batches from adata.obs[1m[[0m[32m"batch"[0m[1m][0m                                               
[34mINFO    [0m No label_key inputted, assuming all cells have same label                           
[34mINFO    [0m Using data from adata.X                                                             
[34mINFO    [0m Successfully registered anndata object containing [1;36m65036[0m cells, [1;36m2000[0m vars, [1;36m4[0m batches,
         [1;36m1[0m labels, and [1;36m0[0m proteins. Also registered [1;36m0[0m extra categorical covariates and [1;36m0[0m extra
         continuous covariates.                                                              
[34mINFO    [0m Please do not further modify adata until model is trained.                          


We note that these parameters are non-default; however, they have been verified to generally work well in the integration task.

In [8]:
vae = scvi.model.SCVI(adata, n_layers=2, n_latent=30, gene_likelihood="nb")



Now we train scVI. This should take a couple of minutes on a Colab session

In [9]:
vae.train()

GPU available: False, used: False
TPU available: False, using: 0 TPU cores


Epoch 123/123: 100%|█████████████████████████████████████████████████████████████████| 123/123 [15:51<00:00,  7.73s/it, loss=386, v_num=1]


In [10]:
# save the reference model
dir_path = "meta_atlas_model_20062022/"
vae.save(dir_path, overwrite=True)