### Intro

The aim of the this notebook (and the 04_benchmark_integration) is to test the integration comparing non-integrated PC, scANVI embedding and PC obtained from scANVI normalized expression matrix.

Here, we will generate an adata object including three embedded space as explained above.

For each embedding we will use the same number of dimension (30)

In [1]:
import scanpy as sc

import pandas as pd
import numpy as np

import scvi

from pyprojroot import here
from glob import glob
import os

### Loading raw non-integrated object

This object was integrated using scANVI

In [2]:
adata = sc.read_h5ad(here("03_downstream_analysis/02_gene_universe_definition/results/04_MAIN_geneUniverse_noRBCnPlatelets.h5ad"))

In [3]:
adata.obs['binned_age'] = adata.obs['binned_age'].astype(str)

### Loading scANVI model

In [4]:
scanvi_model = scvi.model.SCANVI.load(here('03_downstream_analysis/04_integration_with_annotation/results/scANVI_model_fineTuned_lowLR_noRBCnPlat'), 
                                      adata=adata) 

/scratch_isilon/groups/singlecell/shared/conda_env/scvi-v112/lib/python3.9/site-packages/lightning/fabric/plugins/environments/slurm.py:191: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python3.9 /scratch_isilon/groups/singlecell/shared/conda_en ...


[34mINFO    [0m File                                                                                                      
         [35m/scratch_isilon/groups/singlecell/shared/projects/Inflammation-PBMCs-Atlas/03_downstream_analysis/03_scA[0m
         [35mNVI_integration_with_annotation/results/scANVI_model_fineTuned_lowLR_noRBCnPlat/[0m[95mmodel.pt[0m already          
         downloaded                                                                                                


## Generating embeddings

### scANVI latent dimension

In [5]:
adata.obsm['X_scANVI_latent'] = scanvi_model.get_latent_representation(adata=adata)

### PCA from RAW count
#### Normalization

In [6]:
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

#### Computing PCA

In [7]:
sc.tl.pca(adata, n_comps = 30, )

In [8]:
adata.obsm['X_pca_unintegrated'] = adata.obsm['X_pca'].copy()

In [9]:
del adata.obsm['X_pca']

### PCA from scANVI normalized expression

For the third embedding we have to load the  normalized count data obtained using get_normalized_expression() from scANVI model.

We generated the those matrices for each cell-type to make loading easier

**Generating one adata by merging each cell-types**

In [10]:
ctDATApath_list = glob(str(here('03_downstream_analysis/04_integration_with_annotation/results/normalized_adatas_nextflow/cellType_adata_merged/*.log1p.h5ad')))

In [11]:
ctADATA_list = []
for p in ctDATApath_list:
    print(os.path.basename(p))
    ctADATA_list.append(sc.read_h5ad(p))

T_CD4_NonNaive_adataMerged.log1p.h5ad
T_CD8_NonNaive_adataMerged.log1p.h5ad
Cycling_cells_adataMerged.log1p.h5ad
T_CD4_Naive_adataMerged.log1p.h5ad
T_CD8_Naive_adataMerged.log1p.h5ad
UTC_adataMerged.log1p.h5ad
DC_adataMerged.log1p.h5ad
Plasma_adataMerged.log1p.h5ad
ILC_adataMerged.log1p.h5ad
Mono_adataMerged.log1p.h5ad
Progenitors_adataMerged.log1p.h5ad
B_adataMerged.log1p.h5ad
pDC_adataMerged.log1p.h5ad


In [12]:
adataCTmerged = sc.concat(ctADATA_list)[adata.obs_names,:]

In [13]:
assert all(adataCTmerged.obs_names == adata.obs_names)

#### Computing PCA

In [14]:
sc.tl.pca(adataCTmerged, n_comps = 30)

  adata.obsm["X_pca"] = X_pca


In [15]:
adata.obsm['X_pca_scANVI_normalized_exp'] = adataCTmerged.obsm['X_pca'].copy()

### scGen latent dimension

**Load adata with corrected latent and PCs**

In [3]:
scGen_emb_adata = sc.read_h5ad(here('03_downstream_analysis/04_integration_with_annotation/results/02_scGen_corr_latent_expression_noRBCnPlat.h5ad'))
scGen_emb_adata

AnnData object with n_obs × n_vars = 4279352 × 30
    layers: 'scgen_corrected_expression_pc', 'scgen_corrected_latent'

In [4]:
scGen_emb_adata

AnnData object with n_obs × n_vars = 4279352 × 30
    layers: 'scgen_corrected_expression_pc', 'scgen_corrected_latent'

In [5]:
# Check if index of the two adata perfectly match
assert(all(scGen_emb_adata.obs_names == adata.obs_names))

In [9]:
adata.obsm['scgen_corrected_latent'] = scGen_emb_adata.layers['scgen_corrected_latent']
adata.obsm['scgen_corrected_expression_pc'] = scGen_emb_adata.layers['scgen_corrected_expression_pc']

### Harmony latent dimension

**Load adata with corrected PCs**

In [3]:
harmony_adata = sc.read_h5ad(here('03_downstream_analysis/04_integration_with_annotation/results/02_Harmony_correctedPCs_noRBCnPlat.h5ad'))
harmony_adata

AnnData object with n_obs × n_vars = 4279352 × 30

In [4]:
# Check if index of the two adata perfectly match
assert(all(harmony_adata.obs_names == adata.obs_names))

In [7]:
adata.obsm['X_pca_harmony'] = harmony_adata.X
adata

AnnData object with n_obs × n_vars = 4279352 × 8253
    obs: 'studyID', 'libraryID', 'sampleID', 'chemistry', 'disease', 'sex', 'binned_age', 'Level1', 'Level2', '_scvi_batch', '_scvi_labels'
    var: 'hgnc_id', 'symbol', 'locus_group', 'HUGO_status', 'highly_variable'
    uns: '_scvi_manager_uuid', '_scvi_uuid', 'log1p', 'pca'
    obsm: 'X_pca_scANVI_normalized_exp', 'X_pca_unintegrated', 'X_scANVI_latent', '_scvi_extra_categorical_covs', 'scgen_corrected_expression_pc', 'scgen_corrected_latent', 'X_pca_harmony'
    varm: 'PCs'

## Saving the obtained object

In [11]:
adata.write(here('03_downstream_analysis/04_integration_with_annotation/results/03_MAIN_geneUniverse_noRBCnPlatelets_embeddings.h5ad'), compression='gzip')