## References

Tutorial: https://docs.scvi-tools.org/en/stable/user_guide/notebooks/MultiVI_tutorial.html <br>
Paper: https://www.biorxiv.org/content/10.1101/2021.08.20.457057v2

## Dataset to prepare

### 1) RNA (scnRNA + Multiome-RNA)
* Read in data: post-CellBender, filtered as the previous HCA object, cell-type annotated
* Subset scnRNA: barcode x gene -> **`adata_rna.h5ad`**
* Subset MultiomeRNA: barcode x gene

### 2) ATAC (snATAC + Multiome-ATAC)
* Read in data: post-cellatac and filtered peaks and nuclei, `6reg-v2_ATACs_filtered.h5ad`
* Subset snATAC: barcode x peak -> **`adata_atac.h5ad`**
* Subset MultiomeATAC: barcode x peak

### 3) Concatenate Multiome RNA+ATAC
barcode x (gene+peak) -> **`adata_paired.h5ad`**

## Concatenate multimodality anndatas (by using MultiVI function)

In [1]:
import scanpy as sc
import numpy as np
import pandas as pd
import scipy
import anndata
import scvi

In [2]:
import session_info
session_info.show()

path_adata = '/nfs/team205/heart/anndata_objects/'
directory = path_adata + 'MultiVI'

**adata_rna**

In [3]:
# adata_rna=sc.read_h5ad('/nfs/team205/kk18/data/6region_v2/MultiVI/adata_rna_downsized.h5ad')
adata_rna=sc.read_h5ad(directory + '/adata_scnrna.h5ad')

print(adata_rna.X.data[:10])
adata_rna

[1. 1. 1. 1. 2. 1. 1. 1. 5. 1.]


AnnData object with n_obs × n_vars = 618913 × 31915
    obs: 'latent_RT_efficiency', 'latent_cell_probability', 'latent_scale', 'sample_id', 'Foetal_or_Adult', 'Provider', 'Modality', 'Mapping_ver', 'Reference_genome', 'CellBender_out', 'n_cells', 'multiplet_rate', 'batch', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'scrublet_score', 'scrublet_leiden', 'cluster_scrublet_score', 'doublet_pval', 'doublet_bh_pval', 'protocol', 'modality', 'donor_cellnuc', 'donor', 'region', 'cell_or_nuclei', 'Chemistry', 'n_nuclei'
    var: 'ambient_expression-0-0', 'genes-0-0', 'ambient_expression-1-0', 'feature_type-1-0', 'id-1-0', 'ambient_expression-10-0', 'feature_type-10-0', 'id-10-0', 'ambient_expression-11-0', 'feature_type-11-0', 'id-11-0', 'ambient_expression-12-0', 'feature_type-12-0', 'id-12-0', 'ambient_expression-13-0', 'feature_type-13-0', 'id-13-0', 'ambient_expression-14-0', 'feature_type-14-0', 'id-14-0', 'ambient_expression-15-0', 'feature_type-15-0', 'id-15-0', 'ambient_exp

In [4]:
adata_rna.obs

Unnamed: 0_level_0,latent_RT_efficiency,latent_cell_probability,latent_scale,sample_id,Foetal_or_Adult,Provider,Modality,Mapping_ver,Reference_genome,CellBender_out,...,doublet_pval,doublet_bh_pval,protocol,modality,donor_cellnuc,donor,region,cell_or_nuclei,Chemistry,n_nuclei
barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
HCAHeart7606896_GATGAGGCACGGCTAC,8.281336,0.976983,615.699524,HCAHeart7606896,Adult,Sanger Heart Mona-Carlos,scRNA,starsolo,GRCh38-3.0.0,/nfs/team205/heart/soupremoved/cellbender020/H...,...,0.059996,0.779036,RNA,expression,D1_Cell,D1,AX,Cell,,
HCAHeart7606896_TCAGGATCAGCTCGAC,7.195173,0.934280,702.436768,HCAHeart7606896,Adult,Sanger Heart Mona-Carlos,scRNA,starsolo,GRCh38-3.0.0,/nfs/team205/heart/soupremoved/cellbender020/H...,...,0.117506,0.779036,RNA,expression,D1_Cell,D1,AX,Cell,,
HCAHeart7606896_CAAGATCGTCTCACCT,7.452532,0.982054,528.663391,HCAHeart7606896,Adult,Sanger Heart Mona-Carlos,scRNA,starsolo,GRCh38-3.0.0,/nfs/team205/heart/soupremoved/cellbender020/H...,...,0.117506,0.779036,RNA,expression,D1_Cell,D1,AX,Cell,,
HCAHeart7606896_GCAAACTAGCTAGCCC,6.711821,0.992885,490.788574,HCAHeart7606896,Adult,Sanger Heart Mona-Carlos,scRNA,starsolo,GRCh38-3.0.0,/nfs/team205/heart/soupremoved/cellbender020/H...,...,0.117506,0.779036,RNA,expression,D1_Cell,D1,AX,Cell,,
HCAHeart7606896_CGCTTCACATTTGCCC,6.679333,0.974092,611.140320,HCAHeart7606896,Adult,Sanger Heart Mona-Carlos,scRNA,starsolo,GRCh38-3.0.0,/nfs/team205/heart/soupremoved/cellbender020/H...,...,0.735756,0.779036,RNA,expression,D1_Cell,D1,AX,Cell,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
H0037_septum_CATAAGCCATGCCGGT,0.579068,0.993469,1051.631348,H0037_septum,Adult,"Harvard Medical School, Seidman group",snRNA,starsolo,GRCh38-3.0.0,/nfs/team205/heart/soupremoved/cellbender020/H...,...,0.815061,0.925777,RNA,expression,H4_Nuclei,H4,SP,Nuclei,,
H0037_septum_GATCACATCCAATGCA,0.549441,0.989432,1044.838623,H0037_septum,Adult,"Harvard Medical School, Seidman group",snRNA,starsolo,GRCh38-3.0.0,/nfs/team205/heart/soupremoved/cellbender020/H...,...,0.262303,0.916526,RNA,expression,H4_Nuclei,H4,SP,Nuclei,,
H0037_septum_CAGCACGAGCGACCCT,0.607829,0.990621,1082.652344,H0037_septum,Adult,"Harvard Medical School, Seidman group",snRNA,starsolo,GRCh38-3.0.0,/nfs/team205/heart/soupremoved/cellbender020/H...,...,0.262303,0.916526,RNA,expression,H4_Nuclei,H4,SP,Nuclei,,
H0037_septum_AGCGCCAAGAGCAGAA,0.534082,0.991127,1053.899902,H0037_septum,Adult,"Harvard Medical School, Seidman group",snRNA,starsolo,GRCh38-3.0.0,/nfs/team205/heart/soupremoved/cellbender020/H...,...,0.691999,0.916526,RNA,expression,H4_Nuclei,H4,SP,Nuclei,,


In [5]:
adata_rna.obs.donor_cellnuc.isna().sum()

0

**adata_atac**

In [6]:
# adata_atac=sc.read_h5ad('/nfs/team205/kk18/data/6region_v2/MultiVI/adata_atac_downsized.h5ad')
# adata_atac=sc.read_h5ad('/nfs/team205/heart/anndata_objects/6region_v2/6reg-v2_ATACs_filtered.h5ad')
adata_atac=sc.read_h5ad(directory + '/adata_atac.h5ad')


print(adata_atac.X.data[:10])
adata_atac

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


AnnData object with n_obs × n_vars = 48098 × 102627
    obs: 'sample_id', 'protocol', 'donor_cellnuc', 'donor', 'region', 'cell_or_nuclei', 'cellatac_clusters', 'cellatac_code', 'dataset', 'barcode', 'oribarcode', 'modality'
    var: 'modality', 'peak_width', 'exon', 'gene', 'promoter', 'annotation', 'gene_name', 'gene_id', 'tss_distance', 'ENCODE_blacklist'
    layers: 'binary_raw'

In [7]:
adata_atac.obs

Unnamed: 0_level_0,sample_id,protocol,donor_cellnuc,donor,region,cell_or_nuclei,cellatac_clusters,cellatac_code,dataset,barcode,oribarcode,modality
fullbarcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
HCAHeart8374324_AAACGAATCAAACCCA-1,HCAHeart8374324,ATAC,D5_nuclei,D5,LV,nuclei,26,01,LV,AAACGAATCAAACCCA-1,01-AAACGAATCAAACCCA-1,accessibility
HCAHeart8374324_AAAGGATAGGCACTAG-1,HCAHeart8374324,ATAC,D5_nuclei,D5,LV,nuclei,9,01,LV,AAAGGATAGGCACTAG-1,01-AAAGGATAGGCACTAG-1,accessibility
HCAHeart8374324_AAAGGGCAGCGAGCTA-1,HCAHeart8374324,ATAC,D5_nuclei,D5,LV,nuclei,8,01,LV,AAAGGGCAGCGAGCTA-1,01-AAAGGGCAGCGAGCTA-1,accessibility
HCAHeart8374324_AAAGGGCAGTGATATG-1,HCAHeart8374324,ATAC,D5_nuclei,D5,LV,nuclei,16,01,LV,AAAGGGCAGTGATATG-1,01-AAAGGGCAGTGATATG-1,accessibility
HCAHeart8374324_AAATGAGTCCGGGCAT-1,HCAHeart8374324,ATAC,D5_nuclei,D5,LV,nuclei,26,01,LV,AAATGAGTCCGGGCAT-1,01-AAATGAGTCCGGGCAT-1,accessibility
...,...,...,...,...,...,...,...,...,...,...,...,...
HCAHeart8374344_TTTGGTTTCATTGCCC-1,HCAHeart8374344,ATAC,D7_nuclei,D7,LA,nuclei,0,21,LA,TTTGGTTTCATTGCCC-1,21-TTTGGTTTCATTGCCC-1,accessibility
HCAHeart8374344_TTTGGTTTCGTTACAG-1,HCAHeart8374344,ATAC,D7_nuclei,D7,LA,nuclei,0,21,LA,TTTGGTTTCGTTACAG-1,21-TTTGGTTTCGTTACAG-1,accessibility
HCAHeart8374344_TTTGTGTAGTACAACA-1,HCAHeart8374344,ATAC,D7_nuclei,D7,LA,nuclei,12,21,LA,TTTGTGTAGTACAACA-1,21-TTTGTGTAGTACAACA-1,accessibility
HCAHeart8374344_TTTGTGTCACTGTCGG-1,HCAHeart8374344,ATAC,D7_nuclei,D7,LA,nuclei,12,21,LA,TTTGTGTCACTGTCGG-1,21-TTTGTGTCACTGTCGG-1,accessibility


In [8]:
adata_atac.obs.donor_cellnuc.isna().sum()

0

**adata_paired**

In [9]:
adata_paired=sc.read_h5ad(directory + '/adata_paired.h5ad')
print(adata_paired.X.data[:10])
adata_paired

[2. 2. 1. 1. 1. 1. 2. 1. 2. 1.]


AnnData object with n_obs × n_vars = 30638 × 134542
    obs: 'latent_RT_efficiency', 'latent_cell_probability', 'latent_scale', 'sample_id', 'Foetal_or_Adult', 'Provider', 'Modality', 'Mapping_ver', 'Reference_genome', 'n_cells', 'multiplet_rate', 'batch', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'scrublet_score', 'scrublet_leiden', 'cluster_scrublet_score', 'doublet_pval', 'doublet_bh_pval', 'combined_id', 'protocol', 'modality', 'donor_cellnuc', 'donor', 'region', 'cell_or_nuclei', 'Chemistry', 'n_nuclei', 'barcode', 'cellatac_clusters', 'cellatac_code'
    var: 'modality', 'genes-0-0', 'peak_width', 'exon', 'gene', 'promoter', 'annotation', 'gene_name', 'gene_id', 'tss_distance', 'ENCODE_blacklist'

In [10]:
adata_paired.obs

Unnamed: 0_level_0,latent_RT_efficiency,latent_cell_probability,latent_scale,sample_id,Foetal_or_Adult,Provider,Modality,Mapping_ver,Reference_genome,n_cells,...,modality,donor_cellnuc,donor,region,cell_or_nuclei,Chemistry,n_nuclei,barcode,cellatac_clusters,cellatac_code
combined_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
HCAHeart9508628_HCAHeart9508820_GGCTCAATCCGCAAGC-1,2.521864,1.000000,4132.466309,HCAHeart9508628,Adult,Sanger Heart Mona-Carlos,Multiome-RNA,cellranger-arc-1.0.1,GRCh38-2020-A,,...,paired,D7_Nuclei,D7,RA,Nuclei,Single Cell Multiome ATAC + Gene Expression v1,12696.0,GGCTCAATCCGCAAGC-1,4,23
HCAHeart9508628_HCAHeart9508820_AGAGGAACAGGGAGGA-1,0.823278,0.999433,2723.982178,HCAHeart9508628,Adult,Sanger Heart Mona-Carlos,Multiome-RNA,cellranger-arc-1.0.1,GRCh38-2020-A,,...,paired,D7_Nuclei,D7,RA,Nuclei,Single Cell Multiome ATAC + Gene Expression v1,12696.0,AGAGGAACAGGGAGGA-1,11,23
HCAHeart9508629_HCAHeart9508821_TACGCACCAGCTTAGC-1,2.649336,0.999964,4712.682617,HCAHeart9508629,Adult,Sanger Heart Mona-Carlos,Multiome-RNA,cellranger-arc-1.0.1,GRCh38-2020-A,,...,paired,D7_Nuclei,D7,LV,Nuclei,Single Cell Multiome ATAC + Gene Expression v1,10459.0,TACGCACCAGCTTAGC-1,0,24
HCAHeart9845436_HCAHeart9917178_CGTGAGGAGCTCCCTG-1,1.835015,0.999957,6410.629395,HCAHeart9845436,Adult,Sanger Heart Mona-Carlos,Multiome-RNA,cellranger-arc-1.0.1,GRCh38-2020-A,,...,paired,D8_Nuclei,D8,RA,Nuclei,Single Cell Multiome ATAC + Gene Expression v1,8640.0,CGTGAGGAGCTCCCTG-1,4,30
HCAHeart9508629_HCAHeart9508821_CAATGACTCTAAGTGC-1,0.838128,0.998864,3071.858154,HCAHeart9508629,Adult,Sanger Heart Mona-Carlos,Multiome-RNA,cellranger-arc-1.0.1,GRCh38-2020-A,,...,paired,D7_Nuclei,D7,LV,Nuclei,Single Cell Multiome ATAC + Gene Expression v1,10459.0,CAATGACTCTAAGTGC-1,5,24
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
HCAHeart9845433_HCAHeart9917175_GTCTTGCTCATTACAG-1,2.317862,0.999988,16451.451172,HCAHeart9845433,Adult,Sanger Heart Mona-Carlos,Multiome-RNA,cellranger-arc-1.0.1,GRCh38-2020-A,,...,paired,D8_Nuclei,D8,LA,Nuclei,Single Cell Multiome ATAC + Gene Expression v1,5033.0,GTCTTGCTCATTACAG-1,12,27
HCAHeart9508629_HCAHeart9508821_ACAACAGAGGCTACAT-1,1.658403,0.999918,3706.018311,HCAHeart9508629,Adult,Sanger Heart Mona-Carlos,Multiome-RNA,cellranger-arc-1.0.1,GRCh38-2020-A,,...,paired,D7_Nuclei,D7,LV,Nuclei,Single Cell Multiome ATAC + Gene Expression v1,10459.0,ACAACAGAGGCTACAT-1,8,24
HCAHeart9508628_HCAHeart9508820_GTATTGTCAAATTGCT-1,3.156269,1.000000,4514.356934,HCAHeart9508628,Adult,Sanger Heart Mona-Carlos,Multiome-RNA,cellranger-arc-1.0.1,GRCh38-2020-A,,...,paired,D7_Nuclei,D7,RA,Nuclei,Single Cell Multiome ATAC + Gene Expression v1,12696.0,GTATTGTCAAATTGCT-1,10,23
HCAHeart9845431_HCAHeart9917173_CCGCACACAACTAGCC-1,2.539599,0.999967,5819.325195,HCAHeart9845431,Adult,Sanger Heart Mona-Carlos,Multiome-RNA,cellranger-arc-1.0.1,GRCh38-2020-A,,...,paired,D8_Nuclei,D8,LV,Nuclei,Single Cell Multiome ATAC + Gene Expression v1,9167.0,CCGCACACAACTAGCC-1,1,25


In [11]:
adata_paired.obs.donor_cellnuc.isna().sum()

0

**Concatenate**

In [12]:
adata_mvi = scvi.data.organize_multiome_anndatas(adata_paired, adata_rna, adata_atac)
# Note that organize_multiome_anndatas adds an annotation to the cells to indicate which modality they originate from
# so modality label is now changed

adata_mvi

AnnData object with n_obs × n_vars = 697649 × 134542
    obs: 'latent_RT_efficiency', 'latent_cell_probability', 'latent_scale', 'sample_id', 'Foetal_or_Adult', 'Provider', 'Modality', 'Mapping_ver', 'Reference_genome', 'n_cells', 'multiplet_rate', 'batch', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'scrublet_score', 'scrublet_leiden', 'cluster_scrublet_score', 'doublet_pval', 'doublet_bh_pval', 'combined_id', 'protocol', 'modality', 'donor_cellnuc', 'donor', 'region', 'cell_or_nuclei', 'Chemistry', 'n_nuclei', 'barcode', 'cellatac_clusters', 'cellatac_code', 'CellBender_out', 'dataset', 'oribarcode'
    var: 'modality', 'genes-0-0', 'peak_width', 'exon', 'gene', 'promoter', 'annotation', 'gene_name', 'gene_id', 'tss_distance', 'ENCODE_blacklist'
    obsm: 'latent_gene_encoding'
    layers: 'binary_raw'

In [13]:
adata_mvi.obs['modality'].value_counts()

expression       618913
accessibility     48098
paired            30638
Name: modality, dtype: int64

In [14]:
adata_mvi.obs.columns

Index(['latent_RT_efficiency', 'latent_cell_probability', 'latent_scale',
       'sample_id', 'Foetal_or_Adult', 'Provider', 'Modality', 'Mapping_ver',
       'Reference_genome', 'n_cells', 'multiplet_rate', 'batch', 'n_counts',
       'n_genes', 'percent_mito', 'percent_ribo', 'scrublet_score',
       'scrublet_leiden', 'cluster_scrublet_score', 'doublet_pval',
       'doublet_bh_pval', 'combined_id', 'protocol', 'modality',
       'donor_cellnuc', 'donor', 'region', 'cell_or_nuclei', 'Chemistry',
       'n_nuclei', 'barcode', 'cellatac_clusters', 'cellatac_code',
       'CellBender_out', 'dataset', 'oribarcode'],
      dtype='object')

_**MultiVI requires the features to be ordered so that genes appear before genomic regions. This must be enforced by the user.**_

In [15]:
# In this case this is already the case, but it’s always good to verify:
adata_mvi = adata_mvi[:, adata_mvi.var["modality"].argsort()].copy()
# adata_mvi.var

In [16]:
# Filter features to remove those that appear in fewer than 0.1% of the cells
print(adata_mvi.shape)
sc.pp.filter_genes(adata_mvi, min_cells=int(adata_mvi.shape[0] * 0.001))
print(adata_mvi.shape)

(697649, 134542)
(697649, 120303)


In [17]:
adata_mvi.obs.columns

Index(['latent_RT_efficiency', 'latent_cell_probability', 'latent_scale',
       'sample_id', 'Foetal_or_Adult', 'Provider', 'Modality', 'Mapping_ver',
       'Reference_genome', 'n_cells', 'multiplet_rate', 'batch', 'n_counts',
       'n_genes', 'percent_mito', 'percent_ribo', 'scrublet_score',
       'scrublet_leiden', 'cluster_scrublet_score', 'doublet_pval',
       'doublet_bh_pval', 'combined_id', 'protocol', 'modality',
       'donor_cellnuc', 'donor', 'region', 'cell_or_nuclei', 'Chemistry',
       'n_nuclei', 'barcode', 'cellatac_clusters', 'cellatac_code',
       'CellBender_out', 'dataset', 'oribarcode'],
      dtype='object')

In [18]:
# adata_mvi.obs=adata_mvi.obs[['modality','donor_cellnuc','donor','region','cell_or_nuclei','cell_states']].copy()
adata_mvi.var=adata_mvi.var[['modality']].copy()

In [20]:
adata_mvi.obs.to_csv(directory + '/adata_mvi.obs.csv')

In [24]:
df = pd.read_csv(directory + '/adata_mvi.obs.csv', index_col=0) 

  exec(code_obj, self.user_global_ns, self.user_ns)


In [27]:
df.cluster_scrublet_score

HCAHeart9508628_HCAHeart9508820_GGCTCAATCCGCAAGC-1_paired    0.272059
HCAHeart9508628_HCAHeart9508820_AGAGGAACAGGGAGGA-1_paired    0.029783
HCAHeart9508629_HCAHeart9508821_TACGCACCAGCTTAGC-1_paired    0.237220
HCAHeart9845436_HCAHeart9917178_CGTGAGGAGCTCCCTG-1_paired    0.316456
HCAHeart9508629_HCAHeart9508821_CAATGACTCTAAGTGC-1_paired    0.028057
                                                               ...   
HCAHeart8374344_TTTGGTTTCATTGCCC-1_accessibility                  NaN
HCAHeart8374344_TTTGGTTTCGTTACAG-1_accessibility                  NaN
HCAHeart8374344_TTTGTGTAGTACAACA-1_accessibility                  NaN
HCAHeart8374344_TTTGTGTCACTGTCGG-1_accessibility                  NaN
HCAHeart8374344_TTTGTGTGTGCCAAGA-1_accessibility                  NaN
Name: cluster_scrublet_score, Length: 697649, dtype: float64

In [None]:
adata_mvi.obs = adata_mvi.obs[['modality','donor_cellnuc','donor','region','cell_or_nuclei']].copy()

In [None]:
# save
adata_mvi.write(directory + '/adata_mvi.h5ad')