## References

Tutorial: https://docs.scvi-tools.org/en/stable/user_guide/notebooks/MultiVI_tutorial.html <br>
Paper: https://www.biorxiv.org/content/10.1101/2021.08.20.457057v2

## Dataset to prepare

### 1) RNA (scnRNA + Multiome-RNA)
* Read in data: post-CellBender, filtered as the previous HCA object, cell-type annotated
* Subset scnRNA: barcode x gene -> **`adata_rna.h5ad`**
* Subset MultiomeRNA: barcode x gene

### 2) ATAC (snATAC + Multiome-ATAC)
* Read in data: post-cellatac and filtered peaks and nuclei, `6reg-v2_ATACs_filtered.h5ad`
* Subset snATAC: barcode x peak -> **`adata_atac.h5ad`**
* Subset MultiomeATAC: barcode x peak

### 3) Concatenate Multiome RNA+ATAC
barcode x (gene+peak) -> **`adata_paired.h5ad`**

## Concatenate multimodality anndatas (by using MultiVI function)

In [1]:
import scanpy as sc
import numpy as np
import pandas as pd
import scipy
import anndata
import scvi

In [2]:
import session_info
session_info.show()

**adata_rna**

In [3]:
adata_rna=sc.read_h5ad('/nfs/team205/kk18/data/6region_v2/MultiVI/adata_rna_downsized.h5ad')
print(adata_rna.X.data[:10])
adata_rna

[1. 1. 1. 2. 1. 1. 2. 1. 1. 1.]


AnnData object with n_obs × n_vars = 13822 × 31915
    obs: 'sangerID', 'modality', 'donor', 'age_group', 'region', 'cell_or_nuclei', 'gender', 'type', 'cell_states', 'modality_fine', 'donor_cellnuc', 'batch'
    var: 'modality'

**adata_atac**

In [5]:
adata_atac=sc.read_h5ad('/nfs/team205/kk18/data/6region_v2/MultiVI/adata_atac_downsized.h5ad')
print(adata_atac.X.data[:10])
adata_atac

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


AnnData object with n_obs × n_vars = 11170 × 102627
    obs: 'cellatac_clusters', 'cellatac_code', 'sangerID', 'dataset', 'donor', 'Region', 'barcode', 'oribarcode', 'donor_cellnuc', 'modality', 'batch'
    var: 'modality'
    layers: 'binary_raw'

**adata_paired**

In [6]:
adata_paired=sc.read_h5ad('/nfs/team205/kk18/data/6region_v2/MultiVI/adata_paired_downsized.h5ad')
print(adata_paired.X.data[:10])
adata_paired

[3. 2. 1. 3. 1. 2. 1. 1. 1. 1.]


AnnData object with n_obs × n_vars = 7517 × 134542
    obs: 'Combined_ID', 'rna_sangerID', 'barcode', 'donor', 'age_group', 'region', 'cell_or_nuclei', 'gender', 'type', 'cell_states', 'modality', 'donor_cellnuc', 'atac_sangerID', 'cellatac_clusters', 'cellatac_code', 'batch'
    var: 'modality'

**Concatenate**

In [7]:
adata_mvi = scvi.data.organize_multiome_anndatas(adata_paired, adata_rna, adata_atac)
# Note that organize_multiome_anndatas adds an annotation to the cells to indicate which modality they originate from
# so modality label is now changed

adata_mvi

AnnData object with n_obs × n_vars = 32509 × 134542
    obs: 'Combined_ID', 'rna_sangerID', 'barcode', 'donor', 'age_group', 'region', 'cell_or_nuclei', 'gender', 'type', 'cell_states', 'modality', 'donor_cellnuc', 'atac_sangerID', 'cellatac_clusters', 'cellatac_code', 'batch', 'sangerID', 'modality_fine', 'dataset', 'Region', 'oribarcode'
    var: 'modality'
    layers: 'binary_raw'

In [8]:
adata_mvi.obs['modality'].value_counts()

expression       13822
accessibility    11170
paired            7517
Name: modality, dtype: int64

_**MultiVI requires the features to be ordered so that genes appear before genomic regions. This must be enforced by the user.**_

In [9]:
# In this case this is already the case, but it’s always good to verify:
adata_mvi = adata_mvi[:, adata_mvi.var["modality"].argsort()].copy()
adata_mvi.var

Unnamed: 0,modality
MIR1302-2HG,Gene Expression
HS6ST3,Gene Expression
UGGT2,Gene Expression
DNAJC3,Gene Expression
DNAJC3-DT,Gene Expression
...,...
chr5:116420065-116420948,Peaks
chr5:116388452-116389251,Peaks
chr5:116256702-116257293,Peaks
chr5:116652044-116653272,Peaks


In [11]:
# Filter features to remove those that appear in fewer than 1% of the cells
print(adata_mvi.shape)
sc.pp.filter_genes(adata_mvi, min_cells=int(adata_mvi.shape[0] * 0.01))
print(adata_mvi.shape)

(32509, 134542)
(32509, 80362)


In [12]:
adata_mvi.obs=adata_mvi.obs[['modality','donor_cellnuc','donor','region','cell_or_nuclei','cell_states']].copy()
adata_mvi.var=adata_mvi.var[['modality']].copy()

In [13]:
# save
adata_mvi.write('/nfs/team205/kk18/data/6region_v2/MultiVI/adata_mvi_downsized.h5ad')

  c.reorder_categories(natsorted(c.categories), inplace=True)
... storing 'modality' as categorical
  c.reorder_categories(natsorted(c.categories), inplace=True)
... storing 'donor_cellnuc' as categorical
  c.reorder_categories(natsorted(c.categories), inplace=True)
... storing 'donor' as categorical
  c.reorder_categories(natsorted(c.categories), inplace=True)
... storing 'cell_or_nuclei' as categorical
  c.reorder_categories(natsorted(c.categories), inplace=True)
... storing 'cell_states' as categorical
