## 04_1. Data integration and batch correction

<div style="text-align: right;">
    <p style="text-align: left;">Updated Time: 2025-02-08</p>
</div>


An important task of single-cell analysis is the integration of several samples, which we can perform with omicverse. 

Here we demonstrate how to merge data using omicverse and perform a corrective analysis for batch effects. We provide a total of 4 methods for batch effect correction in omicverse, including harmony, scanorama and combat which do not require GPU, and SIMBA which requires GPU. if available, we recommend using GPU-based scVI and scANVI to get the best batch effect correction results.

##### Load libraries

In [None]:
import os
os.environ["OPENBLAS_NUM_THREADS"] = "4"
os.environ["MKL_NUM_THREADS"] = "4"
os.environ["OMP_NUM_THREADS"] = "4"

In [None]:
import numpy as np
import pandas as pd

import omicverse as ov
#print(f"omicverse version: {ov.__version__}")
import scanpy as sc
#print(f"scanpy version: {sc.__version__}")
# Needed for some plotting
import seaborn as sns
import matplotlib.pyplot as plt
ov.utils.ov_plot_set()

import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=DeprecationWarning)
warnings.simplefilter(action="ignore", category=UserWarning)

##### Set working directory  for analysis

In [None]:
working_dir = '/media/bio/Disk/Research Data/EBV/omicverse'
os.chdir(working_dir)
updated_dir = os.getcwd()
print("Updated working directory: ", updated_dir)

from pathlib import Path
saving_dir = Path('Results/04.batch_correction')
saving_dir.mkdir(parents=True, exist_ok=True)

In [None]:
adata = sc.read("Processed Data/scRNA_AutoAnnotation.h5ad")
adata.X=adata.X.tocsr()
adata

In [None]:
print(np.min(adata.X), np.max(adata.X))

Here, we set sample ID as the batch

In [None]:
adata.obs['batch']=adata.obs['orig.ident']
adata.obs['batch'] = adata.obs['batch'].astype('category')
adata.obs['batch'].unique()

#### Data has been preprocessed previously

Set the .raw attribute of the AnnData object to the raw gene expression (counts) for later use. This simply freezes the state of the AnnData object.
We can get back an AnnData of the object in .raw by calling .raw.to_adata().

In [None]:
adata.raw = adata # This saves the raw count data in adata.raw

We subset to the highly variable genes so that each method has the same input.

In [None]:
adata = adata[:, adata.var.highly_variable_features]
ov.pp.scale(adata)
adata

#### Here we just visualise potential batch effects in the data.

There is a very clear batch effect in the data

In [None]:
plt.rcParams['figure.figsize'] = [5, 5]
ov.utils.embedding(adata,
                basis='X_umap',frameon='small',
                color=['Dataset','gpt_celltype'],
                show=False)

plt.savefig('Results/04.batch_correction/04.X_umap_unintegrated.pdf', format='pdf', bbox_inches='tight')
plt.show()

### Run intergration methods

Here we run some frequently-used intergration methods.


#### Harmony

Harmony is an algorithm for performing integration of single cell genomics datasets. Please check out manuscript on [Nature Methods](https://www.nature.com/articles/s41592-019-0619-0).

In [None]:
ov.single.batch_correction(adata,batch_key='batch', methods='harmony',n_pcs=50) 

In [None]:
ov.pp.neighbors(adata, n_neighbors=15, n_pcs=20, use_rep='X_harmony')
ov.pp.umap(adata)

In [None]:
plt.rcParams['figure.figsize'] = [5, 5]
ov.utils.embedding(adata,
                basis='X_umap',frameon='small',
                color=['Dataset'],
                show=False)

plt.savefig('Results/04.batch_correction/04.X_umap_harmony.pdf', format='pdf', bbox_inches='tight')
plt.show()

#### Combat

combat is a batch effect correction method that is very widely used in bulk RNA-seq, and it works just as well on single-cell sequencing data.



In [None]:
ov.single.batch_correction(adata,batch_key='batch', methods='combat',n_pcs=50)

In [None]:
ov.pp.neighbors(adata, n_neighbors=15, n_pcs=20, use_rep='X_combat')
ov.pp.umap(adata)

In [None]:
plt.rcParams['figure.figsize'] = [5, 5]
ov.utils.embedding(adata,
                basis='X_umap',frameon='small',
                color=['Dataset'],
                show=False)

plt.savefig('Results/04.batch_correction/04.X_umap_combat.pdf', format='pdf', bbox_inches='tight')
plt.show()

#### Scanorama

In [None]:
ov.single.batch_correction(adata,batch_key='batch', methods='scanorama',n_pcs=50)

In [None]:
ov.pp.neighbors(adata, n_neighbors=15, n_pcs=20, use_rep='X_scanorama')
ov.pp.umap(adata)

In [None]:
plt.rcParams['figure.figsize'] = [5, 5]
ov.utils.embedding(adata,
                basis='X_umap',frameon='small',
                color=['Dataset'],
                show=False)

plt.savefig('Results/04.batch_correction/04.X_umap_scanorama.pdf', format='pdf', bbox_inches='tight')
plt.show()

#### scVI

In [None]:
import scvi

scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch")
vae = scvi.model.SCVI(adata, gene_likelihood="nb", n_layers=2, n_latent=50)
vae.train()
adata.obsm["X_scVI"] = vae.get_latent_representation()

In [None]:
ov.pp.neighbors(adata, n_neighbors=15, n_pcs=20, use_rep='X_scVI')
ov.pp.umap(adata)

In [None]:
plt.rcParams['figure.figsize'] = [5, 5]
ov.utils.embedding(adata,
                basis='X_umap',frameon='small',
                color=['Dataset'],
                show=False)

plt.savefig('Results/04.batch_correction/04.X_umap_scVI.pdf', format='pdf', bbox_inches='tight')
plt.show()

#### scANVI

In [None]:
lvae = scvi.model.SCANVI.from_scvi_model(
    vae,
    adata=adata,
    labels_key="gpt_celltype",
    unlabeled_category="Unknown",
)
lvae.train(max_epochs=20, n_samples_per_label=100)
adata.obsm["X_scANVI"] = lvae.get_latent_representation()

In [None]:
ov.pp.neighbors(adata, n_neighbors=15, n_pcs=20, use_rep='X_scANVI')
ov.pp.umap(adata)

In [None]:
plt.rcParams['figure.figsize'] = [5, 5]
ov.utils.embedding(adata,
                basis='X_umap',frameon='small',
                color=['Dataset'],
                show=False)

plt.savefig('Results/04.batch_correction/04.X_umap_scANVI.pdf', format='pdf', bbox_inches='tight')
plt.show()

#### CellANOVA

The integration of cells across samples to remove unwanted batch variation plays a critical role in single cell analyses. When the samples are expected to be biologically distinct, it is often unclear how aggressively the cells should be aligned across samples to achieve uniformity. CellANOVA is a Python package for batch integration with signal recovery in single cell data. It builds on existing single cell data integration methods, and uses a pool of control samples to quantify the batch effect and separate meaningful biological variation from unwanted batch variation. When used with an existing integration method, CellAnova allows the recovery of biological signals that are lost during integration.

In omicverse, you only need to prepare the `control_dict`(At least two samples are required!) when you want to try `CellANOVA`. When you're done running it, there are two outputs you need to be aware of:

1. the first one being: `adata.layers['denoised']`, which stores the matrix after the batch effect is removed.
2. The second is `adata.obsm['X_mde_cellANOVA']`, which stores the low-dimensional representation of the cell after removing the batch effect

- Zhang, Z., Mathew, D., Lim, T.L. et al. Recovery of biological signals lost in single-cell batch integration with CellANOVA. Nat Biotechnol (2024). https://doi.org/10.1038/s41587-024-02463-1

In [None]:
## construct control pool
unique_batches = adata.obs['batch'].unique()
batches_except_last = unique_batches[:-1]

control_dict = {
    'pool1': batches_except_last,
}

ov.single.batch_correction(adata,batch_key='batch',n_pcs=50, methods='CellANOVA',control_dict=control_dict)

In [None]:
ov.pp.neighbors(adata, n_neighbors=15, n_pcs=20, use_rep='X_cellanova')
ov.pp.umap(adata)

In [None]:
plt.rcParams['figure.figsize'] = [5, 5]
ov.pl.embedding(adata,
                basis='X_umap',frameon='small',
                color=['Dataset'],
                show=False)

plt.savefig('Results/04.batch_correction/04.X_umap_cellANOVA.pdf', format='pdf', bbox_inches='tight')
plt.show()

#### Save AnnData object with automated celltype annotation

In [None]:
adata = adata.raw.to_adata() # This recovers the raw count data in adata.X
adata

In [None]:
print(np.min(adata.X), np.max(adata.X))

In [None]:
adata.write_h5ad("Processed Data/scRNA_Batch_All.h5ad")


**<span style="font-size:16px;">Session information：</span>**

In [None]:
import sys
import platform
import pkg_resources

# Get Python version information
python_version = sys.version
# Get operating system information
os_info = platform.platform()
# Get system architecture information
architecture = platform.architecture()[0]
# Get CPU information
cpu_info = platform.processor()
# Print Session information
print("Python version:", python_version)
print("Operating system:", os_info)
print("System architecture:", architecture)
print("CPU info:", cpu_info)

# Print imported packages and their versions
print("\nImported packages and their versions:")
for package in pkg_resources.working_set:
    print(package.key, package.version)