## 04_2. Integration metrics

<div style="text-align: right;">
    <p style="text-align: left;">Updated Time: 2025-02-10</p>
</div>


##### Load libraries

In [None]:
import os
import sys
import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt
from scib_metrics.benchmark import Benchmarker

import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=DeprecationWarning)
warnings.simplefilter(action="ignore", category=UserWarning)

##### Set working directory  for analysis

In [None]:
working_dir = '/media/bio/Disk/Research Data/EBV/omicverse'
os.chdir(working_dir)
updated_dir = os.getcwd()
print("Updated working directory: ", updated_dir)

from pathlib import Path
saving_dir = Path('Results/04.batch_correction')
saving_dir.mkdir(parents=True, exist_ok=True)

In [None]:
adata = sc.read("Processed Data/scRNA_Batch_All.h5ad")

### Benchmarking test

The methods demonstrated here are selected based on results from benchmarking experiments including the single-cell integration benchmarking project [Luecken et al., 2021]. This project also produced a software package called [scib](https://www.github.com/theislab/scib) that can be used to run a range of integration methods as well as the metrics that were used for evaluation. In this section, we show how to use this package to evaluate the quality of an integration.

In [None]:
adata.obsm["Unintegrated"] = adata.obsm['scaled|original|X_pca'].copy()

**<span style="color:darkblue; font-size:20px;">Run benchmarking metrics</span>**

Check if any cells have the same embeddings.That might cause erro in subsequent Benchmarker analysis.

In [None]:
adata

In [None]:
for embed in ["Unintegrated","X_harmony","X_combat","X_scanorama","X_scVI","X_scANVI","X_cellanova"]:
  print(embed)
  print(adata.obsm[embed].shape)
  print(np.unique(adata.obsm[embed], axis=0).shape)

Here we use a custom nearest neighbor function to speed up the computation of the metrics. This is not necessary, but can be useful for large datasets.

In particular we use faiss, which can be accelerated with a GPU.

This can be installed as: conda install -c conda-forge faiss-gpu


When using approximate nearest neighbors, an issue can arise where each cell does not get a unique set of K neighbors. This issue happens with faiss hnsw below, so we use the brute force method instead, which is still faster than pynndescent approximate nearest neighbors on CPU.

In [None]:
import faiss
from scib_metrics.nearest_neighbors import NeighborsResults


def faiss_hnsw_nn(X: np.ndarray, k: int):
    """Gpu HNSW nearest neighbor search using faiss.
    See https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md
    for index param details.
    """
    X = np.ascontiguousarray(X, dtype=np.float32)
    res = faiss.StandardGpuResources()
    M = 32
    index = faiss.IndexHNSWFlat(X.shape[1], M, faiss.METRIC_L2)
    gpu_index = faiss.index_cpu_to_gpu(res, 0, index)
    gpu_index.add(X)
    distances, indices = gpu_index.search(X, k)
    del index
    del gpu_index
    # distances are squared
    return NeighborsResults(indices=indices, distances=np.sqrt(distances))


def faiss_brute_force_nn(X: np.ndarray, k: int):
    """Gpu brute force nearest neighbor search using faiss."""
    X = np.ascontiguousarray(X, dtype=np.float32)
    res = faiss.StandardGpuResources()
    index = faiss.IndexFlatL2(X.shape[1])
    gpu_index = faiss.index_cpu_to_gpu(res, 0, index)
    gpu_index.add(X)
    distances, indices = gpu_index.search(X, k)
    del index
    del gpu_index
    # distances are squared
    return NeighborsResults(indices=indices, distances=np.sqrt(distances))

In [None]:
from scib_metrics.benchmark import BioConservation, BatchCorrection
# biocons = BioConservation(isolated_labels=False)
# batchcor = BatchCorrection(pcr_comparison=False)

bm = Benchmarker(
    adata,
    batch_key="batch",
    label_key="gpt_celltype",
    embedding_obsm_keys=["Unintegrated","X_harmony","X_combat","X_scanorama","X_scVI","X_scANVI","X_cellanova"],
    pre_integrated_embedding_obsm_key="scaled|original|X_pca",
    # bio_conservation_metrics=biocons,
    # batch_correction_metrics=batchcor,
    n_jobs=-1,
)

bm.prepare(neighbor_computer=faiss_brute_force_nn)
bm.benchmark()

Visualize the results

In [None]:
plt.rcParams['figure.figsize'] = [10, 4]
bm.plot_results_table(min_max_scale=False, show=False)
plt.savefig("Results/04.batch_correction/04. Intergration_Metrics.pdf")
plt.show()

In [None]:
plt.rcParams['figure.figsize'] = [10, 4]
bm.plot_results_table(min_max_scale=True, show=False)
plt.savefig("Results/04.batch_correction/04. Intergration_Metrics_Scaled.pdf")
plt.show()

We can also access the underlying dataframes to print the results ourselves.

In [None]:
df = bm.get_results(min_max_scale=False)
df.to_excel('Results/04.batch_correction/Intergration_Metrics_Benchmark.xlsx', sheet_name='Sheet1', index=False)
print(df)

We can find that harmony removes the batch effect the best of the three methods that do not use the GPU, scVI is method to remove batch effect using GPU.


**<span style="font-size:18px;">Session information：</span>**

In [None]:
import sys
import platform
import pkg_resources

# Get Python version information
python_version = sys.version
# Get operating system information
os_info = platform.platform()
# Get system architecture information
architecture = platform.architecture()[0]
# Get CPU information
cpu_info = platform.processor()
# Print Session information
print("Python version:", python_version)
print("Operating system:", os_info)
print("System architecture:", architecture)
print("CPU info:", cpu_info)

# Print imported packages and their versions
print("\nImported packages and their versions:")
for package in pkg_resources.working_set:
    print(package.key, package.version)