## 02_1. Preprocessing the data of scRNA-seq

<div style="text-align: right;">
    <p style="text-align: left;">Updated Time: 2025-02-09</p>
</div>


**<span style="font-size:14px;">Load libraries</span>**

In [None]:
import os
import sys
import warnings
import numpy as np
import pandas as pd

import anndata as ad
import scanpy as sc
import omicverse as ov

# Needed for some plotting
import seaborn as sns
import matplotlib.pyplot as plt
ov.plot_set()

import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=DeprecationWarning)
warnings.simplefilter(action="ignore", category=UserWarning)

##### Set working directory  for analysis

In [None]:
working_dir = '/media/bio/Disk/Research Data/EBV/omicverse'
os.chdir(working_dir)
updated_dir = os.getcwd()
print("Updated working directory: ", updated_dir)

from pathlib import Path
saving_dir = Path('Results/02.data_preprocessing')
saving_dir.mkdir(parents=True, exist_ok=True)


##### Reading the non-QC AnnData object

In [None]:
adata = sc.read("Processed Data/scRNA_unfiltered.h5ad")
adata

In [None]:
print(np.min(adata.X), np.max(adata.X))

### Preprocessing

#### Quantity control

For single-cell data, we require quality control prior to analysis, including the removal of cells containing double cells, low-expressing cells, and low-expressing genes. In addition to this, we need to filter based on mitochondrial gene ratios, number of transcripts, number of genes expressed per cell, cellular Complexity, etc. For a detailed description of the different QCs please see the document: https://hbctraining.github.io/scRNA-seq/lessons/04_SC_quality_control.html

<div class="admonition warning">
  <p class="admonition-title">Note</p>
  <p>
    if the version of `omicverse` larger than `1.6.4`, the `doublets_method` can be set between `scrublet` and `sccomposite`.
  </p>
</div>

COMPOSITE (COMpound POiSson multIplet deTEction model) is a computational tool for multiplet detection in both single-cell single-omics and multiomics settings. It has been implemented as an automated pipeline and is available as both a cloud-based application with a user-friendly interface and a Python package.

Hu, H., Wang, X., Feng, S. et al. A unified model-based framework for doublet or multiplet detection in single-cell multiomics data. Nat Commun 15, 5562 (2024). https://doi.org/10.1038/s41467-024-49448-x

In [None]:
%%time
adata=ov.pp.qc(adata,
              tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250},
              doublets_method='scrublet',
              batch_key='orig.ident',
              path_viz='Results/02.data_preprocessing')
adata

### Highly variable gene detection

Here we try to use Pearson's method to calculate highly variable genes. This is the method that is proposed to be superior to ordinary normalisation. See [Article](https://www.nature.com/articles/s41592-023-01814-1#MOESM3) in *Nature Method* for details.


normalize|HVGs：We use | to control the preprocessing step, | before for the normalisation step, either `shiftlog` or `pearson`, and | after for the highly variable gene calculation step, either `pearson` or `seurat`. Our default is `shiftlog|pearson`.

- if you use `mode`=`shiftlog|pearson` you need to set `target_sum=50*1e4`, more people like to se `target_sum=1e4`, we test the result think 50*1e4 will be better
- if you use `mode`=`pearson|pearson`, you don't need to set `target_sum`

<div class="admonition warning">
  <p class="admonition-title">Note</p>
  <p>
    if the version of `omicverse` lower than `1.4.13`, the mode can only be set between `scanpy` and `pearson`.
  </p>
</div>


In [None]:
adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=2000,target_sum=50*1e4)
adata

Set the .raw attribute of the AnnData object to the normalized and logarithmized raw gene expression for later use in differential testing and visualizations of gene expression. This simply freezes the state of the AnnData object.

In [None]:
adata.raw = adata
adata = adata[:, adata.var.highly_variable_features]
adata

### Principal component analysis

In contrast to scanpy, we do not directly scale the variance of the original expression matrix, but store the results of the variance scaling in the layer, due to the fact that scale may cause changes in the data distribution, and we have not found scale to be meaningful in any scenario other than a principal component analysis

In [None]:
ov.pp.scale(adata)
adata

If you want to perform pca in normlog layer, you can set `layer`=`normlog`, but we think scaled is necessary in PCA.

In [None]:
ov.pp.pca(adata,layer='scaled',n_pcs=50)
adata

Let us inspect the contribution of single PCs to the total variance in the data. 
This gives us information about how many PCs we should consider in order to compute the neighborhood relations of cells. In our experience, often a rough estimate of the number of PCs does fine.


In [None]:
ov.utils.plot_pca_variance_ratio(adata)

### Embedding the neighborhood graph

We suggest embedding the graph in two dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preserves trajectories. In some ocassions, you might still observe disconnected clusters and similar connectivity violations. They can usually be remedied by running:

In [None]:
ov.pp.neighbors(adata, n_neighbors=15, n_pcs=20, use_rep='scaled|original|X_pca')

### Clustering the neighborhood graph

As with Seurat and many other frameworks, we recommend the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag *et al.* (2018). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.

In [None]:
ov.pp.leiden(adata,resolution=1,key_added='leiden_1_0')

#### Embedding the neighborhood graph
We suggest embedding the graph in two dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preserves trajectories. In some ocassions, you might still observe disconnected clusters and similar connectivity violations. They can usually be remedied by running:

To visualize the PCA’s embeddings, we use the pymde package wrapper in omicverse. This is an alternative to UMAP that is GPU-accelerated.

In [None]:
ov.pp.umap(adata)

We redesigned the visualisation of embedding to distinguish it from scanpy's embedding by adding the parameter `fraemon='small'`, which causes the axes to be scaled with the colourbar

In [None]:
ov.pl.embedding(adata,
                basis='X_umap',
                color=['leiden_1_0','CD3E','Dataset'],
                ncols=1,
                frameon='small')

### Score cell cyle

In OmicVerse, we store both G1M/S and G2M genes into the function (both human and mouse), so you can run cell cycle analysis without having to manually enter cycle genes!

In [None]:
adata =adata.raw.to_adata()
adata.raw = adata
ov.pp.score_genes_cell_cycle(adata,species='human')


In [None]:
ov.pl.embedding(adata,
                basis='X_umap',
                color='phase',
                frameon='small')

#### Recover raw counts

In [None]:
adata

In [None]:
print(np.min(adata.X), np.max(adata.X))

You can use `recover_counts` to recover the raw counts after normalize and log1p

In [None]:
X_counts_recovered, size_factors_sub=ov.pp.recover_counts(adata.X, 50*1e4, 50*1e5, log_base=None, chunk_size=50000)
adata.layers['counts']=X_counts_recovered

In [None]:
adata.layers['counts'].shape
print(np.min(adata.layers['counts']), np.max(adata.layers['counts']))

In [None]:
adata

#### Save QC AnnData object

In [None]:
adata.write_h5ad("Processed Data/scRNA_QC.h5ad")


**<span style="font-size:16px;">Session information：</span>**

In [None]:
import sys
import platform
import pkg_resources

# Get Python version information
python_version = sys.version

# Get operating system information
os_info = platform.platform()

# Get system architecture information
architecture = platform.architecture()[0]

# Get CPU information
cpu_info = platform.processor()

# Print Session information
print("Python version:", python_version)
print("Operating system:", os_info)
print("System architecture:", architecture)
print("CPU info:", cpu_info)

# Print imported packages and their versions
print("\nImported packages and their versions:")
for package in pkg_resources.working_set:
    print(package.key, package.version)