# Scanpy for single-cell RNA sequencing (scRNA-seq) data analysis

---

## Sequencing

__RNA-seq__ (RNA sequencing) is a transcriptomics method that involves the high-throughput sequencing of RNA molecules in a sample. It can be used to study gene expression, alternative splicing, differential expression between case/control samples, etc. 

Bulk RNA-seq targets the transcriptome of all the RNA in a sample and can only measure the average gene expression among the heterogeneous population of cell types in a tissue ([Gondane & Itkonen, 2023](https://doi.org/10.3390/cimb45030120); [Marguerat & Bähler, 2010](https://doi.org/10.1007/s00018-009-0180-6)).

Single-cell RNA sequencing (__scRNA-seq__) allows to analyze the gene expression profile of individual cells within a sample ([Jovic et al., 2022](https://doi.org/10.1002/ctm2.694); [Hong et al., 2020](https://doi.org/10.1186/s13045-020-01005-x); [Nature Methods, 2014](https://doi.org/10.1038/nmeth.2801); [Tang et al., 2009](https://doi.org/10.1038/nmeth.1315)).

Procedure: 
1. Cell Isolation/Capture. 
  - Flow cytometry (e.g., fluorescence activated cell sorting, FACS). Cells in a suspension are stained (membrane, cytoplasm, nucleus). FACS machine, laser beam, flow cytometer sort electronics. Cell culture and analysis.
  - Microfluidics (array-based and droplet-based). Cells from a suspension are isolated by running barcoded beads and the cells through a microfluidic chip, where each cell is combined with a barcoded bead in a droplet/chamber. (droplets or wells).
2. Library Preparation. Cell lysis for RNA extraction. Reverse transcription into complementary DNA (cDNA) for amplification.  The addition of unique molecular identifiers (UMIs) to each RNA helps to mitigate issues related to PCR amplification biases.
3. Sequencing: The prepared library is sequenced using high-throughput sequencing technologies, generating short nucleotide sequences (reads) that correspond to the cDNA fragments.

![10xLibrary](images/10xgenomics_library.png)\
Source: [10xgenomics](https://kb.10xgenomics.com/hc/en-us/articles/360000939852-What-is-the-difference-between-Single-Cell-3-and-5-Gene-Expression-libraries-)

Common scRNA-seq protocols\
![Methods](images/10.1146_annurev-biodatasci-080917-013452.jpg)\
Source: [Chen et al., 2018. Annual Review of Biomedical Data Science, 1, 29-51.](https://doi.org/10.1146/annurev-biodatasci-080917-013452)

## Data Analysis

The bioinformatics analysis of scRNA-seq data typically involves quantifying gene expression levels and performing downstream analyses such as clustering, dimensionality reduction, and identification of differentially expressed genes.

Cell types are annotated based on gene expression profiles. In some cases, spatial transcriptomics are used to map gene expression in tissues. 

Many tools are available for the analysis of scRNAseq data. [Ref: Table S1](https://doi.org/10.1002/ctm2.694).

### Scanpy

Scanpy relies on an __AnnData__ class designed to store and manipulate high-dimensional, annotated single-cell genomics data.

Components:
- __Data Matrix__: A two-dimensional matrix, where rows represent cells and columns represent features (e.g., genes), containing  raw gene expression counts, normalized expression values, or any other numeric values associated with each cell and feature. `adata.X`
- __Observations__: The first of two main attributes for storing metadata is obs (observations), typically used to store information about individual cells (e.g., cell annotations, sample information, batch information). `adata.obs`
- __Variables__: The second main attribute for storing metadata is var (variables), used to store information about features (e.g., gene names, gene annotations). `adata.var`
- __Layers__: Additional matrices associated with the main data matrix. Commonly used layers include raw counts, normalized counts, and scaled data.
- __Uns__: The uns attribute (unstructured) is a dictionary intended for storing any unstructured data. This can be useful for storing additional information that doesn't fit into the structured obs and var annotations.


![AnnData](images/scanpydocs_anndata.svg)\
Source: [Scanpy Documentation](https://scanpy.readthedocs.io/en/stable/usage-principles.html)

#### Input Data

Expression matrix

![Matrix](images/10.1016_j.jgg.2022.01.004.jpg)\
Source: [Wu et al., 2022. Journal of Genetics and Genomics, 49(9), 891-899](https://doi.org/10.1016/j.jgg.2022.01.004)

#### Cell Quality Control

Damaged cells:
- High proportion of mitochondrial and ribosomal gene expression
  - A commonly used threshold is removing cells with >20% mitochondrial gene count
- Low UMI and gene count
Doublets:
- High UMI and gene count

#### Normalization
Remove technical variation associated with PCR, cell lysis efficiency, reverse transcription efficiency, stochastic molecular sampling during sequencing, to be able to measure true biological variation.

- Divide by the total counts for that cell
- Multiply by a scaling factor (e.g., 10,000)
- Log transformation

#### Filter for Highly Variable Genes

Select genes with a high cell-to-cell variation. Threshold typically at 2000 genes.


#### Scaling

Linear transformation of the data shifts the mean gene expression across cells to 0 and the variance to 1, avoiding bias from highly expressed genes.

![Matrix](images/10.1016_j.mam.2017.07.002.jpg)\
Source: [Andrews & Hemberg, (2018). Molecular aspects of medicine, 59, 114-122.](https://doi.org/10.1016/j.mam.2017.07.002)

#### Principal Component Analysis (PCA)

Linear dimensionality reduction
- http://www.nlpca.org/pca_principal_component_analysis.html
- https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

Select the number of PCs based on the location of the "elbow" on a PCxSTD plot.

Jackstraw plot for PCA significance.

PC heatmaps: Cells x Genes with extreme values, ordered by PCA scores

#### Clustering

Group cells with similar expression profiles

The number of clusters are selected based on the top PCs explaining the highest amount of variance and the resolution  

- PCA 
- tSINE: preserves local structure in the data
- UMAP: preserves both local and global structure



#### Differentially Expressed Genes (DEG) Analysis
- Measure the fraction of expression of each gene in each cluster
  - Percentile 1: Fraction of cells expressed in the current cluster
  - Percentile 2: Fraction of cells expressed in the complement clusters

#### Marker Identification



#### Cluster Annotation
Cell type identification is based on cluster marker genes from annotated reference databases (e.g., [PanglaoDB](https://panglaodb.se/)).

#### Trajectory Inference Analysis
- Transitions between cell identities
- Branching differentiation processes
- Trajectory inference methods interpret single‐cell data as a snapshot of a continuous process

Other tools:
- Seurat
  - Object: 