# Single-cell RNA-Seq Analysis Training Demo

## Overview

This code sets up a workflow for processing and analyzing single-cell RNA-seq (scRNA-seq) data, from Lee DR, Rhodes C, Mitra A, Zhang Y et al study, using the Seurat package in R. The steps include data preparation, quality control, normalization, and clustering. The workflow begins with downloading raw sequencing data, organizing it, and creating a Seurat object for analysis. It then proceeds with identifying important features like highly variable genes, performing dimensionality reduction, and clustering cells based on their gene expression patterns.

Also the code provide plots to visualize and interpret scRNA-seq data. Violin plots and Feature Scatter plots assess the quality control metrics, such as gene counts and mitochondrial content. The Highly Variable Genes plot highlights genes driving differences across cells. Dimensionality reduction techniques, such as PCA, UMAP, and t-SNE, are used to visualize cell groupings and explore the overall structure of the data. Heatmaps display gene expression patterns across clusters, while the Elbow plot helps determine the optimal number of principal components for clustering. Together, these plots provide a comprehensive overview of data quality, cell clustering, and marker gene expression.

## STEP 1. Getting Started

<div class="alert alert-block alert-warning"> NOTE: This Jupyter Notebook was developed to run within a customized container on AWS with all software and tools pre-configured. If running without this customized container, you will need to install the packages below before moving on to Step 1.2.</div>

### STEP 1.1. Setup: Installing Required Tools

This step sets up the R enviroment for the scRNA-seq analysis installing Seurat package along with the dependencies ensuring the necessary libraries for this analysis. 
This step can take over 35 minutes to run

<div class="alert alert-block alert-info">Tip: If using the Miniforge install, run the following code cells by removing the # pound from each command line. </div>

In [None]:
# Enter commands in R (or R studio, if installed)
#install.packages('Seurat')

In [None]:
#setRepositories(ind = 1:3, addURLs = c('https://satijalab.r-universe.dev', 'https://bnprks.r-universe.dev/'))
#install.packages(c("BPCells", "presto", "glmGamPoi"))

In [None]:
# Install packages if not already installed
#if (!requireNamespace("BiocManager", quietly = TRUE))
#  install.packages("BiocManager")

#BiocManager::install(c("patchwork", "ggplot2", "cowplot", "dplyr", "fastmap"), force = TRUE)

In [None]:
# Install the remotes package
#if (!requireNamespace("remotes", quietly = TRUE)) {
#  install.packages("remotes")
#}
#install.packages('Signac')
#remotes::install_github("satijalab/seurat-data", quiet = TRUE)
#remotes::install_github("satijalab/azimuth", quiet = TRUE)
#remotes::install_github("satijalab/seurat-wrappers", quiet = TRUE)

----------------------------------------------------

## If running from a container, as noted above, start with <b> STEP 1.2 </b> below:

### STEP 1.2: Load Libraries & Setup Directories

In [None]:
# Load libraries for scRNA-seq analysis
library(dplyr)
library(Seurat)
library(patchwork)
library(ggplot2)
library(cowplot)
library(dplyr)

Create necessary directories to store data

In [None]:
dir.create("data", recursive = TRUE)
dir.create("data/raw_data")
dir.create("data/seurat_output")

## STEP 2. Experimental Design / Dataset¶

This step downloading and preparing the raw scRNA-seq data from the GEO database GSE167013. The code first download 10x Genomics scRNA-seq data for further analysis by Seurat. It downloads the data, organizes it into a Seurat-compatible format, and loads it for downstream processing and renames the extracted files to match Seurat's expected file naming conventions. 

In [None]:
# Download the supplementary data (TAR file) from GEO
system("wget -O data/raw_data/GSE167013_RAW.tar 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE167nnn/GSE167013/suppl/GSE167013_RAW.tar'")

In [None]:
# Extract the TAR file
system("tar -xvf data/raw_data/GSE167013_RAW.tar -C data/raw_data")

# Create the GSM5090775 directory if it doesn't exist
dir.create("data/raw_data/GSM5090774", recursive = TRUE)

# Rename the extracted GSM5090775 files to match Seurat's expected names
system("mv data/raw_data/GSM5090774_CTX_barcodes.tsv.gz data/raw_data/GSM5090774/barcodes.tsv.gz")
system("mv data/raw_data/GSM5090774_CTX_features.tsv.gz data/raw_data/GSM5090774/features.tsv.gz")
system("mv data/raw_data/GSM5090774_CTX_matrix.mtx.gz data/raw_data/GSM5090774/matrix.mtx.gz")

Once the raw data is organized into the correct format, it is loaded into R using the Read10X function, which is designed to read gene expression matrices generated by 10X Genomics. This function reads three essential files—barcodes, features (genes), and the expression matrix (counts)—and organizes them into a format suitable for further analysis.

Next, the CreateSeuratObject function is used to convert this data into a Seurat object, which is the core data structure for scRNA-seq analysis in Seurat. This object stores the gene expression data alongside associated metadata. 

In [None]:
# Load data
data_dir <- "data/raw_data/GSM5090774"
scrna.data  <- Read10X(data.dir = data_dir)

# Create a Seurat object
scrna <- CreateSeuratObject(counts = scrna.data, project = "Mouse_scRNA", min.cells = 3, min.features = 200)
scrna

<div class="alert alert-block alert-warning"> <b>NOTE</b>: If you receive a warning that states that Read10X is not a recognized function after running the above code cell then try restarting the notebook kernel and re-execute the code cell.</div>

## STEP 3. Quality Control, Filtering, and Normalization

This step performs quality control on the scRNA-seq data by calculating the percentage of mitochondrial gene expression, visualizing key metrics such as gene counts, RNA counts, and mitochondrial content using violin and scatter plots, filtering cells based on these metrics to remove low-quality cells, and normalizing the gene expression data to prepare it for further analysis.

In [None]:
# Quality control
scrna[["percent.mt"]] <- PercentageFeatureSet(scrna, pattern = "^mt-")

# Plot QC metrics
VlnPlot(scrna, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

# FeatureScatter is typically used to visualize feature-feature relationships, but can be used
# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.

plot1 <- FeatureScatter(scrna, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(scrna, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

# Filter cells based on QC metrics
scrna <- subset(scrna, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

# Normalize the data
scrna <- NormalizeData(scrna, normalization.method = "LogNormalize", scale.factor = 10000)


## STEP 4. Identifying Highly Variable Genes

This step identifies the top 2,000 highly variable genes in the scRNA-seq dataset using the variance-stabilizing transformation (VST) method, which highlights genes that show significant variability across cells, often indicative of biological differences. It then plots these variable genes using the VariableFeaturePlot, and labels the top 10 most variable genes in the dataset, providing a visual representation of genes that are likely to contribute to key cell distinctions.

In [None]:
# Identify highly variable genes
scrna <- FindVariableFeatures(scrna, selection.method = "vst", nfeatures = 2000)
top10 <- head(VariableFeatures(scrna), 10)

# plot variable features with and without labels
plot1 <- VariableFeaturePlot(scrna)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2

## STEP 5. Labeling Cell Types

This step scales the gene expression data for all genes in the dataset, ensuring that each gene contributes equally to the analysis. PCA is then run using only the previously identified highly variable genes to reduce the dimensionality of the data, capturing the most important sources of variation. The VizDimLoadings function visualizes the contribution of individual genes to the first two principal components, helping identify which genes drive the most variability in the dataset.

In [None]:
# Run PCA
all.genes <- rownames(scrna)
scrna <- ScaleData(scrna, features = all.genes)
scrna <- RunPCA(scrna, features = VariableFeatures(object = scrna))
# Examine and visualize PCA results a few different ways
print(scrna[["pca"]], dims = 1:5, nfeatures = 10)
VizDimLoadings(scrna, dims = 1:2, reduction = "pca")

## STEP 6. PCA and heatmap plots

This section of the code focuses on clustering and visualizing the scRNA-seq data. First, it generates a PCA plot (DimPlot) to visualize the principal components that summarize the variation across cells. Then, heatmaps are plotted for the top variable genes, showing their expression across cells, followed by an elbow plot to help select the optimal number of principal components (PCs) for downstream analysis.

In [None]:
#PCA plot
DimPlot(scrna, reduction = "pca")

In [None]:
# Heatmap of top variable genes
DimHeatmap(scrna, dims = 1, cells = 500, balanced = TRUE)
DimHeatmap(scrna, dims = 1:15, cells = 500, balanced = TRUE)

In [None]:
# Elbow plot to determine optimal number of PCs
ElbowPlot(scrna)

## STEP 7. Cluster cells and Run non-linear dimensional reduction (UMAP/tSNE)

Next, the code clusters cells using the Louvain algorithm based on their nearest-neighbor graph, after which the UMAP and t-SNE dimensionality reduction techniques are applied to visualize these clusters. These visualizations, presented in UMAP and t-SNE plots, show cell groupings based on gene expression similarities, providing insight into potential distinct cell populations.

In [None]:
# Identify clusters
scrna <- FindNeighbors(scrna, dims = 1:10)
scrna <- FindClusters(scrna, resolution = 0.5)

In [None]:
# Run UMAP
scrna <- RunUMAP(scrna, dims = 1:10)
DimPlot(scrna, reduction = "umap", label = TRUE)
saveRDS(scrna, file = "data/seurat_output/scrna_tutorial.rds")

In [None]:
# Run t-SNE
scrna <- RunTSNE(scrna, dims = 1:10)
DimPlot(scrna, reduction = "tsne", label = TRUE)

## STEP 8. Differentially expressed features

This section of the code identifies and visualizes key marker genes for cell clusters in the scRNA-seq data. First, it finds the differentially expressed genes for cluster 2 and then does the same for all clusters, filtering for genes with a log-fold change greater than 1. The code also highlights the markers for specific clusters (e.g., cluster 0) using the ROC method.
Also, Violin plots, feature plots, and a heatmap were used to visualize the gene expression patterns and marker genes across different cell clusters in the scRNA-seq data.

In [None]:
# find all markers of cluster 2
cluster2.markers <- FindMarkers(scrna, ident.1 = 2)
head(cluster2.markers, n = 5)

In [None]:
# find markers for every cluster compared to all remaining cells, report only the positive ones
scrna.markers <- FindAllMarkers(scrna, only.pos = TRUE)
scrna.markers %>%
    group_by(cluster) %>%
    dplyr::filter(avg_log2FC > 1)

write.csv(scrna.markers, "data/seurat_output/cluster_markers.csv", row.names = FALSE)

In [None]:
cluster0.markers <- FindMarkers(scrna, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
head(cluster0.markers, 10)

In [None]:
# Assuming you have identified highly variable genes using FindVariableFeatures
top_genes <- head(VariableFeatures(scrna), 10)  # Get top 10 highly variable genes
VlnPlot(scrna, features = top_genes)

In [None]:
# Violin Plot
VlnPlot(scrna, features = c("Hba-x", "Nudt4"))

In [None]:
FeaturePlot(scrna, features = c("Hba-x", "Nudt4", "Hbb-bh1", "Gpx1", "Gypa", "Car2", "Alas2", "Hmox1",
    "Col3a1", "Hbb-bh0"))

In [None]:
scrna.markers %>%
    group_by(cluster) %>%
    dplyr::filter(avg_log2FC > 1) %>%
    slice_head(n = 10) %>%
    ungroup() -> top10
DoHeatmap(scrna, features = top10$gene)

## STEP 9. Assigning cell type identity to clusters

This section of the code renames the cell clusters with more meaningful biological identities, based on prior knowledge of the cell types. It uses the RenameIdents function to assign these names to the identified clusters.

Next, it creates a UMAP (Uniform Manifold Approximation and Projection) plot, which visualizes the clusters of cells in two dimensions, with labels for each cluster. The plot is customized with larger labels and axes using ggplot2 for clear visualization. The final UMAP plot is then saved as a JPEG image file (scrna3k_umap.jpg), showing the distinct cell populations with their assigned identities in the scRNA-seq dataset.

In [None]:
new.cluster.ids <- c("Prolif", "Neurons", "Prolif", "Neuro Dev", "OPCs", "Neurogenesis", 
    "Erythro", "Mitosis", "ECM", "Neuro Sign", "Microglia")
names(new.cluster.ids) <- levels(scrna)
scrna <- RenameIdents(scrna, new.cluster.ids)
DimPlot(scrna, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

In [None]:
library(ggplot2)
plot <- DimPlot(scrna, reduction = "umap", label = TRUE, label.size = 4.5) + xlab("UMAP 1") + ylab("UMAP 2") +
    theme(axis.title = element_text(size = 18), legend.text = element_text(size = 18)) + guides(colour = guide_legend(override.aes = list(size = 10)))
ggsave(filename = "data/seurat_output/scrna3k_umap.jpg", height = 7, width = 12, plot = plot, quality = 50)

In [None]:
saveRDS(scrna, file = "data/seurat_output/scrna3k_final.rds")