: '
================================================================================
Pipeline: Single-cell RNA-seq Preprocessing, QC, Integration, and Export
Purpose: Prepare a high-quality single-cell reference for Cell2location

Description:
    This script implements preprocessing, quality control (QC), integration, 
    and export of single-cell RNA-seq reference data prior to spatial 
    deconvolution with Cell2location.

Environment Setup:
    # Local setup with conda/mamba
    mamba activate cell2loc_prep_r
    mamba install ipykernel
    python3 -m ipykernel install --user --name=cell2loc_prep_r --display-name "cell2loc_prep_r"

    # Register R kernel
    R
    install.packages("IRkernel")
    IRkernel::installspec(user = TRUE)
    quit()

Input Data:
    - GEO accession: GSE158937
      Single-cell RNA sequencing samples from high-grade serous ovarian cancer.
    - Files per sample (10x Genomics format):
        barcodes.tsv   : Unique cell identifiers
        features.tsv   : List of measured genes
        matrix.mtx     : Sparse gene expression matrix

    - Example samples used:
        GSM4816045, GSM4816046, GSM4816047

Workflow Steps:
    1. Load raw 10x data into Seurat objects.
    2. Perform QC:
         - Keep cells with >200 and <7500 detected genes.
         - Exclude cells with >10% mitochondrial gene content.
    3. Normalize data and identify 2000 highly variable genes.
    4. Integrate samples:
         - Select common features.
         - Find anchors (matched cells across datasets).
         - Correct batch effects and merge datasets.
    5. (Optional) Scale, PCA, UMAP, clustering for QC visualization.
    6. Export integrated object to:
         - H5Seurat format (.h5Seurat).
         - H5AD format (.h5ad) for Python/Scanpy compatibility.

Outputs:
    - QC violin plots per sample (genes, UMIs, mitochondrial %).
    - Integrated Seurat object.
    - Exported files:
        comb_GSE158937.h5Seurat
        comb_GSE158937.h5ad

Notes:
    - Mitochondrial genes are retained in Visium workflows, but here we filter 
      high-mito cells to build a clean reference for Cell2location.
    - UMAP and PCA checks ensure successful batch integration.

================================================================================
'


In [1]:
#if local 
mamba activate cell2loc_prep_r
#mamba install ipykernel
#python3 -m ipykernel install --user --name=cell2loc_prep_r --display-name "cell2loc_prep_r"

ERROR: Error in parse(text = x, srcfile = src): <text>:2:7: unexpected symbol
1: #if local 
2: mamba activate
         ^


In [None]:
#register kernel 
#in terminal, open R
#install.packages('IRkernel')
#IRkernel::installspec(user = TRUE)
#quit()
.libPaths()
ip <- installed.packages(fields = "LibPath")
ip


pipeline for single cell preprocessing, QC, integration and export prior to cell2location
Single cell reference data from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE158937 	Single-cell RNA sequencing samples from high-grade serous ovarian cancer (HGSOC)
    barcodes.tsv: List of all cell barcodes (unique cell IDs)
    features.tsv or genes.tsv: List of all genes/features measured (gene names or IDs)
    matrix.mtx: The raw count matrix itself, in sparse format, matching barcodes x feature

In [2]:
library(Seurat)
library(SeuratDisk)
library(dplyr)


data_dir <- "/home/lythgo02/Documents/scRNAseq/GSE158937_RAW"
samples <- c("GSM4816045", "GSM4816046", "GSM4816047")  # replace with your sample IDs
seurat_list <- lapply(samples, function(samp) {
  folder <- file.path(data_dir, paste0(samp, "_filtered"))
  counts <- Read10X(data.dir = folder)
  seu <- CreateSeuratObject(counts = counts, 
                            project = samp, 
                            min.cells = 3, #only genes expresssed in at least 3 cells
                            min.features = 200) # only cells with at least 200 genes detected to remove low quality cells 
  seu$orig.ident <- samp
  return(seu)
})

names(seurat_list) <- samples




ERROR: Error in library(Seurat): there is no package called ‘Seurat’


QC on each sample 
    Mitochondrial content - keep cells where the % expressed genes that are mitochdondrial is <10% (high damage = cells that leak mitochondrial RNA)
        (Building a reference for deconvolution (e.g., with Cell2location), so want clean, healthy, well-annotated cell types to form accurate basis profiles. Removing low-quality or stressed cells here helps improve robustness even though i have retained mitochondrial genes in the visium QC).
    Remove cells with fewer than 200 genes detected
    Remove cells with more than 7500 genes detected 
        (Abnormally high numbers of genes detected indicates doublets or multiplets captured in cell capture)    


In [None]:

library(ggplot2)

#plot directory 
qc_plot_dir<- "/home/lythgo02/Documents/scRNAseq/plots/"

# Apply QC step-by-step and name plots correctly
seurat_list <- lapply(names(seurat_list), function(sample_name) {
  seu <- seurat_list[[sample_name]]
  
  # Add mitochondrial gene percentage
  seu[["percent.mt"]] <- PercentageFeatureSet(seu, pattern = "^MT-")
  
  # Create and save QC violin plot
  qc_plot <- VlnPlot(seu, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
  ggsave(
    filename = file.path(qc_plot_dir, paste0(sample_name, "_qc_violin.png")),
    plot = qc_plot,
    width = 10,
    height = 4
  )

  # Filter cells
  seu <- subset(seu, subset = nFeature_RNA > 200 & nFeature_RNA < 7500 & percent.mt < 10)
  
  return(seu)
})

# Restore names
names(seurat_list) <- c("GSM4816045", "GSM4816046", "GSM4816047")



In [None]:

Interpretation ofo violin plots:
    nFeature_RNA (number of genes per cell): Most cells have 200–1000 genes detected: some outliers at higher end but acceptable 
    nCount_RNA (total UMIs per cell): Skewed distribution with a long tail —  few cells with very high RNA content (>20,000 UMIs). Might suggest doublets or highly active cells.
    percent.mt (mitochondrial gene percentage): Evenly distributed from ~0 to 10%, with many cells around 2–8%. This is expected; <10% is usually considered acceptable.

Suitable filters applied.

Normalise
Find highly variable genes 

In [None]:
seurat_list <- lapply(seurat_list, function(seu) {
  seu <- NormalizeData(seu)    #log normalisation
  seu <- FindVariableFeatures(seu, 
                              selection.method = "vst", #variant stabilising transformation to find top 2000 variable genes
                              nfeatures = 2000)    
})


Integrate samples based on shared genes 
Find "anchors," pairs of cells from different datasets that are in a matched biological state using dimensionality reduction.
Use the anchors to correct batch effects and merge datasets into a single integrated Seurat object

In [None]:
# Select features (common genes) for integration
features <- SelectIntegrationFeatures(object.list = seurat_list) #output = vector of gene names for integration 

# Find integration anchors (using PCA)
anchors <- FindIntegrationAnchors(object.list = seurat_list, anchor.features = features)

# Integrate data
seu_integrated <- IntegrateData(anchorset = anchors)

# Switch to integrated assay for downstream analysis
DefaultAssay(seu_integrated) <- "integrated"

For the purposes of QC:
Scale and centre the gene expression values by dividing by standard deviation and subtracting the gene's mean, each gene becomes mean 0, variance 1
Perform dimensionality reduction and clustering then plot to check for batch effects
Comment these lines out before processing for cell2location

In [None]:


#seu_integrated <- ScaleData(seu_integrated, verbose = FALSE)
#seu_integrated <- RunPCA(seu_integrated, npcs = 30, verbose = FALSE)
#seu_integrated <- RunUMAP(seu_integrated, reduction = "pca", dims = 1:30)


#check origin 
#table(seu_integrated$orig.ident)

#plot umap coloured by sample
#p_pca <- DimPlot(seu_integrated, reduction = "pca", group.by = "orig.ident") +
  #ggtitle("pca coloured by sample")

#ggsave("scRNAseq/plots/pca_by_sample.png", plot = p_pca, width = 6, height = 5, dpi = 300)

#p_umap <- DimPlot(seu_integrated, reduction = "umap", group.by = "orig.ident") +
  #ggtitle("umap coloured by sample")

#ggsave("scRNAseq/plots/umap_by_sample.png", plot = p_umap, width = 6, height = 5, dpi = 300)


This UMAP suggests:
Well mixed clusters, no dominant sample specific regions suggesting successful integration and correction of batch effects.
Cells are grouped primarily by biological similarity, not by sample origin.
You can now proceed with downstream steps: clustering, cell type annotation, differential expression, etc.

In [None]:
library(SeuratDisk)
#Saves the integrated Seurat object to .h5Seurat.
#Converts it to .h5ad format for Python compatibility (Scanpy).
#converted file is automatically saved when you run convert
SaveH5Seurat(seu_integrated, filename = "/scRNAseq/comb_GSE158937.h5Seurat")
Convert("comb_GSE158937.h5Seurat", assay = "RNA", dest = "/scRNAseq/comb_GSE158937.h5ad")

Then quit and prepare spatial data 