**WORKSHOP: SINGLE-CELL RNA-SEQ ANALYSIS WITH SCANPY**

**Objective:** We will computationally reproduce the key bioinformatic findings of the study "Single-cell RNA-Seq analysis of molecular changes during radiation-induced skin injury: the involvement of Nur77". By processing raw single-cell sequencing data from Rat (Time-series: Day 0, 7, 14, 28) and Human (Control vs. Irradiated) samples, we aim to:

* *Master Quality Control & Normalization:* We will generate "Pre- vs. Post-Processing" PCA visualizations to empirically demonstrate how Normalization removes technical artifacts (like sequencing depth) and how Log-Transformation stabilizes variance, ensuring downstream analysis reflects true biological signals rather than technical noise.

* *Construct a Skin Atlas:* We will identify and annotate major cutaneous cell populations—specifically observing the radiation-induced depletion of Keratinocytes and the dynamic fluctuation of Fibroblasts and Endothelial cells.

* *Analyze Fibroblast Heterogeneity:* We will isolate the fibroblast lineage to perform sub-clustering and Trajectory Inference (Pseudotime), mapping the developmental transition of fibroblasts during the injury repair process.

* *Validate Molecular Targets:* We will visualize the specific upregulation of the orphan nuclear receptor Nur77 (Nr4a1) within fibroblast subpopulations, confirming its role as a critical regulator of cellular radiosensitivity and apoptosis.

**Part 1: Setup and Environment Configuration**

First, we import the necessary libraries. We use Scanpy for the core analysis, Pandas/NumPy for data manipulation, and Matplotlib/Seaborn for plotting. We also define our directory structure to ensure our results are saved systematically.

*sc.set_figure_params:* This sets global defaults for plots. dpi=300 ensures high resolution suitable for academic journals. vector_friendly=True ensures that when we save as PDF/SVG, the text remains editable.

*Path Management:* We use pathlib instead of standard strings. This makes the code robust, working seamlessly on both Windows (\) and Mac/Linux (/) file systems.

In [None]:
import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.sparse
import seaborn as sns
from pathlib import Path
import bbknn

if not hasattr(scipy.sparse.csr_matrix, "ravel"):
    def ravel_fix(self, order='C'):
        return self.toarray().ravel(order)
    scipy.sparse.csr_matrix.ravel = ravel_fix
sc.settings.verbosity = 3 
sc.set_figure_params(dpi=300, fontsize=8, facecolor="white", frameon=False, vector_friendly=True)

BASE_DIR = Path.cwd()
WRITE_DIR = BASE_DIR / "write"
RESULT_DIR = Path.home() / "results"

for d in [WRITE_DIR, RESULT_DIR]:
    d.mkdir(parents=True, exist_ok=True)

for fig in ["1_preprocessing_and_clustering_Rat", "1_preprocessing_and_clustering_Human"]:
    (RESULT_DIR / fig).mkdir(exist_ok=True)

print(f"Results will be saved to: {RESULT_DIR}")

**Part 2: Data Loading Helper**

*The 10x Genomics Format:* Single-cell data usually comes in a triplet of files:

* *matrix.mtx:* The actual counts (how many times Gene X was detected in Cell Y).

* *barcodes.tsv:* The unique ID for every cell (e.g., AAAGTCG...).

* *features.tsv:* The list of gene names.

*The AnnData Object:* Scanpy stores data in an object called adata.

* *adata.X:* The big matrix of counts (Cells × Genes).

* *adata.obs:* Observations. A dataframe describing the Cells (e.g., which batch, total counts, % mitochondria).

* *adata.var:* Variables. A dataframe describing the Genes (e.g., gene ID, mean expression).

*Standardization Strategy:* Human gene symbols are all uppercase (e.g., TP53), while rodent genes are title case (e.g., Tp53). To perform cross-species integration later, we force all gene names to uppercase immediately.

In [None]:
def load_and_standardize(path, sample_id, condition, species):

    try:
        adata = sc.read_10x_mtx(BASE_DIR / path, var_names="gene_symbols", cache=False)
        adata.var_names_make_unique()
        adata.var_names = adata.var_names.str.upper()
        adata.obs["sample_id"] = sample_id
        adata.obs["condition"] = condition
        adata.obs["species"] = species
        return adata
    except Exception as e:
        print(f"ERROR: Could not load {path} -> {e}")
        return None

**Part 3: The Processing Pipeline (QC, Normalization, Integration)**

This is the most critical section. We define a function process_basic that takes raw data and turns it into a clustered map.

*1. QC (Quality Control):*

* *Mitochondrial Reads (pct_counts_mt):* When a cell undergoes apoptosis (cell death) or mechanical lysis, the cytoplasmic membrane breaks, and cytoplasmic RNA leaks out. However, mitochondria are double-membrane organelles, so their RNA stays trapped inside. Therefore, a cell with >15% mitochondrial reads is likely a "soup" of a broken cell and must be removed.

* *Total Counts:* We utilized it to filter out technical artifacts, identifying counts below 800 as "empty droplets" (oil droplets lacking a cell) and counts exceeding 20,000 as "doublets" (two cells captured simultaneously). Furthermore, we enforced a minimum threshold for detected genes (n_genes_by_counts > 400), a critical step that removes low-quality cells with insufficient transcriptomic information while preserving rare but biologically relevant cell populations.

*2. Normalization (sc.pp.normalize_total):* Normalization is a critical preprocessing step designed to correct for technical artifacts related to sequencing depth. Since the total number of reads captured can vary significantly between cells due to technical efficiency rather than biology, raw counts are scaled to a fixed total (typically 10,000 counts per cell). This ensures that the observed differences in gene expression levels reflect true biological variation rather than mere sampling discrepancies.

*3. Log Transformation (sc.pp.log1p):* Log transformation is employed to stabilize the variance of the dataset. Because gene expression data typically follows a power-law distribution—where a few genes are highly expressed while most are rare—it violates the assumptions of many statistical algorithms. Applying a natural logarithm (ln(x+1)) converts this skewed distribution into a format that approximates a normal distribution, making the data suitable for downstream statistical analysis like differential expression testing.

*4. Principal Component Analysis (PCA):* PCA is a dimensionality reduction technique used to manage the high-dimensional nature of single-cell data, which often contains over 20,000 genes. It mathematically compresses this vast feature space into a smaller set of "Principal Components" (typically the top 30-50) that capture the directions of maximum variance in the data. This process isolates the predominant biological signals, such as cell differentiation states or cell cycle phases, while effectively filtering out random technical noise.

*5. BBKNN (Batch Balanced k-Nearest Neighbors):* BBKNN is a fast, graph-based integration algorithm designed to correct "Batch Effects" without altering the original gene expression data. It addresses the common issue where technical noise causes biologically identical cells from different batches (e.g., Rat vs. Human) to cluster separately rather than grouping by cell type. Unlike standard algorithms that simply find the closest neighbors, BBKNN enforces connectivity across batches by compelling every cell to identify a balanced number of neighbors from each experimental condition, thereby creating topological bridges between datasets. Crucially, unlike methods that mathematically modify gene counts and risk introducing biological artifacts, BBKNN preserves the integrity of the original data by only adjusting the Neighborhood Graph, allowing for successful dataset merging for UMAP visualization and clustering.

*6. Leiden Clustering (The Community Detection):*

* *What is it?:* An improvement on the Louvain algorithm. It treats cells as nodes in a graph.

* *How it works?:* It looks for areas in the graph where nodes are highly connected to each other but loosely connected to the rest of the network.

* *Why Leiden?:* Older algorithms (Louvain) sometimes produced "disconnected communities" (a cluster that is actually two separate islands). Leiden mathematically guarantees that every cluster is a well-connected community.

* Resolution: The "knob" you turn.

  * *resolution=0.1:* Broad grouping (e.g., "Immune Cells" vs "Stromal Cells").

  * *resolution=2.0:* Specific grouping (e.g., "CD4+ T-Cells" vs "CD8+ T-Cells").

![alt text](result1/QC_1_Raw_Scatter_batch.png) ![alt text](result1/QC_2_Filtered_Scatter_batch.png)

![alt text](result1/QC_1_Raw_Violin_batch.png) ![alt text](result1/QC_2_Filtered_Violin_batch.png)

In [None]:
def process_basic(adata, batch_key):

    adata.var["mt"] = adata.var_names.str.startswith("MT-")
    sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)
    
    print(f"Number of cells before filtering: {adata.n_obs}")

    print(f"Generating RAW QC Plots for {batch_key}...")
    
    sc.pl.violin(
        adata,
        ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
        jitter=0.4,
        multi_panel=True,
        show=False
    )
    plt.suptitle(f"Raw (Pre-Filter): {batch_key}", y=1.05) 
    plt.savefig(RESULT_DIR / f"QC_1_Raw_Violin_{batch_key}.png", bbox_inches="tight")
    plt.close()

    # 2. Raw Scatter Plot
    sc.pl.scatter(
        adata, 
        x="total_counts", 
        y="n_genes_by_counts", 
        color="pct_counts_mt", 
        show=False,
        title=f"Raw (Pre-Filter) Scatter: {batch_key}" 
    )
    plt.savefig(RESULT_DIR / f"QC_1_Raw_Scatter_{batch_key}.png", bbox_inches="tight")
    plt.close()

    adata = adata[
        (adata.obs.total_counts > 800) & 
        (adata.obs.total_counts < 20000) & 
        (adata.obs.pct_counts_mt < 15) & 
        (adata.obs.n_genes_by_counts > 400) 
    ].copy()
    print(f"Number of cells after filtering: {adata.n_obs}")

    print(f"Generating FILTERED QC Plots for {batch_key}...")

    sc.pl.violin(
        adata,
        ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
        jitter=0.4,
        multi_panel=True,
        show=False
    )
    plt.suptitle(f"Filtered (Post-Filter): {batch_key}", y=1.05)
    plt.savefig(RESULT_DIR / f"QC_2_Filtered_Violin_{batch_key}.png", bbox_inches="tight")
    plt.close()

    sc.pl.scatter(
        adata, 
        x="total_counts", 
        y="n_genes_by_counts", 
        color="pct_counts_mt", 
        show=False,
        title=f"Filtered (Post-Filter) Scatter: {batch_key}"
    )
    plt.savefig(RESULT_DIR / f"QC_2_Filtered_Scatter_{batch_key}.png", bbox_inches="tight")
    plt.close()

![alt text](result1/QC_PCA_ScenarioA_LogOnly_batch.png) ![alt text](result1/QC_PCA_ScenarioB_NormOnly_batch.png) ![alt text](result1/QC_PCA_Standard_Final_batch.png)

In [None]:
    print("Generating QC Plot: Log1P ONLY (No Normalization)...")

    adata_log_only = adata.copy()
    sc.pp.log1p(adata_log_only) 
    sc.tl.pca(adata_log_only)
    
    sc.pl.pca(
        adata_log_only,
        color=[batch_key, batch_key, "total_counts", "pct_counts_mt"],
        dimensions=[(0, 1), (2, 3), (0, 1), (2, 3)],
        ncols=2,
        size=2,
        show=False,
        title=[
            f"LogOnly: {batch_key} (PC1-2)", 
            f"LogOnly: {batch_key} (PC3-4)", 
            "LogOnly: Counts", 
            "LogOnly: MT%"
        ]
    )
    plt.savefig(RESULT_DIR / f"QC_PCA_ScenarioA_LogOnly_{batch_key}.png", bbox_inches="tight")
    plt.close()
    del adata_log_only

    print("Generating QC Plot: Normalization ONLY (No Log1P)...")
    
    adata_norm_only = adata.copy()
    sc.pp.normalize_total(adata_norm_only, target_sum=1e4) 
    sc.tl.pca(adata_norm_only)
    
    sc.pl.pca(
        adata_norm_only,
        color=[batch_key, batch_key, "total_counts", "pct_counts_mt"],
        dimensions=[(0, 1), (2, 3), (0, 1), (2, 3)],
        ncols=2,
        size=2,
        show=False,
        title=[
            f"NormOnly: {batch_key} (PC1-2)", 
            f"NormOnly: {batch_key} (PC3-4)", 
            "NormOnly: Counts", 
            "NormOnly: MT%"
        ]
    )
    plt.savefig(RESULT_DIR / f"QC_PCA_ScenarioB_NormOnly_{batch_key}.png", bbox_inches="tight")
    plt.close()
    del adata_norm_only

    sc.pp.normalize_total(adata, target_sum=1e4)
    
    sc.pp.log1p(adata)
    adata.raw = adata 

    sc.pp.highly_variable_genes(adata, n_top_genes=3000, flavor="seurat_v3")
    sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
    sc.pp.scale(adata)
 
    sc.tl.pca(adata, mask_var="highly_variable")

    sc.pl.pca(
        adata,
        color=[batch_key, batch_key, "total_counts", "pct_counts_mt"],
        dimensions=[(0, 1), (2, 3), (0, 1), (2, 3)],
        ncols=2,
        size=2,
        show=False,
        title=[
            f"Final: {batch_key} (PC1-2)", 
            f"Final: {batch_key} (PC3-4)", 
            "Final: Counts", 
            "Final: MT%"
        ]
    )
    plt.savefig(RESULT_DIR / f"QC_PCA_Standard_Final_{batch_key}.png", bbox_inches="tight")
    plt.close()

    bbknn.bbknn(adata, batch_key=batch_key)
    print("Calculating t-SNE...")
    sc.tl.tsne(adata, n_pcs=30)
    print("Calculating UMAP...")
    sc.tl.umap(adata)
    sc.tl.leiden(adata, resolution=2)
    
    return adata

**Part 4: Automated Annotation Strategy**

Instead of manually checking every cluster, we score them against known biological markers.

* *sc.tl.score_genes:* This function takes a list of genes (e.g., T-cell markers: CD3D, CD3E) and calculates an average expression score for each cell.

* *Control Mechanism:* To prevent random noise from looking like a signal, Scanpy selects a "control gene set" (random genes with similar average expression to your markers) and subtracts that background noise.

* *Winner-Takes-All:* We calculate scores for all cell types. If a cluster has a high "T-Cell Score" and low "B-Cell Score," we label it a T-Cell.

In [None]:
def auto_annotate_cell_types(adata, marker_dict):
 
    old_scores = [c for c in adata.obs.columns if c.startswith("score_")]
    if old_scores: adata.obs.drop(columns=old_scores, inplace=True)
    
    print("\n--- Marker Gene Check ---")
 
    for cell_type, genes in marker_dict.items():
        valid_genes = [g for g in genes if g in adata.var_names]
        if len(valid_genes) > 0:
            sc.tl.score_genes(adata, gene_list=valid_genes, score_name=f"score_{cell_type}")
        else:
            print(f"WARNING: NO marker genes found for {cell_type}!")
    
    score_cols = [col for col in adata.obs.columns if col.startswith("score_")]
    if not score_cols: return adata

    cluster_scores = adata.obs.groupby("leiden")[score_cols].mean()
    cluster_mapping = {}
    for cluster in cluster_scores.index:
        best_score_col = cluster_scores.loc[cluster].idxmax()
        cell_name = best_score_col.replace("score_", "")
        cluster_mapping[cluster] = cell_name
    
    adata.obs["cell_type"] = adata.obs["leiden"].map(cluster_mapping)
    return adata

**Part 5: Visualization Helper**

We create a custom function to plot datasets side-by-side while keeping the axes locked.

*Locked Axes:* This function calculates the global Min/Max of the t-SNE coordinates. Even if the "Irradiated" dataset doesn't have cells in the top-right corner, the plot frame will remain identical to the "Control" plot. This is essential for spotting Cell State Transitions (e.g., seeing a cloud of cells shift position).

In [None]:
def plot_split_tsne(adata, keys, key_col, color_col, save_path, figsize_per_plot=(4, 4)):
  
    n_plots = len(keys)
    fig, axes = plt.subplots(1, n_plots, figsize=(figsize_per_plot[0] * n_plots, figsize_per_plot[1]))
    
    if n_plots == 1: axes = [axes] 

    x_min, x_max = adata.obsm['X_tsne'][:, 0].min(), adata.obsm['X_tsne'][:, 0].max()
    y_min, y_max = adata.obsm['X_tsne'][:, 1].min(), adata.obsm['X_tsne'][:, 1].max()
    pad_x = (x_max - x_min) * 0.05
    pad_y = (y_max - y_min) * 0.05

    for i, key in enumerate(keys):
        ax = axes[i]
        subset = adata[adata.obs[key_col] == key] 
        
        sc.pl.tsne(
            subset, 
            color=color_col, 
            ax=ax, 
            show=False, 
            title=key, 
            frameon=False,
            legend_loc="none" if i < n_plots - 1 else "right margin", 
            s=50 
        )
        
        ax.set_xlim(x_min - pad_x, x_max + pad_x)
        ax.set_ylim(y_min - pad_y, y_max + pad_y)
        
        ax.set_xlabel("tSNE_1")
        if i == 0: ax.set_ylabel("tSNE_2")
        else: ax.set_ylabel("")
            
    plt.tight_layout()
    plt.savefig(save_path, bbox_inches="tight")
    plt.close()
    print(f"Saved split t-SNE to: {save_path}")

**Part 6: Defining Biological Markers**

*Specificity:* Good markers are expressed only in that cell type (e.g., CD3D is only in T-cells).

*Sensitivity:* Good markers are expressed in most cells of that type.

*Note:* The human list is slightly different from the rat list due to biological differences between species.

In [None]:
MARKERS_RAT = {
    "Keratinocytes (KC)": ["KRT1", "KRT14", "KRT5"],
    "Fibroblasts (FB)": ["DCN", "APOD", "CRABP1"],
    "Endothelial (EC)": ["TM4SF1", "CDH5", "CD93"],
    "Pericytes (PC)": ["RGS5", "DES", "PDGFRB"],
    "Schwann (SC)": ["GATM", "MPZ", "PLP1"],          
    "Smooth_Muscle (SMC)": ["MYH11", "MYL9", "TAGLN"],
    "Myoblasts (MB)": ["MYF5", "JSRP1", "CDH15"],
    "Neural (NC)": ["CMTM5", "BCHE", "AJAP1"],
    "Macrophages (Mø)": ["C1QC", "CD68", "PF4"],
    "Neutrophils (NEUT)": ["S100A8", "S100A9", "LYZ2"],
    "T_Cells (TC)": ["CD3D", "CD3E", "ICOS"],
    "B_Cells (BC)": ["TYROBP", "IFI30", "IGHM"],
    "Dendritic (DC)": ["CD207", "CD74", "RT1-DB1"]   
}

MARKERS_HUMAN = {
    "Keratinocytes (KC)": ["KRT14", "KRT5", "KRT1"],
    "Fibroblasts (FB)": ["DCN", "APOD", "COL1A1"],
    "Endothelial (EC)": ["CDH5", "CD93"],
    "Sweat_Gland (SGC)": ["DCD","AQP5"], 
    "Smooth_Muscle (SMC)": ["MYL9", "TAGLN", "ACTA2"],
    "Neural (NC)": ["CDH19", "SOX10", "PMP22"],
    "T_Cells (TC)": ["CD3D", "CD3E", "CD3G", "CCL5"],
    "NK_Cells (NK)": ["CD3D", "CD3E", "CD3G", "CCL5", "GZMB"],
    "Macrophages (Mø)": ["CD74", "AIF1", "CD68"],
    "Mast_Cells (MC)": ["TPSAB1", "TPSB2", "CPA3", "IL1RL1"],
    "Neutrophils (NEUT)": ["KRT14", "KRT5", "KRT1", "S100A8", "S100A9"]
}

**Part 7: Executing Rat Analysis**

**The t-SNE Plot**

* *Code Reference:* sc.pl.tsne(...) and plot_split_tsne(...)

*What is it?:* t-SNE is a non-linear dimensionality reduction technique. Imagine you have a spreadsheet where every cell has 20,000 columns (one for each gene). It is impossible to visualize 20,000 dimensions. t-SNE calculates the probability that Cell A is similar to Cell B in high-dimensional space and tries to recreate that probability in just 2 dimensions (X and Y).

*Why do we use it?*

* *Cluster Separation:* t-SNE is extremely aggressive at separating different groups. It creates distinct "islands" of cells.

* *Identification:* It allows us to visually confirm that our clustering algorithm (Leiden) worked. If Cluster 1 and Cluster 2 are mixed together on the t-SNE plot, the clustering resolution is likely wrong.

* *The "Split" t-SNE:* We use a custom function plot_split_tsne.

* *The Problem:* If you plot all cells together, a massive influx of inflammatory cells in the "Injured" sample might obscure the small number of cells in the "Control" sample.

* *The Solution:* We plot the conditions side-by-side (Control vs. 7 Days vs. 14 Days).

* *Locked Axes:* Crucially, the code locks the X and Y axes limits. If a cluster of cells appears at coordinate (10, 5) in the Control plot, looking at coordinate (10, 5) in the Injury plot tells you exactly what happened to that specific population.

*How to Interpret:*

* *Proximity:* Cells that are close together are transcriptionally similar.

* *Distance:* Unlike UMAP, the distance between clusters in t-SNE is meaningless. If the "T-Cell" island is far away from the "B-Cell" island, it does not mathematically mean they are "more different" than if they were close. t-SNE prioritizes local structure (neighbors), not global structure.

* *Changes in Density (Split Plot):* If an island exists in the "7d" plot but is empty in the "Control" plot, that represents a newly infiltrated cell population (e.g., Neutrophils entering the wound).

In [None]:
print("\n--- FIGURE 1: RAT ANALYSIS ---")

rat_samples = [
    load_and_standardize("datas/rat/GSM5814220_con/", "Con", "Con", "rat"),
    load_and_standardize("datas/rat/GSM5814221_IR_7d/", "7d", "7d", "rat"),
    load_and_standardize("datas/rat/GSM5814222_IR_14d/", "14d", "14d", "rat"),
    load_and_standardize("datas/rat/GSM5814223_IR_28d/", "28d", "28d", "rat"),
]

adata_rat = sc.concat([r for r in rat_samples if r is not None], label="batch")
adata_rat = process_basic(adata_rat, batch_key="batch")
adata_rat = auto_annotate_cell_types(adata_rat, MARKERS_RAT)

sc.pl.tsne(adata_rat, color=["cell_type"], frameon=False, show=False, title="All Conditions Combined - RAT")
plt.savefig(RESULT_DIR / "1_preprocessing_and_clustering_Rat/tSNE_Combined_RAT.png", bbox_inches="tight")
plt.close()

![alt text](result1/tSNE_Combined_RAT.png)

In [None]:
plot_split_tsne(
    adata_rat, 
    keys=["Con", "7d", "14d", "28d"], 
    key_col="condition", 
    color_col="cell_type", 
    custom_titles=["Control Group", "7 Days Post-IR", "14 Days Post-IR", "28 Days Post-IR"],
    save_path=RESULT_DIR / "1_preprocessing_and_clustering_Rat/tSNE_Split_RAT.png"
)

![alt text](result1/tSNE_Split_RAT.png)

**The Dot Plot**

* *Code Reference:* sc.pl.dotplot(...)

*What is it?:* The Dot Plot is the Gold Standard for visualizing marker genes in single-cell papers. It compactly displays two dimensions of information simultaneously for a list of genes across all cell types.

*The Visual Encodings:*

* *Dot Size (Fraction of cells):*

  * Represents the percentage of cells in that cluster expressing the gene (Sensitivity).

  * *Example:* If the dot for CD3D in the "T-Cell" row is huge, it means 90-100% of those cells have the gene. If it is tiny, only 10% do.

* *Dot Color (Mean expression):*

  * Represents the average intensity of expression (How many reads were found).

  * *Example:* A dark blue/red dot means the gene is produced in massive quantities. A light color means it is expressed, but at low levels.

*Why standard_scale='var'?:* We use standard_scale='var'. This scales the color mapping from 0 to 1 per gene.

* *Without scaling:* A highly expressed structural gene (like Actin) would be dark blue, and a transcription factor (which is naturally low abundance) would be invisible.

* *With scaling:* The highest expression of Actin is set to 1.0 (Dark), and the highest expression of the Transcription Factor is also set to 1.0 (Dark). This allows you to compare the pattern of expression, regardless of absolute abundance.

*How to Interpret:*

* *Diagonal Pattern:* A good annotation should look like a diagonal line. The first gene behaves effectively only in the first cluster, the second gene in the second cluster, etc.

* *"Dirty" Dots:* If you see a marker gene expressing in every single row (cluster), it is a "Housekeeping Gene" or a "Mitochondrial Gene" and is not useful for identifying cell types.

* *Empty Rows:* If a row (Cell Type) has no dots lighting up for its supposed markers, your annotation is wrong, or the sequencing depth was too low to capture those genes.

In [None]:
all_rat_markers = [g for list in MARKERS_RAT.values() for g in list]
all_rat_markers = list(dict.fromkeys(all_rat_markers))
sc.pl.dotplot(adata_rat, all_rat_markers, groupby="cell_type", standard_scale="var", show=False)
plt.savefig(RESULT_DIR / "1_preprocessing_and_clustering_Rat/Dotplot_RAT.png", bbox_inches="tight")
plt.close()

![alt text](result1/Dotplot_RAT.png)

In [None]:
counts = adata_rat.obs.groupby(["condition", "cell_type"]).size().unstack(fill_value=0)
props = counts.div(counts.sum(axis=1), axis=0) * 100
props = props.loc[["Con", "7d", "14d", "28d"]]
props.plot(kind="bar", stacked=True, figsize=(8, 5), colormap="tab20")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title("Rat Skin Cell Type Proportions")
plt.tight_layout()
plt.savefig(RESULT_DIR / "1_preprocessing_and_clustering_Rat/Proportions_RAT.png")
plt.close()

print("   > Extracting Rat Fibroblasts...")
rat_fibro = adata_rat[adata_rat.obs["cell_type"].str.contains("Fibro", na=False)].copy()
if rat_fibro.raw is not None: rat_fibro = rat_fibro.raw.to_adata()

**The Stacked Bar Plot (Proportions)**

* *Code Reference:* props.plot(kind="bar", stacked=True)

*What is it?:* This moves the analysis from Gene Expression (Transcriptomics) to Population Dynamics (Cellular Ecology). It visualizes the cellular composition of the tissue.

*The Logic:*

  1. We count how many cells of each type exist in each sample (e.g., "In Control, we have 500 Fibroblasts and 50 T-Cells").

  2. We Normalize to 100%. We cannot compare raw numbers because the "Control" sample might have 5,000 cells total while the "Irradiated" sample has 8,000.

  3. We stack the percentages to see the relative abundance.

*Why is it crucial for your Radiation Injury model?:* In injury models, transcriptional change (genes turning on/off) is only half the story. The other half is infiltration and proliferation.

*How to Interpret:*

* *Expansion:* If the "Macrophage" bar block represents 5% of the column in "Con" but expands to 40% in "7d", it proves a massive immune response.

* *Depletion:* If the "Endothelial" bar shrinks significantly after radiation, it indicates vascular damage/apoptosis caused by the treatment.

* *Recovery:* If the proportions in "28d" look similar to "Con", it suggests the tissue has healed and returned to homeostasis. If "28d" still looks like "7d", it suggests chronic inflammation or permanent fibrosis.

![alt text](result1/Proportions_RAT.png)

**Part 8: Executing Human Analysis**

*Wilcoxon Rank-Sum Test:* While we use a dictionary to name cells here, normally you discover markers using statistics.

* *The Problem:* Normal t-tests assume data follows a Bell Curve. Single-cell data is "Zero-Inflated Negative Binomial" (lots of zeros, heavy tails). A t-test will fail and give false positives.

* *The Wilcoxon Solution:* This is a Non-Parametric test.

  * It doesn't care about the values (e.g., 0.1 vs 1000).

  * It only cares about the Rank.

  * *Example:* It lists all cells from highest expression to lowest. If all "Cluster 1" cells are at the top of the list, they have a high "Rank Sum."

  * This makes it robust against outliers and the weird distribution of RNA-seq data.

In [None]:
print("\n--- FIGURE 2: HUMAN ANALYSIS ---")

human_samples = [
    load_and_standardize("datas/human/GSM5821748_con/", "Con", "Con", "human"),
    load_and_standardize("datas/human/GSM5821749_IR/", "IR", "IR", "human"),
]

adata_human = sc.concat([h for h in human_samples if h is not None], label="batch")
adata_human = process_basic(adata_human, batch_key="batch")
adata_human = auto_annotate_cell_types(adata_human, MARKERS_HUMAN)

sc.pl.tsne(adata_human, color=["cell_type"], frameon=False, show=False, title="All Conditions Combined - HUMAN")
plt.savefig(RESULT_DIR / "1_preprocessing_and_clustering_Human/tSNE_Combined_HUMAN.png", bbox_inches="tight")
plt.close()

plot_split_tsne(
    adata_human, 
    keys=["Con", "IR"], 
    key_col="condition", 
    color_col="cell_type", 
    custom_titles=["Control Group", "IR Treatment"]
    save_path=RESULT_DIR / "1_preprocessing_and_clustering_Human/tSNE_Split_HUMAN.png"
)

all_human_markers = [g for list in MARKERS_HUMAN.values() for g in list]
all_human_markers = list(dict.fromkeys(all_human_markers))
sc.pl.dotplot(adata_human, all_human_markers, groupby="cell_type", standard_scale="var", show=False)
plt.savefig(RESULT_DIR / "1_preprocessing_and_clustering_Human/Dotplot_HUMAN.png", bbox_inches="tight")
plt.close()

counts_human = adata_human.obs.groupby(["condition", "cell_type"]).size().unstack(fill_value=0)
props_human = counts_human.div(counts_human.sum(axis=1), axis=0) * 100
props_human = props_human.loc[["Con", "IR"]] 
props_human.plot(kind="bar", stacked=True, figsize=(6, 5), colormap="tab20")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title("Human Skin Cell Type Proportions")
plt.tight_layout()
plt.savefig(RESULT_DIR / "1_preprocessing_and_clustering_Human/Proportions_HUMAN.png")
plt.close()

print("   > Extracting Human Fibroblasts...")
human_fibro = adata_human[adata_human.obs["cell_type"].str.contains("Fibro", na=False)].copy()
if human_fibro.raw is not None: human_fibro = human_fibro.raw.to_adata()

**Part 9: Advanced Sub-clustering Strategy**

*The Fractal Nature of Biology:* Biological heterogeneity is fractal. If you look at "Cells," you see T-cells vs B-cells. If you zoom into "T-cells," you see CD4 vs CD8. If you zoom into "CD4," you see Naive vs Memory.

*Re-HVG Selection:* The genes that distinguish a Fibroblast from a Neuron (e.g., COL1A1) are useless for distinguishing two types of Fibroblasts. We must re-run highly_variable_genes to find genes that vary only within the fibroblast population.

*Automated Resolution Tuning:* Leiden resolution is arbitrary. This script uses a binary search algorithm (like a "High-Low" game) to find the exact resolution that produces the N clusters we expect based on prior literature.

In [None]:
print("\n--- FIGURE 3: FIBROBLAST DETAILED ANALYSIS ---")

(RESULT_DIR / "2_trajectory_analysis").mkdir(parents=True, exist_ok=True)

def auto_tune_resolution(adata, target_clusters):
    res_low, res_high = 0.1, 2.0
    for _ in range(15):
        current_res = (res_low + res_high) / 2
        sc.tl.leiden(adata, resolution=current_res, key_added="sub_leiden")
        n_clusters = len(adata.obs['sub_leiden'].unique())
        
        if n_clusters == target_clusters: return current_res
        if n_clusters > target_clusters: res_high = current_res 
        else: res_low = current_res 
    return current_res

**Part 10: Trajectory Inference (PAGA & ForceAtlas2)**

This is where we move from "Clustering" (discrete groups) to "Trajectory" (continuous flow).

*1. Re-calculation:* We re-run PCA and Neighbors because the variable genes that distinguish Fibroblast A from Fibroblast B are different from those that distinguish Fibroblasts from T-Cells.

*2. PAGA (Partition-based Graph Abstraction):*

* Standard t-SNE/UMAP often breaks continuous trajectories into separate islands.

* PAGA draws a "map of connectivity" between clusters. It tells us: "Cluster 1 is strongly connected to Cluster 2, but not Cluster 3."

*3. ForceAtlas2 (FA2):*

* This is a graph layout algorithm that simulates physics. Nodes (cells) repel each other like magnets, but edges (similarities) pull them together like springs.

* We initialize FA2 using the PAGA positions (init_pos='paga'). This forces the single-cell plot to respect the biological topology, revealing tree-like differentiation paths.

*4. Diffusion Pseudotime (DPT):*

* We cannot measure "Real Time" in a single-cell experiment because the cell is destroyed when sequenced.

* Pseudotime is a mathematical proxy. We set a "Start Cell" (Root, usually from the Control group). DPT measures the probability distance of every other cell from that root.

* *Result:* A scale from 0 (stem-like/naive) to 1 (terminally differentiated/damaged).

In [None]:
def analyze_fibroblast_subtypes_branched(adata, species_name):
    
    print(f"   > Processing {species_name} Fibroblasts...")
    
    sc.pp.filter_genes(adata, min_cells=3)
    sc.pp.normalize_total(adata, target_sum=1e4)
    sc.pp.log1p(adata)
    adata.raw = adata 
    
    sc.pp.highly_variable_genes(adata, n_top_genes=1500, flavor='seurat')
    sc.pp.scale(adata)
    sc.pp.pca(adata)

    sc.pp.neighbors(adata, n_neighbors=10, use_rep="X_pca") 
    
    target = 7 if species_name == "Rat" else 5
    res = auto_tune_resolution(adata, target)
    sc.tl.leiden(adata, resolution=res, key_added="sub_leiden")
    
    cluster_map = {str(c): f"FB{i+1}" for i, c in enumerate(sorted(adata.obs['sub_leiden'].unique().astype(int)))}
    adata.obs["sub_type"] = adata.obs["sub_leiden"].map(cluster_map)
    
    sc.tl.paga(adata, groups="sub_type")
    sc.pl.paga(adata, plot=False) 
    
    print(f"     > Calculating branched tree structure...")
    layout_key = 'fr' 
    if 'fa2' in locals() or 'fa2' in globals():
        try:
            sc.tl.draw_graph(adata, init_pos='paga', layout='fa', random_state=42)
            layout_key = 'fa'
        except:
            sc.tl.draw_graph(adata, init_pos='paga', layout='fr')
    else:
        sc.tl.draw_graph(adata, init_pos='paga', layout='fr')

    con_counts = adata.obs.groupby("sub_type")["condition"].apply(lambda x: (x == "Con").sum())
    root_clust = con_counts.idxmax()
    
    root_cells = np.flatnonzero(adata.obs["sub_type"] == root_clust)
    adata.uns["iroot"] = root_cells[0] if len(root_cells) > 0 else 0 
    
    sc.tl.diffmap(adata)
    sc.tl.dpt(adata)
    
    sc.tl.tsne(adata, n_pcs=20)
    
    return adata

rat_fibro = analyze_fibroblast_subtypes_branched(rat_fibro, "Rat")
human_fibro = analyze_fibroblast_subtypes_branched(human_fibro, "Human")

**Part 11: Transcription Factor Enrichment**

* *Code Reference:* plot_fig3a(...) → sc.pl.dotplot(...)

*What is it?:* This is a variation of the standard Dot Plot, but the axes have a specific biological meaning different from Part 7.

* *Y-Axis:* Transcription Factors (e.g., NFKB1, SMAD1, FOS). These are the "master regulators" that control the expression of other genes.

* *X-Axis:* Time/Condition (Control → 7d → 14d → 28d), rather than Cell Type.

*Why do we use it?:* We want to visualize the Temporal Dynamics of gene regulation.

* *Early Responders:* You might see FOS and JUN (Immediate Early Genes) light up dark blue in the "7d" column but fade away by "14d".

* *Chronic Regulators:* You might see SMAD1 (Fibrosis marker) start low in "Control" and get darker and larger as time progresses to "28d".

*Key Technical Detail:* We use cat.reorder_categories. By default, Python plots alphabetically (14d, 28d, 7d, Con). This destroys the biological timeline. We force the order to be logical (Con, 7d, 14d, 28d) to visualize the "flow" of the injury response.

In [None]:

print("\n--- Generating Figure 3 Plots ---")

RAT_TFS = ["BACH1", "ETS1", "NFKB1", "RFX5", "SMAD1", "STAT2", "ATF3", "FOS", "FOSB", 
"JUN", "JUND", "IRF7", "IRF1", "CEBPE", "HOXD4", "DLX5", "GRHL2", "NFE2L1", "POU3F1", "SREBF2"]

HUMAN_TFS = ["EHF", "EMX2", "FOXA1", "HOXB3", "HOXB4", "HOXC6", "HOXC8", "HOXC9", 
"IRF6", "MESP1", "FOSL1", "NFE2L2", "NR2F1", "RARB", "SREBF2", "TEAD4", "MAFA", "NFATC1", "LHX9", "ETV3"]

def plot_fig3a(adata, genes, filename, order, plot_title):
    valid = [g for g in genes if g in adata.raw.var_names]
    present = [c for c in order if c in adata.obs["condition"].unique()]
    
    adata.obs["condition"] = adata.obs["condition"].astype("category").cat.reorder_categories(present)
    
    if valid:
        sc.pl.dotplot(adata, valid, groupby="condition", standard_scale="var", show=False, title=plot_title)
        plt.savefig(RESULT_DIR / f"2_trajectory_analysis/{filename}", bbox_inches="tight"); plt.close()

plot_fig3a(rat_fibro, RAT_TFS, "Rat_TF.png", ["Con", "7d", "14d", "28d"], "Rat Fibroblast TF Enrichment")
plot_fig3a(human_fibro, HUMAN_TFS, "Human_TF.png", ["Con", "IR"], "Human Fibroblast TF Enrichment")

![alt text](result1/Rat_TF.png) ![alt text](result1/Human_TF.png)

**Part 12: Fibroblast Heterogeneity**

**The "Zoomed-In" t-SNE**

* *Code Reference:* sc.pl.tsne(rat_fibro, ...)

*What is it?:* This is a t-SNE calculated only on the Fibroblast cells, ignoring T-cells, B-cells, etc.

*Why recalculate it?*

* *Global t-SNE (Part 7):* Separates "Fibroblasts" from "T-Cells". The variance is driven by huge lineage markers like CD3D vs COL1A1.

* *Local t-SNE (Part 12):* Separates "Inflammatory Fibroblasts" from "Myofibroblasts".

* *The Benefit:* By recalculating the neighbor graph on just these cells, we increase the resolution. Subtle differences that were invisible in the global map now become distinct clusters (sub-types).



In [None]:
tsne_rat = sc.pl.tsne(rat_fibro, color="sub_type", legend_loc="on data", palette="tab10", 
                      title="All Conditions Combined - Rat Fibroblasts", return_fig=True, show=False)
tsne_rat.savefig(RESULT_DIR / "2_trajectory_analysis/Rat_tSNE.png", bbox_inches="tight")
plt.close(tsne_rat)

plot_split_tsne(rat_fibro, keys=["Con", "7d", "14d", "28d"], key_col="condition", color_col="sub_type",
                custom_titles=["Control Group", "7 Days Post-IR", "14 Days Post-IR", "28 Days Post-IR"],
                save_path=RESULT_DIR / "2_trajectory_analysis/Rat_tSNE_Split.png")

![alt text](result1/Rat_tSNE.png) ![alt text](result1/Rat_tSNE_Split.png)

In [None]:
tsne_human = sc.pl.tsne(human_fibro, color="sub_type", legend_loc="on data", palette="tab10", 
                        title="All Conditions Combined - Human Fibroblasts", return_fig=True, show=False)
tsne_human.savefig(RESULT_DIR / "2_trajectory_analysis/Human_tSNE.png", bbox_inches="tight")
plt.close(tsne_human)

plot_split_tsne(human_fibro, keys=["Con", "IR"], key_col="condition", color_col="sub_type", 
                custom_titles=["Control Group", "IR Treatment"],
                save_path=RESULT_DIR / "2_trajectory_analysis/Human_tSNE_Split.png")

![alt text](result1/Human_tSNE.png) ![alt text](result1/Human_tSNE_Split.png)

**Part 13: Visualizing the Trajectory Tree**

This section uses Graph Drawing algorithms to visualize continuous biological processes.

**The ForceAtlas2 Plot (PAGA Graph)**

* *Code Reference:* sc.pl.draw_graph(..., layout='fa')

*What is it?:* This is a network visualization.

* *Nodes:* Cells.

* *Edges:* Similarity connections.

* *Layout: The algorithm simulates physics. Nodes repel each other, but edges pull them together.

*Why not t-SNE or UMAP?*

* t-SNE tends to "tear" continuous trajectories into separate blobs.

* ForceAtlas2 (initialized with PAGA) preserves the Continuity. If Cell A differentiates into Cell B, they will be connected by a visual "bridge" or branch. This creates the "Tree" structure where you can see a "Root" (Stem/Naïve) branching into different "Tips" (Fates).

In [None]:

def plot_trajectory_branched(adata, name, label):
    layout_key = 'fa' if 'X_draw_graph_fa' in adata.obsm else 'fr'
    print(f"   > Plotting {name} branched trajectory using '{layout_key}'...")
    
    gene = "NR4A1" if "NR4A1" in adata.raw.var_names else "NUR77"

    sc.pl.draw_graph(adata, color="dpt_pseudotime", layout=layout_key, 
                      frameon=False, show=False, cmap="viridis", 
                      edges=True, edges_width=0.2, 
                      title=f"{name} Pseudotime (Trajectory)")
    plt.savefig(RESULT_DIR / f"2_trajectory_analysis/{label}_{name}_Pseudotime_Trajectory.png", bbox_inches="tight")
    plt.close()

![alt text](result1/Fig3E_Rat_Pseudotime_Trajectory.png) ![alt text](result1/Fig3F_Human_Pseudotime_Trajectory.png)

**Pseudotime Coloring**

* *Code Reference:* sc.pl.draw_graph(..., color="dpt_pseudotime")

*What is it?*

* *Pseudotime:* A value from 0.0 to 1.0 assigned to every cell.

* *0.0 (Purple/Dark):* The "Start" of the process (usually the healthy/control cells).

* *1.0 (Yellow/Bright):* The "End" of the process (the most damaged or differentiated cells).

*Why do we use it?:* Single-cell sequencing destroys the cell. We don't know which cell came before which. Pseudotime mathematically infers the order. It tells us: "Even though we captured these cells at the same time, Cell A is 'younger' in its development than Cell B."

**Pseudotime vs. Gene Expression Scatter Plot**

* *Code Reference:* sc.pl.scatter(..., x='dpt_pseudotime', y=gene)

*What is it?:* A standard X-Y scatter plot.

* *X-Axis:* Pseudotime (0 → 1).

* *Y-Axis:* Gene Expression (Log scale).

*How to Interpret:*

* *Rising Curve:* The gene is Upregulated during the differentiation/injury process.

* *Falling Curve:* The gene is Downregulated (loss of phenotype).

* Bell Shape:* The gene is a Transient Regulator (turns on specifically during the transition phase, then turns off).

*Why is this the "Money Shot"?:*This plot proves causality (correlation with time). It allows you to say: "As the fibroblast progresses through the injury response (X-axis), it progressively activates the gene NR4A1 (Y-axis), suggesting NR4A1 drives the injury phenotype."

In [None]:
if gene in adata.raw.var_names:
        raw_data = adata.raw[:, gene].X
        if hasattr(raw_data, "toarray"): raw_data = raw_data.toarray()
        raw_data = raw_data.flatten()
        
        counts = np.expm1(raw_data)
        custom_val = np.log10(counts + 0.1)
        
        plot_col_name = f"{gene}_log10"
        adata.obs[plot_col_name] = custom_val

        sc.pl.draw_graph(adata, color=plot_col_name, layout=layout_key, 
                         frameon=False, show=False, cmap="viridis", 
                         edges=True, edges_width=0.2, 
                         title=f"{name} {gene} (log10(val+0.1))")
        plt.savefig(RESULT_DIR / f"2_trajectory_analysis/{label}_{name}_{gene}_Trajectory.png", bbox_inches="tight")
        plt.close()

![alt text](result1/Fig3E_Rat_NR4A1_Trajectory.png) ![alt text](result1/Fig3F_Human_NR4A1_Trajectory.png)

In [None]:
       plt.figure(figsize=(6, 4))
       sc.pl.scatter(adata, x='dpt_pseudotime', y=plot_col_name, color='sub_type', 
                      show=False, title=f"{name} {gene} across Pseudotime")
       plt.xlabel("Pseudotime (Start -> End)")
       plt.ylabel(f"{gene} Expression (log10(val+0.1))")
       plt.savefig(RESULT_DIR / f"2_trajectory_analysis/{label}_{name}_{gene}_vs_Pseudotime_Scatter.png", bbox_inches="tight")
       plt.close()

![alt text](result1/Fig3E_Rat_NR4A1_vs_Pseudotime_Scatter.png) ![alt text](result1/Fig3F_Human_NR4A1_vs_Pseudotime_Scatter.png)

In [None]:
sc.pl.draw_graph(adata, color="sub_type", layout=layout_key, 
                      frameon=False, show=False, palette="tab10", 
                      edges=True, edges_width=0.3, 
                      title=f"{name} Subtype Branches")
plt.savefig(RESULT_DIR / f"2_trajectory_analysis/{label}_{name}_Subtypes_Trajectory.png", bbox_inches="tight"); plt.close()

plot_trajectory_branched(rat_fibro, "Rat", "E")
plot_trajectory_branched(human_fibro, "Human", "F")

print("\n--- ALL ANALYSES COMPLETE ---")

![alt text](result1/Fig3E_Rat_Subtypes_Trajectory.png) ![alt text](result1/Fig3F_Human_Subtypes_Trajectory.png)