This directory contains the automated end-to-end pipeline for analyzing cancer RNA-seq data.
The objective of this pipeline is to investigate transcriptomic responses (e.g., drug treatments across multiple doses) using high-throughput RNA-seq. The pipeline is designed to:
- Identify Differentially Expressed Genes (DEGs) for each condition relative to a control baseline.
- Compare overlapping effects across treatments using multi-way Venn Diagrams and UpSet plots.
- Perform Functional Enrichment (GSEA & ORA) to identify activated and suppressed signaling pathways, Hallmark gene sets, and biological processes.
RNAseq_pipeline/
├── _data/ # Centralized Data Storage (git ignored)
├── _hpc/ # Cluster Environment Management (Slurm)
├── scripts_upstream/ # Upstream Automation (BASH)
├── scripts_downstream/ # Downstream Analysis Suite (R)
├── test_upstream/ # Verification Suite: Upstream (Chr21)
├── test_downstream/ # Verification Suite: Downstream (Mock)
├── docs/ # Project Documentation
│ ├── project/ # Project Management
│ │ ├── upstream/ # Upstream script dissections
│ │ ├── downstream/ # Downstream script dissections
│ │ ├── research_plan.md
│ │ └── TODO.md
│ └── reports/ # Generated Analysis Reports
├── Sample_Data_Table.csv # Metadata template for the study
├── environment.yml # Conda/Mamba environment definition
├── setup_env.sh # Environment automation script
├── run # Master Production Runner
└── test # Master Verification Runner
Place raw FASTQ files in _data/fastq/. You can run only the upstream pipeline using:
./run upRun only the downstream R analysis using:
./run downRun everything from start to finish:
./run allFor rapid verification of the pipeline logic on local hardware, a Chromosome 21 Test Suite is provided.
| Mode | Target | STAR RAM | Time |
|---|---|---|---|
| Production | Full Genome | ~30GB | 8-10 Hours |
| Test | Chr21 Only | ~1.5GB | ~10 Minutes |
./test all # Run entire verification (Upstream -> Downstream)The pipeline is split into an upstream BASH execution engine and a downstream R statistical suite. Full, line-by-line documentation for every script can be found in the docs/ directory.
| Script | Biological Goal | Technical Focus | Documentation |
|---|---|---|---|
01_genome_prep.sh |
Indexing & Trimming | STAR, Trimmomatic | Docs |
02_star_align.sh |
Mapping (Alignment) | STAR | Docs |
03_alignment_qc.sh |
Health Check | Picard, QC Stats | Docs |
04_quantification.sh |
Counting Genes | featureCounts | Docs |
05_multiqc.sh |
Final Reporting | MultiQC | Docs |
Shared Utilities (scripts_upstream/utils/):
parse_samples.py: The "Source of Truth" that maps your CSV to file paths.biotype_to_multiqc.py: Transforms complex counts into visual reports for MultiQC.
| Script | Biological Goal | Technical Focus | Documentation |
|---|---|---|---|
01_bridge_data_prep.R |
Translate mentor counts | Data formatting | Docs |
01_data_prep.R |
Clean & organize | Metadata integration | Docs |
02_deseq2_dge.R |
DGE Engine | DESeq2, apeglm | Docs |
03_enrichment.R |
Functional analysis | GSEA, GO, KEGG | Docs |
04.x Visualization Suite: High-resolution modular plotting.
04_01_pca.R: Sample clustering / PCA. Docs04_02_volcano.R: DEG significance (Volcano). Docs04_03_venn.R: Multi-contrast overlap (Venn). Docs04_04_heatmap_pathway.R: Top enriched pathways gene expression. Docs04_05_heatmap_variable.R: Top 50 most variable genes. Docs04_06_enrichment_nes.R: Hallmark enrichment scores. Docs04_07_enrichment_dotplot.R: GSEA Hallmark enrichment dots (Mirror Plot). Docs04_08_ora_dotplot.R: ORA GO enrichment dots. Docs04_09_upset_consistency.R: UpSet multi-way comparisons. Docs04_10_correlation_plots.R: Correlation scatters. Docs
(Note: See docs/project/downstream/libraries.md for full library rationales).
All core dependencies are managed via Micromamba or Mamba. Refer to environment.yml for exact pinned versions.
Run bash setup_env.sh to automatically detect your package manager and install all requirements into the cancer_rnaseq environment.
This pipeline draws inspiration from and builds upon methodological patterns established in the YeastAnalysis project, reflecting a shared focus on robust, automated bioinformatic workflows.