RNA-seq Pipeline

This directory contains the automated end-to-end pipeline for analyzing cancer RNA-seq data.

Research Purpose

The objective of this pipeline is to investigate transcriptomic responses (e.g., drug treatments across multiple doses) using high-throughput RNA-seq. The pipeline is designed to:

Identify Differentially Expressed Genes (DEGs) for each condition relative to a control baseline.
Compare overlapping effects across treatments using multi-way Venn Diagrams and UpSet plots.
Perform Functional Enrichment (GSEA & ORA) to identify activated and suppressed signaling pathways, Hallmark gene sets, and biological processes.

Directory Structure

RNAseq_pipeline/
├── _data/                       # Centralized Data Storage (git ignored)
├── _hpc/                        # Cluster Environment Management (Slurm)
├── scripts_upstream/            # Upstream Automation (BASH)
├── scripts_downstream/          # Downstream Analysis Suite (R)
├── test_upstream/               # Verification Suite: Upstream (Chr21)
├── test_downstream/             # Verification Suite: Downstream (Mock)
├── docs/                        # Project Documentation
│   ├── project/                 # Project Management
│   │   ├── upstream/            # Upstream script dissections
│   │   ├── downstream/          # Downstream script dissections
│   │   ├── research_plan.md
│   │   └── TODO.md
│   └── reports/                 # Generated Analysis Reports
├── Sample_Data_Table.csv        # Metadata template for the study
├── environment.yml              # Conda/Mamba environment definition
├── setup_env.sh                 # Environment automation script
├── run                          # Master Production Runner
└── test                         # Master Verification Runner

Quick Start

1. Upstream (Alignment & Quantification)

Place raw FASTQ files in _data/fastq/. You can run only the upstream pipeline using:

./run up

2. Downstream (Differential Expression)

Run only the downstream R analysis using:

./run down

3. Full Pipeline

Run everything from start to finish:

./run all

Verification & Testing (Local Run)

For rapid verification of the pipeline logic on local hardware, a Chromosome 21 Test Suite is provided.

Mode	Target	STAR RAM	Time
Production	Full Genome	~30GB	8-10 Hours
Test	Chr21 Only	~1.5GB	~10 Minutes

./test all      # Run entire verification (Upstream -> Downstream)

Pipeline Architecture & Documentation

The pipeline is split into an upstream BASH execution engine and a downstream R statistical suite. Full, line-by-line documentation for every script can be found in the docs/ directory.

Upstream Pipeline Suite: The "Genome Engine"

Script	Biological Goal	Technical Focus	Documentation
`01_genome_prep.sh`	Indexing & Trimming	STAR, Trimmomatic	Docs
`02_star_align.sh`	Mapping (Alignment)	STAR	Docs
`03_alignment_qc.sh`	Health Check	Picard, QC Stats	Docs
`04_quantification.sh`	Counting Genes	featureCounts	Docs
`05_multiqc.sh`	Final Reporting	MultiQC	Docs

Shared Utilities (scripts_upstream/utils/):

parse_samples.py: The "Source of Truth" that maps your CSV to file paths.
biotype_to_multiqc.py: Transforms complex counts into visual reports for MultiQC.

Downstream Pipeline Suite: R Analysis

Script	Biological Goal	Technical Focus	Documentation
`01_bridge_data_prep.R`	Translate mentor counts	Data formatting	Docs
`01_data_prep.R`	Clean & organize	Metadata integration	Docs
`02_deseq2_dge.R`	DGE Engine	DESeq2, apeglm	Docs
`03_enrichment.R`	Functional analysis	GSEA, GO, KEGG	Docs

04.x Visualization Suite: High-resolution modular plotting.

04_01_pca.R: Sample clustering / PCA. Docs
04_02_volcano.R: DEG significance (Volcano). Docs
04_03_venn.R: Multi-contrast overlap (Venn). Docs
04_04_heatmap_pathway.R: Top enriched pathways gene expression. Docs
04_05_heatmap_variable.R: Top 50 most variable genes. Docs
04_06_enrichment_nes.R: Hallmark enrichment scores. Docs
04_07_enrichment_dotplot.R: GSEA Hallmark enrichment dots (Mirror Plot). Docs
04_08_ora_dotplot.R: ORA GO enrichment dots. Docs
04_09_upset_consistency.R: UpSet multi-way comparisons. Docs
04_10_correlation_plots.R: Correlation scatters. Docs

(Note: See docs/project/downstream/libraries.md for full library rationales).

Requirements

All core dependencies are managed via Micromamba or Mamba. Refer to environment.yml for exact pinned versions. Run bash setup_env.sh to automatically detect your package manager and install all requirements into the cancer_rnaseq environment.

Inspiration & Related Projects

This pipeline draws inspiration from and builds upon methodological patterns established in the YeastAnalysis project, reflecting a shared focus on robust, automated bioinformatic workflows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNA-seq Pipeline

Research Purpose

Directory Structure

Quick Start

1. Upstream (Alignment & Quantification)

2. Downstream (Differential Expression)

3. Full Pipeline

Verification & Testing (Local Run)

Pipeline Architecture & Documentation

Upstream Pipeline Suite: The "Genome Engine"

Downstream Pipeline Suite: R Analysis

Requirements

Inspiration & Related Projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
_data/fastq		_data/fastq
_hpc		_hpc
docs		docs
scripts_downstream		scripts_downstream
scripts_upstream		scripts_upstream
test_downstream		test_downstream
test_upstream		test_upstream
.gitignore		.gitignore
GEMINI.md		GEMINI.md
README.md		README.md
cleanup.sh		cleanup.sh
drPhuong_Sample_Data_Table.csv		drPhuong_Sample_Data_Table.csv
environment.yml		environment.yml
multiqc_config.yaml		multiqc_config.yaml
report_audit.md		report_audit.md
report_scripts_audit.md		report_scripts_audit.md
run		run
setup_env.sh		setup_env.sh
test		test

Folders and files

Latest commit

History

Repository files navigation

RNA-seq Pipeline

Research Purpose

Directory Structure

Quick Start

1. Upstream (Alignment & Quantification)

2. Downstream (Differential Expression)

3. Full Pipeline

Verification & Testing (Local Run)

Pipeline Architecture & Documentation

Upstream Pipeline Suite: The "Genome Engine"

Downstream Pipeline Suite: R Analysis

Requirements

Inspiration & Related Projects

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages