Skip to content

MashXP/RNAseq_pipeline

Repository files navigation

RNA-seq Pipeline

This directory contains the automated end-to-end pipeline for analyzing cancer RNA-seq data.

Research Purpose

The objective of this pipeline is to investigate transcriptomic responses (e.g., drug treatments across multiple doses) using high-throughput RNA-seq. The pipeline is designed to:

  1. Identify Differentially Expressed Genes (DEGs) for each condition relative to a control baseline.
  2. Compare overlapping effects across treatments using multi-way Venn Diagrams and UpSet plots.
  3. Perform Functional Enrichment (GSEA & ORA) to identify activated and suppressed signaling pathways, Hallmark gene sets, and biological processes.

Directory Structure

RNAseq_pipeline/
├── _data/                       # Centralized Data Storage (git ignored)
├── _hpc/                        # Cluster Environment Management (Slurm)
├── scripts_upstream/            # Upstream Automation (BASH)
├── scripts_downstream/          # Downstream Analysis Suite (R)
├── test_upstream/               # Verification Suite: Upstream (Chr21)
├── test_downstream/             # Verification Suite: Downstream (Mock)
├── docs/                        # Project Documentation
│   ├── project/                 # Project Management
│   │   ├── upstream/            # Upstream script dissections
│   │   ├── downstream/          # Downstream script dissections
│   │   ├── research_plan.md
│   │   └── TODO.md
│   └── reports/                 # Generated Analysis Reports
├── Sample_Data_Table.csv        # Metadata template for the study
├── environment.yml              # Conda/Mamba environment definition
├── setup_env.sh                 # Environment automation script
├── run                          # Master Production Runner
└── test                         # Master Verification Runner

Quick Start

1. Upstream (Alignment & Quantification)

Place raw FASTQ files in _data/fastq/. You can run only the upstream pipeline using:

./run up

2. Downstream (Differential Expression)

Run only the downstream R analysis using:

./run down

3. Full Pipeline

Run everything from start to finish:

./run all

Verification & Testing (Local Run)

For rapid verification of the pipeline logic on local hardware, a Chromosome 21 Test Suite is provided.

Mode Target STAR RAM Time
Production Full Genome ~30GB 8-10 Hours
Test Chr21 Only ~1.5GB ~10 Minutes
./test all      # Run entire verification (Upstream -> Downstream)

Pipeline Architecture & Documentation

The pipeline is split into an upstream BASH execution engine and a downstream R statistical suite. Full, line-by-line documentation for every script can be found in the docs/ directory.

Upstream Pipeline Suite: The "Genome Engine"

Script Biological Goal Technical Focus Documentation
01_genome_prep.sh Indexing & Trimming STAR, Trimmomatic Docs
02_star_align.sh Mapping (Alignment) STAR Docs
03_alignment_qc.sh Health Check Picard, QC Stats Docs
04_quantification.sh Counting Genes featureCounts Docs
05_multiqc.sh Final Reporting MultiQC Docs

Shared Utilities (scripts_upstream/utils/):

  • parse_samples.py: The "Source of Truth" that maps your CSV to file paths.
  • biotype_to_multiqc.py: Transforms complex counts into visual reports for MultiQC.

Downstream Pipeline Suite: R Analysis

Script Biological Goal Technical Focus Documentation
01_bridge_data_prep.R Translate mentor counts Data formatting Docs
01_data_prep.R Clean & organize Metadata integration Docs
02_deseq2_dge.R DGE Engine DESeq2, apeglm Docs
03_enrichment.R Functional analysis GSEA, GO, KEGG Docs

04.x Visualization Suite: High-resolution modular plotting.

  • 04_01_pca.R: Sample clustering / PCA. Docs
  • 04_02_volcano.R: DEG significance (Volcano). Docs
  • 04_03_venn.R: Multi-contrast overlap (Venn). Docs
  • 04_04_heatmap_pathway.R: Top enriched pathways gene expression. Docs
  • 04_05_heatmap_variable.R: Top 50 most variable genes. Docs
  • 04_06_enrichment_nes.R: Hallmark enrichment scores. Docs
  • 04_07_enrichment_dotplot.R: GSEA Hallmark enrichment dots (Mirror Plot). Docs
  • 04_08_ora_dotplot.R: ORA GO enrichment dots. Docs
  • 04_09_upset_consistency.R: UpSet multi-way comparisons. Docs
  • 04_10_correlation_plots.R: Correlation scatters. Docs

(Note: See docs/project/downstream/libraries.md for full library rationales).

Requirements

All core dependencies are managed via Micromamba or Mamba. Refer to environment.yml for exact pinned versions. Run bash setup_env.sh to automatically detect your package manager and install all requirements into the cancer_rnaseq environment.


Inspiration & Related Projects

This pipeline draws inspiration from and builds upon methodological patterns established in the YeastAnalysis project, reflecting a shared focus on robust, automated bioinformatic workflows.

About

A modular, HPC-optimized RNA-seq pipeline for end-to-end transcriptomic analysis. Features automated upstream processing (STAR, featureCounts) and a standardized R visualization suite for differential expression and functional enrichment workflows.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors