TCGA Cancer Epigenomics Pipeline

A Python bioinformatics pipeline for differential gene and miRNA expression analysis using data from the NCI Cancer Genome Atlas (TCGA), with an interactive Streamlit front-end.

Overview

This tool implements the two-stage differential expression methodology described in:

Parikesit AA et al. Comparative Gene Analysis of Squamous Cell Lung Carcinoma Between Smoking and Non-smoking Individuals. Berita Biologi 22(3), 2023. DOI: 10.14203/beritabiology.v20i1.3991
Fugaha DR, Putta D, Lunoto D, Parikesit AA. MicroRNA Expression Profiling in Colorectal Carcinoma: Identification and Validation of Novel Diagnostic, Prognostic and Predictive Biomarkers. ICIC 2024.

The pipeline retrieves count data and clinical metadata directly from the GDC API, applies user-selected clinical filters, runs DESeq2 differential expression analysis via pydeseq2, and generates interactive visualizations.

Features

Universal cancer support: All 33 TCGA cancer project codes (LUSC, BRCA, COAD, LIHC, PRAD, and 28 others)
Two molecular layers: Gene expression (mRNA, STAR counts) and miRNA expression
Clinical stratification: Filter patients by gender, race/ethnicity, and smoking status before analysis
Two analysis modes:
- Single-stage: Tumor vs. normal (Fugaha et al. 2024 style)
- Dual-stage: Tumor/normal + clinical group comparison with gene overlap (Parikesit et al. 2023 style)
Visualizations: Volcano plot, MA plot, Wald scatter plot, Venn diagram, expression heatmap
Streamlit front-end: Interactive, step-by-step web interface

Installation

1. Create a virtual environment

python -m venv venv
source venv/bin/activate        # macOS / Linux
venv\Scripts\activate.bat       # Windows

2. Install dependencies

pip install -r requirements.txt

R is not required. This pipeline uses pydeseq2, a pure-Python port of the R DESeq2 package.

Usage

Streamlit App (recommended)

streamlit run app.py

Open your browser at http://localhost:8501.

Workflow:

Select cancer type (e.g., TCGA-LUSC) and data layer (Gene Expression / miRNA)
Click Load Data from TCGA - data is downloaded from GDC and cached locally
Apply clinical filters: gender, race/ethnicity, smoking status
Click Apply Filters
Choose analysis mode and set DESeq2 thresholds
Click Run Analysis
Explore volcano plot, MA plot, DEG tables, Venn diagram, and overlap biomarkers
Download results as CSV

Python API

from pipeline.pipeline import EpigenomicsPipeline

# Initialize with local cache directory
pl = EpigenomicsPipeline(cache_dir="./tcga_cache")

# Step 1: Download data
pl.load_data("TCGA-LUSC", "Gene Expression")

# Step 2: Apply clinical filters
pl.set_clinical_filters({
    "gender": ["male"],
    "tobacco_smoking_status": ["2", "3", "4"],  # smokers
})

# Step 3: Run dual-stage analysis (Parikesit et al. 2023 methodology)
results = pl.run_analysis(
    analysis_mode="dual_stage",
    stage2_col="tobacco_smoking_status",
    stage2_group_a=["2", "3", "4"],   # smokers
    stage2_group_b=["1"],              # non-smokers
    stage2_label_a="Smoker",
    stage2_label_b="Non-smoker",
    lfc_threshold=1.0,
    padj_threshold=0.01,
    overlap_lfc_threshold=2.0,
)

# Results
print(results["summary"])
print(results["overlap"])          # Biomarker candidates
print(results["degs_stage1"])      # Stage 1 DEGs

Pipeline Architecture

app.py                          Streamlit front-end
pipeline/
    constants.py                Cancer types, data configs, field mappings
    tcga_downloader.py          GDC REST API wrapper (file query + download + parse)
    preprocessing.py            Clinical filters, tumor/normal labeling, expression filtering
    deseq_analysis.py           pydeseq2 wrapper, dual/single-stage orchestration
    visualization.py            Plotly interactive plots + matplotlib Venn diagrams
    pipeline.py                 EpigenomicsPipeline class (orchestrates all modules)

Analysis workflow (dual-stage)

TCGA GDC API
    -> query_files()            [query file catalog by project + data type]
    -> get_clinical_metadata()  [gender, race, smoking status, tumor stage]
    -> download_and_parse()     [batch download + count matrix assembly]
         |
    apply_clinical_filters()    [restrict to subpopulation of interest]
         |
    Stage 1: DESeq2             [tumor vs. normal within clinical group]
         |                       design = ~ condition
    Stage 2: DESeq2             [clinical variable comparison]
         |                       design = ~ condition + tumor_status
    Overlap                     [intersect Stage 1 and Stage 2 DEG lists]
         |                       filter by |LFC| >= overlap_lfc_threshold
    Visualize                   [Volcano, MA, Wald scatter, Venn, heatmap]

Data Source

All data is retrieved from the NCI Genomic Data Commons (GDC) via the public REST API. No authentication is required for open-access TCGA data.

Downloaded files are cached locally under ./tcga_cache/ to avoid repeated downloads.

DESeq2 Filtering Criteria

Following Parikesit et al. (2023) and Yao et al. (2019):

Criterion	Default value
Minimum count per sample	10
Fraction of samples expressing	>= 50%
log2 fold change (DEG call)	>= 1.0
Adjusted p-value (FDR)	< 0.01
Overlap LFC threshold	>= 2.0

All thresholds are adjustable in the Streamlit sidebar.

TCGA Smoking Status Codes

Code	Meaning
1	Lifelong Non-Smoker (<100 cigarettes)
2	Current Smoker
3	Current Reformed Smoker (>15 years)
4	Current Reformed Smoker (<=15 years)
5	Current Reformed Smoker (Duration Unknown)

Citation

If you use this pipeline in your research, please cite:

@article{parikesit2023lusc,
  author  = {Shemuel, Josia and Angelique, Priscilla and Josephine, Evelina
             and Fugaha, Daniel Ryan and Gabriela, Vania
             and Alhusain, Shaheer and Parikesit, Arli Aditya},
  title   = {Comparative Gene Analysis of Squamous Cell Lung Carcinoma
             Between Smoking and Non-smoking Individuals},
  journal = {Berita Biologi},
  volume  = {22},
  number  = {3},
  pages   = {291--302},
  year    = {2023},
  doi     = {10.14203/beritabiology.v20i1.3991}
}

@inproceedings{fugaha2024mirna,
  author    = {Fugaha, Daniel Ryan and Putta, Dhannyo and Lunoto, Dennis
               and Parikesit, Arli Aditya},
  title     = {MicroRNA Expression Profiling in Colorectal Carcinoma:
               Identification and Validation of Novel Diagnostic,
               Prognostic and Predictive Biomarkers},
  booktitle = {2024 Ninth International Conference on Informatics and Computing (ICIC)},
  year      = {2024}
}

License

MIT License. See LICENSE.txt.

Author

Dr.rer.nat. Arli Aditya Parikesit Department of Bioinformatics, Indonesia International Institute for Life Sciences (i3L) Jakarta, Indonesia ORCID: 0000-0001-8716-3926 Email: arli.parikesit@i3l.ac.id

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
pipeline		pipeline
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Fugaha_et_al___2024___MicroRNA_Expression_Profiling_in_Colorectal_Carcinoma_Identification_and_Validation_of_Novel_Diagn.pdf		Fugaha_et_al___2024___MicroRNA_Expression_Profiling_in_Colorectal_Carcinoma_Identification_and_Validation_of_Novel_Diagn.pdf
LICENSE.txt		LICENSE.txt
README.md		README.md
app.py		app.py
full_text.pdf		full_text.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TCGA Cancer Epigenomics Pipeline

Overview

Features

Installation

1. Create a virtual environment

2. Install dependencies

Usage

Streamlit App (recommended)

Python API

Pipeline Architecture

Analysis workflow (dual-stage)

Data Source

DESeq2 Filtering Criteria

TCGA Smoking Status Codes

Citation

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TCGA Cancer Epigenomics Pipeline

Overview

Features

Installation

1. Create a virtual environment

2. Install dependencies

Usage

Streamlit App (recommended)

Python API

Pipeline Architecture

Analysis workflow (dual-stage)

Data Source

DESeq2 Filtering Criteria

TCGA Smoking Status Codes

Citation

License

Author

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages