
# Prognostic Biomarker Discovery in Breast Cancer with TCGA RNA-Seq Data

## 1. Introduction

Breast cancer remains one of the leading causes of cancer mortality in women. Prognostic biomarkers derived from high-throughput transcriptomic data can aid in identifying patient-specific survival outcomes and inform personalized treatments. This project analyzes TCGA-BRCA RNA-seq and clinical data to identify survival-associated genes using Cox proportional hazards models and explore their biological relevance through enrichment analysis.



## 2. Dataset Overview

- **Source:** The Cancer Genome Atlas (TCGA)
- **Cohort:** Breast Invasive Carcinoma (BRCA)
- **Samples:** ~1000 patients
- **Data Used:**
  - RNA-seq gene expression (TPM)
  - Clinical survival data (OS time, OS event)
  - Gene annotations (Ensembl → HGNC symbols)



## 3. Workflow Summary

1. **Preprocessing**
   - Filter low-expression genes
   - Normalize data (TPM → log2)
   - Join expression with clinical metadata
2. **Survival Modeling**
   - Univariate Cox proportional hazards model
   - Select genes with adjusted p < 0.05
   - Kaplan-Meier visualization for top candidates
3. **Biological Enrichment**
   - Functional profiling (GO/KEGG)
   - Pathways linked to immune response, apoptosis, and cell cycle



## 4.1 Cox Regression Summary

Below is a summary of the top 10 prognostic genes from Cox regression analysis — including genes with both high and low hazard ratios.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>gene</th>
      <th>coef</th>
      <th>HR</th>
      <th>p</th>
      <th>logHR</th>
      <th>-log10p</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>GADL1</td>
      <td>0.341109</td>
      <td>1.406506</td>
      <td>0.000066</td>
      <td>0.492116</td>
      <td>4.179588</td>
    </tr>
    <tr>
      <td>C19orf69</td>
      <td>0.316307</td>
      <td>1.372052</td>
      <td>0.000155</td>
      <td>0.456335</td>
      <td>3.810493</td>
    </tr>
    <tr>
      <td>LOC729467</td>
      <td>0.266056</td>
      <td>1.304808</td>
      <td>0.000012</td>
      <td>0.383838</td>
      <td>4.931364</td>
    </tr>
    <tr>
      <td>ODF3L2</td>
      <td>0.263123</td>
      <td>1.300987</td>
      <td>0.000104</td>
      <td>0.379607</td>
      <td>3.981350</td>
    </tr>
    <tr>
      <td>STBD1</td>
      <td>0.254482</td>
      <td>1.289794</td>
      <td>0.000092</td>
      <td>0.367140</td>
      <td>4.037085</td>
    </tr>
    <tr>
      <td>HNRNPC</td>
      <td>-0.804086</td>
      <td>0.447497</td>
      <td>0.000015</td>
      <td>-1.160051</td>
      <td>4.822058</td>
    </tr>
    <tr>
      <td>MYD88</td>
      <td>-0.472582</td>
      <td>0.623391</td>
      <td>0.000240</td>
      <td>-0.681791</td>
      <td>3.620066</td>
    </tr>
    <tr>
      <td>ZNF672</td>
      <td>-0.443441</td>
      <td>0.641824</td>
      <td>0.000109</td>
      <td>-0.639751</td>
      <td>3.960928</td>
    </tr>
    <tr>
      <td>PTMA</td>
      <td>-0.411906</td>
      <td>0.662386</td>
      <td>0.000231</td>
      <td>-0.594255</td>
      <td>3.635595</td>
    </tr>
    <tr>
      <td>RCC2</td>
      <td>-0.409440</td>
      <td>0.664022</td>
      <td>0.000110</td>
      <td>-0.590698</td>
      <td>3.958072</td>
    </tr>
  </tbody>
</table>

- **Interpretation:**
  - Genes with **HR > 1** (e.g., `LEF1`, `PSME2`) are associated with **higher risk** and worse survival outcomes.
  - Genes with **HR < 1** (e.g., `HNRNPC`, `RCC2`) are associated with **lower risk** and improved survival.



## 4.2 Volcano Plot

The volcano plot below visualizes the results of univariate Cox regression across all genes. Each point represents a gene, plotted by:

- **x-axis**: log₂(Hazard Ratio), indicating effect size on survival
- **y-axis**: -log₁₀(p-value), indicating statistical significance

Genes with **HR > 1** (right side) are associated with **higher risk**, while those with **HR < 1** (left side) are **protective**. Genes toward the top right and top left are the most significant candidates for further investigation.

![Volcano Plot](figures/01_fig_volcano_cox_regression.png)


## 4.3 Kaplan-Meier Plots

The Kaplan-Meier plots below illustrate overall survival differences between high and low expression groups for the top 6 prognostic genes identified through univariate Cox regression.

Each plot compares two patient groups stratified by median gene expression levels:
- **Blue curve**: patients with **low expression**
- **Red curve**: patients with **high expression**

A significant separation between the curves indicates that gene expression levels are associated with survival outcomes. These visualizations support the prognostic relevance of the identified genes.

![KM Plot](figures/02_fig_km_top6_genes.png)


## 5. Enrichment Analysis

### 5.1 GO/KEGG Enrichment

![Enrichment Bar plot](figures/03_fig_enrichment_terms.png)

The top enriched terms from the prognostic gene set highlight both proliferative control and immune-related processes:

- **GO: microtubule cytoskeleton organization involved in mitosis** *(GO:1902850)*
- **Reactome: Mitotic Metaphase and Anaphase** *(R-HSA-2555396)*
- **Reactome: Mitotic Anaphase** *(R-HSA-68882)*
- **GO: cellular response to interleukin-12** *(GO:0071349)*
- **GO: regulation of ruffle assembly** *(GO:1900027)*
- **GO: interleukin-12-mediated signaling pathway** *(GO:0035722)*
- **GO: regulation of nitric oxide biosynthetic process** *(GO:0045428)*

These findings suggest that both **cell division control** and **immune signaling (notably IL-12 pathways)** may play a significant role in patient survival outcomes.

📁 **Further enrichment data** — including full tables of GO, KEGG, and Reactome annotations with adjusted p-values and combined scores — can be found in:
results/enrichr_annotation/


## 6. Conclusion

In this study, we developed a reproducible pipeline to identify and interpret prognostic biomarkers in breast cancer using RNA-seq data from TCGA.

- We performed univariate Cox regression analysis across the transcriptome and identified **multiple genes significantly associated with overall survival**.
- **Top high-risk genes** (e.g., `LEF1`, `PSME2`, `ZNF672`) demonstrated hazard ratios significantly above 1 and were linked to **poor prognosis**.
- **Top protective genes** (e.g., `HNRNPC`, `RCC2`) showed hazard ratios below 1 and were associated with **better survival outcomes**.
- Kaplan-Meier analysis provided strong visual confirmation of these trends.
- Enrichment analysis revealed that these prognostic genes are involved in critical biological processes such as **mitotic regulation**, **interleukin signaling**, and **immune modulation**.

This integrative approach highlights promising biomarkers for future clinical investigation and provides a modular foundation for extending this pipeline to other cancer types or multi-omic layers.




## 7. Reproducibility

- Public repository: [GitHub Link](https://github.com/AKcode08/bioinformatics-portfolio/tree/main/bulkRNAseq-BRCA)
- Compatible with cloud or local execution via Docker/Conda



## 8. Future Directions

- Workflow automated with **Nextflow**
- Multivariate Cox regression and LASSO selection
- Cross-validation and external cohort validation
- Integration of mutation/CNV data
