Unsupervised Learning of Breast Cancer Subtypes

Overview

Breast cancer is the most common type of cancer in women regardless of age, ethnicity, or race. As a highly heterogeneous disease, breast cancer has four subtypes: Basal-like, HER2-enriched, Luminal A, and Luminal B. Each subtype requires different treatment due to the expression (positive status) or lack of expression (negative status) of three biomarkers: estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor 2 (HER2). A panel of 50 genes known as the PAM50 signature is currently used to classify subtypes at the molecular level with transcriptomic data, however, studies have shown that this classifier is not perfect. In addition, transcriptomic data does not inform about the important role of proteins in cell signaling pathways that promote cell proliferation and cell growth in breast cancer. This project uses an unsupervised learning approach to assess whether proteomic data adds additional information for subtype clustering using transcriptomic data of 17,607 genes and proteomic data of 7,853 proteins for 77 breast cancer patients.

Raw Data

Raw -omics data and clinical data for the breast cancer patients can be found in Mertins, et al. (2016). For the purpose of this project, the Cancer Genome Atlas (TCGA) identifiers have been stripped of their 'TCGA-' prefix.

Directories

data

rna.csv: global transcriptomic data
rna_filtered.csv: transcriptomic data filtered for genes present in at least 90% of the samples
rna_pam50.csv: ranscriptomic data for PAM50 genes
rna_protein_pam50_norm.csv: normalized transcriptomic and proteomic data for PAM50 genes and proteins
rna_pam50_mofa_lf6_n47.csv: transcriptomic data for the top 47 genes in MOFA LF6
protein.csv: global proteomic data
protein_filtered.csv: proteomic data filtered for proteins present in at least 90% of the samples
protein_pam50.csv: proteomic data for PAM50 proteins
mofa_trained_model.RData: output of trained MOFA model on filtered transcriptomic and proteomic data (see MOFA for a complete guide to train a MOFA model)
samples.csv: vital status, PAM50 subtype, ER, PR, and HER2 marker status for each patient

src

heatmap.R: function for plotting heatmaps using the pheatmap package
hierarchicalclustering.R: functions for hierarchical clustering of transcriptomic and proteomic data
normalization.R: functions for row-median centering, log-transformation of transcriptomic data and imputation of missing values in proteomic data

output

output files from analysis

Analysis

hierarchicalclustering_analysis.R: hierarchical clustering of transcriptomic and proteomic data
hierarchicalclustering_quantify.R: quantitative analysis of heterogeneity in clusters produced by hierarchicalclustering_analysis.R that are stored in output/hierarchicalclustering_clusters.xlsx
clustering_results.xlsx: manual mapping of cluster assignments in hierarchical_clusters.xlsx to PAM50 subtype names; includes columns for patient identifiers and original PAM50 subtype assignment
mofa.R: MOFA of filtered transcriptomic and proteomic data
gsea.R: gene set enrichment analysis of top 47 genes in MOFA LF6

Acknowledgements

Vogel Lab (New York University) for sponsoring this project
Hyungwon Choi (National University of Singapore) for collaboration on this project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unsupervised Learning of Breast Cancer Subtypes

Overview

Raw Data

Directories

data

src

output

Analysis

Acknowledgements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
data		data
output		output
src		src
README.md		README.md
clustering_results.xlsx		clustering_results.xlsx
gsea.R		gsea.R
hierarchicalclustering_analysis.R		hierarchicalclustering_analysis.R
hierarchicalclustering_quantify.R		hierarchicalclustering_quantify.R
mofa.R		mofa.R

AmyLei96/unsupervised-learning-breast-cancer-subtypes

Folders and files

Latest commit

History

Repository files navigation

Unsupervised Learning of Breast Cancer Subtypes

Overview

Raw Data

Directories

data

src

output

Analysis

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages