# The Cancer Genome Atlas (TCGA) Breast Invasive Carcinoma (BRCA) Dataset

This dataset measures the genetic and biological effect of BRCA in a cohort of 1000+ patients. 

For this project we have subset this dataset with common and unique patients across three datasets : 
 - Transcriptomcs (mRNA)
 - Epigentics (DNAm)
 - Proteomics (RPPA)

The prediction taks in this project is tumour subtype classification. It has been shown that, depending on the specific tumour subtype, outcomes for women with BRCA will vary significantly. Therefore, being able to accruately stratify by subtype is an important characterisation for this cancer and will affect the treatment course decided by the physician. 

The different subtypes present in this dataset are : 
- Luminal A (LumA)
- Luminal B (LumB)
- Basal
- HER2

Each of these modalities will capture a different aspect of the disease, thus many methods which can integrate them have become popular. 

In this short notebook, we will look at the different data types and give some information on their biological aspects. 

In [None]:
import pickle
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from palettable import wesanderson as wes

data_dir = "./../data/TCGA-BRCA/"
mod = ["mRNA", "DNAm" , "RPPA"]

In [None]:
data = {}
for omic in mod : 
   with open(f"{data_dir}{omic}.pkl", "rb") as f:  # 'rb' = read binary
    data[omic] = pickle.load(f) 

## Transcriptomics (mRNA)

Transcriptomics looks at all RNA in a cell to see which genes are “on” and how strongly.

This gene expression dataset is a table of gene activity levels across breast cancer (BRCA) tissue samples

Numbers come from RNA sequencing; higher values mean more of that gene’s RNA was detected.

mRNA and genes should have a very similar distribution in a population, but when a person has a disease, the disease can cause a large change in a genes expression causing it to become over/under expressed. 

In [None]:
data['mRNA']["expr"].head()

In [None]:
color = wes.Darjeeling2_5.mpl_colors[0]  # pick a color from the palette

gene = data['mRNA']["expr"].columns[1]

sns.set(style="whitegrid")
sns.kdeplot(data=data['mRNA']["expr"], x=f"{gene}", fill=True, color=color, alpha=0.85, linewidth=0)
plt.xlabel(f"{gene}")
plt.ylabel("Density"); 
plt.title(f"Distribution of {gene}")
plt.tight_layout()
plt.show()

## Epigentics (DNAm)

Epigenetics studies chemical tags on DNA sites that control gene activity without changing the DNA sequence.

DNA methylation is one such tag (adding methyl groups), referred to as CpGs,  that often reduces gene activity.

A DNA methylation dataset measures how much methylation is present at many genomic sites across samples.

These tags are crucial in aging for example, to stop us growing taller and taller indefinitely.

DNAm is a very useful measure for how we interact with out environment as the number and location of specific chemical tags can tell us if and how much someone consumes alcohol, smokes, works with pesticides, exposure to carcinogenics etc...  

In [None]:
data['DNAm']["expr"].head()

In [None]:
color = wes.Darjeeling2_5.mpl_colors[2]  # pick a color from the palette

cpg = data['DNAm']["expr"].columns[1]

sns.set(style="whitegrid")
sns.kdeplot(data=data['DNAm']["expr"], x=f"{cpg}", fill=True, color=color, alpha=0.85, linewidth=0)
plt.xlabel(f"{cpg}")
plt.ylabel("Density"); 
plt.title(f"Distribution of {cpg}")
plt.tight_layout()
plt.show()

## Proteomics (RPPA)

Proteomics studies all the proteins in a cell or tissue—what’s there and how much.

The proteomics dataset we are using is Reverse Phase Protein Array (RPPA). This dataset is a table of protein abundance levels across the tumour tissue samples.

It is measure by microscope technologies by tagging proteins in the samples with a chemical dye and quantifying how much of each protein is present by their illuminaiton. 

Proteomics is really useful to get an accurate pin-point snapshop of the biology of the tumour in its measured state. The downside is that protein measurements are sparse across patient samples with many missing samples as not every protein will be abundant in each patient. How you handle this artefact will be an important consideration in your analyses and could effect different models differently. 

In [None]:
data['RPPA']["expr"].head()

In [None]:
color = wes.Darjeeling2_5.mpl_colors[3]  # pick a color from the palette

prtn = data['RPPA']["expr"].columns[1]

sns.set(style="whitegrid")
sns.kdeplot(data=data['RPPA']["expr"], x=f"{prtn}", fill=True, color=color, alpha=0.85, linewidth=0)
plt.xlabel(f"{prtn}")
plt.ylabel("Density"); 
plt.title(f"Distribution of {prtn}")
plt.tight_layout()
plt.show()

## Next Steps 

1. Data Exploration Analysis (DEA)
   - How many patients are common to each omic, pairs of omics, and across all omics?
   - Missingness in each omic?
     - Methods of imputation?
   - Patient outliers?
     
2. MOFA

3. IntegrAO

4. PNet