# The Cancer Genome Atlas (TCGA) Breast Invasive Carcinoma (BRCA) Dataset

This dataset measures the genetic and biological effect of BRCA in a cohort of 1000+ patients. 

For this project we have subset this dataset with common and unique patients across three datasets : 
 - Transcriptomcs (mRNA)
 - Epigentics (DNAm)
 - Proteomics (RPPA)

The prediction taks in this project is tumour subtype classification. It has been shown that, depending on the specific tumour subtype, outcomes for women with BRCA will vary significantly. Therefore, being able to accruately stratify by subtype is an important characterisation for this cancer and will affect the treatment course decided by the physician. 

The different subtypes present in this dataset are : 
- Luminal A (LumA)
- Luminal B (LumB)
- Basal
- HER2

Each of these modalities will capture a different aspect of the disease, thus many methods which can integrate them have become popular. 

In this short notebook, we will look at the different data types and give some information on their biological aspects. 

In [None]:
import pickle
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from palettable import wesanderson as wes

data_dir = "./../data/TCGA-BRCA/"
mod = ["mRNA", "DNAm" , "RPPA"]

In [None]:
data = {}
for omic in mod : 
   with open(f"{data_dir}{omic}.pkl", "rb") as f:  # 'rb' = read binary
    data[omic] = pickle.load(f) 

In [None]:
# getting some statistics on mRNA data (first num_feat_for_display features)
# cannot display all features due to space constraints (there are tens of thousands of features)

df = data["mRNA"]["expr"]

num_feat_for_display = 5  # Number of features to display statistics for
summary = pd.DataFrame({
    "mean": df.iloc[:, :num_feat_for_display].mean(),
    "median": df.iloc[:, :num_feat_for_display].median(),
    "std": df.iloc[:, :num_feat_for_display].std(),
    "min": df.iloc[:, :num_feat_for_display].min(),
    "max": df.iloc[:, :num_feat_for_display].max(),
})
summary = summary.reset_index().rename(columns={"index": "feature_name"})
display(summary)
summary.to_latex("exports/tables/mRNA_feature_stats.tex", index=False, escape=True)



In [None]:
OMICS = ["mRNA", "DNAm", "RPPA"]

rows = []
for om in OMICS:
    X = data[om]["expr"]
    n_nonnum = len(X.select_dtypes(exclude="number").columns)
    pct_nonnum = n_nonnum / X.shape[1] * 100
    miss_values = X.isna().mean().mean() * 100   # average % missing
    shape = X.shape
    rows.append([om, shape, n_nonnum, miss_values])
    
numericality_overview = pd.DataFrame(rows, columns=[
    "Omic", "features_shape", "#_numerical_features", "avg_missing_%"
])
display(numericality_overview)
numericality_overview.to_latex("exports/tables/total_stats.tex", index=False, escape=True)


In [None]:
import numpy as np
import matplotlib.pyplot as plt

subtype_order = ['Basal', 'Her2', 'LumA', 'LumB']
target_col = 'paper_BRCA_Subtype_PAM50'

fig, axes = plt.subplots(3, 1, figsize=(6, 6), sharex=False)

for ax, om in zip(axes, OMICS):
    y = data[om]["meta"][target_col]

    # counts per subtype in fixed order
    value_counts = y.value_counts()
    counts = np.array([value_counts.get(s, 0) for s in subtype_order])
    total = counts.sum() if counts.sum() > 0 else 1  # guard
    perc = counts / total * 100

    # horizontal bars
    ax.barh(subtype_order, counts, edgecolor="black", height=0.4)
    ax.set_title(om)
    ax.set_xlabel("Number of samples")

    ax.set_xlim(0, counts.max() * 1.3)

    # annotate with "count (xx.x%)"
    max_count = counts.max() if counts.max() > 0 else 1
    for i, (c, p) in enumerate(zip(counts, perc)):
        ax.text(c + max_count * 0.02, i, f"{c} ({p:.1f}%)", va="center")

plt.tight_layout()
plt.savefig("exports/figures/subtype_distribution_per_omic.png", dpi=300)
plt.show()


## Transcriptomics (mRNA)

Transcriptomics looks at all RNA in a cell to see which genes are “on” and how strongly.

This gene expression dataset is a table of gene activity levels across breast cancer (BRCA) tissue samples

Numbers come from RNA sequencing; higher values mean more of that gene’s RNA was detected.

mRNA and genes should have a very similar distribution in a population, but when a person has a disease, the disease can cause a large change in a genes expression causing it to become over/under expressed. 

In [None]:
data['mRNA']["expr"].head()

In [None]:
color = wes.Darjeeling2_5.mpl_colors[0]  # pick a color from the palette

gene = data['mRNA']["expr"].columns[1]

sns.set(style="whitegrid")
sns.kdeplot(data=data['mRNA']["expr"], x=f"{gene}", fill=True, color=color, alpha=0.85, linewidth=0)
plt.xlabel(f"{gene}")
plt.ylabel("Density"); 
plt.title(f"Distribution of {gene}")
plt.tight_layout()
plt.show()

## Epigentics (DNAm)

Epigenetics studies chemical tags on DNA sites that control gene activity without changing the DNA sequence.

DNA methylation is one such tag (adding methyl groups), referred to as CpGs,  that often reduces gene activity.

A DNA methylation dataset measures how much methylation is present at many genomic sites across samples.

These tags are crucial in aging for example, to stop us growing taller and taller indefinitely.

DNAm is a very useful measure for how we interact with out environment as the number and location of specific chemical tags can tell us if and how much someone consumes alcohol, smokes, works with pesticides, exposure to carcinogenics etc...  

In [None]:
data['DNAm']["expr"].head()

In [None]:
color = wes.Darjeeling2_5.mpl_colors[2]  # pick a color from the palette

cpg = data['DNAm']["expr"].columns[1]

sns.set(style="whitegrid")
sns.kdeplot(data=data['DNAm']["expr"], x=f"{cpg}", fill=True, color=color, alpha=0.85, linewidth=0)
plt.xlabel(f"{cpg}")
plt.ylabel("Density"); 
plt.title(f"Distribution of {cpg}")
plt.tight_layout()
plt.show()

## Proteomics (RPPA)

Proteomics studies all the proteins in a cell or tissue—what’s there and how much.

The proteomics dataset we are using is Reverse Phase Protein Array (RPPA). This dataset is a table of protein abundance levels across the tumour tissue samples.

It is measure by microscope technologies by tagging proteins in the samples with a chemical dye and quantifying how much of each protein is present by their illuminaiton. 

Proteomics is really useful to get an accurate pin-point snapshop of the biology of the tumour in its measured state. The downside is that protein measurements are sparse across patient samples with many missing samples as not every protein will be abundant in each patient. How you handle this artefact will be an important consideration in your analyses and could effect different models differently. 

In [None]:
data['RPPA']["expr"].head()

In [None]:
color = wes.Darjeeling2_5.mpl_colors[3]  # pick a color from the palette

prtn = data['RPPA']["expr"].columns[1]

sns.set(style="whitegrid")
sns.kdeplot(data=data['RPPA']["expr"], x=f"{prtn}", fill=True, color=color, alpha=0.85, linewidth=0)
plt.xlabel(f"{prtn}")
plt.ylabel("Density"); 
plt.title(f"Distribution of {prtn}")
plt.tight_layout()
plt.show()

In [None]:

print(data['mRNA']['expr'].shape)
print(data['DNAm']['expr'].shape)
print(data['RPPA']['expr'].shape)  

print(data['mRNA']['meta'].shape)
print(data['DNAm']['meta'].shape)
print(data['RPPA']['meta'].shape)   


data['mRNA']['meta'].head()



In [None]:
data['DNAm']['meta'].head()

In [None]:
data['RPPA']['meta'].head()

## Next Steps 

1. Data Exploration Analysis (DEA)
   - How many patients are common to each omic, pairs of omics, and across all omics?
   - Missingness in each omic?
     - Methods of imputation?
   - Patient outliers?
     
2. MOFA

3. IntegrAO

4. PNet

In [None]:
import itertools
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


for om in OMICS:
    print(f"Shape of {om}: {data[om]["expr"].shape}")

print("\n")

for om in OMICS:
    expr_idx = data[om]["expr"].index
    expr_unique = expr_idx.is_unique
    print(f"Uniqueness of patient IDs in {om} expression data: {expr_unique}")

print("\n")

# Build per-omic patient sets using expr indices (they are the same in meta)
patient_sets = {}
for om in OMICS:
    idx = data[om]["expr"].index
    patient_sets[om] = set(idx)


# Build intersections table: singles, pairs, all three
rows = []
# singles
for om in OMICS:
    rows.append((" & ".join([om]), len(patient_sets[om])))
# pairs
for a, b in itertools.combinations(OMICS, 2):
    inter = patient_sets[a].intersection(patient_sets[b])
    rows.append((" & ".join([a, b]), len(inter)))
# all three
all_three = set.intersection(*[patient_sets[om] for om in OMICS])
rows.append((" & ".join(OMICS), len(all_three)))

overlap_df = pd.DataFrame(rows, columns=["Set", "Count (Overlapped Indicies)"]).sort_values("Count (Overlapped Indicies)", ascending=False)
display(overlap_df)
overlap_df.to_latex("exports/tables/patient_overlap_table.tex", index=False, escape=True)


overlap_patients = {
    "mRNA": patient_sets["mRNA"],
    "DNAm": patient_sets["DNAm"],
    "RPPA": patient_sets["RPPA"],
    "mRNA & DNAm": patient_sets["mRNA"].intersection(patient_sets["DNAm"]),
    "mRNA & RPPA": patient_sets["mRNA"].intersection(patient_sets["RPPA"]),
    "DNAm & RPPA": patient_sets["DNAm"].intersection(patient_sets["RPPA"]),
    "mRNA & DNAm & RPPA": all_three
}


## Mutual information

In [None]:
from sklearn.feature_selection import mutual_info_classif 

fig, axes = plt.subplots(3, 1, figsize=(6, 8))

mi_values = {}
for ax, om in zip(axes, OMICS):
    X = data[om]["expr"]
    y = data[om]["meta"][target_col]

    mi = mutual_info_classif(X, y, discrete_features=False, random_state=42)
    mi_series = pd.Series(mi, index=X.columns).sort_values(ascending=False)

    mi_values[om] = mi_series

    ax.hist(mi_series, bins=30, edgecolor="black")
    ax.set_title(f"{om} (n_features={X.shape[1]})")
    ax.set_xlabel("Mutual information")
    ax.set_ylabel("Number of features")

plt.tight_layout()
plt.savefig("exports/figures/mutual_information_histograms.png", dpi=300)
plt.show()

In [None]:
mi_values['mRNA'].head(10)

## Outliers analysis

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import IsolationForest

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="whitegrid", context="notebook")

target_col = "paper_BRCA_Subtype_PAM50"  # already used above
OMICS = ["mRNA", "DNAm", "RPPA"]

fig, axes = plt.subplots(3, 1, figsize=(6, 9))

outlier_results = {}  # store per-omic results in case we want to inspect later

for ax, om in zip(axes, OMICS):
    # 1) get data
    X = data[om]["expr"]
    y = data[om]["meta"][target_col]

    # 2) simple median imputation (per feature)
    X_imp = X.fillna(X.median())

    # 3) standardize features
    scaler = StandardScaler()
    Z = pd.DataFrame(
        scaler.fit_transform(X_imp),
        index=X.index,
        columns=X.columns
    )

    # 4) PCA to denoise / reduce dimension
    n_pcs = min(10, Z.shape[1], Z.shape[0] - 1)  # safe upper bound
    pca = PCA(n_components=n_pcs, random_state=42)
    Z_pca = pca.fit_transform(Z.values)

    # for plotting we just use first 2 PCs
    pcs2 = Z_pca[:, :2]
    pcs2_df = pd.DataFrame(pcs2, index=Z.index, columns=["PC1", "PC2"])

    # 5) Isolation Forest on PCA scores (unsupervised outlier detection)
    iso = IsolationForest(
        random_state=42,
        contamination=0.025  # ~2.5% of samples flagged as outliers
    )
    iso.fit(Z_pca)
    is_outlier = iso.predict(Z_pca) == -1  # -1 = outlier, 1 = inlier
    outlier_score = -iso.score_samples(Z_pca)  # larger = more outlying

    # 6) collect results
    res_df = pcs2_df.copy()
    res_df["outlier_score"] = outlier_score
    res_df["is_outlier"] = is_outlier
    res_df["subtype"] = y.values
    outlier_results[om] = res_df

    # 7) plot PC1 vs PC2, highlight outliers
    sns.scatterplot(
        data=res_df,
        x="PC1",
        y="PC2",
        hue="is_outlier",
        style="is_outlier",
        palette={False: "C0", True: "C3"},
        s=40,
        ax=ax
    )
    ax.set_title(f"{om}: PCA with IsolationForest outliers")
    ax.set_xlabel("PC1")
    ax.set_ylabel("PC2")
    ax.legend(title="Outlier", loc="best")

plt.tight_layout()
plt.show()


In [None]:
df = pd.DataFrame({
    "mRNA": outlier_results["mRNA"]["outlier_score"],
    "DNAm": outlier_results["DNAm"]["outlier_score"],
    "RPPA": outlier_results["RPPA"]["outlier_score"]
})
df_long = df.melt(var_name="omic", value_name="score")

sns.boxplot(data=df_long, x="omic", y="score")
plt.xlabel("Omic Type")
plt.ylabel("Outlier Score")
plt.title("Outlier Score Distribution Across Omics")
plt.savefig("exports/figures/outlier_score_boxplot.png", dpi=300)
plt.show()
