# Breast Cancer Proteomes - Multioutput model

The dataset contains published iTRAQ proteome profiling of 77 breast cancer samples generated by the Clinical Proteomic Tumor Analysis Consortium (NCI/NIH). It contains expression values for ~12.000 proteins for each sample, with missing values present when a given protein could not be quantified in a given sample.

###### AIM:
To build a ML model to predict all of the endpointd endpoints - AJCC stage, metastasis, tumor stage and PAM50 mRNA type

In [28]:
import optuna
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier
from sklearn.multioutput import MultiOutputClassifier
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, classification_report, accuracy_score, mean_squared_error

Prediction of PAM50 mRNA

PAM50 is a gene expression-based assay used for molecular profiling of breast cancer, derived from the original identification method as determined by Perou et al.   
It is a molecular diagnostic tool that helps classify breast cancer into different intrinsic subtypes based on the expression patterns of 50 genes measured parallel with 8 housekeeping genes.

0 : Basal  
1 : Luminal A  
2 : Luminal B  
3 : HER2  

In [145]:
final = pd.read_csv("final_data.csv")

In [146]:
final.drop(['RefSeq_accession_number','Metastasis-Coded','AJCC Stage',"ER Status","PR Status","HER2 Final Status"],axis=True,inplace=True) #removing other targets and sample id

In [147]:
feature_tumor = pd.read_csv("Tumor_feature.txt")
feature_pam = pd.read_csv("PAM50_feature.txt")

In [148]:
final2 = final[list(set(feature_tumor["protein"].to_list() + feature_pam["protein"].to_list()))]

In [149]:
final2 = final2.drop(['Tumor','PAM50 mRNA'],axis=1)

In [150]:
scaler = MinMaxScaler() #scaling
final2_scale = pd.DataFrame(scaler.fit_transform(final2),columns=final2.columns.to_list())

In [151]:
X = final2_scale
y = final[['PAM50 mRNA','Tumor']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

xgb = XGBClassifier(n_estimators=100)

# Train a multi-output classifier with XGBoost
classifier = MultiOutputClassifier(xgb)
classifier.fit(X_train, y_train)

In [152]:
predictions = classifier.predict(X_test)
predictions = pd.DataFrame(predictions,columns=["PAM50",'Tumor'])

mse = mean_squared_error(y_test, predictions)
print(f'Mean Squared Error: {mse}')

Mean Squared Error: 1.1041666666666667


In [153]:
preds = predictions["PAM50"]
acc_xgb = (preds == y_test["PAM50 mRNA"].tolist()).sum().astype(float) / len(preds)*100

print("XGBoost's prediction accuracy is: %3.2f" % (acc_xgb))
print("Precision:", precision_score(y_test["PAM50 mRNA"], preds, average='micro'))
print("Recall:",recall_score(y_test["PAM50 mRNA"], preds, average='micro'))
print("F1-Score:", f1_score(y_test["PAM50 mRNA"], preds, average='micro'))
print(confusion_matrix(y_test["PAM50 mRNA"], preds))

XGBoost's prediction accuracy is: 54.17
Precision: 0.5416666666666666
Recall: 0.5416666666666666
F1-Score: 0.5416666666666666
[[2 0 0 1]
 [1 0 1 1]
 [2 0 7 0]
 [0 1 4 4]]


In [154]:
preds = predictions["Tumor"]
acc_xgb = (preds == y_test["Tumor"].tolist()).sum().astype(float) / len(preds)*100

print("XGBoost's prediction accuracy is: %3.2f" % (acc_xgb))
print("Precision:", precision_score(y_test["Tumor"], preds, average='micro'))
print("Recall:",recall_score(y_test["Tumor"], preds, average='micro'))
print("F1-Score:", f1_score(y_test["Tumor"], preds, average='micro'))
print(confusion_matrix(y_test["Tumor"], preds))

XGBoost's prediction accuracy is: 58.33
Precision: 0.5833333333333334
Recall: 0.5833333333333334
F1-Score: 0.5833333333333334
[[ 0  2  1  0]
 [ 0 14  1  0]
 [ 0  3  0  0]
 [ 0  3  0  0]]
