## Machine learning classification of TCGA-BRCA gene expression data 

* I intend to perform machine learning classification methods in the Cancer Genome Atlas Breast Invasive Carcinoma (TCGA-BRCA), a reference to the data can be found at this link : https://portal.gdc.cancer.gov/projects/TCGA-BRCA.
* The dataset consists of gene expression quantification data (transcriptomic profile/counts) from a broad sampling of 1084 breast invasive carcinomas from currently 1084 patients. 
* For this purpose, Logistic regression, Random forest, and Support Vector Machine were performed on the dataset sample_type: "Solid Tumor", "Metastastic", and "Solid Tissue Normal".

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Loading data 
data= pd.read_csv("C:/Users/amade/Documents/UofSC/Bioinformatics/Github/Machine learning in genomics/tcga_brca_gene_expression.csv")
print(data.head())  # Printing the data
print(data.shape)  # Printing the shape of the dataframe 

                     Unnamed: 0  ENSG00000000003.15  ENSG00000000005.6  \
0  TCGA-D8-A146-01A-31R-A115-07                3414                210   
1  TCGA-AQ-A0Y5-01A-11R-A14M-07                 879                  9   
2  TCGA-C8-A274-01A-11R-A16F-07                8917                  0   
3  TCGA-BH-A0BD-01A-11R-A034-07                2071                102   
4  TCGA-B6-A1KC-01B-11R-A157-07                2047                 13   

   ENSG00000000419.13  ENSG00000000457.14  ENSG00000000460.17  \
0                2108                2100                 560   
1                2623                1727                 421   
2                2908                4764                2010   
3                1784                2336                1735   
4                2155                1855                 749   

   ENSG00000000938.13  ENSG00000000971.16  ENSG00000001036.14  \
0                 562                3741                2054   
1                 218             

In [3]:
# Checking for missing values in the dataframe (N/A)
print(data.isnull().sum())
# Dropping rows with missing values
data1 = data.dropna()   
print(data1.shape)   # No missing values (N/A)

Unnamed: 0            0
ENSG00000000003.15    0
ENSG00000000005.6     0
ENSG00000000419.13    0
ENSG00000000457.14    0
                     ..
ENSG00000187017.17    0
ENSG00000187021.15    0
ENSG00000187024.15    0
ENSG00000187026.3     0
ENSG00000187033.9     0
Length: 16384, dtype: int64
(1231, 16384)


In [6]:
# To train the machine learning models we need numerical values
# Dropping the first column of ID genes 
data1 = data1.iloc[:, 1:]

In [9]:
# Start training the models 
# Splitting the data into features and targets
#X = data1.drop(columns=['sample_type'])  # Obtaining all the columns except the sample type 
#y = data1['sample_type']   # Obtaining the column of sample type
#print(y)  # Printing the sample type: Primary tumor, metastastic, normal tissue

In [22]:
# Composition of the type of cancer samples:
print(np.sum (y== "Primary Tumor"))  # 1111 samples
print(np.sum (y== "Solid Tissue Normal"))  ## 113 samples
print(np.sum (y== "Metastatic"))  ## 7 samples

1111
113
7


In [23]:
# Splitting data into training and testing sets. 
# Based on the size of the dataframe and the size of targets I chosen a testing and training size 
# Testing sets is set up to 30 % and random_state ensure reproducibility set up to 42 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 72)

In [24]:
#Standardizing features
scaler = StandardScaler()  # Remomving the mean and scaling to unit variance.
X_train = scaler.fit_transform(X_train)   # Fitting the scaler to the training set and transform it.
X_test = scaler.transform(X_test)  # Converting the testing set through fitted scaler 

In [27]:
# Initializing logistic regression, random forest and support vector machine models. 
# Using same seed for reproducibility 
logistic_regression = LogisticRegression(random_state = 72)
random_forest = RandomForestClassifier(random_state = 72)
svm = SVC(random_state = 72)

In [28]:
# Training, fit ( train models using training set previously determined)
logistic_regression.fit(X_train, y_train)
random_forest.fit(X_train, y_train)
svm.fit(X_train, y_train)
print("It is ready")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


It is ready


In [29]:
# Making predictions on testing data ( 30% ).
y_pred_lr = logistic_regression.predict(X_test)
y_pred_rf = random_forest.predict(X_test)
y_pred_svm = svm.predict(X_test)

In [30]:
# Model evaluation of performance. 
# Logistic regression performance
print("Logistic Regression:")
print(classification_report(y_test, y_pred_lr))  # Printing precision, recall, f1-score, support for the three types of cancer
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_lr))

Logistic Regression:
                     precision    recall  f1-score   support

         Metastatic       0.00      0.00      0.00         4
      Primary Tumor       0.99      0.94      0.96       335
Solid Tissue Normal       0.66      1.00      0.79        31

           accuracy                           0.94       370
          macro avg       0.55      0.65      0.59       370
       weighted avg       0.95      0.94      0.94       370

Confusion Matrix:
[[  0   4   0]
 [  3 316  16]
 [  0   0  31]]


#### Interpretation of the Logistic Regression model
* Metastatic: Logistic Regression did not identify predictions because the lack of metastatic samples (4)
* Primary tumor: Logistic Regression had a 0.99% as samples predicted as primary tumor, 94% of actual primary tumor samples were indetify with a 96% performance in 335 samples from the 30% of the testing dataset
* Solid Tissue Normmal: Logistic Regression had a 0.66% as samples predicted as solid tissue normal samples, all the actual solid tissue normal samples were indetify with a 79% of precision in 31 samples from the 30% of the testing dataset

In [31]:
# Random forest
print("\nRandom Forest:")
print(classification_report(y_test, y_pred_rf)) # Printing precision, recall, f1-score, support for the three types of cancer
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))


Random Forest:
                     precision    recall  f1-score   support

         Metastatic       0.00      0.00      0.00         4
      Primary Tumor       0.98      0.99      0.99       335
Solid Tissue Normal       0.91      0.94      0.92        31

           accuracy                           0.98       370
          macro avg       0.63      0.64      0.64       370
       weighted avg       0.97      0.98      0.97       370

Confusion Matrix:
[[  0   4   0]
 [  0 332   3]
 [  0   2  29]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### Interpretation of the random forest model
* Metastatic: Random forest did not identify predictions because the lack of metastatic samples (4)
* Primary tumor: Random forest had a 0.98% samples predicted as primary tumor, 99% of actual primary tumor samples were indetify with a 99% performance in 335 samples from the 30% of the testing dataset
* Solid Tissue Normmal: Random forest had a 0.91% as samples predicted as solid tissue normal, all the actual solid tissue normal samples were indetify with a 92% of precision in 31 samples from the 30% of the testing dataset

In [32]:
# Support Vector Machine
print("\nSupport Vector Machine:")
print(classification_report(y_test, y_pred_svm))  # Printing precision, recall, f1-score, support for the three types of cancer
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_svm))


Support Vector Machine:
                     precision    recall  f1-score   support

         Metastatic       0.00      0.00      0.00         4
      Primary Tumor       0.97      0.99      0.98       335
Solid Tissue Normal       0.93      0.84      0.88        31

           accuracy                           0.97       370
          macro avg       0.63      0.61      0.62       370
       weighted avg       0.96      0.97      0.96       370

Confusion Matrix:
[[  0   4   0]
 [  0 333   2]
 [  0   5  26]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### Interpretation of the Support Vector machine model
* Metastatic: Support Vector machine did not identify predictions because the lack of metastatic samples (4)
* Primary tumor: Support Vector machine had a 0.97% samples predicted as primary tumor, 99% of actual primary tumor samples were indetify with a 98% performance in 335 samples from the 30% of the testing dataset
* Solid Tissue Normmal: Support Vector machine had a 0.93% as samples predicted as solid tissue normal, 84% of the actual solid tissue normal samples were indetify with a 88% of precision in 31 samples from the 30% of the testing dataset

### Overall conclusion 
* All three models perform well for primary tumor and solid tissue normal classes, with Random Forest and SVM showing better performance than Logistic Regression.
* None of the models are able to correctly identify metastatic samples, probably because of teh amount of data for this sample type. Further improvement can be done by adding additional data or augmenting it.* Random Forest and SVM have the highest overall accuracy, making them the preferred models for this dataset.