## Assignment 3 Breast Cancer Stage Classification

Breast cancer (BRCA) is the most common cancer in women. One important task to improve the survival rate of BRCA patients is identifying the cancer stage and applying different treatment strategies. We can train a model to classify cancer stages using RNA-seq of patient samples. 

Tasks:
1.	Prepare a dataset using TCGA-BRCA RNA-Seq data as features and cancer stages as labels. (Hint: you can find the processed RNA-Seq data and patient phenotype data from UCSC Xena)
2.	Applying data processing methods. (Normalization, Training-Test split, etc.)
3.	Applying three different classification estimators and optimizing the parameters through cross-validation.
4.	Comparing three estimators by evaluating the performance on the test dataset.
5.	Applying feature selection to improve performance.


### Task 1

In [255]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scikitplot as skplt

In [305]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, roc_curve, auc, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
from itertools import cycle
from sklearn.model_selection import GridSearchCV

In [257]:
reads = pd.read_csv('TCGA-BRCA.htseq_fpkm.tsv', sep='\t', header=0)
label = pd.read_csv('TCGA-BRCA.GDC_phenotype.tsv', sep='\t', header=0)

In [259]:
class classifier(object):
    def __init__(self, reads, label):
        self.reads = reads
        self.label = label
        self.data = None
        self.sel_X = None

    def preprossessing(self):
        '''Merge the reads and label dataframes into a single dataframe'''
        reads = self.reads.set_index('Ensembl_ID').T
        reads.reset_index(inplace=True)
        reads.rename(columns={'index':'sample_ID'}, inplace=True)

        self.label = label.loc[:, ['submitter_id.samples', 'tumor_stage.diagnoses']].rename(columns={'submitter_id.samples':'sample_ID', 'tumor_stage.diagnoses': 'diagnosis'})
        print('label 1', self.label.shape)
        self.label.dropna(inplace=True)
        print('label (drop NA)', self.label.shape)
        self.label = self.label.query('diagnosis != "not reported"') # exclude samples with no diagnosis
        # self.label = self.label.loc[:,['sample_ID','diagnosis']] # extract useful info
        # label2.dropna(inplace=True)
        print('label (drop "not reported")', self.label.shape)
        print('The shape of the whole data frame:', reads.shape)
        data = pd.merge(reads, self.label, on='sample_ID', how='inner') # merge two dataframes
        print('The merged data shape: ', data.shape)
        self.data = data
        # when the whole process is finished, the return value can be None
        return data

    def dimen_reduction(self):
        pass

    def train_test_split(self):
        '''First convert the label to binary numerical values, then split the data into training and testing sets;
        A label dictionary is created to map the numerical values back to the original labels
        '''
        self.y = self.data['diagnosis']
        self.X = self.data.drop(['sample_ID', 'diagnosis'], axis=1)
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(self.X, self.y, test_size=0.3, random_state=42)

    def sel_train_test_split(self):
        self.X_train_sel, self.X_test_sel, self.y_train_sel, self.y_test_sel = train_test_split(self.sel_X, self.y, test_size=0.3, random_state=42)
        print('The shape of the selected training set:', self.X_train_sel.shape)

    def cross_validation(self):
        pass

    def testing(self):
        pass

    def feature_selection(self):
        pass

### Action codes

In [260]:
A3 = classifier(reads, label)
data = A3.preprossessing()
A3.train_test_split()

label 1 (1284, 2)
label (drop NA) (1282, 2)
label (drop "not reported") (1270, 2)
The shape of the whole data frame: (1217, 60484)
The merged data shape:  (1204, 60485)


## Below are draft

In [261]:
A3.data.diagnosis.value_counts()

stage iia     397
stage iib     290
stage iiia    172
stage i       105
stage ia       91
stage iiic     70
stage iiib     31
stage iv       22
stage x        12
stage ib        6
stage ii        6
stage iii       2
Name: diagnosis, dtype: int64

## Base-line performances
### SVM

In [262]:
# construct multi-class SVM classifier
model = SVC(kernel='linear', gamma="auto")
model.fit(A3.X_train, A3.y_train)
y_pred = model.predict(A3.X_test)
print('Accuracy: ', accuracy_score(A3.y_test, y_pred))
# print('Confusion matrix: \n', confusion_matrix(A3.y_test, y_pred))
# print('Classification report: \n', classification_report(A3.y_test, y_pred))

Accuracy:  0.3149171270718232
Confusion matrix: 
 [[ 4  5  0  0 20 10  0  0  0  1  0  0]
 [ 0  5  0  0 17  6  0  0  0  0  0  0]
 [ 0  0  0  0  2  2  0  0  0  0  0  0]
 [ 0  0  0  0  2  1  0  0  0  0  0  0]
 [ 2  3  0  0 60 27  0 15  1  1  0  0]
 [ 0  3  0  0 40 31  0 12  0  2  0  0]
 [ 0  0  0  0  1  0  0  0  0  0  0  0]
 [ 2  3  0  0 17 18  0  9  0  1  0  0]
 [ 0  0  0  0  5  3  0  0  1  0  0  0]
 [ 1  0  0  0  9  7  0  1  0  4  0  0]
 [ 0  0  0  0  1  2  0  2  0  0  0  0]
 [ 0  1  0  0  1  1  0  0  0  0  0  0]]
Classification report: 
               precision    recall  f1-score   support

     stage i       0.44      0.10      0.16        40
    stage ia       0.25      0.18      0.21        28
    stage ib       0.00      0.00      0.00         4
    stage ii       0.00      0.00      0.00         3
   stage iia       0.34      0.55      0.42       109
   stage iib       0.29      0.35      0.32        88
   stage iii       0.00      0.00      0.00         1
  stage iiia       0.23

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Feature selection

In [301]:
# conduct feature selection based on chi2 test
from sklearn.feature_selection import SelectKBest, chi2
select = SelectKBest(chi2, k=7000)
X_new = select.fit_transform(A3.X, A3.y)
X_new.shape

(1204, 7000)

Use the SVM to predict based on the selected features

In [265]:
# the selected features
chosen_index = select.get_support()
features = np.array(A3.X.columns)
print(features[chosen_index])

['ENSG00000277257.1' 'ENSG00000277646.1' 'ENSG00000255461.1'
 'ENSG00000146399.1' 'ENSG00000168530.14' 'ENSG00000163092.18'
 'ENSG00000201393.1' 'ENSG00000129170.7' 'ENSG00000181693.7'
 'ENSG00000164326.4']


In [302]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X_new, A3.y, test_size=0.3, random_state=42)

In [304]:
# train a new SVM model using the selected features
model3 = SVC(kernel='linear', gamma="auto")
model3.fit(X_train, y_train)
y_pred3 = model3.predict(X_test)
print('Accuracy: ', accuracy_score(y_test, y_pred3))

Accuracy:  0.292817679558011


### Try something new
I try to select features based on their variance

In [283]:
# feature selection
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.5)
selector.fit(A3.X_train)
A3.sel_X = selector.transform(A3.X)
print(A3.sel_X.shape)
A3.sel_train_test_split()

(1204, 5675)
The shape of the selected training set: (842, 5675)


In [288]:
# use random forest (select features) !!!
rfc2 = RandomForestClassifier(n_estimators=50, max_depth=7, max_features=20)
rfc2.fit(A3.sel_X, A3.y)
y_pred7 = rfc2.predict(A3.X_test_sel)
print('Accuracy: ', accuracy_score(A3.y_test_sel, y_pred7))

Accuracy:  0.6546961325966851


In [309]:
# hyperparameter tuning
import warnings
warnings.filterwarnings('ignore')
param_grid = {'bootstrap': [True, False],
              'max_depth': [5, 15, 50, 100, None],
              'max_features': [20, 50, 80],
              'min_samples_leaf': [1, 2, 4],
              'min_samples_split': [2, 5, 10],
              'n_estimators': [10, 50, 100, 200]}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=3, verbose=3, n_jobs=-1)
grid.fit(A3.X_train_sel, A3.y_train_sel)
print(grid.best_params_)

Fitting 3 folds for each of 1512 candidates, totalling 4536 fits
[CV 2/3] END bootstrap=True, max_depth=10, max_features=20, min_samples_leaf=1, min_samples_split=2, n_estimators=10;, score=0.306 total time=   0.3s
[CV 1/3] END bootstrap=True, max_depth=10, max_features=20, min_samples_leaf=1, min_samples_split=2, n_estimators=10;, score=0.281 total time=   0.3s
[CV 3/3] END bootstrap=True, max_depth=10, max_features=20, min_samples_leaf=1, min_samples_split=2, n_estimators=10;, score=0.314 total time=   0.3s
[CV 3/3] END bootstrap=True, max_depth=10, max_features=20, min_samples_leaf=1, min_samples_split=2, n_estimators=50;, score=0.357 total time=   1.0s
[CV 1/3] END bootstrap=True, max_depth=10, max_features=20, min_samples_leaf=1, min_samples_split=2, n_estimators=50;, score=0.370 total time=   1.0s
[CV 2/3] END bootstrap=True, max_depth=10, max_features=20, min_samples_leaf=1, min_samples_split=2, n_estimators=50;, score=0.345 total time=   1.0s
[CV 1/3] END bootstrap=True, max_de

#### MLP (compare before and after)

In [None]:
# use MLP to train the model
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(100, 100, 100), max_iter=500)
mlp.fit(A3.X_train, A3.y_train)
y_pred8 = mlp.predict(A3.X_test)
print('Accuracy: ', accuracy_score(A3.y_test, y_pred8))


Accuracy:  0.30662983425414364


In [297]:
# use MLP to train the model using new features
mlp2 = MLPClassifier(hidden_layer_sizes=(100, 100, 100), max_iter=500)
mlp2.fit(A3.X_train_sel, A3.y_train_sel)
y_pred9 = mlp2.predict(A3.X_test_sel)
print('Accuracy: ', accuracy_score(A3.y_test_sel, y_pred9))

Accuracy:  0.30939226519337015


#### Try SVM

In [299]:
svm2 = SVC(kernel='linear', gamma="auto")
svm2.fit(A3.X_train_sel, A3.y_train_sel)
y_pred10 = svm2.predict(A3.X_test_sel)
print('Accuracy: ', accuracy_score(A3.y_test_sel, y_pred10))

Accuracy:  0.3011049723756906


Try to use XGBoost

In [None]:
import xgboost as xgb