# Microarray Analysis - Machine Learning

**Author:** [Tony Kabilan Okeke](mailto:tko35@drexel.edu)

In this study, you will analyze a Breast Cancer dataset [GSE7390](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse7390), and identify a gene signature for prediction of Breast Cancer relapse.  
Use SVM to predict relapse. Use a forward-selection strategy and 10-fold crossvalidation to determine the best gene signature.

In [1]:
%load_ext autoreload

In [4]:
# Imports
%autoreload 2
import pandas as pd
import numpy as np
import rich
import re

from tools import geodlparse, hwmaml_breastcancer_trainandtest
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import StratifiedKFold, train_test_split, cross_val_score

In [3]:
# Download and parse data
gse = geodlparse('GSE7390')
gse_data = pd.concat(
    [ gsm.table.set_index('ID_REF')['VALUE'] for _,gsm in gse.gsms.items() ],
    axis=1
).set_axis([ x for x,_ in gse.gsms.items() ], axis=1, inplace=False)

# Retrieve sample groups (labels)
groups = gse.phenotype_data.filter(regex='e\\.rfs$', axis=1) \
    .replace({'0': 'No Replapse', '1': 'Relapse'}).sort_index()

# Select the 76 genes identified in Wang, 2005
with open('data/genelist.txt', 'r') as file:
    genelist = [re.match(r'^\d{6}\w+', x)[0] for x in file.readlines()]

gse_data = gse_data.filter(genelist, axis=0).T \
    .rename_axis('', axis=1) \
    .sort_index(axis=1)

# Define variables for ml
X = gse_data.values
X = StandardScaler().fit_transform(X)  # normalize data
y = groups['characteristics_ch1.14.e.rfs'].values
genes = gse_data.columns

Loading cached data...


In [9]:
# Split the data into training (90%) and testing sets (10%)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.9, random_state=69)

# Split the data into stratified folds and select the first partition
skf = StratifiedKFold(n_splits=4, random_state=69, shuffle=True)
train_idx, test_idx = list(skf.split(X_train, y_train))[0]
X_train, y_train = X_train[train_idx], y_train[train_idx]

# Fit the training data to the SVM 
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)

# Get model predictions
y_pred = clf.predict(X_test)

# Calculate and report model accuracy
accuracy = (y_pred == y_test).mean()
rich.print(f'The accuracy of the SVM model for a single fold is {accuracy*100:.2f}%')

Write an evaluation function `hwmalml_breastcancer_trainandtest(X_train, y_train, X_test, y_test)`
that trains an SVM using `X_train` and `y_train`, where `X_train` is the gene
expression data for a subset of the samples, and `y_train` is a binary vector
of class labels (indicating cancer relapse status) and calculates the **accuracy**
on the test data (`X_test` and `y_test`).

The `hwmaml_breastcancer_trainandtest` function is defined in the `tools.py` module.

In [10]:
accuracy = hwmaml_breastcancer_trainandtest(X_train, y_train, X_test, y_test)
rich.print(f'The accuracy of the SVM model for a single fold is {accuracy*100:.2f}%')

### Feature Selection

Perform forward selection of features (genes) that give the best prediction 
results (as measured by accuracy).  

- Create a 10-fold cross-validation of all data samples
- Report the names of the genes that were selected to have the best accuracy

In [11]:
# Initialize cross-validator
skf = StratifiedKFold(n_splits=10, random_state=69, shuffle=True)
# Initialize SVM classifier
clf = svm.SVC(kernel='rbf')
# Perform feature selection
sfs = SequentialFeatureSelector(clf, direction='forward', cv=skf, n_jobs=-1)
sfs.fit(X, y);

rich.print('Feature Selection resulted in the following genes:\n', 
           ', '.join(genes[sfs.get_support()]), sep='')

In [12]:
# Using the list of genes selected, report the 10-fold cross-validation accuracy
# of the SVM model
accuracy = cross_val_score(clf, X[:,sfs.get_support()], y, cv=skf)
rich.print(f'The SVM accuracy is {accuracy.mean()*100:.3f}%')