# Machine Learning Demo
### Gene Expression Classification Using Support Vector Machine (SVM)

By [Ahmet Sacan](mailto:ahmetmsacan@gmail.com)  
Modified By [Tony Kabilan Okeke](mailto:tko35@drexel.edu)

Data was retreived from [Kaggle](https://www.kaggle.com/datasets/crawford/gene-expression/metadata?select=actual.csv).  
Data originally published in *"Molecular Classification of Cancer: Class Discovery and Class Prediction By Gene Expression Monitoring", Golub'99*

### Load Packages

In [219]:
%load_ext autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [220]:
# Imports
%autoreload 2
import pandas as pd
import numpy as np
import kaggle
import rich
from ToolBox.utils import color_bool
from sklearn import svm
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SequentialFeatureSelector

### Prepare Data

In our code below, we'll use the convention of samples being on each row  
and features (genes) being on each column.  

In [221]:
# Retrieve data from kaggle
kaggle.api.authenticate()
kaggle.api.dataset_download_files('crawford/gene-expression', path='data', unzip=True)

In [222]:
# Load the training and testing data retrieved from Kaggle
data = []
for ext in ['train', 'independent']:
    data.append(pd.read_csv(f"data/data_set_ALL_AML_{ext}.csv"))

# Merge the training and testing datasets
# The data is transposed so features = columns and samples = rows
df = pd.merge(*[ d.filter(regex=r'^[^call]') for d in data ]) \
    .drop('Gene Description', axis=1) \
    .rename({'Gene Accession Number': ''}, axis=1) \
    .set_index('') \
    .transpose()
df.index = df.index.astype(int)

# Add labels to data and transpose
df = pd.read_csv('data/actual.csv') \
    .merge(df, how='left', left_on='patient', right_index=True) \
    .rename({'patient': 'sample'}, axis=1) \
    .set_index(['sample', 'cancer'])

# Prepare data for machine learning
y = df.index.get_level_values(1).to_numpy()  # Labels
genes = df.columns  # Features
X = df.values

df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,AFFX-BioB-5_at,AFFX-BioB-M_at,AFFX-BioB-3_at,AFFX-BioC-5_at,AFFX-BioC-3_at,AFFX-BioDn-5_at,AFFX-BioDn-3_at,AFFX-CreX-5_at,AFFX-CreX-3_at,AFFX-BioB-5_st,...,U48730_at,U58516_at,U73738_at,X06956_at,X16699_at,X83863_at,Z17240_at,L49218_f_at,M71243_f_at,Z78285_f_at
sample,cancer,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1,ALL,-214,-153,-58,88,-295,-558,199,-176,252,206,...,185,511,-125,389,-37,793,329,36,191,-37
2,ALL,-139,-73,-1,283,-264,-400,-330,-168,101,74,...,169,837,-36,442,-17,782,295,11,76,-14
3,ALL,-76,-49,-307,309,-376,-650,33,-367,206,-215,...,315,1199,33,168,52,1138,777,41,228,-41
4,ALL,-135,-114,265,12,-419,-585,158,-253,49,31,...,240,835,218,174,-110,627,170,-50,126,-91
5,ALL,-106,-125,-76,168,-230,-284,4,-122,70,252,...,156,649,57,504,-26,250,314,14,56,-25


### Train & Test for 1 Fold

First, let's train and test the SVM model for a single cross validation fold.  
Later, we will explore how to do this for multiple folds in a mature 
cross-validation of a ML model.

We will be using the *[K Fold]()* cross-validation strategy.

In [223]:
kf = KFold(n_splits=4, shuffle=True, random_state=69)
train_idx, test_idx = list(kf.split(y))[0]

X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]

#### Train (Learn) Model from Training Data

In [224]:
# Initialize classifier
clf = svm.SVC(kernel='rbf', random_state=69)

# Fit training data to model
clf.fit(X_train, y_train)

SVC(random_state=69)

#### Test (Predict) Test Data Classifications

In [225]:
y_pred = clf.predict(X_test)

When comparing `y_test` and `y_pred`, you need to use the proper functions
for comparison of those data types i.e., if `y_test` and `y_pred` are text,
you would use a text comparison function; if they were numbers, you would
use a numerical comparison. It's your job to find out the best way to 
compare a prediction to the correct target value.

Let's print the predictions (`y_pred`) and the correct `y_test` side-by-side.
This is shown for instruction/de-bugging only. You don't need to show this
in your analysis.

In [226]:
pd.DataFrame([y_test, y_pred, y_test == y_pred]) \
    .transpose() \
    .rename({0: 'Labels', 1: 'Prediction', 2: 'Correct'}, axis=1) \
    .style.applymap(color_bool)

Unnamed: 0,Labels,Prediction,Correct
0,ALL,ALL,True
1,ALL,ALL,True
2,ALL,ALL,True
3,ALL,ALL,True
4,ALL,ALL,True
5,ALL,ALL,True
6,ALL,ALL,True
7,AML,ALL,False
8,AML,ALL,False
9,ALL,ALL,True


#### Performance Summary for 1-Fold

In [227]:
ncorrect = (y_pred == y_test).sum()
nerror = (y_pred != y_test).sum()
accuracy = ncorrect / len(y_test)
errorrate = nerror / len(y_test)

rich.print(f" ncorrect = {ncorrect}\n   nerror = {nerror}\n",
           f"accuracy = {accuracy:.3f}\nerrorrate = {errorrate:.3f}")

### Evaluation Function that Trains and Tests a Model for 1-Fold

Let's take what we did and put it in a function that returns the number
of errors for 1 fold.

In [228]:
def svm_train_and_test(X_train, y_train, X_test, y_test):
    """
    Train a SVM and test it.
    @params {X,y}_train
        training dataset and labels
    @params {X,y}_test
        testing dataset and labels
    @return
        model error rate
    """

    # Train (learn) model
    clf = svm.SVC(kernel='rbf', random_state=69)
    clf.fit(X_train, y_train)

    # Test (predict) test data
    y_pred = clf.predict(X_test)

    return (y_pred != y_test).sum() / len(y_test)

#### Train & Test Using the evaluation function

The evaluation function does exactly what we demonstrated above, so we should
get the same (or similar) results as before.

In [229]:
errorrate = svm_train_and_test(X_train, y_train, X_test, y_test)
rich.print(f"errorrate = {errorrate:.3f}")

#### Evaluate for All 4 Folds

In [230]:
errorrate = []
for train, test in kf.split(y):
    errorrate.append(svm_train_and_test(X[train], y[train], X[test], y[test]))
errorrate = np.array(errorrate)

rich.print( errorrate.round(3) )

#### Total Performance Across All Folds

In [231]:
rich.print( f'Total Error Rate = {errorrate.mean():.4f}' )

### Cross-Validation Made Easy

We don't have to train and test for each fold ourselves. We can let `cross_val_score` do the
work for us. We need to provide the evaluation function we created, all the
`X` and `y` data. `cross_val_score` will call the evaluation function just like we did above
and return the number of errors for each fold.

`cross_val_score` returns the accuracy for each fold.

In [232]:
errors = 1 - cross_val_score(clf, X, y, cv=kf)
rich.print(f'Total Error Rate = {errors.mean():.3f}')

### Feature Selection

In feature selection, we try to find a subset of features (genes) that give as
accurate (or sometimes more accurate) predictions.  
Having created an evaluation function, there is not left much to code. We'll
use `` which will call our evaluation function with different feature combinations
and give us back the best subset it can find.  
Note that feature selection can take a long time to complete. To save time, I
have decided to limit the feature selection to 50 genes that are most correlated
with the target class.

In [233]:
corrvals = np.corrcoef(X.T, (y == 'ALL').astype(int), rowvar=True)[:, -1]
I = corrvals.argsort()[:50]

Now do feature selection out of the onews we decieded to consider

In [234]:
# Ifilter_selected =
clf = svm.SVC(kernel='rbf', random_state=69)
sfs = SequentialFeatureSelector(clf, direction='forward', cv=kf)
sfs.fit(X[:,I], y)

SequentialFeatureSelector(cv=KFold(n_splits=4, random_state=69, shuffle=True),
                          estimator=SVC(random_state=69))

In [244]:
X[:,I][:, sfs.get_support()].shape

(72, 25)

In [239]:
X[:,I].shape

(72, 50)

In [187]:
print('Feature Selection resulted in the following genes:', 
      ', '.join(genes.values[I][sfs.get_support()]))

Feature Selection resulted in the following genes: X95735_at, X17042_at, M23197_at, M84526_at, L09209_s_at, U46499_at, M27891_at, M16038_at, M22960_at, M63138_at, M55150_at, M62762_at, U50136_rna1_at, X61587_at, X16546_at, M11147_at, M32304_s_at, X52056_at, D49950_at, M19507_at, X14008_rna1_f_at, M81695_s_at, X62654_rna1_at, X64072_s_at, Y00787_s_at


### Performance of Selected Features

Repeat cross-validation to get the performance of selected features.

In [197]:
accuracy = cross_val_score(clf, X[:,I][:,sfs.get_support()], y, cv=kf)

In [209]:
rich.print(f'Average Accuracy: {accuracy.mean():.4f}')

### Normalize Data (Using Standard-Normalization and Re-do it all)

This is something that we should've done before. But let's now normalize
to see the accuracy we can achieve with normalized data.

In [214]:
X_std = (X - X.mean()) / X.std()

#### Accuracy of Using Normalized Data for Prediction

In [217]:
accuracy = cross_val_score(clf, X_std, y, cv=kf)
rich.print(f'Average Accuracy: {accuracy.mean():.4f}')

### Feature Selection with Normalized Data

In [218]:
corrvals = np.corrcoef(X_std.T, (y == 'ALL').astype(int), rowvar=True)[:, -1]
I = corrvals.argsort()[:50]

clf = svm.SVC(kernel='rbf', random_state=69)
sfs = SequentialFeatureSelector(clf, direction='forward', cv=kf)
sfs.fit(X_std[:,I], y)

print('Feature Selection resulted in the following genes:', 
      ', '.join(genes.values[I][sfs.get_support()]))

accuracy = cross_val_score(clf, X_std[:,I][:,sfs.get_support()], y, cv=kf)
rich.print(f'Average Accuracy: {accuracy.mean():.4f}')

Feature Selection resulted in the following genes: X95735_at, X17042_at, M23197_at, M84526_at, L09209_s_at, U46499_at, M27891_at, M16038_at, M22960_at, M63138_at, M55150_at, M62762_at, U50136_rna1_at, X61587_at, X16546_at, M11147_at, M32304_s_at, X52056_at, D49950_at, M19507_at, X14008_rna1_f_at, M81695_s_at, X62654_rna1_at, X64072_s_at, Y00787_s_at
