# Multiclass Task Notebook
### Contains Code for Binary task SVM Model creation, training and validation(hyperparameter tuning)
### The multiclass targets are arranged as<br/>no_tumor = 0<br/>glioma_tumor = 1<br/>meningioma_tumor = 2<br/>pituitary_tumor = 3<br/>Respectively

In [1]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import keras
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

#tqdm is for progress bar functionality in code, must be installed for code to function (TO DO: include exception if tqdm not imported )
from tqdm import tqdm

#Importing libraries used for SVM classification
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_curve, roc_auc_score
import pickle as pkl
import os

Using TensorFlow backend.


# 1. Loading Dataset and doing final preprocessing
## We Load the preprocessed data and carry out PCA on the image array here for multi-classification training and test data. <br/> Initial data preprocessing code will be very similar to Binary Task SVM as the only difference is the use of the multiclass label file instead of binary.
## 1.1 Loading Datasets

In [2]:
#We can do PCA for the images but must be done separately for binary and multiclass task as the data must be split first
#This is because we must do PCA on the training data only (fit and transform it) and then only use the transform on the test data to prevent bias
#We select 400 components as it provides around 96% explained variance (can get the exact value)

#Reading created pkl files for binary labels and image data.
Multiclass_labels = pd.read_pickle('./dataset/Y_Multiclass_label.pkl')
Flattened_MRI_Array = pd.read_pickle('.\dataset\Image_DF_Flat.pkl')

#For Display
print(Multiclass_labels)
Flattened_MRI_Array


      MRI_Multiclass_Label
0                      2.0
1                      0.0
2                      2.0
3                      1.0
4                      2.0
...                    ...
2995                   0.0
2996                   2.0
2997                   1.0
2998                   1.0
2999                   3.0

[3000 rows x 1 columns]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,774,775,776,777,778,779,780,781,782,783
0,1,1,2,2,3,3,3,3,3,2,...,40,67,63,29,44,62,3,3,3,3
1,3,2,2,2,2,2,2,2,2,2,...,1,1,0,1,0,0,0,0,0,0
2,2,0,0,0,0,0,0,0,0,0,...,48,44,46,33,43,38,47,34,1,0
3,0,0,0,0,1,1,1,2,1,2,...,2,2,2,2,3,4,24,26,5,2
4,1,1,1,1,1,0,0,6,1,20,...,157,112,183,133,43,1,0,0,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,5,1,1,3,1,0,5,6,6,3,...,1,1,1,2,3,4,1,2,0,3
2996,1,1,2,2,2,2,3,2,3,3,...,72,70,42,9,4,4,2,4,2,2
2997,0,3,2,2,2,2,2,2,2,2,...,27,21,27,14,4,3,2,1,3,0
2998,2,2,2,2,3,3,3,3,4,3,...,153,63,86,81,59,71,26,7,5,3


In [3]:
#Storing Data from pickle files as X and Y
Y = Multiclass_labels[['MRI_Multiclass_Label']]
X = Flattened_MRI_Array

#Verifying the arrays are of correct shape, 3000 samples and for the image array (X) 784 pixels
print(Y.shape)
print(X.shape)

(3000, 1)
(3000, 784)


## 1.2 Splitting data in to training and testing sets

In [4]:
# Split the data into training and testing(70% training and 30% testing data)
xTrain,xTest,yTrain,yTest=train_test_split(X, Y, train_size = 0.7)

#Rescaling the dataframe as the pixel values range from 0 to 255
#We want it to be between 0 to 1 to let it pass through the NN and models
xTrain_Scaled = xTrain/255
xTest_Scaled = xTest/255

## 1.3 PCA

In [5]:
#Initialising PCA with 400 components determined in preprocessing notebook
Multiclass_PCA = PCA(n_components = 400)

#Fitting and Transforming training dataset
xTrain_PCA = Multiclass_PCA.fit_transform(xTrain_Scaled)

#We only transform test dataset as we do not want the model to learn about the test data statistics
xTest_transformed = Multiclass_PCA.transform(xTest_Scaled)


#Prints the percentage of explained variance to verify it is greater than our threshold of 95%
print(np.cumsum(Multiclass_PCA.explained_variance_ratio_ * 100)[-1])

96.43163561590319


# 2. Model Building

## 2.1 SVM without tuning
### 2.1.1 Training<br/> We first train SVM without hyperparameter tuning to assess the performance of the model with the training and test data

In [6]:
#Training model using SVC without hyperparameter tuning
multiclass_SVM_base = SVC(probability = True)
multiclass_SVM_base.fit(xTrain_PCA, yTrain.values.ravel())

The Result for SVM without hyperparameter tuning is:
              precision    recall  f1-score   support

         0.0       0.84      0.80      0.82       128
         1.0       0.82      0.79      0.81       275
         2.0       0.75      0.73      0.74       250
         3.0       0.87      0.96      0.92       247

    accuracy                           0.82       900
   macro avg       0.82      0.82      0.82       900
weighted avg       0.82      0.82      0.82       900

The confusion matrix is:
[[102   5  13   8]
 [  8 218  43   6]
 [ 12  36 182  20]
 [  0   6   4 237]]


### 2.1.2 Obtaining the Prediction results and performance<br/> Showing the confusion matrix and general assessment metrics using the classification report

In [None]:
#Printing prediction results
multiclass_SVM_base_pred = multiclass_SVM_base.predict(xTest_transformed) 
print("The Result for SVM without hyperparameter tuning is:")
print(classification_report(yTest, multiclass_SVM_base_pred))

#Printing the confusion matrix for SVM without tuning
print("The confusion matrix is:")
print(confusion_matrix(yTest, multiclass_SVM_base_pred))

## 2.2 SVM with validation

### 2.2.1 Training<br/> We now use SVM again but with validation, tuning of the hyperparameter values<br/> This is done using an exhaustive Gridsearch of the parameter values given.

In [7]:
#Using SVM but this time with gridsearch to tune hyperparameter

#Define the parameter ranges
param_grid = {
    'C': [0.1, 1, 10, 100, 1000],
    'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
    'kernel': ['rbf', 'poly']
}

multiclass_SVM_grid = GridSearchCV(SVC(probability = True), param_grid, refit = True, verbose = 1)

#Fitting model with grid search
multiclass_SVM_grid.fit(xTrain_PCA, yTrain.values.ravel())

#Est Runtime: 32mins

Fitting 5 folds for each of 50 candidates, totalling 250 fits


GridSearchCV(estimator=SVC(probability=True),
             param_grid={'C': [0.1, 1, 10, 100, 1000],
                         'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
                         'kernel': ['rbf', 'poly']},
             verbose=1)

### 2.2.2 Obtaining best parameters found<br/> After model is trained, we obtain the best parameters found by the gridsearch

In [8]:
#Display the best parameters after the hyperparameter tuning
print("The best hyperparameters found by gridsearch are:")
print(multiclass_SVM_grid.best_params_)

#Print the new details of the SVM model after tuning
print("The new model created after hyperparameter tuning is:")
print(multiclass_SVM_grid.best_estimator_)

The best hyperparameters found by gridsearch are:
{'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
The new model created after hyperparameter tuning is:
SVC(C=10, gamma=0.1, probability=True)


### 2.2.3 Obtaining the Prediction results and performance<br/> Showing the confusion matrix and general assessment metrics using the classification report

In [9]:
#Now running the tuned model with the test data to obtain the classification report
Tuned_SVM_pred = multiclass_SVM_grid.predict(xTest_transformed)

#Print classification report
print("The Result for SVM with hyperparameter tuning via gridsearch is:")
print(classification_report(yTest, Tuned_SVM_pred))

#Printing the confusion matrix for SVM with tuning
print("The confusion matrix is:")
print(confusion_matrix(yTest, Tuned_SVM_pred))

The Result for SVM with hyperparameter tuning via gridsearch is:
              precision    recall  f1-score   support

         0.0       0.88      0.77      0.82       128
         1.0       0.88      0.82      0.85       275
         2.0       0.78      0.84      0.81       250
         3.0       0.93      0.98      0.95       247

    accuracy                           0.86       900
   macro avg       0.87      0.85      0.86       900
weighted avg       0.86      0.86      0.86       900

The confusion matrix is:
[[ 99   4  21   4]
 [  9 225  36   5]
 [  5  26 210   9]
 [  0   2   3 242]]


In [10]:
#Saving tuned and base SVM models
save_path = "./Models/Multiclassification"
Tuned_SVM_filename = 'Tuned_multiclass_SVM.sav'
Base_SVM_filename = 'Base_multiclass_SVM.sav'

#Using Pickle to put them in files
pkl.dump(multiclass_SVM_base , open(os.path.join(save_path, Base_SVM_filename), 'wb'))
pkl.dump(multiclass_SVM_grid, open(os.path.join(save_path, Tuned_SVM_filename), 'wb'))