# Multiclass Task Notebook
### Contains Code for Binary task SVM Model creation, training and validation(hyperparameter tuning)
### The multiclass targets are arranged as<br/>no_tumor = 0<br/>glioma_tumor = 1<br/>meningioma_tumor = 2<br/>pituitary_tumor = 3<br/>Respectively

In [1]:
# Import required libraries
import numpy as np
import pandas as pd
import pickle as pkl
import os
import cv2
import matplotlib.pyplot as plt
import tensorflow as tf
import keras
from sklearn.decomposition import PCA

#tqdm is for progress bar functionality in code, must be installed for code to function (TO DO: include exception if tqdm not imported )
from tqdm import tqdm

#Importing libraries used for SVM classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_curve, roc_auc_score

#Importing functions notebook containing functions created to streamline code
from ipynb.fs.full.functions import load_dataset, dataset_PCA, Tuned_SVM_train, SVM_predictions

Using TensorFlow backend.


# 1. Loading Dataset and doing final preprocessing
#### We Load the preprocessed data and carry out PCA on the image array here for multi-classification training and test data. <br/> Initial data preprocessing code will be very similar to Binary Task SVM as the only difference is the use of the multiclass label file instead of binary.
## 1.1 Loading Datasets

In [4]:
#Calls load_dataset function from "functions.ipynb" which loads the X data and Y label datasets for Multiclass task from the inputted file paths
#It prints the loaded array shapes to verify it has completed properly
X, Y = load_dataset('.\dataset\Image_DF_Flat.pkl', './dataset/Y_Multiclass_label.pkl')

Datasets successfully loaded with shapes:
Y Shape:
(3000,)
X Shape:
(3000, 784)


##### We can do PCA for the images but must be done separately for binary and multiclass task as the data must be split first<br/> This is because we must do PCA on the training data only (fit and transform it) and then only use the transform on the test data to prevent bias<br/> We select 400 components as it provides around 96% explained variance as shown previously.

## 1.2 Splitting data in to training and testing sets

In [9]:
# Split the data into training and testing(70% training and 30% testing data)
xTrain,xTest,yTrain,yTest=train_test_split(X, Y, train_size = 0.7)

#Rescaling the dataframe as the pixel values range from 0 to 255
#We want it to be between 0 to 1 to let it pass through the NN and models
xTrain_Scaled = xTrain/255
xTest_Scaled = xTest/255

## 1.3 PCA

In [10]:
#Initialising PCA with 400 components determined in preprocessing notebook
#Calls the dataset_PCA function defined in "functions.ipynb" to carry out PCa
#Input arguements are number of components, xTrain data and xTest data
#We put in the scaled X train and test data
xTrain_PCA, xTest_transformed = dataset_PCA(400, xTrain_Scaled, xTest_Scaled)

#Function returns the resultant explained variance percentage when we use 400 components.

PCA conducted with 400 components.
The percentage of Explained Variance of the dataset from PCA is: 96.51006820948666


# 2. Model Building

## 2.1 SVM without tuning
### 2.1.1 Training<br/> We first train SVM without hyperparameter tuning to assess the performance of the model with the training and test data

In [13]:
#Training model using SVC without hyperparameter tuning
multiclass_SVM_untuned = SVC(probability = True)
multiclass_SVM_untuned.fit(xTrain_PCA, yTrain.values.ravel())

ERROR! Session/line number was not unique in database. History logging moved to new session 260


SVC(probability=True)

### 2.1.2 Obtaining the Prediction results and performance<br/> Showing the confusion matrix and general assessment metrics using the classification report

In [14]:
#Calls SVM_predictions function from "functions.ipynb" to carry out the predictions using the untuned multiclass SVM model we made.
#It prints out the classification report of the predictions as well as the confusion matrix
#Returns the predictions
multiclass_SVM_untuned_pred = SVM_predictions(multiclass_SVM_untuned, xTest_transformed, yTest) 

The Results for SVM are:
              precision    recall  f1-score   support

         0.0       0.86      0.75      0.80       126
         1.0       0.79      0.78      0.79       274
         2.0       0.75      0.76      0.75       251
         3.0       0.90      0.95      0.92       249

    accuracy                           0.82       900
   macro avg       0.82      0.81      0.82       900
weighted avg       0.82      0.82      0.82       900

The confusion matrix is:
[[ 95   9  17   5]
 [  6 215  45   8]
 [  7  40 190  14]
 [  3   7   3 236]]


## 2.2 SVM with validation

### 2.2.1 Training<br/> We now use SVM again but with validation, tuning of the hyperparameter values<br/> This is done using an exhaustive Gridsearch of the parameter values given. <br/> After model is trained, we obtain the best parameters found by the gridsearch

In [15]:
#Using SVM but this time with gridsearch to tune hyperparameter

#Define the parameter ranges
#We test various values of:
# C
# gamma
# Type of Kernel to use
param_grid = {
    'C': [0.1, 1, 10, 100, 1000],
    'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
    'kernel': ['rbf', 'poly']
}


In [16]:
#Calls Tuned_SVM_train from "functions.ipynb" to conduct training and tuning of SVM model using gridsearch
#Full details on the input arguements listed in the functions notebook

#Function prints the resultant best hyperparameters found and new details of the model
#Returns the tuned model
multiclass_SVM_Tuned = Tuned_SVM_train(param_grid, 5, xTrain_PCA, yTrain, True)

#est time = 11mins 10s

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Tuned SVM Model successfully trained and tuned
The best hyperparameters found by gridsearch are:
{'C': 1000, 'gamma': 0.1, 'kernel': 'rbf'}
The new model created after hyperparameter tuning is:
SVC(C=1000, gamma=0.1, probability=True)


### 2.2.2 Obtaining the Prediction results and performance<br/> Showing the confusion matrix and general assessment metrics using the classification report

In [17]:
#Calls SVM_predictions function from "functions.ipynb"
#This time we are doing predictions with the tuned SVM model
multiclass_SVM_Tuned_pred = SVM_predictions(multiclass_SVM_Tuned , xTest_transformed, yTest)

#It prints out the classification report of the predictions as well as the confusion matrix
#Returns the predictions

The Results for SVM are:
              precision    recall  f1-score   support

         0.0       0.88      0.71      0.78       126
         1.0       0.88      0.82      0.85       274
         2.0       0.76      0.86      0.81       251
         3.0       0.92      0.96      0.94       249

    accuracy                           0.85       900
   macro avg       0.86      0.84      0.85       900
weighted avg       0.86      0.85      0.85       900

The confusion matrix is:
[[ 89   5  27   5]
 [  5 226  39   4]
 [  4  21 215  11]
 [  3   5   2 239]]


In [18]:
#Saving tuned and base SVM models
save_path = "./Models/Multiclassification"
Tuned_SVM_filename = 'Tuned_multiclass_SVM.sav'
Base_SVM_filename = 'Untuned_multiclass_SVM.sav'

#Using Pickle to put them in files
pkl.dump(multiclass_SVM_untuned , open(os.path.join(save_path, Base_SVM_filename), 'wb'))
pkl.dump(multiclass_SVM_Tuned, open(os.path.join(save_path, Tuned_SVM_filename), 'wb'))