[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io)

# Machine Learning Methods

## Supervised Learning - Kernel SVM - Exercise

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 0.1.001 | 24/05/2023 | Royi Avital | Scaling data into `[0, 1]`                                         |
| 0.1.000 | 28/01/2023 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/MachineLearningMethods/2023_01/0018KernelSVMExerciseSolution.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import auc, confusion_matrix, precision_recall_fscore_support, roc_curve
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Miscellaneous
import gzip
import os
from platform import python_version
import random
import urllib.request

# Typing
from typing import Tuple

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

In [None]:
# Configuration
#%matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF = (8, 8)
ELM_SIZE_DEF = 50
CLASS_COLOR = ('b', 'r')
EDGE_COLOR  = 'k'

# Fashion MNIST
TRAIN_DATA_SET_IMG_URL = r'https://github.com/zalandoresearch/fashion-mnist/raw/master/data/fashion/train-images-idx3-ubyte.gz'
TRAIN_DATA_SET_LBL_URL = r'https://github.com/zalandoresearch/fashion-mnist/raw/master/data/fashion/train-labels-idx1-ubyte.gz'
TEST_DATA_SET_IMG_URL  = r'https://github.com/zalandoresearch/fashion-mnist/raw/master/data/fashion/t10k-images-idx3-ubyte.gz'
TEST_DATA_SET_LBL_URL  = r'https://github.com/zalandoresearch/fashion-mnist/raw/master/data/fashion/t10k-labels-idx1-ubyte.gz'

TRAIN_DATA_IMG_FILE_NAME = 'TrainImgFile'
TRAIN_DATA_LBL_FILE_NAME = 'TrainLblFile'
TEST_DATA_IMG_FILE_NAME  = 'TestImgFile'
TEST_DATA_LBL_FILE_NAME  = 'TestLblFile'

TRAIN_DATA_SET_FILE_NAME = 'FashionMnistTrainDataSet.npz'
TEST_DATA_SET_FILE_NAME  = 'FashionMnistTestDataSet.npz'

TRAIN_DATA_NUM_IMG  = 60_000
TEST_DATA_NUM_IMG   = 10_000

D_CLASSES = {0: 'T-Shirt', 1: 'Trouser', 2: 'Pullover', 3: 'Dress', 4: 'Coat', 5: 'Sandal', 6: 'Shirt', 7: 'Sneaker', 8: 'Bag', 9: 'Boots'}
L_CLASSES = ['T-Shirt', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Boots']

In [None]:
# Fixel Algorithms Packages


## Model Parameter Optimization with Kernel SVM

In this exercise we'll apply the Cross Validation automatically to find the optimal hyper parameters for the Kernel SVM Model.  
In order to achieve this we'll do a [Grid Search for Hyper Parameters Optimization](https://en.wikipedia.org/wiki/Hyperparameter_optimization).

1. Load the [Fashion MNIST Data Set](https://github.com/zalandoresearch/fashion-mnist) manually (Done by me).
2. Train a baseline Linear SVM model.
3. Find the optimal Kernel SVM model using Grid Search.
4. Extract the optimal model.
5. Plot the Confusion Matrix of the best model on the training data.

* <font color='brown'>(**#**)</font> You may and should use the functions in the `Auxiliary Functions` section.

In [None]:
# Parameters

numSamplesTrain = 4_000
numSamplesTest  = 1_000
numImg = 3

# Linear SVM (Baseline Model)
paramC      = 1
kernelType  = 'linear'

#===========================Fill This===========================#
# Think of the parameters to optimize
# Select the set to optimize over
# Set the number of folds in the cross validation
?????
numFold = ???
#===============================================================#

In [None]:
# Auxiliary Functions

def DownloadDecompressGzip(fileUrl, fileName):
    # Based on https://stackoverflow.com/a/61195974

    # Read the file inside the .gz archive located at url
    with urllib.request.urlopen(fileUrl) as response:
        with gzip.GzipFile(fileobj = response) as uncompressed:
            file_content = uncompressed.read()
        # write to file in binary mode 'wb'
        with open(fileName, 'wb') as f:
            f.write(file_content)
            f.close()
        return

def ConvertMnistDataDf(imgFilePath: str, labelFilePath: str):
    numPx = 28 * 28
    # Merge of https://pjreddie.com/projects/mnist-in-csv/ and https://github.com/keras-team/keras/blob/master/keras/datasets/fashion_mnist.py
    f = open(imgFilePath, "rb")
    l = open(labelFilePath, "rb")

    lCol = [f'Px {ii:04}' for ii in range (numPx)]
    lCol.append('Label')

    vY = np.frombuffer(l.read(), np.uint8, offset = 8)
    mX = np.frombuffer(f.read(), np.uint8, offset = 16)
    # mX = np.reshape(mX, (numPx, len(vY))).T
    mX = np.reshape(mX, (len(vY), numPx))

    f.close()
    l.close()

    return mX, vY

def PlotMnistImages(mX, vY, numImg, lClasses = list(range(10)), hF = None):

    numSamples  = mX.shape[0]
    numPx       = mX.shape[1]

    numRows = int(np.sqrt(numPx))

    tFigSize = (numImg * 3, numImg * 3)

    if hF is None:
        hF, hA = plt.subplots(numImg, numImg, figsize = tFigSize)
    else:
        hA = hF.axis
    
    hA = np.atleast_1d(hA) #<! To support numImg = 1
    hA = hA.flat

    
    for kk in range(numImg * numImg):
        idx = np.random.choice(numSamples)
        mI  = np.reshape(mX[idx, :], (numRows, numRows))
    
        # hA[kk].imshow(mI.clip(0, 1), cmap = 'gray')
        hA[kk].imshow(mI, cmap = 'gray')
        hA[kk].tick_params(axis = 'both', left = False, top = False, right = False, bottom = False, labelleft = False, labeltop = False, labelright = False, labelbottom = False)
        hA[kk].set_title(f'Index = {idx}, Label = {lClasses[vY[idx]]}')
    
    plt.show()

def PlotLabelsHistogram(vY: np.ndarray, hA = None):

    if hA is None:
        hF, hA = plt.subplots(figsize = (8, 6))
    
    vLabels, vCounts = np.unique(vY, return_counts = True)

    hA.bar(vLabels, vCounts, width = 0.9, align = 'center')
    hA.set_xticks(vLabels)
    hA.set_title('Histogram of Classes / Labels')
    hA.set_xlabel('Class')
    hA.set_ylabel('Number of Samples')

    return hA

def PlotConfusionMatrix(vY: np.ndarray, vYPred: np.ndarray, normMethod: str = None, hA: plt.Axes = None, lLabels: list = None, dScore: dict = None, titleStr: str = 'Confusion Matrix') -> plt.Axes:

    # Calculation of Confusion Matrix
    mConfMat = confusion_matrix(vY, vYPred, normalize = normMethod)
    oConfMat = ConfusionMatrixDisplay(mConfMat, display_labels = lLabels)
    oConfMat = oConfMat.plot(ax = hA)
    hA = oConfMat.ax_
    if dScore is not None:
        titleStr += ':'
        for scoreName, scoreVal in  dScore.items():
            titleStr += f' {scoreName} = {scoreVal:0.2},'
        titleStr = titleStr[:-1]
    hA.set_title(titleStr)
    hA.grid(False)

    return hA, mConfMat
    

## Generate / Load Data


In [None]:
# Loading / Generating Data
if os.path.isfile(TRAIN_DATA_SET_FILE_NAME):
    dData = np.load(TRAIN_DATA_SET_FILE_NAME)
    mXTrain, vYTrain = dData['mXTrain'], dData['vYTrain']
else:
    if not os.path.isfile(TRAIN_DATA_IMG_FILE_NAME):
        DownloadDecompressGzip(TRAIN_DATA_SET_IMG_URL, TRAIN_DATA_IMG_FILE_NAME) #<! Download Data (GZip File)
    if not os.path.isfile(TRAIN_DATA_LBL_FILE_NAME):
        DownloadDecompressGzip(TRAIN_DATA_SET_LBL_URL, TRAIN_DATA_LBL_FILE_NAME) #<! Download Data (GZip File)
    mXTrain, vYTrain = ConvertMnistDataDf(TRAIN_DATA_IMG_FILE_NAME, TRAIN_DATA_LBL_FILE_NAME)
    np.savez_compressed(TRAIN_DATA_SET_FILE_NAME, mXTrain  = mXTrain, vYTrain = vYTrain)

if os.path.isfile(TEST_DATA_SET_FILE_NAME):
    dData = np.load(TEST_DATA_SET_FILE_NAME)
    mXTest, vYTest = dData['mXTest'], dData['vYTest']
else:
    if not os.path.isfile(TEST_DATA_IMG_FILE_NAME):
        DownloadDecompressGzip(TEST_DATA_SET_IMG_URL, TEST_DATA_IMG_FILE_NAME) #<! Download Data (GZip File)
    if not os.path.isfile(TEST_DATA_LBL_FILE_NAME):
        DownloadDecompressGzip(TEST_DATA_SET_LBL_URL, TEST_DATA_LBL_FILE_NAME) #<! Download Data (GZip File)
    mXTest, vYTest = ConvertMnistDataDf(TEST_DATA_IMG_FILE_NAME, TEST_DATA_LBL_FILE_NAME)
    np.savez_compressed(TEST_DATA_SET_FILE_NAME, mXTest = mXTest, vYTest = vYTest)


vSampleIdx = np.random.choice(mXTrain.shape[0], numSamplesTrain)
mXTrain = mXTrain[vSampleIdx, :]
vYTrain = vYTrain[vSampleIdx]

vSampleIdx = np.random.choice(mXTest.shape[0], numSamplesTest)
mXTest = mXTest[vSampleIdx, :]
vYTest = vYTest[vSampleIdx]


print(f'The number of train data samples: {mXTrain.shape[0]}')
print(f'The number of train features per sample: {mXTrain.shape[1]}') 
print(f'The number of test data samples: {mXTest.shape[0]}')
print(f'The number of test features per sample: {mXTest.shape[1]}') 

### Pre Process Data

The image data is in the `UInt8` data form with values in `{0, 1, 2, ..., 255}`.   
Scale it into `[0, 1]` range.

In [None]:
# Pre Process Data
# Scale data into [0, 1] range

mXTrain = ???
mXTest  = ???

### Plot Data

In [None]:
# Display the Data

PlotMnistImages(mXTrain, vYTrain, numImg, lClasses = L_CLASSES)

In [None]:
# Histogram of Classes
hA = PlotLabelsHistogram(vYTrain)
hA.set_xticks(range(len(L_CLASSES)))
hA.set_xticklabels(L_CLASSES)
plt.show()

## Train Linear SVM Classifier

This will be the base line

In [None]:
# SVM Linear Model
#===========================Fill This===========================#
# Construct a baseline model (Linear SVM)
# Train the model
# Score the model (Accuracy)
???
???
#===============================================================#

print(f'The model score (Accuracy) on the data: {modelScore:0.2%}') #<! Accuracy

## Train Kernel SVM

In this section we'll train a Kernel SVM. We'll find the optimal kernel by cross validation.
In order to optimize on the following parameters: `C`, `kernel` and `gamma` we'll use `GridSearchCV()`.  
The idea is iterating over the grid of parameters of the model to find the optimal one.  
Each parameterized model is evaluated by a Cross Validation.

In order to use it we need to define:
 - The Model (`estimator`) - Which model is used.
 - The Parameters Grid (`param_grid`) - The set of parameter to try.
 - The Scoring (`scoring`) - The score used to define the best model.
 - The Cross Validation Iterator (`cv`) - The iteration to validate the model.


* <font color='brown'>(**#**)</font> Pay attention to the expected run time. Using `verbose` is useful.
* <font color='brown'>(**#**)</font> This is a classic grid search which is not the most efficient policy. There are more advanced policies.
* <font color='brown'>(**#**)</font> The `GridSearchCV()` is limited to one instance of an estimator. Yet using Pipelines we may test different types of estimators.

In [None]:
# Construct the Grid Search object 

#===========================Fill This===========================#
# 1. Set the parameters to iterate over and their values (Dictionary).
dParams = ???
#===============================================================#

oGsSvc = GridSearchCV(estimator = SVC(), param_grid = dParams, scoring = None, cv = numFold, verbose = 4)

* <font color='brown'>(**#**)</font> You may want to have a look at the `n_jobs_` parameter.

In [31]:
# Training (Hyper Parameter Optimization)

#===========================Fill This===========================#
# The model trains on the train data using Stratified K Fold cross validation
oGsSvc = ???
#===============================================================#

In [None]:
# Extract the attributes of the best model

#===========================Fill This===========================#
# Extract the best score
# Extract a dictionary of the parameters
bestScore   = ???
dBestParams = ???
#===============================================================#

print(f'The best model had the following parameters: {dBestParams} with the CV score: {bestScore:0.2%}')


* <font color='brown'>(**#**)</font> In production one would visualize the effect of each parameter on the model result. then use it to fine tune farther the parameters.

In [None]:
# The Best Mode

#===========================Fill This===========================#
# Extract the best model
# Score the best model on the test data set
bestModel = ???
modelScore = ???
#===============================================================#

print(f'The model score (Accuracy) on the data: {modelScore:0.2%}') #<! Accuracy

With proper tuning one can bet the baseline model by `~5%`.

### Train the Best Model on the Train Data Set

In production we take the optimal Hyper Parameters and then retrain the model on the whole training data set.  
This is the model we'll use in production.


In [None]:
# The Model with Optimal Parameters

#===========================Fill This===========================#
# Construct the model
# Train the model
oSvmCls = ???
oSvmCls = ???
#===============================================================#

modelScore = oSvmCls.score(mXTest, vYTest)

print(f'The model score (Accuracy) on the data: {modelScore:0.2%}') #<! Accuracy

* <font color='red'>(**?**)</font> Is the value above exactly as the value from the best model of the grid search? If so, look at the `refit` parameter of `GridSearchCV`.

## Performance Metrics / Scores

In this section we'll analyze the model using the _confusion matrix_.

### Display the Confusion Matrix

In [None]:
# Plot the Confusion Matrix
hF, hA = plt.subplots(figsize = (10, 10))

#===========================Fill This===========================#
hA, mConfMat = ???
#===============================================================#

plt.show()

* <font color='red'>(**?**)</font> Which class has the best accuracy?
* <font color='red'>(**?**)</font> Which class has a dominant false prediction? Does it make sense?
* <font color='red'>(**?**)</font> What's the difference between $p \left( \hat{y}_{i} = \text{coat} \mid {x}_{i} = \text{coat} \right)$ to $p \left( {y}_{i} = \text{coat} \mid \hat{y}_{i} = \text{coat} \right)$?
* <font color='blue'>(**!**)</font> Make the proper calculations on `mConfMat` or the function `PlotConfusionMatrix` to answer the questions above.