[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io)

# Machine Learning Methods

## Supervised Learning - Cross Validation - Exercise Solution

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 0.1.000 | 21/01/2023 | Royi Avital | First version                                                      |
|         |            |             |                                                                    |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/MachineLearningMethods/2023_01/0012ConfMatCrossValidationExerciseSolution.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.datasets import fetch_openml
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import cross_val_predict, KFold, StratifiedKFold, train_test_split
from sklearn.svm import LinearSVC, SVC

# Miscellaneous
import os
from platform import python_version
import random

# Typing
from typing import Tuple

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

In [None]:
# Configuration
%matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF = (8, 8)
ELM_SIZE_DEF = 50
CLASS_COLOR = ('b', 'r')


In [None]:
# Fixel Algorithms Packages


## Cross Validation with the SVM

In this exercise we'll apply the Cross Validation manually to find the optimal `C` parameter for the SVM Model.  
Instead of using `cross_val_predict()` we'll do a manual loop on the folds and average the score.

1. Load the [MNIST Data set](https://en.wikipedia.org/wiki/MNIST_database) using `fetch_openml()`.
2. Split the data using Stratified K Fold.
3. For each model (Parameterized by `C`):
    - Train model on the train sub set.
    - Score model on the test sub set.
4. Plot the score per model.
5. Plot the Confusion Matrix of the best model on the training data.

* <font color='brown'>(**#**)</font> Make sure to chose small number of models and folds at the beginning to measure run time and scale accordingly. 
* <font color='brown'>(**#**)</font> We'll use `LinearSVC` class which optimized `SVC` with kernel `linear` as it fits for larger data sets.  
* <font color='brown'>(**#**)</font> You may and should use the functions in the `Auxiliary Functions` section.

In [None]:
# Parameters

numSamples  = 10_000
numImg = 3

maxItr = 5000 #<! For the LinearSVC model

#===========================Fill This===========================#
numFold = 5
lC = list(np.linspace(0.0005, 1.5, 15))
#===============================================================#

In [None]:
# Auxiliary Functions

def PlotMnistImages(mX, vY, numImg, hF = None):

    numSamples  = mX.shape[0]
    numPx       = mX.shape[1]

    numRows = int(np.sqrt(numPx))

    tFigSize = (numImg * 3, numImg * 3)

    if hF is None:
        hF, hA = plt.subplots(numImg, numImg, figsize = tFigSize)
    else:
        hA = hF.axis
    
    hA = np.atleast_1d(hA) #<! To support numImg = 1
    hA = hA.flat

    
    for kk in range(numImg * numImg):
        idx = np.random.choice(numSamples)
        mI  = np.reshape(mX[idx, :], (numRows, numRows))
    
        # hA[kk].imshow(mI.clip(0, 1), cmap = 'gray')
        hA[kk].imshow(mI, cmap = 'gray')
        hA[kk].tick_params(axis = 'both', left = False, top = False, right = False, bottom = False, labelleft = False, labeltop = False, labelright = False, labelbottom = False)
        hA[kk].set_title(f'Index = {idx}, Label = {vY[idx]}')
    
    plt.show()

def PlotLabelsHistogram(vY: np.ndarray, hA = None):

    if hA is None:
        hF, hA = plt.subplots(figsize = (8, 6))
    
    vLabels, vCounts = np.unique(vY, return_counts = True)

    hA.bar(vLabels, vCounts, width = 0.9, align = 'center')
    hA.set_title('Histogram of Classes / Labels')
    hA.set_xlabel('Class')
    hA.set_ylabel('Number of Samples')

    return hA

def PlotConfusionMatrix(vY: np.ndarray, vYPred: np.ndarray, hA: plt.Axes = None, lLabels: list = None, dScore: dict = None, titleStr: str = 'Confusion Matrix'):

    # Calculation of Confusion Matrix
    mConfMat = confusion_matrix(vY, vYPred)
    oConfMat = ConfusionMatrixDisplay(mConfMat, display_labels = lLabels)
    oConfMat = oConfMat.plot(ax = hA)
    hA = oConfMat.ax_
    if dScore is not None:
        titleStr += ':'
        for scoreName, scoreVal in  dScore.items():
            titleStr += f' {scoreName} = {scoreVal:0.2},'
        titleStr = titleStr[:-1]
    hA.set_title(titleStr)
    hA.grid(False)

    return hA
    

## Generate / Load Data


In [None]:
# Loading / Generating Data

#===========================Fill This===========================#
mX, vY = fetch_openml('mnist_784', version = 1, return_X_y = True, as_frame = False, parser = 'auto')
#===============================================================#

# The data has many samples, for fast run time we'll sub sample it

vSampleIdx = np.random.choice(mX.shape[0], numSamples)
mX = mX[vSampleIdx, :]
vY = vY[vSampleIdx]

print(f'The features data shape: {mX.shape}')
print(f'The labels data shape: {vY.shape}')
print(f'The unique values of the labels: {np.unique(vY)}')

In [None]:
# Pre Processing

# The image is in the range {0, 1, ..., 255}
# We scale it into [0, 1]

#===========================Fill This===========================#

mX = mX / 255.0

#===============================================================#

### Plot Data

In [None]:
# Display the Data

PlotMnistImages(mX, vY, numImg)

### Distribution of Labels

When dealing with classification, it is important to know the balance between the labels within the data set.

In [None]:
hA = PlotLabelsHistogram(vY)
plt.show()

## Cross Validation

The _Cross Validation_ process allows us to estimate the stability of performance.  
It also the main tool to optimize the model _Hyper Parameters_. 

### Cross Validation for Hyper Parameter Optimization

We can also use the _Cross Validation_ approach to search for the best _Hype Parameter_.  
The idea is iterating through the data and measure the score we care about.  
The hyper parameter which maximize the score will be used for the production model.

* <font color='red'>(**?**)</font> What kind of a problem is this? Binary Class or Multi Class?
* <font color='red'>(**?**)</font> What kind of strategy will be used? Advise documentation.
* <font color='brown'>(**#**)</font> When using `LinearSVC`:
    *   If #Samples > #Features -> Set `dual = False`.
    *   If #Samples < #Features -> Set `dual = True` (Default).
* <font color='brown'>(**#**)</font> If you experience converging issues with `LinearSVC` use `SVC`.

In [None]:
# Cross Validation for the C parameter
numC = len(lC)
mACC = np.zeros(shape = (numFold, numC)) #<! Accuracy per Fold and Model

oStrCv = StratifiedKFold(n_splits = numFold, random_state = seedNum, shuffle = True)

for ii, (vTrainIdx, vTestIdx) in enumerate(oStrCv.split(mX, vY)):
    print(f'Working on Fold #{(ii + 1):02d} Out of {numFold} Folds')
    #===========================Fill This===========================#
    # Setting the Train / Test split
    mXTrain = mX[vTrainIdx, :]
    vYTrain = vY[vTrainIdx]
    mXTest  = mX[vTestIdx, :]
    vYTest  = vY[vTestIdx]
    #===============================================================#
    for jj, C in enumerate(lC):
        print(f'Working on Model #{(jj + 1):02d} Out of {numC} Models with C = {C:0.4f}')
        #===========================Fill This===========================#
        # Set the model, train, score
        # Set `max_iter = maxItr`
        # Set `dual = False`
        oSvmCls     = LinearSVC(C = C, max_iter = maxItr, dual = False)
        # oSvmCls     = SVC(C = C)
        oSvmCls     = oSvmCls.fit(mXTrain, vYTrain)
        accScore    = oSvmCls.score(mXTest, vYTest)
        #===============================================================#
        mACC[ii, jj] = accScore



* <font color='red'>(**?**)</font> How can we accelerate the above calculation? Think about dependency between the scores, does it exist?

In [None]:
# Calculate the score per model (Reduction)
# Average over the different folds

#===========================Fill This===========================#
vAvgAcc = np.mean(mACC, axis = 0)
#===============================================================#

* <font color='red'>(**?**)</font> In the above we used the mean as the reduction operator of many results into one. Can you think on other?
* <font color='blue'>(**!**)</font> Try using a different reduction method and see results.

In [None]:
# Plot Results

hF, hA = plt.subplots(figsize = FIG_SIZE_DEF)
hA.plot(lC, vAvgAcc)
hA.scatter(lC, vAvgAcc, s = 100)
hA.set_title(f'Accuracy Score as a Function of C - Average of {numFold} Folds')
hA.set_xlabel('C')
hA.set_ylabel('Accuracy')
hA.set_xticks(lC)
hA.grid()

plt.show()

* <font color='red'>(**?**)</font> What range would you choose to do a fine tune over?

## Confusion Matrix

The confusion matrix is almost the whole story for classification problems.  

Train the model with the best parameter on the whole data and plot the _Confusion Matrix_.

In [None]:
# Extract the optimal C
# Look at `np.argmax()`

#===========================Fill This===========================#
optC = lC[np.argmax(vAvgAcc)]
#===============================================================#

print(f'The optimal C value is C = {optC}')


In [None]:
# Plot the Confusion Matrix 

#===========================Fill This===========================#
oSvmCls = LinearSVC(C = optC)
oSvmCls = oSvmCls.fit(mX, vY)
vYPred = oSvmCls.predict(mX)
dScore = {'Accuracy': np.mean(vYPred == vY)}
#===============================================================#

PlotConfusionMatrix(vY, vYPred, dScore = dScore) #<! The accuracy should be >= than above!
plt.show()

* <font color='red'>(**?**)</font> Is the accuracy above higher or smaller than the one on the _cross validation_? Why?
* <font color='blue'>(**!**)</font> Run the above using `SVC()` instead of `LinearSVC()`.