[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io)

# Machine Learning Methods

## Supervised Learning - Confusion Matrix and Cross Validation 

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 0.1.000 | 21/01/2023 | Royi Avital | First version                                                      |
|         |            |             |                                                                    |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/MachineLearningMethods/2023_01/0011ConfMatCrossValidation.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.datasets import fetch_openml
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import cross_val_predict, KFold, StratifiedKFold, train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Miscellaneous
import os
from platform import python_version
import random

# Typing
from typing import Tuple

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

In [None]:
# Configuration
#%matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF = (8, 8)
ELM_SIZE_DEF = 50
CLASS_COLOR = ('b', 'r')


In [None]:
# Fixel Algorithms Packages


In [None]:
# Parameters

numImg  = 3
vSize   = [28, 28] #<! Size of images

numSamples  = 10_000
trainRatio  = 0.55
testRatio   = 1 - trainRatio

# Data Visualization
elmSize     = 50
classColor0 = 'b'
classColor1 = 'r'

numGridPts = 250

In [None]:
# Auxiliary Functions

def PlotMnistImages(mX, vY, numImg, hF = None):

    numSamples  = mX.shape[0]
    numPx       = mX.shape[1]

    numRows = int(np.sqrt(numPx))

    tFigSize = (numImg * 3, numImg * 3)

    if hF is None:
        hF, hA = plt.subplots(numImg, numImg, figsize = tFigSize)
    else:
        hA = hF.axis
    
    hA = np.atleast_1d(hA) #<! To support numImg = 1
    hA = hA.flat

    
    for kk in range(numImg * numImg):
        idx = np.random.choice(numSamples)
        mI  = np.reshape(mX[idx, :], (numRows, numRows))
    
        # hA[kk].imshow(mI.clip(0, 1), cmap = 'gray')
        hA[kk].imshow(mI, cmap = 'gray')
        hA[kk].tick_params(axis = 'both', left = False, top = False, right = False, bottom = False, labelleft = False, labeltop = False, labelright = False, labelbottom = False)
        hA[kk].set_title(f'Index = {idx}, Label = {vY[idx]}')
    
    plt.show()

def PlotLabelsHistogram(vY: np.ndarray, hA = None):

    if hA is None:
        hF, hA = plt.subplots(figsize = (8, 6))
    
    vLabels, vCounts = np.unique(vY, return_counts = True)

    hA.bar(vLabels, vCounts, width = 0.9, align = 'center')
    hA.set_title('Histogram of Classes / Labels')
    hA.set_xlabel('Class')
    hA.set_ylabel('Number of Samples')

    return hA

def PlotConfusionMatrix(vY: np.ndarray, vYPred: np.ndarray, hA: plt.Axes = None, lLabels: list = None, dScore: dict = None, titleStr: str = 'Confusion Matrix'):

    # Calculation of Confusion Matrix
    mConfMat = confusion_matrix(vY, vYPred)
    oConfMat = ConfusionMatrixDisplay(mConfMat, display_labels = lLabels)
    oConfMat = oConfMat.plot(ax = hA)
    hA = oConfMat.ax_
    if dScore is not None:
        titleStr += ':'
        for scoreName, scoreVal in  dScore.items():
            titleStr += f' {scoreName} = {scoreVal:0.2},'
        titleStr = titleStr[:-1]
    hA.set_title(titleStr)
    hA.grid(False)

    return hA
    


## Generate / Load Data


In [None]:
# Loading / Generating Data
mX, vY = fetch_openml('mnist_784', version = 1, return_X_y = True, as_frame = False, parser = 'auto')

print(f'The features data shape: {mX.shape}')
print(f'The labels data shape: {vY.shape}')
print(f'The unique values of the labels: {np.unique(vY)}')

In [None]:
# Pre Processing

# The image is in the range {0, 1, ..., 255}
# We scale it into [0, 1]

mX = mX / 255

* <font color='brown'>(**#**)</font> Try to do the scaling with `mX /= 255.0`. It will fail, try to understand why.

In [None]:
# The data has many samples, for fast run time we'll sub sample it

vSampleIdx = np.random.choice(mX.shape[0], numSamples, replace = False)
mX = mX[vSampleIdx, :]
vY = vY[vSampleIdx]

print(f'The features data shape: {mX.shape}')
print(f'The labels data shape: {vY.shape}')
print(f'The unique values of the labels: {np.unique(vY)}')

### Plot Data

In [None]:
# Display the Data

PlotMnistImages(mX, vY, numImg)

### Distribution of Labels

When dealing with classification, it is important to know the balance between the labels within the data set.

In [None]:
# Distribution of Labels

hA = PlotLabelsHistogram(vY)
plt.show()

## Train / Test Split

In this section we'll split the data into 2 sub sets: _Train_ and _Test_.

<font color='red'>(**?**)</font> The split will be random. What could be the issue with that? Think of the balance of labels.

In [None]:
# SciKit Learn has a built in tool for this split
# It can take ratios or integer numbers.
# In case only `train_size` or `test_size` is given the other one is the rest of the data.
mXTrain, mXTest, vYTrain, vYTest = train_test_split(mX, vY, train_size = trainRatio, test_size = testRatio, random_state = seedNum)

print(f'The train features data shape: {mXTrain.shape}')
print(f'The train labels data shape: {vYTrain.shape}')
print(f'The test features data shape: {mXTest.shape}')
print(f'The test labels data shape: {vYTest.shape}')

In [None]:
# Distribution of classes in train data

hA = PlotLabelsHistogram(vYTrain)
hA.set_title('Histogram of Classes for the Train Data')
plt.show()

In [None]:
# Distribution of classes in test data

hA = PlotLabelsHistogram(vYTest)
hA.set_title('Histogram of Classes for the Test Data')
plt.show()

* <font color='red'>(**?**)</font> Do you see the same distribution at both sets? What does it mean?
* <font color='blue'>(**!**)</font> Use the `stratify` option in `train_test_split()` and look at the results.

## Train a K-NN Model

In this section we'll train a K-NN model on the train data set and test its performance on the test data set.

In [None]:
K = 1
oKnnCls = KNeighborsClassifier(n_neighbors = K)
oKnnCls = oKnnCls.fit(mXTrain, vYTrain)

<font color='red'>(**?**)</font> What would be the score on the _train set_?  
<font color='red'>(**?**)</font> What would be the relation between the performance on the _train set_ vs. _test set_?

In [None]:
# Prediction on the Train Set

rndIdx  = np.random.randint(mXTrain.shape[0])
yPred = oKnnCls.predict(np.atleast_2d(mXTrain[rndIdx, :])) #<! The input must be 2D data
PlotMnistImages(np.atleast_2d(mXTrain[rndIdx, :]), yPred, 1)

In [None]:
# Prediction on the Test Set

rndIdx  = np.random.randint(mXTest.shape[0])
yPred = oKnnCls.predict(np.atleast_2d(mXTest[rndIdx, :])) #<! The input must be 2D data
PlotMnistImages(np.atleast_2d(mXTest[rndIdx, :]), yPred, 1)

* <font color='blue'>(**!**)</font> Find the sample in the train data set which is closest to the sample above.

### Confusion Matrix and Score on Train and Test Sets

In this section we'll evaluate the performance of the model on the train and test sets.  
The `SciKit Learn` package has some built in functions / classes to display those: `confusion_matrix()`, `ConfusionMatrixDisplay`.

In [None]:
# Computing the prediction per set
vYTrainPred = oKnnCls.predict(mXTrain) #<! Predict train set
vYTestPred  = oKnnCls.predict(mXTest)  #<! Predict test set

trainAcc = oKnnCls.score(mXTrain, vYTrain)
testAcc  = oKnnCls.score(mXTest, vYTest)

In [None]:
# Plot the Confusion Matrix

hF, hA = plt.subplots(nrows = 1, ncols = 2, figsize = (14, 6)) #<! Figure

# Arranging data for the plot function
lConfMatData = [{'vY': vYTrain, 'vYPred': vYTrainPred, 'hA': hA[0], 'dScore': {'Accuracy': trainAcc}, 'titleStr': 'Train - Confusion Matrix'},
{'vY': vYTest, 'vYPred': vYTestPred, 'hA': hA[1], 'dScore': {'Accuracy': testAcc}, 'titleStr': 'Test - Confusion Matrix'}]

for ii in range(2):
    PlotConfusionMatrix(**lConfMatData[ii])

plt.show()

* <font color='red'>(**?**)</font> Look at the most probable error per label, does it make sense?
* <font color='red'>(**?**)</font> What do you expect to happen with a different `K`?
* <font color='blue'>(**!**)</font> Run the above with different values of `K`.

## Cross Validation

The _Cross Validation_ process allows us to estimate the stability of performance.  
It also the main tool to optimize the model _Hyper Parameters_. 

### Cross Validation as a Measure of Test Performance

Let's see if indeed the cross validation is a better way to estimate the performance of the test set.  
We can do that using _Cross Validation_ on the training set. We'll predict the label of each sample using other data.
We'll use a K-Fold Cross Validation with stratified option to keep the data distribution in tact.

In [None]:
# Prediction the classes using Cross Validation
numFold = 10

vYTrainPred = cross_val_predict(KNeighborsClassifier(n_neighbors = K), mXTrain, vYTrain, cv = KFold(numFold, shuffle = True))
trainAcc = np.mean(vYTrainPred == vYTrain)


* <font color='blue'>(**!**)</font> Change the values of `numFold`. Try extreme values. What happens?
* <font color='green'>(**@**)</font> Repeat the above with `StratifiedKFold()`.

In [None]:
# Plot the Confusion Matrix

hF, hA = plt.subplots(nrows = 1, ncols = 2, figsize = (14, 6)) #<! Figure

# Arranging data for the plot function
lConfMatData = [{'vY': vYTrain, 'vYPred': vYTrainPred, 'hA': hA[0], 'dScore': {'Accuracy': trainAcc}, 'titleStr': 'Train - Confusion Matrix'},
{'vY': vYTest, 'vYPred': vYTestPred, 'hA': hA[1], 'dScore': {'Accuracy': testAcc}, 'titleStr': 'Test - Confusion Matrix'}]

for ii in range(2):
    PlotConfusionMatrix(**lConfMatData[ii])

plt.show()

# TODO: Show in percentage

### Cross Validation for Hyper Parameter Optimization

We can also use the _Cross Validation_ approach to search for the best _Hype Parameter_.  
The idea is iterating through the data and measure the score we care about.  
The hyper parameter which maximize the score will be used for the production model.

* <font color='brown'>(**#**)</font> Usually, once we set the optimal _hyper parameters_ we'll re train the model on the whole data set.
* <font color='brown'>(**#**)</font> We'll learn how to to automate this process later using built in tools, but the idea is the same.

In [None]:
# Cross Validation for the K parameters
numFold = 10

lK = list(range(1, 13, 2)) #<! Range of values of K
numK = len(lK)

lAcc = [None] * numK

for ii, K in enumerate(lK):
    vYTrainPred = cross_val_predict(KNeighborsClassifier(n_neighbors = K), mX, vY, cv = StratifiedKFold(numFold, shuffle = True))
    lAcc[ii] = np.mean(vYTrainPred == vY)


In [None]:
# Plot Results

hF, hA = plt.subplots(figsize = FIG_SIZE_DEF)
hA.plot(lK, lAcc)
hA.scatter(lK, lAcc, s = 100)
hA.set_title('Accuracy Score as a Function of K')
hA.set_xlabel('K')
hA.set_ylabel('Accuracy')
hA.set_xticks(lK)
hA.grid()

plt.show()

* <font color='red'>(**?**)</font> What's the optimal `K`?
* <font color='red'>(**?**)</font> What's the _Dynamic Range_ of the results? Think again on the question above.