[![Fixel Algorithms](https://i.imgur.com/AqKHVZ0.png)](https://fixelalgorithms.gitlab.io/)

# AI Program

## Exercise 0005 - Classification

Feature engineering for color classification.

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 17/03/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/Exercise0005.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Miscellaneous
import gdown
import os
import random
import shutil

# Typing
from typing import Callable, Dict, List, Optional, Set, Tuple, Union

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

 ```python
 vallToFill = ???
 ```

 - Multi Line to Fill (At least one)

 ```python
 # You need to start writing
 ????
 ```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

???
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

DATA_SET_FILE_URL   = r'https://drive.google.com/uc?export=download&confirm=9iBg&id=17-IWjWCPuXMSO0uUWKVDIO-NT38SoB8J'
DATA_SET_FILE_NAME  = 'ColorClassification.zip'

TEST_DATA_FILE_NAME  = 'TestData.mat'
TRAIN_DATA_FILE_NAME = 'TrainData.mat'

L_CLASSES   = ['Red', 'Green', 'Blue']
IMG_SIZE    = [100, 100]



In [None]:
# Course Packages

from DataVisualization import PlotConfusionMatrix, PlotLabelsHistogram


In [None]:
# General Auxiliary Functions

def PlotImage(vX, imgClass = None, imgSize = IMG_SIZE, hA = None):

    mI = np.reshape(vX, (imgSize[0], imgSize[1], 3), order = 'F') #<! Data is coming from MATLAB

    if hA is None:
        hF, hA = plt.subplots(figsize = (4, 4))
    
    hA.imshow(mI)
    hA.tick_params(axis = 'both', left = False, top = False, right = False, bottom = False, labelleft = False, labeltop = False, labelright = False, labelbottom = False)

    if imgClass is not None:
        hA.set_title('Image Class: {imgClass}')

    return hA


def PlotImages(mX: np.ndarray, vY: np.ndarray, numRows: int, numCols: int, lClass = L_CLASSES, hF = None):

    numSamples  = mX.shape[0]
    numPx       = mX.shape[1]

    numImg = numRows * numCols

    tFigSize = (numRows * 3, numCols * 3)

    if hF is None:
        hF, hA = plt.subplots(nrows = numRows, ncols = numCols, figsize = tFigSize)
    else:
        hA = hF.axis
    
    hA = np.atleast_1d(hA) #<! To support numImg = 1
    hA = hA.flat

    vIdx = np.random.choice(numSamples, numImg, replace = False)
    
    for kk in range(numImg):
        imgIdx  = vIdx[kk]

        PlotImage(mX[imgIdx], hA = hA[kk])
        hA[kk].set_title(f'Index = {imgIdx}, Label = {lClass[vY[imgIdx]]}')
    
    plt.show()



## Exercise

This exercise introduces:

 - Concept of _Features Transform_ to reduce the amount of data.
 - Optimizing a classifier by the accuracy score.


* <font color='brown'>(**#**)</font> While in this case the _dimensionality reduction_ of data is done manually by domain knowledge, later in the course we'll learn ML based methods.
* <font color='brown'>(**#**)</font> There is more than one way to implement the exercise. Feel free to wander.

In this exercise we'll work on images which are composed in a matrix.  
Each image is of size `100 x 100 x 3` yet it is spread in a _column stack_ fashion as a row in the data matrix.  
The train data has `2700` images in a matrix, so the matrix size is `2_700 x 30_000`.

The objective is being able to classify the color of the image: `Red: 0`, `Green: 1`, `Blue: 2`.  
The image colors isn't uniform but contains many colors, but the idea is to identify the dominant color.  
The concept of _red_ / _green_ / _blue_ is not a single color but the _family_ of colors.

1. Load data into `mXTrain`, `vYTrain`, `mXTest`, `vYTest`.  
   Data is downloaded and loaded by the notebook.  
   **Make sure internet connection is available**.
2. Extract features from the images for the training data.  
   Analyze the features using EDA and select the subset which you think will work best.
3. Train a Kernel SVM (`rbf`) model on the data.  
   Optimize the `C` and `gamma` parameters for accuracy using grid search.
4. Extract the same features from the test images into test data.
5. Plot the _confusion matrix_ of the best model on the test data.

Optimize features (repeat if needed) to get accuracy of at least `85%` per class.

In [None]:
# Parameters

#===========================Fill This===========================#
# 1. Think of the parameters to optimize per model (See above).
# 2. Select the set to optimize over.
# 3. Set the number of folds in the cross validation.
lC = [0.1, 0.5, 1, 3]
lγ = ['scale', 'auto', 0.05, 0.5, 1.00, 3.00]
numFold = 5
#===============================================================#

## Generate / Load Data

Load the classification data set.

In [None]:
# Load Data

if not (os.path.isfile(TEST_DATA_FILE_NAME) and os.path.isfile(TRAIN_DATA_FILE_NAME)):
    # Delete files if only one exists
    if os.path.isfile(TEST_DATA_FILE_NAME):
        os.remove(TEST_DATA_FILE_NAME)
    if os.path.isfile(TRAIN_DATA_FILE_NAME):
        os.remove(TRAIN_DATA_FILE_NAME)
    if os.path.isfile(DATA_SET_FILE_NAME):
        os.remove(DATA_SET_FILE_NAME)
    gdown.download(DATA_SET_FILE_URL, DATA_SET_FILE_NAME)
    shutil.unpack_archive(DATA_SET_FILE_NAME)
    os.remove(DATA_SET_FILE_NAME)

dTestData  = sp.io.loadmat(TEST_DATA_FILE_NAME)
dTrainData = sp.io.loadmat(TRAIN_DATA_FILE_NAME)

mXTrain, vYTrain    = dTrainData['mX'], np.squeeze(dTrainData['vY'])
mXTest, vYTest      = dTestData['mX'], np.squeeze(dTestData['vY'])

print(f'The number of training data samples: {mXTrain.shape[0]}')
print(f'The number of training features per sample: {mXTrain.shape[1]}') 


print(f'The number of test data samples: {mXTest.shape[0]}')
print(f'The number of test features per sample: {mXTest.shape[1]}') 

### Plot Data

A useful plot for multi features data is the _pair plot_ (See `SeaBorn`'s [`pairplot()`](https://seaborn.pydata.org/generated/seaborn.pairplot.html)).  
The pair plot easily gives a view on the:

1. Relation between each pair of the features.
2. Distribution of each feature.

It is an important tool for observation of the features and their interrelation.

* <font color='brown'>(**#**)</font> You may read on it in [Data Exploration and Visualization with SeaBorn Pair Plots](https://scribe.rip/40e6d3450f6d).
* <font color='brown'>(**#**)</font> The plots matrix is $n \times n$ where $n$ is the number of features. Hence it is not feasible for $n \gg 1$.



In [None]:
# Plot the Data

# Train Data
PlotImages(mXTrain, vYTrain, 3, 3, lClass = L_CLASSES)

In [None]:
# Plot the Data

# Test Data
PlotImages(mXTest, vYTest, 3, 3, lClass = L_CLASSES)

In [None]:
# Histogram of Classes

# Train
hA = PlotLabelsHistogram(vYTrain, lClass = L_CLASSES)
hA.set_title(hA.get_title() + ' - Train Data')
plt.show()

In [None]:
# Histogram of Classes

# Test
hA = PlotLabelsHistogram(vYTest, lClass = L_CLASSES)
hA.set_title(hA.get_title() + ' - Test Data')
plt.show()

* <font color='red'>(**?**)</font> Is the data balanced or imbalanced?

## Training Data and Feature Engineering / Extraction

The vector of values doesn't fit, as is, for classification with SVM.  
It misses a lot of the information given in the structure of the image or a color pixel.  
In our case, the important thing is to give the classifier information about the structure of color, a vector of 3 values: `[r, g, b]`.  
Yet, the classifier input is limited to a list of values. This is where the concept of metric comes into play.  

We need to create information about distance between colors.  
We also need to extract features to represent the colors in the image.

In this section the task are:

1. Implement functions to extract features from the data.
2. Arrange the features in a _matrix_ / _data frame_ for processing.
3. Explore the features using _SeaBorn_. Specifically if the features extracts meaningful information.

* <font color='brown'>(**#**)</font> Don't include _test data_ in the analysis for feature extraction. Other wise, a data leakage will happen.

### Ideas for Features

1. The distance between the the _mean_ / _median_ / _mode_ color of the image to the per _mean_ / _median_ / _mode_ color per class.
2. The distance between the quantized histogram of `R` / `G` / `B` color channels of the image to the class.
3. The distance of the mean color at the center of the image to the mean color of the class.
4. The channel with the maximum value (Is this a continuous value? Does it fit the SVM model?).
5. Use of the _HSL_ color space.


* <font color='brown'>(**#**)</font> You're encouraged to think on more features!
* <font color='brown'>(**#**)</font> Pay attention to dimensionality fo the data. For instance, how do you define the _median color_?
* <font color='brown'>(**#**)</font> For simplicity we use the RGB Color Space. Yet color distance might be better calculated in other color spaces (See LAB for instance).

In [None]:
# Functions for Feature Extraction
#===========================Fill This===========================#
# 1. Some function work per image, some per the whole data (Comparing stuff).
# 2. You may want to extract statistical information from the training data and use metric between a single image and the statistical data.

# Mean Color Per Class
def CalcMeanColorPerClass( mX, vY, imgSize = IMG_SIZE ):
    # Assuming input data is UINT8
    
    vClass = np.unique(vY)
    mColor = np.zeros(shape = (vClass.shape[0], 3)) #<! Each row is a class

    for ii, classIdx in enumerate(vClass):
        numImg = np.sum(vY == classIdx)
        mD = np.reshape(mX[vY == classIdx], (numImg, imgSize[0] * imgSize[1], 3), order = 'F') #<! Data is column stacked
        mColor[ii, :] = np.mean(mD, axis = (0, 1))
    
    return mColor / 255.0

# Mean Color per Image
def CalcMeanColor( vX, imgSize = IMG_SIZE ):

    mI = np.reshape(vX, (imgSize[0] * imgSize[1], 3), order = 'F') #<! Data is column stacked

    return np.mean(mI, axis = 0) / 255.0

#===============================================================#

In [None]:
# Functions for Feature Extraction
#===========================Fill This===========================#
# 1. Some function work per image, some per the whole data (Comparing stuff).
# 2. You may want to extract statistical information from the training data and use metric between a single image and the statistical data.

# Mean Histogram per Channel per Class
def CalcRgbHistPerClass( mX, vY, imgSize = IMG_SIZE, lHist = [0, 64, 128, 192, 255]):

    vClass = np.unique(vY)
    tH = np.zeros(shape = (3, len(lHist) - 1, 3)) #<! Color x #Bins x #Classes

    for ii, classIdx in enumerate(vClass):
        numImg = np.sum(vY == classIdx)
        mD = np.reshape(mX[vY == classIdx], (numImg * imgSize[0] * imgSize[1], 3), order = 'F') #<! Data is column stacked
        for jj in range(3):
            # Color Channel
            # mD is ((numImg * imgSize[0] * imgSize[1]) x 3)
            vH, _ = np.histogram(mD[:, jj], bins = lHist)
            tH[jj, :, ii] = vH / np.sum(vH)
    
    return tH

# Histogram per Channel (Single Image)
def CalcHistogram( vX, imgSize = IMG_SIZE, lHist = [0, 64, 128, 192, 255] ):

    mH = np.zeros(shape = (3, len(lHist) - 1))

    mI = np.reshape(vX, (imgSize[0] * imgSize[1], 3), order = 'F') #<! Data is column stacked

    for ii in range(3):
        vH, _ = np.histogram(mI[:, ii], bins = lHist)
        mH[ii] = vH / np.sum(vH)
    
    return mH

#===============================================================#

In [None]:
# Functions for Feature Extraction
#===========================Fill This===========================#
# 1. Some function work per image, some per the whole data (Comparing stuff).
# 2. You may want to extract statistical information from the training data and use metric between a single image and the statistical data.

# Channel Value to Mean Value Ratio
# The ratio between the mean value of the channel to the mean value of all pixels.
def MeanChannelValueMeanValueRatio(vX, imgSize = IMG_SIZE):
    
    vMeanColor = CalcMeanColor(vX, imgSize = imgSize)
    
    return vMeanColor / np.mean(vMeanColor)

#===============================================================#

In [None]:
# Features Matrix
# Function to Create the Features Matrix Given the RAW Data.

#===========================Fill This===========================#
# 1. Create a function that given the RAW data and other parameters calculates the feature matrix.
# 2. It should handle both Training and Test data, yet don't pass info between.
# 3. The output dimensions should match the number of samples of the input and the number of features: (numSamples, numFeatures).
# 4. Make sure the order of processing keeps it aligned with the labels vector.

lFeatureName = ['Red vs. Mean', 'Green vs. Mean', 'Blue vs. Mean', 'Red Channel Hist Distance Class 0', 'Green Channel Hist Distance Class 0', 'Blue Channel Hist Distance Class 0', 'Red Channel Hist Distance Class 1', 'Green Channel Hist Distance Class 1', 'Blue Channel Hist Distance Class 1', 'Red Channel Hist Distance Class 2', 'Green Channel Hist Distance Class 2', 'Blue Channel Hist Distance Class 2', 'Mean Pixel Distance Class 0', 'Mean Pixel Distance Class 1', 'Mean Pixel Distance Class 2']

# Creating the Features Matrix
# The matrix is numSamples x numFeatures
# The features are (15): ratioR, ratioG, ratioB, redHistDisCls0, greenHistDisCls0, blueHistDisCls0, redHistDisCls1, greenHistDisCls1, blueHistDisCls1, redHistDisCls2, greenHistDisCls2, blueHistDisCls2, meanPxDisCls0, meanPxDisCls1, meanPxDisCls2
# Hence the matrix is 2700x15

def CalcFeaturesMatrix( mD, tH, mC ):

    numSamples = mD.shape[0]

    mX = np.zeros(shape = (numSamples, 15))
    
    for ii in range(numSamples):
        vR = MeanChannelValueMeanValueRatio(mD[ii])
        mH = CalcHistogram(mD[ii])
        vC = CalcMeanColor(mD[ii])
        for jj in range(15):
            if jj < 3:
                mX[ii, jj] = vR[jj]
            # The next section could be written in a vectorized manner yet written for clarity
            elif jj == 3:
                mX[ii, jj] = np.linalg.norm(mH[0] - tH[0, :, 0]) #!< Red Channel, Class 0
            elif jj == 4:
                mX[ii, jj] = np.linalg.norm(mH[1] - tH[1, :, 0]) #!< Green Channel, Class 0
            elif jj == 5:
                mX[ii, jj] = np.linalg.norm(mH[2] - tH[2, :, 0]) #!< Blue Channel, Class 0
            elif jj == 6:
                mX[ii, jj] = np.linalg.norm(mH[0] - tH[0, :, 1]) #!< Red Channel, Class 0
            elif jj == 7:
                mX[ii, jj] = np.linalg.norm(mH[1] - tH[1, :, 1]) #!< Green Channel, Class 0
            elif jj == 8:
                mX[ii, jj] = np.linalg.norm(mH[2] - tH[2, :, 1]) #!< Blue Channel, Class 0
            elif jj == 9:
                mX[ii, jj] = np.linalg.norm(mH[0] - tH[0, :, 2]) #!< Red Channel, Class 0
            elif jj == 10:
                mX[ii, jj] = np.linalg.norm(mH[1] - tH[1, :, 2]) #!< Green Channel, Class 0
            elif jj == 11:
                mX[ii, jj] = np.linalg.norm(mH[2] - tH[2, :, 2]) #!< Blue Channel, Class 0
            elif jj == 12:
                mX[ii, jj] = np.linalg.norm(vC - mC[0]) #!<Class 0
            elif jj == 13:
                mX[ii, jj] = np.linalg.norm(vC - mC[1]) #!<Class 1
            elif jj == 14:
                mX[ii, jj] = np.linalg.norm(vC - mC[2]) #!<Class 2
    
    return mX

#===============================================================#

In [None]:
# Create Features

#===========================Fill This===========================#
# 1. Calculate the Features Matrix for the Training Data Set.
# 2. Name the features matrix `mF`.
mC = CalcMeanColorPerClass(mXTrain, vYTrain) #<! Mean pixel per Class
tH = CalcRgbHistPerClass(mXTrain, vYTrain) #<! Mean histogram per channel per class
mF = CalcFeaturesMatrix(mXTrain, tH, mC) #<! The features matrix
#===============================================================#

* <font color='brown'>(**#**)</font> One could optimize the histogram by creating a 3D histogram.

### Features Analysis

In this section the relation between the features and the labels is analyzed.  
You should visualize / calculate measures which imply the features makes the classes identifiable.

#### Ideas for Analysis

1. Display the histogram / density of each feature by the label of sample.
2. Display the correlation between the feature to the class value (Pay attention this is a mix of continuous values and categorical values).

* <font color='brown'>(**#**)</font> You may find SeaBorn's `kdeplot()` useful.

In [None]:
# Function to Visualize Features

#===========================Fill This===========================#
# 1. Visualize the distribution of the features per class.
# 2. You're after features which separate the different classes (Least common values with other classes).

hF, hA = plt.subplots(nrows = 5, ncols = 3, figsize = (16, 16))

for ii, featName in enumerate(lFeatureName):
    sns.kdeplot(x = mF[:, ii], hue = vYTrain, ax = hA.flat[ii])
    hA.flat[ii].set_title(f'Distribution of {featName}')

#===============================================================#

## Optimize Classifiers

In this section we'll train a Kernel SVM model with optimized hyper parameters: `C` and `gamma`.  
The score should be the regular accuracy.

1. Build the dictionary of parameters for the grid search.
2. Construct the grid search object (`GridSearchCV`).
3. Optimize the hyper parameters by the `fit()` method of the grid search object.

* <font color='red'>(**?**)</font> Why is the accuracy a reasonable score in this case?

In [None]:
# SciKit Learn requires vY to be matrix form for `_ThresholdScorer()` (See https://github.com/scikit-learn/scikit-learn/blob/7b13a8f120a6d67112b0f50a8834d65e2258f045/sklearn/metrics/_scorer.py#L366)
# Hence we'll use accuracy measure.

# def RocAucSvm( vY, mDecFun ):
    
#     mP      = sp.special.softmax(mDecFun, axis = 1)
#     aucVal  = roc_auc_score(vY, mP, multi_class = 'ovr')
    
#     return aucVal

# RocAucSvmScore = make_scorer(RocAucSvm, needs_threshold = True)

# oGsSvc = GridSearchCV(estimator = SVC(kernel = 'rbf', decision_function_shape = 'ovr'), param_grid = dParams, scoring = RocAucSvmScore, cv = numFold, verbose = 4)

In [None]:
# Grid Search Object
# Hyper parameter optimization by a combined grid search and cross validation.

#===========================Fill This===========================#
# 1. Construct the Grid Search object.
# 2. Set the parameters to iterate over and their values.
dParams = {'C': lC, 'gamma': lγ}
#===============================================================#

oGsSvc = GridSearchCV(estimator = SVC(kernel = 'rbf'), param_grid = dParams, scoring = None, cv = numFold, verbose = 4)

In [None]:
# Optimize Hyper Parameters
# Apply the grid search.

#===========================Fill This===========================#
# 1. Apply the grid search phase.
oGsSvc = oGsSvc.fit(mF, vYTrain)
#===============================================================#

## Confusion Matrix on Test Data 

In this section we'll test the model on the test data.

1. Extract the best estimator from the grid search.
2. If needed, fit it to the train data.
3. Calculate the test set features. Make sure to avoid data leakage.
4. Display the _confusion matrix_.

The objective is to get at least `85%` accuracy per class.

In [None]:
# Extract the Best Model

#===========================Fill This===========================#
# 1. Get the best model with the optimized hyper parameters.
bestModel = oGsSvc.best_estimator_
#===============================================================#

* <font color='red'>(**?**)</font> Does the best model need a refit on data?

In [None]:
# Test Set Features
# Calculate the test data set features.
# Pay attention to not use of leak of data from the test set to the model / features.
# One way to obey this is assume you got the test data one by one.

#===========================Fill This===========================#
# 1. Features of the Test Data.
mFTest = CalcFeaturesMatrix(mXTest, tH, mC)
#===============================================================#

In [None]:
# Confusion Matrix

hF, hA = plt.subplots(figsize = (10, 10))

#===========================Fill This===========================#
# 1. Plot the Confusion Matrix.
hA, mConfMat = PlotConfusionMatrix(vYTest, bestModel.predict(mFTest), lLabels = L_CLASSES, hA = hA)
#===============================================================#

plt.show()

* <font color='red'>(**?**)</font> If results are good, can you spot the dominant feature for them if there is?
* <font color='blue'>(**!**)</font> If there are errors, analyze at least one of each class with error.
* <font color='green'>(**@**)</font> Check results with a single feature: The channel with the highest mean value.