[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io/)

# Machine Learning Methods

## Exercise 003 - Classification

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 0.1.000 | 13/02/2023 | Royi Avital | First version                                                      |
|         |            |             |                                                                    |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/MachineLearningMethods/2023_01/Exercise0002ClassificationSolution.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem.wordnet import WordNetLemmatizer

from lightgbm import LGBMClassifier

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, precision_score, recall_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

# Miscellaneous
import json
import os
from platform import python_version
import random
import urllib.request
import re

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from bokeh.plotting import figure, show

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF = (8, 8)
ELM_SIZE_DEF = 50
CLASS_COLOR = ('b', 'r')
EDGE_COLOR  = 'k'

TEST_DATA_FILE_NAME  = 'TestData.mat'
TRAIN_DATA_FILE_NAME = 'TrainData.mat'

L_CLASSES   = ['Red', 'Green', 'Blue']
IMG_SIZE    = [100, 100]

DATA_FILE_URL  = r'https://github.com/FixelAlgorithmsTeam/FixelCourses/raw/master/DataSets/KaggleWhatsCooking.json'
DATA_FILE_NAME = 'KaggleWhatsCooking.json'


In [None]:
# Fixel Algorithms Packages


## Exercise

This exercise introduces:

 - Working with real world data in the context of basic Natural Language Processing (NLP).
 - Working with binary features using Decision Trees.
 - Working with Ensemble Method based on trees.
 - Utilizing the `LightGBM` package with the `LGBMClassifier`.


* <font color='brown'>(**#**)</font> One of the objectives of this exercise is working on non trivial data set in size, features and performance.

In this exercise we'll work the data set: [Yummly - What's Cooking?](https://www.kaggle.com/competitions/whats-cooking) from [Kaggle](https://www.kaggle.com).  
The data set is basically a list of ingredients of a recipe (Features) and the type of cuisine of the recipe (Italian, French, Indian, etc...).  
The objective is being able to classify the cuisine of a recipe by its ingredients.

* <font color='brown'>(**#**)</font> The data set will be downloaded and parsed automatically.

The data will be defined as the following:

1. A boolean matrix of size `numSamples x numFeatures`.
2. The features are the list of all ingredients in the recipes.
3. For a recipe, the features vector is hot encoding of the features.  

For example, if the list of features is: `basil, chicken, egg, eggplant, garlic, pasta, salt, tomato sauce`.  
Then for Pasta with Tomato Sauce the features vector will be: `[1, 0, 0, 0, 1, 1, 1, 1]` which means: `basil, garlic, pasta, salt, tomato sauce`.
This will be the basic feature list while you're encourages to add more features.

In this exercise:

1. Download the data (Will be done automatically by the code).
2. Parse data into a data structure to work with (Automatically by the code).
3. Extract features from the recipes (The basic features: Existence of an ingredient).
4. Train an Ensemble of Decision Trees using the LightGBM models (Very fast).
5. Optimize the model hyper parameters (See below).
6. Plot the _confusion matrix_ of the best model on the data.

Optimize features (repeat if needed) to get accuracy of at least `70%`.

* <font color='brown'>(**#**)</font> It might be useful to use the [NLTK](https://github.com/nltk/nltk) package for [word stemming](https://en.wikipedia.org/wiki/Stemming).
* <font color='brown'>(**#**)</font> To install the package (Prior to working with the notebook):
  - Open Anaconda command line (`Prompt`).
  - Activate the `IAIMLMethods` environment by: `conda activate IAIMLMethods`.
  - Install the package using `conda install nltk -c conda-forge`. 

In [None]:
# Parameters

numSamplesTrain = 35_000
numSamplesTest  = None

# Hyper Parameters of the Model

#===========================Fill This===========================#
# 1. Set the list of learning rate (4 values in range [0.05, 0.5]).
# 2. Set the list of maximum iterations (3 integer values in range [10, 200]).
# 3. Set the list of maximum nodes (3 integer values in range [10, 50]).
lLearnRate  = [0.05, 0.10, 0.15, 0.20]
lMaxItr     = [50, 100, 200]
lMaxNodes   = [20, 30, 50]
#===============================================================#

numFold     = 3

* <font color='blue'>(**!**)</font> Fill the function **after** reading the code below which use them.

In [None]:
# Auxiliary Functions

def ReadData( filePath: str ) -> tuple[list, list, list]:
    # read data into lists
    
    hFile = open(filePath)
    dJsonData = json.load(hFile)
        
    lId, lCuisine, lIngredients = [], [], []
    for ii in range(len(dJsonData)):
        lId.append(dJsonData[ii]['id'])
        lCuisine.append(dJsonData[ii]['cuisine'])
        lIngredients.append(dJsonData[ii]['ingredients'])  
                
    return lId, lCuisine, lIngredients

def RemoveDigits( lIngredients: list ) -> list:
    # Remove digits from the ingredients list
    
    #===========================Fill This===========================#
    # 1. Look for the symbol of a digit in RegExp.
    return [[re.sub("\d+", "", x) for x in y] for y in lIngredients]
    #===============================================================#

def RemoveChars( lIngredients: list ) -> list:
    # Remove some un required characters from the ingredients list
   
    lIngredients = [[x.replace("-", " ") for x in y] for y in lIngredients]
    #===========================Fill This===========================# 
    # 01. Remove the following: & 
    # 02. Remove the following: '
    # 03. Remove the following: ''
    # 04. Remove the following: % 
    # 05. Remove the following: ! 
    # 06. Remove the following: (  
    # 07. Remove the following: ) 
    # 08. Remove the following: / 
    # 09. Remove the following: \ 
    # 10. Remove the following: , 
    # 11. Remove the following: . 
    # 12. Remove the following: "
    lIngredients = [[x.replace("&", " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace("'", " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace("''", " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace("%", " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace("!", " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace("(", " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace(")", " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace("/", " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace("\\", " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace(",", " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace(".", " ") for x in y] for y in lIngredients]
    lIngredients = [[x.replace('"', " ") for x in y] for y in lIngredients]
    #===============================================================# 
    lIngredients = [[x.replace(u"\u2122", " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace(u"\u00AE", " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace(u"\u2019", " ") for x in y] for y in lIngredients] 

    return lIngredients

def LowerCase( lIngredients: list ) -> list:
    # Make letters lowercase for the ingredients list
    
    #===========================Fill This===========================# 
    # 1. Make lower case of the text. 
    # !! Pay attention that the input is a list of lists!
    return [[x.lower() for x in y] for y in lIngredients]
    #===============================================================# 

def RemoveRedundantWhiteSpace( lIngredients: list ) -> list:
    # Removes redundant whitespaces
    
    #===========================Fill This===========================# 
    # 1. Make lower case of the text. 
    # !! Pay attention that the input is a list of lists!
    return [[re.sub( '\s+', ' ', x).strip() for x in y] for y in lIngredients] 
    #===============================================================# 
    
    
def StemWords( lIngredients: list ) -> list:
    # Word stemming for ingredients list (Per word)
    
    #===========================Fill This===========================# 
    # 1. Construct the `WordNetLemmatizer` object.
    lmtzr = WordNetLemmatizer()
    #===============================================================# 
    
    def WordByWord( inStr: str ):
        
        return " ".join(["".join(lmtzr.lemmatize(w)) for w in inStr.split()])
    
    return [[WordByWord(x) for x in y] for y in lIngredients] 
    
    
def RemoveUnits( lIngredients: list ) -> list:
    # Remove units related words from ingredients
    
    remove_list = ['g', 'lb', 's', 'n']
        
    def CheckWord( inStr: str ):
        
        splitStr = inStr.split()
        resStr  = [word for word in splitStr if word.lower() not in remove_list]
        
        return ' '.join(resStr)

    return [[CheckWord(x) for x in y] for y in lIngredients]

def ExtractUniqueIngredients( lIngredients: list, sortList: bool = True ) -> list:
    # Extract all unique ingredients from the list as a single list

    #===========================Fill This===========================# 
    # 1. Extract the unique values of ingredients (You use the `set()` data type of Python).
    # 2. Sort it by name if `sortList == True`. 
    lUniqueIng = list(set([ing for lIngredient in lIngredients for ing in lIngredient]))
    if sortList:
        lUniqueIng = sorted(lUniqueIng)
    #===============================================================# 

    return lUniqueIng

def ExtractFeatureEncoding( lIngredient: list, lUniqueIng: list ) -> np.ndarray:
    # If an ingredient is in the specific recipe
    
    mF = np.zeros(shape = (len(lIngredient), len(lUniqueIng)), dtype = np.uint)
    #===========================Fill This===========================# 
    # 1. Iterate over the list of lists of the ingredients.
    # 2. For each sample (List of ingredients), put 1 in the location of the ingredients.
    for ii in range(len(lIngredient)):
        for jj in lIngredient[ii]:
            mF[ii, lUniqueIng.index(jj)] = 1
    #===============================================================# 
            
    return mF

def PlotLabelsHistogram(vY: np.ndarray, hA = None, lClass = None, xLabelRot: int = None) -> plt.Axes:

    if hA is None:
        hF, hA = plt.subplots(figsize = (8, 6))
    
    vLabels, vCounts = np.unique(vY, return_counts = True)

    hA.bar(vLabels, vCounts, width = 0.9, align = 'center')
    hA.set_title('Histogram of Classes / Labels')
    hA.set_xlabel('Class')
    hA.set_ylabel('Number of Samples')
    hA.set_xticks(vLabels)
    if lClass is not None:
        hA.set_xticklabels(lClass)
    
    if xLabelRot is not None:
        for xLabel in hA.get_xticklabels():
            xLabel.set_rotation(xLabelRot)

    return hA

def PlotConfusionMatrix(vY: np.ndarray, vYPred: np.ndarray, normMethod: str = None, hA: plt.Axes = None, lLabels: list = None, dScore: dict = None, titleStr: str = 'Confusion Matrix', xLabelRot: int = None, valFormat: str = None) -> plt.Axes:

    # Calculation of Confusion Matrix
    mConfMat = confusion_matrix(vY, vYPred, normalize = normMethod)
    oConfMat = ConfusionMatrixDisplay(mConfMat, display_labels = lLabels)
    oConfMat = oConfMat.plot(ax = hA, values_format = valFormat)
    hA = oConfMat.ax_
    if dScore is not None:
        titleStr += ':'
        for scoreName, scoreVal in  dScore.items():
            titleStr += f' {scoreName} = {scoreVal:0.2},'
        titleStr = titleStr[:-1]
    hA.set_title(titleStr)
    hA.grid(False)
    if xLabelRot is not None:
        for xLabel in hA.get_xticklabels():
            xLabel.set_rotation(xLabelRot)

    return hA, mConfMat
    

## Generate / Load Data


In [None]:
# Download Data
# This section downloads data from the given URL if needed.

if not os.path.exists(DATA_FILE_NAME):
    urllib.request.urlretrieve(DATA_FILE_URL, DATA_FILE_NAME)

In [None]:
# Loading / Generating Data

lId, lCuisine, lIngredients = ReadData(DATA_FILE_NAME)


### Pre Processing the Data

In this section we'll do as following:

1. Make all text _lower case_.
2. Remove digits (Weights etc...).
3. Remove some not needed chars.
4. Remove redundant spaces.
5. Remove units.
6. Stem the text (See [Word Stemming](https://en.wikipedia.org/wiki/Stemming)).

The objective is to reduce the sensitivity to the style used to describe the ingredients.  
So we're after the most basic way to describe each ingredient.

* <font color='brown'>(**#**)</font> The list above is the minimum to be done. You may use more ideas.
* <font color='brown'>(**#**)</font> Look at the features list after this. You'll find more duplications and improvements ideas.

In [None]:
# Pre Process Data

#===========================Fill This===========================#
# 1. Fill the body of the function above.
lIng = LowerCase(lIngredients)
lIng = RemoveDigits(lIng)
lIng = RemoveChars(lIng)
lIng = RemoveRedundantWhiteSpace(lIng)
lIng = RemoveUnits(lIng)
lIng = StemWords(lIng)
#===============================================================#

In [None]:
# Extract the Features

#===========================Fill This===========================#
# 1. Fill the body of the function above.
lFeat = ExtractUniqueIngredients(lIng)
#===============================================================#

* <font color='brown'>(**#**)</font> The function `` matches based on teh whole name of the ingredient. For multi words ingredients one might use even a match of a single word.  
This can be useful for cases like `ketchup` vs. `tomato ketchup`.

In [None]:
# Create the Features Encoding

#===========================Fill This===========================#
# 1. Fill the body of the function above.
mF = ExtractFeatureEncoding(lIng, lFeat)
#===============================================================#


In [None]:
# Display the Data
# Create a Data Frame of the data
dfX = pd.DataFrame(columns = lFeat, data = mF)
dfX

In [None]:
# Create the Labels Data

dsY = pd.Series(data = lCuisine, name = 'Cuisine')

In [None]:
dsY

In [None]:
# Labels as Categorical Data

vY          = pd.Categorical(dsY).codes
lEncoding   = pd.Categorical(dsY).categories.to_list()

In [None]:
hF, hA = plt.subplots(figsize = (12, 8))
hA = PlotLabelsHistogram(dsY, hA = hA, xLabelRot = 90)


* <font color='red'>(**?**)</font> Is this a balanced data set?

### Split Data

We'll split the data into training and testing.  
Set `numSamplesTrain`. For teh first tries you use small number just to verify everything works.

In [None]:
# Split Train & Test Data

#===========================Fill This===========================#
# 1. Split the data using `train_test_split()`.
# 2. Make sure to use `numSamplesTrain` and `numSamplesTest`.
# 3. Set the `random_state` so iterative runs will be reproducible.
mXTrain, mXTest, vYTrain, vYTest = train_test_split(mF, vY, train_size = numSamplesTrain, test_size = numSamplesTest, random_state = seedNum, shuffle = True, stratify = vY)
#===============================================================#


# Dimensions of the Data
print(f'The number of training data samples: {mXTrain.shape[0]}')
print(f'The number of training features per sample: {mXTrain.shape[1]}') 


print(f'The number of test data samples: {mXTest.shape[0]}')
print(f'The number of test features per sample: {mXTest.shape[1]}') 

* <font color='red'>(**?**)</font> What's the ratio of the train samples vs/ number of features? What do you think it should be?

### Plot Data

In [None]:
# Histogram of Classes

# Train
hA = PlotLabelsHistogram(vYTrain, lClass = lEncoding, xLabelRot = 90)
hA.set_title(hA.get_title() + ' - Train Data')
plt.show()

In [None]:
# Histogram of Classes

# Test
hA = PlotLabelsHistogram(vYTest, lClass = lEncoding, xLabelRot = 90)
hA.set_title(hA.get_title() + ' - Test Data')
plt.show()

* <font color='red'>(**?**)</font> Which score method would you use between _accuracy_, _recall_, _precision_ or _f1_?

## Training Data and Feature Engineering / Extraction

The vector of values doesn't fit, as is, for classification with SVM.  
It misses a lot of the information given in the structure of the image or a color pixel.  
In our case, the important thing is to give the classifier information about the structure of color, a vector of 3 values: `[r, g, b]`.  
Yet, the classifier input is limited to a list of values. This is where the concept of metric comes into play.  

We need to create information about distance between colors.  
We also need to extract features to represent the colors in the image.

In this section the task are:

1. Implement functions to extract features from the data.
2. Arrange the features in a _matrix_ / _data frame_ for processing.
3. Explore the features using _SeaBorn_. Specifically if the features extracts meaningful information.

* <font color='brown'>(**#**)</font> Don't include _test data_ in the analysis for feature extraction. Other wise, a data leakage will happen.

### Ideas for Features

1. The distance between the the _mean_ / _median_ / _mode_ color of the image to the per _mean_ / _median_ / _mode_ color per class.
2. The distance between the quantized histogram of `R` / `G` / `B` color channels of the image to the class.
3. The distance of the mean color at the center of the image to the mean color of the class.
4. The channel with the maximum value (Is this a continuous value? Does it fit the SVM model?).
5. Use of the _HSL_ color space.


* <font color='brown'>(**#**)</font> You're encouraged to think on more features!
* <font color='brown'>(**#**)</font> Pay attention to dimensionality fo the data. For instance, how do you define the _median color_?
* <font color='brown'>(**#**)</font> For simplicity we use the RGB Color Space. Yet color distance might be better calculated in other color spaces (See LAB for instance).

## Optimize Classifiers

In this section we'll train a Kernel SVM model with optimized hyper parameters: `C` and `gamma`.  
The score should be the regular accuracy.

1. Build the dictionary of parameters for the grid search.
2. Construct the grid search object (`GridSearchCV`).
3. Optimize the hyper parameters by the `fit()` method of the grid search object.

* <font color='red'>(**?**)</font> Why is the accuracy a reasonable score in this case?

In [None]:
# # Construct the Grid Search Object 

# #===========================Fill This===========================#
# # 1. Set the parameters to iterate over and their values.
# dParams = {'learning_rate': lLearnRate, 'max_iter': lMaxItr, 'max_leaf_nodes': lMaxNodes}
# #===============================================================#

# vCatFeatFlag = np.full(shape = len(lFeat), fill_value = True)
# oGsSvc = GridSearchCV(estimator = HistGradientBoostingClassifier(categorical_features = vCatFeatFlag), param_grid = dParams, scoring = 'f1_micro', cv = numFold, verbose = 4)

In [None]:
# # Optimize

# #===========================Fill This===========================#
# # 1. Apply the grid search phase.
# oGsSvc = oGsSvc.fit(mXTrain, vYTrain)
# #===============================================================#

In [None]:
# Construct the Grid Search Object 

#===========================Fill This===========================#
# 1. Set the parameters to iterate over and their values.
dParams = {'num_leaves': lMaxNodes, 'learning_rate': lLearnRate, 'n_estimators': lMaxItr}
#===============================================================#

vCatFeatFlag = np.full(shape = len(lFeat), fill_value = True)
oGsSvc = GridSearchCV(estimator = LGBMClassifier(), param_grid = dParams, scoring = 'f1_micro', cv = numFold, verbose = 4)

In [None]:
# Optimize

#===========================Fill This===========================#
# 1. Apply the grid search phase.
oGsSvc = oGsSvc.fit(mXTrain, vYTrain)
#===============================================================#

* <font color='brown'>(**#**)</font> One could optimize the histogram by creating a 3D histogram.

### Features Analysis

In this section the relation between the features and the labels is analyzed.  
You should visualize / calculate measures which imply the features makes the classes identifiable.

#### Ideas for Analysis

1. Display the histogram / density of each feature by the label of sample.
2. Display the correlation between the feature to the class value (Pay attention this is a mix of continuous values and categorical values).

* <font color='brown'>(**#**)</font> You may find SeaBorn's `kdeplot()` useful.

## Confusion Matrix on Test Data 

In this section we'll test the model on the test data.

1. Extract the best estimator from the grid search.
2. If needed, fit it to the train data.
3. Calculate the test set features. Make sure to avoid data leakage.
4. Display the _confusion matrix_.

The objective is to get at least `85%` accuracy per class.

In [None]:
# Extract the Best Model

#===========================Fill This===========================#
# 1. Get the best model with the optimized hyper parameters.
bestModel = oGsSvc.best_estimator_
#===============================================================#

* <font color='red'>(**?**)</font> Does the best model need a refit on data?

In [None]:
# Fit the Model

#===========================Fill This===========================#
# 1. Train the model on the whole training data
bestModel = bestModel.fit(mXTrain, vYTrain)
#===============================================================#

In [None]:
# Plot the Confusion Matrix
hF, hA = plt.subplots(figsize = (12, 12))

#===========================Fill This===========================#
hA, mConfMat = PlotConfusionMatrix(vYTrain, bestModel.predict(mXTrain), lLabels = lEncoding, hA = hA, xLabelRot = 90, normMethod = 'true', valFormat = '0.0%')
hA.set_title(hA.get_title() + ' - Train Data')
#===============================================================#

plt.show()

In [None]:
# Plot the Confusion Matrix
hF, hA = plt.subplots(figsize = (12, 12))

#===========================Fill This===========================#
hA, mConfMat = PlotConfusionMatrix(vYTest, bestModel.predict(mXTest), lLabels = lEncoding, hA = hA, xLabelRot = 90, normMethod = 'true', valFormat = '0.0%')
hA.set_title(hA.get_title() + ' - Test Data')
#===============================================================#

plt.show()

* <font color='red'>(**?**)</font> How would you handle the case the test would have features not in the training?
* <font color='green'>(**@**)</font> Try to get more features and improve results.