[![Fixel Algorithms](https://i.imgur.com/AqKHVZ0.png)](https://fixelalgorithms.gitlab.io/)

# AI Program

## Exercise 0006 - Classification

Feature engineering for text classification.

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 23/03/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/Exercise0006.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from lightgbm import LGBMClassifier

import nltk
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, precision_score, recall_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

# Miscellaneous
import gdown
import json
import os
import random
import urllib.request
import re

# Typing
from typing import Callable, Dict, List, Optional, Set, Tuple, Union

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

 ```python
 vallToFill = ???
 ```

 - Multi Line to Fill (At least one)

 ```python
 # You need to start writing
 ????
 ```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

???
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

nltk.download('wordnet')
# nltk.download('omw-1.4')

In [None]:
# Constants

DATA_FILE_URL  = r'https://github.com/FixelAlgorithmsTeam/FixelCourses/raw/master/DataSets/KaggleWhatsCooking.json'
DATA_FILE_NAME = 'KaggleWhatsCooking.json'


In [None]:
# Course Packages

from DataVisualization import PlotConfusionMatrix, PlotLabelsHistogram


In [None]:
# General Auxiliary Functions

def ReadData( filePath: str ) -> tuple[list, list, list]:
    # read data into lists
    
    hFile = open(filePath)
    dJsonData = json.load(hFile)
        
    lId, lCuisine, lIngredients = [], [], []
    for ii in range(len(dJsonData)):
        lId.append(dJsonData[ii]['id'])
        lCuisine.append(dJsonData[ii]['cuisine'])
        lIngredients.append(dJsonData[ii]['ingredients'])  
                
    return lId, lCuisine, lIngredients

def RemoveDigits( lIngredients: list ) -> list:
    # Remove digits from the ingredients list
    
    #===========================Fill This===========================#
    # 1. Look for the symbol of a digit in RegExp.
    # 2. If the digit symbol is `?` put `?+` to match more than one digit in a row.
    return [[re.sub(???, "", x) for x in y] for y in lIngredients]
    #===============================================================#

def RemoveChars( lIngredients: list ) -> list:
    # Remove some unnecessary characters from the ingredients list
   
    lIngredients = [[x.replace("-", " ") for x in y] for y in lIngredients]
    #===========================Fill This===========================# 
    # 01. Remove the following: & 
    # 02. Remove the following: '
    # 03. Remove the following: ''
    # 04. Remove the following: % 
    # 05. Remove the following: ! 
    # 06. Remove the following: (  
    # 07. Remove the following: ) 
    # 08. Remove the following: / 
    # 09. Remove the following: \ 
    # 10. Remove the following: , 
    # 11. Remove the following: . 
    # 12. Remove the following: "
    # !!! In some cases escaping is required.
    # !!! Look at the above example.
    lIngredients = [[x.replace(???, " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace(???, " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace(???, " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace(???, " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace(???, " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace(???, " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace(???, " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace(???, " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace(???, " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace(???, " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace(???, " ") for x in y] for y in lIngredients]
    lIngredients = [[x.replace(???, " ") for x in y] for y in lIngredients]
    #===============================================================# 
    lIngredients = [[x.replace(u"\u2122", " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace(u"\u00AE", " ") for x in y] for y in lIngredients] 
    lIngredients = [[x.replace(u"\u2019", " ") for x in y] for y in lIngredients] 

    return lIngredients

def LowerCase( lIngredients: list ) -> list:
    # Make letters lowercase for the ingredients list
    
    #===========================Fill This===========================# 
    # 1. Make lower case of the text. 
    # !! Pay attention that the input is a list of lists!
    return [[??? for x in y] for y in lIngredients]
    #===============================================================# 

def RemoveRedundantWhiteSpace( lIngredients: list ) -> list:
    # Removes redundant whitespaces
    
    #===========================Fill This===========================# 
    # 1. Look for the symbol of a space in RegExp.
    # 2. If the space symbol is `?` put `?+` to match more than one space in a row.
    # !! Pay attention that the input is a list of lists!
    return [[re.sub(???, ' ', x).strip() for x in y] for y in lIngredients] 
    #===============================================================# 
    
    
def StemWords( lIngredients: list ) -> list:
    # Word stemming for ingredients list (Per word)
    
    #===========================Fill This===========================# 
    # 1. Construct the `WordNetLemmatizer` object.
    lmtzr = ???
    #===============================================================# 
    
    def WordByWord( inStr: str ):
        
        return " ".join(["".join(lmtzr.lemmatize(w)) for w in inStr.split()])
    
    return [[WordByWord(x) for x in y] for y in lIngredients] 
    
    
def RemoveUnits( lIngredients: list ) -> list:
    # Remove units related words from ingredients
    
    remove_list = ['g', 'lb', 's', 'n']
        
    def CheckWord( inStr: str ):
        
        splitStr = inStr.split()
        resStr  = [word for word in splitStr if word.lower() not in remove_list]
        
        return ' '.join(resStr)

    return [[CheckWord(x) for x in y] for y in lIngredients]

def ExtractUniqueIngredients( lIngredients: list, sortList: bool = True ) -> list:
    # Extract all unique ingredients from the list as a single list

    #===========================Fill This===========================# 
    # 1. Extract the unique values of ingredients (You use the `set()` data type of Python).
    # 2. Sort it by name if `sortList == True`. 
    lUniqueIng = ???
    if sortList:
        lUniqueIng = ???
    #===============================================================# 

    return lUniqueIng

def ExtractFeatureEncoding( lIngredient: list, lUniqueIng: list ) -> np.ndarray:
    # If an ingredient is in the specific recipe
    
    mF = np.zeros(shape = (len(lIngredient), len(lUniqueIng)), dtype = np.uint)
    #===========================Fill This===========================# 
    # 1. Iterate over the list of lists of the ingredients.
    # 2. For each sample (List of ingredients), put 1 in the location of the ingredients.
    ?????
    #===============================================================# 
            
    return mF


## Exercise

This exercise introduces:

 - Working with real world data in the context of basic Natural Language Processing (NLP).
 - Working with binary features using Decision Trees.
 - Working with Ensemble Method based on trees.
 - Utilizing the `LightGBM` package with the `LGBMClassifier`.


* <font color='brown'>(**#**)</font> One of the objectives of this exercise is working on non trivial data set in size, features and performance.
* <font color='brown'>(**#**)</font> SciKit Learn has some text feature extractors in the [`sklearn.feature_extraction.text`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text).  
  You're encouraged ot use them to improve results after finishing the exercise once without them.

In this exercise we'll work the data set: [Yummly - What's Cooking?](https://www.kaggle.com/competitions/whats-cooking) from [Kaggle](https://www.kaggle.com).  
The data set is basically a list of ingredients of a recipe (Features) and the type of cuisine of the recipe (Italian, French, Indian, etc...).  
The objective is being able to classify the cuisine of a recipe by its ingredients.

* <font color='brown'>(**#**)</font> The data set will be downloaded and parsed automatically.

The data will be defined as the following:

1. A boolean matrix of size `numSamples x numFeatures`.
2. The features are the list of all ingredients in the recipes.
3. For a recipe, the features vector is hot encoding of the features.  

For example, if the list of features is: `basil, chicken, egg, eggplant, garlic, pasta, salt, tomato sauce`.  
Then for Pasta with Tomato Sauce the features vector will be: `[1, 0, 0, 0, 1, 1, 1, 1]` which means: `basil, garlic, pasta, salt, tomato sauce`.  
This will be the basic feature list while you're encourages to add more features.

In this exercise:

1. Download the data (Automatically by the code).
2. Parse data into a data structure to work with (Automatically by the code).
3. Extract features from the recipes (The basic features: Existence of an ingredient).
4. Train an Ensemble of Decision Trees using the LightGBM models (Very fast).
5. Optimize the model hyper parameters (See below).
6. Plot the _confusion matrix_ of the best model on the data.

Optimize features (repeat if needed) to get accuracy of at least `70%`.

* <font color='brown'>(**#**)</font> Working with text requires some knowledge in [Regular Expression](https://en.wikipedia.org/wiki/Regular_expression).  
  This is the most useful engine to handle deterministic patterns at the character level.  
  [RegExOne](https://regexone.com) has a great tutorial and [RegEx 101](https://regex101.com) has a great online tool to experiment with.
* <font color='brown'>(**#**)</font> Read on: [Stemming](https://en.wikipedia.org/wiki/Stemming) and [Lemmatization](https://en.wikipedia.org/wiki/Lemmatization).
* <font color='brown'>(**#**)</font> It might be useful to use the [NLTK](https://github.com/nltk/nltk) package for word stemming.
* <font color='brown'>(**#**)</font> To install the package (Prior to working with the notebook):
  - Open Anaconda command line (`Prompt`).
  - Activate the course environment by: `conda activate <CourseEnvName>`.
  - Install the package using `conda install nltk -c conda-forge`. 
  - You may use `micromamba` instead of `conda`. 

In [None]:
# Parameters

numSamplesTrain = 35_000
numSamplesTest  = None

# Hyper Parameters of the Model

#===========================Fill This===========================#
# 1. Set the list of learning rate (4 values in range [0.05, 0.5]).
# 2. Set the list of maximum iterations (3 integer values in range [10, 200]).
# 3. Set the list of maximum nodes (3 integer values in range [10, 50]).
lLearnRate  = ??? #<! List of learn rates
lMaxItr     = ??? #<! List of maximum iterations
lMaxNodes   = ??? #<! List of maximum nodes (Leaves)
#===============================================================#

numFold     = 3 #<! Don't change!

* <font color='blue'>(**!**)</font> Fill the functions in `Auxiliary Functions` **after** reading the code below which use them.

## Generate / Load Data

Load the classification data set.

In [None]:
# Download Data
# This section downloads data from the given URL if needed.

if not os.path.exists(DATA_FILE_NAME):
    urllib.request.urlretrieve(DATA_FILE_URL, DATA_FILE_NAME)

In [None]:
# Load Data

lId, lCuisine, lIngredients = ReadData(DATA_FILE_NAME)


### Pre Processing the Data

In this section we'll do as following:

1. Make all text _lower case_.
2. Remove digits (Weights etc...).
3. Remove some not required chars.
4. Remove redundant spaces.
5. Remove units.
6. Stem the text (See [Word Stemming](https://en.wikipedia.org/wiki/Stemming)).

The objective is to reduce the sensitivity to the style used to describe the ingredients.  
So we're after the most basic way to describe each ingredient.

* <font color='brown'>(**#**)</font> The list above is the minimum to be done. You're encouraged to use more ideas. For example:
  - The number of ingredients.
  - Higher level aggregation of ingredients: Cheese, Flour, Sauce, etc...
* <font color='brown'>(**#**)</font> Look at the features list after this. You'll find there are still duplications and redundancy.  
   Removing those will improve results.
* <font color='brown'>(**#**)</font> There are extreme number of features, in this case, being able to minimize the number by removing redundant features is useful.

In [None]:
# Pre Process Data

#===========================Fill This===========================#
# 1. Fill the body of the functions above.
lIng = LowerCase(lIngredients)
lIng = RemoveDigits(lIng)
lIng = RemoveChars(lIng)
lIng = RemoveRedundantWhiteSpace(lIng)
lIng = RemoveUnits(lIng)
lIng = StemWords(lIng)
#===============================================================#

In [None]:
# Extract the Features

#===========================Fill This===========================#
# 1. Fill the body of the function above.
lFeat = ExtractUniqueIngredients(lIng)
#===============================================================#

* <font color='brown'>(**#**)</font> The function `ExtractFeatureEncoding` matches based on the whole name of the ingredient.  
  For multi words ingredients one might use even a match of a single word.  
  This can be useful for cases like `ketchup` vs. `tomato ketchup`.

## Training Data and Feature Engineering / Extraction

The idea of the feature engineering in this case is assisting the classifier to identify patterns.  
Most of the cuisines have some patterns associated with them, for example: dough, tomato and cheese.  
The combinations are given by one hot encoding of the ingredients.


* <font color='brown'>(**#**)</font> You're encouraged to think on more features!
* <font color='brown'>(**#**)</font> Pay attention to dimensionality fo the data.

In [None]:
# Create the Features Encoding

#===========================Fill This===========================#
# 1. Fill the body of the function above.
mF = ExtractFeatureEncoding(lIng, lFeat) #<! Features matrix
#===============================================================#


In [None]:
# Display the Data
# Create a Data Frame of the data
dfX = pd.DataFrame(columns = lFeat, data = mF)
dfX

In [None]:
# Create the Labels Data

dsY = pd.Series(data = lCuisine, name = 'Cuisine')
dsY

In [None]:
# Labels as Categorical Data

vY          = pd.Categorical(dsY).codes
lEncoding   = pd.Categorical(dsY).categories.to_list()

In [None]:
# Data Dimensions

print(f'The data shape: {dfX.shape}')
print(f'The labels shape: {dsY.shape}')
print(f'The number of classes: {len(dsY.unique())}')
print(f'The unique values of the labels: {dsY.unique()}')

In [None]:
# Plot the Labels Distribution

hF, hA = plt.subplots(figsize = (12, 8))
hA = PlotLabelsHistogram(dsY, hA = hA, xLabelRot = 90)


* <font color='red'>(**?**)</font> Is this a balanced data set?
* <font color='red'>(**?**)</font> If the data is imbalanced, what approach would you use in this case to handle it?

### Split Data

We'll split the data into training and testing.  
Set `numSamplesTrain`. For the first tries you use small number just to verify everything works.

In [None]:
# Split Train & Test Data

#===========================Fill This===========================#
# 1. Split the data using `train_test_split()`.
# 2. Make sure to use `numSamplesTrain` and `numSamplesTest`.
# 3. Set the `random_state` so iterative runs will be reproducible.
mXTrain, mXTest, vYTrain, vYTest = ???
#===============================================================#


# Dimensions of the Data
print(f'The number of training data samples: {mXTrain.shape[0]}')
print(f'The number of training features per sample: {mXTrain.shape[1]}') 


print(f'The number of test data samples: {mXTest.shape[0]}')
print(f'The number of test features per sample: {mXTest.shape[1]}') 

* <font color='red'>(**?**)</font> What's the ratio of the train samples vs. number of features? What do you think it should be?

### Plot Data

In [None]:
# Histogram of Classes

# Train
hA = PlotLabelsHistogram(vYTrain, lClass = lEncoding, xLabelRot = 90)
hA.set_title(hA.get_title() + ' - Train Data')
plt.show()

In [None]:
# Histogram of Classes

# Test
hA = PlotLabelsHistogram(vYTest, lClass = lEncoding, xLabelRot = 90)
hA.set_title(hA.get_title() + ' - Test Data')
plt.show()

* <font color='red'>(**?**)</font> Which score method would you use between _accuracy_, _recall_, _precision_ or _F1_?

## Optimize Classifiers

In this section we'll train an Ensemble of Trees using the [`LGBMClassifier`](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html) class of the [LightGMB](https://github.com/microsoft/LightGBM) package.  
We'll learn the ensemble model later in the course, but for now we'll just optimize its hyper parameters.
This model has a lot of hyper parameters yet we'll focus on:

 - Number of Leaves Nodes (`num_leaves`) - Sets the maximum number of leaves in each tree.
 - Learning Rate (`learning_rate`) - The learning rate of the ensemble (The significance of each model compared to those before it).
 - Number of Trees (`n_estimators`) - The number of iterations of the algorithm. In each iteration a single tree is added.

The score will be the F1 score with averaging over all classes (Use `f1_micro` string).  

Those are the generic steps for hyper parameter optimization: 

1. Build the dictionary of parameters for the grid search.
2. Construct the grid search object (`GridSearchCV`).
3. Optimize the hyper parameters by the `fit()` method of the grid search object.

* <font color='red'>(**?**)</font> Why is the _F1_ score a reasonable choice in this case?
* <font color='brown'>(**#**)</font> There are several implementations of tree based ensemble methods which are considered better and more production ready than _SciKit Learn_ while being compatible with it:
  * [XGBoost](https://github.com/dmlc/xgboost) - The pioneer of specialized boosting trees. Very efficient and widely used. Lately added the feature of _histogram based_ training.  
    Originally developed by the _Distributed (Deep) Machine Learning Community_ (DMLC) group at _Washington University_.
  * [LightGBM](https://github.com/microsoft/LightGBM) - Pioneered the concept of _histogram based_ training which gives a much faster training with minimal effect on the performance.  
    Developed by _Microsoft_.
  * [CatBoost](https://github.com/catboost/catboost) - Known for optimized treatment of _categorical_ features and extreme optimization.  
    Developed by _Yandex_ (Russian company).

In [None]:
# Construct the Grid Search Object 

#===========================Fill This===========================#
# 1. Set the parameters to iterate over and their values.
# 2. Set the estimator of `GridSearchCV` to `LGBMClassifier`.
# 3. Set the parameters grid.
# 4. Set the scoring to `f1_micro`.
# 5. Set the number of folds.
# 6. Set the verbosity level to the highest.
dParams = ??? #<! Parameters dictionary
oGsSvc = GridSearchCV(estimator = ???, param_grid = dParams, scoring = ???, cv = ???, verbose = ???)
#===============================================================#



In [None]:
# Optimize
# Might take few minutes!

# Set the indices of the categorical features.
# If you extend `mF` beyond the default, make sure to adjust accordingly.
vCatFeatFlag = np.full(shape = len(lFeat), fill_value = True)

#===========================Fill This===========================#
# 1. Apply the grid search phase.
oGsSvc = ???
#===============================================================#

* <font color='brown'>(**#**)</font> The above might take a while (Up to 10 minutes)!

## Confusion Matrix on Test Data 

In this section we'll test the model on the test data.

1. Extract the best estimator from the grid search.
2. If needed, fit it to the train data.
3. Display the _confusion matrix_ for the train and test data sets.

In [None]:
# Extract the Best Model

#===========================Fill This===========================#
# 1. Get the best model with the optimized hyper parameters.
oBestModel = ???
#===============================================================#

* <font color='red'>(**?**)</font> Does the best model need a refit on data?

In [None]:
# Fit the Model

#===========================Fill This===========================#
# 1. Train the model on the whole training data.
oBestModel = ???
#===============================================================#

In [None]:
# Plot the Confusion Matrix (Train)
hF, hA = plt.subplots(figsize = (12, 12))

hA, mConfMat = PlotConfusionMatrix(vYTrain, oBestModel.predict(mXTrain), lLabels = lEncoding, hA = hA, xLabelRot = 90, normMethod = 'true', valFormat = '0.0%')
hA.set_title(hA.get_title() + ' - Train Data')

plt.show()

In [None]:
# Plot the Confusion Matrix (Test)
hF, hA = plt.subplots(figsize = (12, 12))


hA, mConfMat = PlotConfusionMatrix(vYTest, oBestModel.predict(mXTest), lLabels = lEncoding, hA = hA, xLabelRot = 90, normMethod = 'true', valFormat = '0.0%')
hA.set_title(hA.get_title() + ' - Test Data')

plt.show()

In [None]:
# Accuracy
# should be above 70%
print(f'The best model accuracy is: {oBestModel.score(mXTest, vYTest):0.1%}')

* <font color='red'>(**?**)</font> How would you handle the case the test would have features not in the training?
* <font color='red'>(**?**)</font> Have a look on the good performing cuisines vs. the bad ones, can you think why?
* <font color='green'>(**@**)</font> Try to get more features and improve results.