[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io/)

# Machine Learning Methods

## Exercise 005 - Regression

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 0.1.000 | 17/02/2023 | Royi Avital | First version                                                      |
|         |            |             |                                                                    |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/MachineLearningMethods/2023_01/Exercise0005RegressionSolution.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

# Miscellaneous
import itertools
import json
import os
from platform import python_version
import random
import urllib.request
import re

# Typing
from typing import Callable, List, Tuple

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# from bokeh.plotting import figure, show

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

In [None]:
# Fixel Algorithms Packages


## Exercise

In this exercise we'll use the most advanced Gradient Boosted models: `LightGBM` and `XGBoost`.  
We'll work on the [Insurance Data Set from OpenML](https://www.openml.org/search?type=data&id=43463).  
In the data set we're given data of people and their insurance charges.

This exercise introduces:

 - Some basic EDA / Feature Engineering / Pre Processing.
 - Working with categorical features.
 - Building a pipeline based on a Data Frame with processing only sub set of the features.
 - Utilizing the `LightGBM` and `XGBoost` packages with the `LGBMRegressor` and `XGBRegressor`.
 - Optimizing a pipeline which one the hyper parameters is the model type.

The objective is to predict a person insurance by a regression model.

* <font color='brown'>(**#**)</font> In `LightGBM` support for categorical data is fully implemented. In `XGBoost` it is still a work in progress.

In this exercise:

1. Download the data (Automatically by the code).
2. Parse data into a data structure to work with (Automatically by the code).
3. Explore the data and the features (EDA).
4. Feature Engineering and Pre Process of the data.
4. Optimize the _Hyper Parameters_ of the models using the `R2` score.
5. Build a _pipeline_ which process only a sub set of the columns.
6. Plot the _regression error_ of the best model on the data.

The hyper parameters optimization and the features engineering should get you `R2 > 0.8`.  
With some effort even `R2 > 0.85` is achievable.

In [None]:
# Parameters

numSamplesTrain = 35_000
numSamplesTest  = None

# Hyper Parameters of the Model

lRegModel       = ['LightGBM', 'XGBoost']
#===========================Fill This===========================#
# 1. Set the list of learning rate (3 values in range [0.05, 0.5]).
# 2. Set the list of maximum number of trees in the model (3 integer values in range [10, 200]).
# 3. Set the list of maximum leaf nodes (3 integer values in range [10, 50]).
# !! Start with small number of combinations until the code is stable.
# !! You may want to optimize the polynomial degree list after the data analysis.
lLearnRate      = ???
lNumEstimators  = ???
lMaxLeafNodes   = ???
lPolyDeg        = ???
#===============================================================#

numFold = 4 #<! Don't change!

In [None]:
# Auxiliary Functions

def PlotRegResults( vY, vYPred, hA:plt.Axes = None, figSize: Tuple[int, int] = FIG_SIZE_DEF, lineWidth: int = LINE_WIDTH_DEF, elmSize: int = ELM_SIZE_DEF, classColor: Tuple[str, str] = CLASS_COLOR, axisTitle: str = None ) -> plt.Axes:

    if hA is None:
        hF, hA = plt.subplots(figsize = figSize)
    else:
        hF = hA.get_figure()

    numSamples = len(vY)
    if (numSamples != len(vYPred)):
        raise ValueError(f'The inputs `vY` and `vYPred` must have the same number of elements')
    
    
    hA.plot(vY, vY, color = 'r', lw = lineWidth, label = 'Ground Truth')
    hA.scatter(vY, vYPred, s = elmSize, color = classColor[0], edgecolor = 'k', label = f'Estimation')
    hA.set_xlabel('Label Value')
    hA.set_ylabel('Prediction Value')
    # hA.axis('equal')
    if axisTitle is not None:
        hA.set_title(axisTitle)
    hA.legend()
    
    return hA

## Generate / Load Data


In [None]:
# Loading / Generating Data

dfData, _ = fetch_openml('Insurance-Premium-Data', version = 1, return_X_y = True, as_frame = True, parser = 'auto')

print(f'The data shape: {dfData.shape}')


Basic info on the data

In [None]:
# Data frame Head
# Look at the structure of the data
dfData.head()

In [None]:
# Data Frame Information
# Look at the types of each feature 
dfData.info()

In [None]:
# Change Features Name
dfData.columns = ['Age', 'Sex', 'BMI', 'NumberChildren', 'Smoker', 'Region', 'Charges']
dfData

## Basic EDA, Feature Engineering & Pre Processing

Work with data starts with looking at the data and the connection between the features.  
This is an iterative procedure: EDA -> Feature Engineering -> Model Optimization -> Error Analysis -> EDA.

In [None]:
# Pair Plot
# The basic connection between the features
sns.pairplot(data = dfData)

plt.show()

In [None]:
# Data Visualization
# Basic EDA on the Data: Box Plot for discrete data, Scatter plot for continuos data.

numCol = dfData.shape[1]
lCols  = dfData.columns
numAx  = int(np.ceil(np.sqrt(numCol)))

hIsCatLikData = lambda dsX: (pd.api.types.is_categorical_dtype(dsX) or pd.api.types.is_bool_dtype(dsX) or pd.api.types.is_object_dtype(dsX) or pd.api.types.is_integer_dtype(dsX))

hF, hAs = plt.subplots(nrows = numAx, ncols = numAx, figsize = (20, 12))
hAs = hAs.flat

for ii in range(numCol):
    colName = dfData.columns[ii]
    if hIsCatLikData(dfData[colName]):
        sns.boxplot(data = dfData, x = colName, y = 'Charges', ax = hAs[ii])
    else:
        sns.scatterplot(data = dfData, x = colName, y = 'Charges', ax = hAs[ii])

plt.show()

* <font color='red'>(**?**)</font> What do you think about the Dynamic Range of the `Age` feature?
* <font color='brown'>(**#**)</font> If the data set is large enough, one might consider specialized models. For instance, specialization for the `Sex` and the `Smoker` features.  
  Though, in their basic levels, _decision trees_ with proper support for categorical features, can do exactly that inherently. 

Important thing is to look at the distribution of the objective values.  
Some models make assumptions on the values.  
Sometimes we even process the target value, for instance applying some transform to make it more Gaussian like.

In [None]:
hF, hA = plt.subplots(figsize = (10, 7))

sns.kdeplot(data = dfData, x = 'Charges', fill = True, clip = (0, np.inf), ax = hA)

plt.show()

* <font color='red'>(**?**)</font> In the sense of "Classification", is the data balanced?
* <font color='red'>(**?**)</font> How should we handle the test / train split?

In [None]:
# Histogram of the `Age` Feature

hF, hA = plt.subplots(figsize = (10, 7))

sns.histplot(data = dfData, x = 'Age', discrete = True, ax = hA)

plt.show()


In [None]:
# Scatter of the Age Feature

hF, hA = plt.subplots(figsize = (10, 7))

sns.scatterplot(data = dfData, x = 'Age', y = 'Charges', ax = hA)

plt.show()

* <font color='red'>(**?**)</font> What kind of relationship do you see between the `Age` and the `Charges`?

In [None]:
# Grouped Scatter Plot with Linear Model
# It is sometimes good to see the behavior within a slice of features.
# In this case we want to see the relation between BMI to charges split to different sex.

sns.lmplot(data = dfData, x = 'BMI',  y = 'Charges', hue = 'Smoker', col = 'Sex', palette = 'magma')
plt.show()


In [None]:
# Grouped Scatter Plot with Linear Model
# It is sometimes good to see the behavior within a slice of features.
# In this case we want to see the relation between BMI to charges split to different regions.
sns.lmplot(data = dfData, x = 'BMI',  y = 'Charges', hue = 'Smoker', col = 'Region', palette = 'magma')
plt.show()

* <font color='red'>(**?**)</font> Do all regions behave the same? Is there an outlier?

### Feature Engineering

In this section we'll do a simple feature engineering / pre processing:

1. Create 2 Binary features based on the `Region` feature:
  - `RegionNorth` if there is `north` in `Region`.
  - `RegionWest` if there is `west` in `Region`.
2. Create a new feature based on the BMI according to [American Cancer Society - Normal Weight Ranges: Body Mass Index (BMI)](https://www.cancer.org/healthy/cancer-causes/diet-physical-activity/body-weight-and-cancer-risk/adult-bmi.html).
3. Create a new feature based on the age range.

You may try more features, for instance:
 
 - High Risk: If `Obese`, `Smoker` and `Elder` (Or other combination which makes sense).
 - Family Size: `Small`, `Large` (Analyze the histogram to determine).
 - Family Role: `Father` / `Mother` / `None`.

* <font color='brown'>(**#**)</font> Feature engineering is the the magic in the process. We don't want too much features, but we want good ones.

In [None]:
# Pre Process Data
# Region

#===========================Fill This===========================#
# 1. Create 2 new features: `RegionNorth` and `RegionWest`:
#  - `RegionNorth` = 1 if `north` in `Region` else 0.
#  - `RegionWest` = 1 if `west` in `Region` else 0.
# 2. Remove the `Region` feature.
????
#===============================================================#


In [None]:
# Pre Process Data
# BMI

#===========================Fill This===========================#
# 1. Create a function which gets a float number `inBMI` and returns a string.
# 2. The string should match: 'Under Weight', 'Normal Weight', 'Over Weight' and 'Obese'.
# 3. The criteria should match https://www.cancer.org/cancer/cancer-causes/diet-physical-activity/body-weight-and-cancer-risk/adult-bmi.html.
def BmiCategory( inBmi: float ) -> str:
    ?????

#===============================================================#


#===========================Fill This===========================#
# 1. Create a new feature `BMI Category` by mapping `BMI` using `BmiCategory()`.
dfData['BMI Category'] = ???
#===============================================================#

In [None]:
# Pre Process Data
# Age

#===========================Fill This===========================#
# 1. Create a function which gets an integer number `inAge` and returns a string.
# 2. The string should match: Young Adult', 'Senior Adult', 'Elder'.
# 3. Set the limits according to data or any other sensible choice.
def AgeCategory( inAge: int ) -> str:
    ????

#===============================================================#


#===========================Fill This===========================#
# 1. Create a new feature `Age Category` by mapping `Age` using `AgeCategory()`.
dfData['Age Category'] = ???
#===============================================================#

* <font color='brown'>(**#**)</font> In practice, after some feature engineering, we need to redo the EDA part.

In [None]:
# Look at the Data
dfData

### Pre Processing

We'll apply simple transforms on the data:

1. Map the `Sex`, `Smoker`, `Age Category` and `BMI Category` into numerical values.
2. Set the categorical features data types into `categorical`.
3. Split the data into `dfX` and `dsY`.


* <font color='brown'>(**#**)</font> In practice, part of the pre processing is rejecting outliers.

In [None]:
# Convert Categorical Data into Numeric Value

#===========================Fill This===========================#
# 1. Map the values into numerical values.
dfData['Sex'] = ???
dfData['Smoker'] = ???
dfData['Age Category'] = ???
dfData['BMI Category'] = ???
#===============================================================#

In [None]:
# Observe the Data
dfData

In [None]:
# Observe the Data Types
dfData.info()

In [None]:
# Select the Categorical / Numerical Columns

#===========================Fill This===========================#
# 1. Create a list of the columns which are categorical features.
# 2. Create a list of the numerical features.
lCatData = ???
lNumData = ???
#===============================================================#

In [None]:
# Split Data into Features and Labels

#===========================Fill This===========================#
# 1. Set the data of `dfX`.
# 2. Set the labels of `dsY`.
dfX = ???
dsY = ???
#===============================================================#

In [None]:
# Set the Categorical Features

#===========================Fill This===========================#
# 1. Convert the columns in `lCatData` to categorical data.
# !! You may use `astype('category')` or `pd.Categorical()`.
?????
#===============================================================#

# Observe the Data
dfX

In [None]:
# Observe the Data Types
# Make sure all columns are set correctly.
dfX.info()

## Optimize Regressors

In this section we'll train an Ensemble of Trees which are optimized by Gradient Boosting.  
One of the hyper parameters to optimize is the implementation: `LGBMRegressor` or `XGBRegressor`.

This models has a lot of hyper parameters yet we'll focus on:

 - Implementation: `LGBMRegressor` or `XGBRegressor`.
 - Number of Leaves Nodes (`num_leaves` / `max_leaves`) - Sets the maximum number of leaves in each tree.
 - Learning Rate (`learning_rate`) - The learning rate of the ensemble (The significance of each model compared to those before it).
 - Number of Trees (`n_estimators`) - The number of iterations of the algorithm. In each iteration a single tree is added.

The score will be the `R2` score.  
We'll use `KFold` for cross validation and `cross_val_predict()` to build the predicted values.

The actual model is a pipeline of `PolynomialFeatures` and the model.  
Yet, we want to use `PolynomialFeatures` only on subset of the features (The non categorical).

* <font color='brown'>(**#**)</font> In some cases people can use the categorical features in the polynomial transform. Yet in order to learn the ability to process a sub set, we'll focus on the numerical ones.

In order to process a subset of the features we'll use [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html).  

The process we'll do is as following:

1. Build a data frame which has a row per a combination of the hyper parameters.
2. Iterate on all rows of the data frame, for each combination build the pipeline and predict the labels with `cross_val_predict()`.
3. Calculate the score and update the row with it.

* <font color='brown'>(**#**)</font> Make sure to read about the `remainder` parameter. As it will allow us pipelining the data properly for our use.

In [None]:
# Creating the Data Frame

#===========================Fill This===========================#
# 1. Calculate the number of combinations.
# 2. Create a nested loop to create the combinations between the parameters (Use `itertools.product()`).
# 3. Store the combinations as the columns of a data frame.
numComb = len(???) * len(???) * len(???) * len(???) * len(???)
dData   = {'Model': [], 'Learn Rate': [], 'Number of Estimators': [], 'Max Leaf Nodes': [], 'Poly Deg': [], 'R2': [0.0] * numComb}

for (regModel, learnRate, numEst, maxLeafNode, polyDeg) in itertools.product(???):
    dData['Model'].append(???)
    dData['Learn Rate'].append(???)
    dData['Number of Estimators'].append(???)
    dData['Max Leaf Nodes'].append(???)
    dData['Poly Deg'].append(???)
#===============================================================#

dfModelScore = pd.DataFrame(data = dData)
dfModelScore

In [None]:
# Optimize the Model

#===========================Fill This===========================#
# 1. Iterate over each row of the data frame `dfModelScore`. Each row defines the hyper parameters.
# 2. Construct the model.
# 3. Train it on the Train Data Set.
# 4. Calculate the score.
# 5. Store the score into the data frame column.


for ii in range(numComb):
    modelName       = ???
    learningRate    = ???
    numEst          = ???
    maxLeafNodes    = ???
    polyDeg         = ???

    print(f'Processing model {ii + 1:03d} out of {numComb}')
    print(f'Model Parameters: {modelName=}, {learningRate=}, {numEst=}, {maxLeafNodes=}, {polyDeg=}') #<! Python trick for F strings

    #!! Set the parameters of the column transformer. Set `remainder` properly to have all data moving forward.
    oColTrns = ColumnTransformer([('PolyFeatures', PolynomialFeatures(degree = ???), lNumData)], remainder = ???)
    if modelName == 'LightGBM':
        oModelReg = ???(n_estimators = ???, learning_rate = ???, num_leaves= ???)
    elif modelName == 'XGBoost':
        oModelReg = ???(n_estimators = ???, learning_rate = ???, max_leaves = ???)
    else:
        raise ValueError(f'The {modelName=} is not supported.')
    
    # Building the pipeline
    oPipeReg = Pipeline([('PolyFeat', ???), ('Regressor', ???)])
    
    # Prediction by Cross Validation
    vYPred = cross_val_predict(???, dfX, dsY, cv = KFold(n_splits = ???, shuffle = True))

    # Score based on the prediction
    scoreR2 = r2_score(???, ???)
    dfModelScore.loc[ii, 'R2'] = ???
    print(f'Finished processing model {ii + 1:03d} with `R2 = {???}.')
#===============================================================#

* <font color='brown'>(**#**)</font> Efficiency wise, it would be better to calculate the features once per `polyDeg`.
* <font color='red'>(**?**)</font> Why don't we use a stratified K-Fold split in the case above?

In [None]:
# Display Sorted Results (Descending)
# Pandas allows sorting data by any column using the `sort_values()` method
# The `head()` allows us to see only the the first values
dfModelScore.sort_values(by = ['R2'], ascending = False).head(10)

* <font color='brown'>(**#**)</font> With good optimization the `LightGBM` models should be high ranked. In the data above their built in support for categorical data can assist squeezing more.
* <font color='brown'>(**#**)</font> The reason it is easy for the `LightGBM` model to optimize on categorical data is related to the way they work (Analyzing the histograms of the data).

### Optimal Model

In this section we'll extract the best model an retrain it on the whole data (`dfXNum`).  
We need to export the model which has the best Test values.

In [None]:
# Extract the Optimal Hyper Parameters

#===========================Fill This===========================#
# 1. Extract the index of row which maximizes the score.
# 2. Use the index of the row to extract the hyper parameters which were optimized.

#! You may find the `idxmax()` method of a Pandas data frame useful.
idxArgMax = ???
#===============================================================#

modelName       = dfModelScore.loc[idxArgMax, 'Model']
learningRate    = dfModelScore.loc[idxArgMax, 'Learn Rate']
numEst          = dfModelScore.loc[idxArgMax, 'Number of Estimators']
maxLeafNodes    = dfModelScore.loc[idxArgMax, 'Max Leaf Nodes']
polyDeg         = dfModelScore.loc[idxArgMax, 'Poly Deg']

print(f'The optimal hyper parameters are: {modelName=}, {learningRate=}, {numEst=}, {maxLeafNodes=}, {polyDeg=}')



In [None]:
# Construct the Optimal Model & Train on the Whole Data

#===========================Fill This===========================#
# 1. Construct the model with the optimal hyper parameters.
# 2. Fit the model on the whole data set.
oColTrns = ???
if modelName == 'LightGBM':
    oModelReg = ???
elif modelName == 'XGBoost':
    oModelReg = ???
else:
    raise ValueError(f'The model name: {modelName} is not supported.')

oPipeReg = ???

oPipeReg = ???
#===============================================================#

In [None]:
# Model Score (R2)

print(f'The model score (R2) is: {oPipeReg.score(dfX, dsY):0.2f}.')

In [None]:
# Plot the Regression Error

hF, hA = plt.subplots(figsize = (10, 10))

#===========================Fill This===========================#
hA = PlotRegResults(???, hA = hA)
#===============================================================#

plt.show()

* <font color='green'>(**@**)</font> Try to get more features and improve results.  
  Pay attention to the samples which have large error.
* <font color='green'>(**@**)</font> Try building a multiple models in a single model.  
  For instance, a model for smokers and non smokers.
* <font color='green'>(**@**)</font> Analyze the feature importance. Create features which are important. Remove those which are not.