[![Fixel Algorithms](https://i.imgur.com/AqKHVZ0.png)](https://fixelalgorithms.gitlab.io)

# AI Program

## Machine Learning - Supervised Learning - Regression - Decision Tree Regression - Exercise

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 31/12/2025 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0046RegressorPolynomialFit.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from lightgbm import LGBMRegressor
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

# Miscellaneous
from platform import python_version
import random

# Typing
from typing import Callable, Dict, List, Optional, Self, Set, Tuple, Union
from numpy.typing import NDArray

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

```python
valToFill = ???
```

 - Multi Line to Fill (At least one)

```python
# You need to start writing
?????
```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

?????
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

PEOPLE_CSV_URL = 'https://github.com/FixelAlgorithmsTeam/FixelCourses/raw/master/DataSets/People.csv'

In [None]:
# Courses Packages


In [None]:
# General Auxiliary Functions


## Polynomial Fit

This notebooks compares the performance of a _Regression Decision Tree_ vs. a _Linear Regressor_.  

The data is based on [`People.csv`](https://github.com/FixelAlgorithmsTeam/FixelCourses/blob/master/DataSets/People.csv) data set.  
It includes 1000 samples of peoples: Sex, Age, Height [CM], Weight [KG].  

The objective is to estimate the weight given the height and sex.  
The Sex is a categorical feature which is to work with with Decision Tree while Linear Model sometimes struggle with.

This notebook goes through:

1. Load the [`People.csv`](https://github.com/FixelAlgorithmsTeam/FixelCourses/blob/master/DataSets/People.csv) data set using `pd.csv_read()`.
2. Build a baseline regressor based on a _pipeline_ of _Polynomial Features_ and _Linear Regression_.
3. Build a regressor based on (Ensemble) Decision Tree Regressor.
4. Optimize the Hyper Parameters of both.
4. Compare results.

* <font color='brown'>(**#**)</font> In order to let the classifier know the data is binary / categorical we'll use a **Data Frame** as the data structure.

In [None]:
# Parameters

# Training Parameters
numFolds = 20

## Generate / Load Data

Loads the online `csv` file directly as a Data Frame.

In [None]:
# Load Data

dfPeople = pd.read_csv(PEOPLE_CSV_URL)
dfPeople['Sex'] = dfPeople['Sex'].map({'f': 'Female', 'm': 'Male'})

dfPeople.head(10)

### Plot Data

In [None]:
# Pair Plot

sns.pairplot(data = dfPeople, hue = 'Sex')

* <font color='red'>(**?**)</font> How would you model the data for the task of estimation of the weight of a person given his sex, age and height?

In [None]:
# The Training Data 

#===========================Fill This===========================#
# 1. Extract the 'Sex' and 'Height' columns into a dataframe `dfX`.
# 2. Extract the 'Weight' column into a series `dsY`.
dfX = dfPeople[['Sex', 'Height']].copy()
dsY = dfPeople['Weight'].copy()
#===============================================================#

print(f'The features data shape: {dfX.shape}')
print(f'The labels data shape: {dsY.shape}')

In [None]:
# Plot the Data

hF, hA = plt.subplots(figsize = FIG_SIZE_DEF)
sns.scatterplot(data = dfPeople, x = 'Height', y = 'Weight', hue = 'Sex', ax = hA);

* <font color='red'>(**?**)</font> Which polynomial order fits the data?

In [None]:
# Convert String to Numeric

#===========================Fill This===========================#
# 1. Convert the 'Sex' column from string to numeric.
# 2. Define the `Sex` column as categorical.
# !! You may find the method `astype(category)` useful.
dfX['Sex'] = dfX['Sex'].map({'Female': 0, 'Male': 1})
dfX['Sex'] = dfX['Sex'].astype('category') #<! LightGBM can handle categorical variables directly
#===============================================================#

dfX

In [None]:
# Data Frame Info
# Should show that the 'Sex' column is categorical
dfX.info()

## Regressors

This section implements 2 types of regressors:

 - Parametric Regressor  
   In the form of Polynomial Regression model.
 - Non Parametric Regressor  
   In the form of (Ensemble) Regression Decision Tree.

### Polynomial Regressor

The PolyFit optimization problem is given by:

$$ \arg \min_{\boldsymbol{w}} {\left\| \boldsymbol{\Phi} \boldsymbol{w} - \boldsymbol{y} \right\|}_{2}^{2} $$

Where

$$
\boldsymbol{\Phi} = \begin{bmatrix} — & {p}_{d} \left( \boldsymbol{x}_{1} \right) & — \\
— & {p}_{d} \left( \boldsymbol{x}_{2} \right) & — \\
\vdots & \vdots & \vdots \\
— & {p}_{d} \left( \boldsymbol{x}_{n} \right) & —
\end{bmatrix}
$$

Where ${p}_{d} \left( \cdot \right)$ is a transformer of the sample vector to a polynomial features of degree $d$.  

This section implements the _Ridge Regression_ model:

$$ \arg \min_{\boldsymbol{w}} {\left\| \boldsymbol{\Phi} \boldsymbol{w} - \boldsymbol{y} \right\|}_{2}^{2} + \lambda {\left\| \boldsymbol{w} \right\|}_{2}^{2} $$

Where $\lambda$ is the _Regularization Factor_.

* <font color='brown'>(**#**)</font> For arbitrary $\Phi$ the above becomes a _linear regression_ problem.

In [None]:
# Linear Regression Model with Polynomial Features

#===========================Fill This===========================#
# 1. Build a pipeline based on `PolynomialFeatures` and `Ridge`.
# !! Set configuration to optimize the memory efficiency.
# !! The degree of the polynomial features and regularization will be set by Hyper Parameter tuning.
oPolyFitReg = Pipeline([
    ('PolyFeatures', PolynomialFeatures(include_bias = False)),
    ('RidgeReg', Ridge(fit_intercept = True))
])
#===============================================================#

### Decision Tree Regressor

Decision Tree Regressor is a non parametric regression method since:
 - It does not make assumptions about the underlying distribution of the data.
 - The number of parameters is not fixed in advance, but rather grows and adapts with the complexity of the training data.

A single Decision Tree creates a _piece wise constant_ approximation of the data.  
This section builds an _Ensemble_ of such trees using Gradient Boosting as implemented in the [LightGBM](https://github.com/microsoft/LightGBM) library.  

* <font color='brown'>(**#**)</font> Both [LightGBM](https://github.com/microsoft/LightGBM) (`LGBMRegressor` / `LGBMClassifier` / `LGBMRanker`) and [XGBoost](https://github.com/dmlc/xgboost) (`XGBRegressor` / `XGBClassifier` / `XGBRanker`) offer SciKit Learn compatible implementation of Ensemble (Graident Boosting) of Decision Trees.

#### Categorical Features

Decision Tree can handle _categorical_ features natively.   
The way it is used is to break the categories into subsets by [_Optimal Partitioning_](https://xgboost.readthedocs.io/en/latest/tutorials/categorical.html#optimal-partitioning) (See [Walter D. Fisher - On Grouping for Maximum Homogeneity](https://www.semanticscholar.org/paper/040c3e7d4baac625b6072cf9bf6be697f26d3cab)).   
Basically it identifies all splits and chooses the ones which minimized impurity (Classification) or error (Regression).  
This method yield more efficient tree structure than having one Hot Encoding in addition to better performance (Run time and memory).

* <font color='brown'>(**#**)</font> [StackExchange AI - Parametric vs. Non Parametric Models](https://ai.stackexchange.com/questions/23777).
* <font color='brown'>(**#**)</font> [LightGBM](https://github.com/microsoft/LightGBM) fully supports _categorical_ features out of the box. Currently, [XGBoost](https://github.com/dmlc/xgboost) requires to [enable the feature explicitly](https://xgboost.readthedocs.io/en/latest/tutorials/categorical.html#training-with-scikit-learn-interface).
* <font color='brown'>(**#**)</font> [StackExchange Cross Validated - How Decision Trees Split by Categorical Features](https://stats.stackexchange.com/questions/443780), [StackExchange Data Science - How Decision Trees Split by Categorical Features](https://datascience.stackexchange.com/questions/57256), [StackExchange Data Science - Why Decision Trees Split Require Categorical Features to Be Encoded](https://datascience.stackexchange.com/questions/52066), [StackExchange Data Science - Why Decision Trees Can Handle Categorical Features without Encoding](https://datascience.stackexchange.com/questions/18056).

In [None]:
# Decision Tree Regression Model

#===========================Fill This===========================#
# 1. Build a Decision Tree Regression Model based on LightGBM's `LGBMRegressor`.
# !! Set `random_state` to `seedNum` for reproducibility.
# !! The depth and regularization will be set by Hyper Parameter tuning.
# !! Treat `LGBMRegressor` as a regular SciKit Learn estimator with `fit()`, `predict()` and `score()` methods.
oDecTreeReg = LGBMRegressor(random_state = seedNum)
#===============================================================#

### Training

This section trains the models and optimize the Hyper Parameter using Grid Search with Cross Validation.

In [None]:
# Optimize Hyper Parameters with Grid Search and Cross Validation

#===========================Fill This===========================#
# 1. Construct an object for Grid Search.
# 2. Optimize the hyperparameters: Polynomial Degree, Regularization Factor.
# !! Use `numFolds` for the number of folds in Cross Validation.
# !! Start with small number of combination, then focus on the area of interest (Higher score).
oGridSearchCv = GridSearchCV(
    estimator = oPolyFitReg,
    param_grid = {
        'PolyFeatures__degree': [1, 2, 3, 4],
        'RidgeReg__alpha': np.linspace(0, 5, 21).tolist()
    },
    cv = numFolds,
    n_jobs = -1,
    verbose = 1
)
oGridSearchCv = oGridSearchCv.fit(dfX, dsY)
#===============================================================#

In [None]:
# Extract Optimal Model

#===========================Fill This===========================#
# 1. Extract the best model from the grid search object `oGridSearchCv`.
oPolyFitReg = oGridSearchCv.best_estimator_
#===============================================================#

print(f'Optimal Hyper Parameters: {oGridSearchCv.best_params_}')
print(f'Model Score: {oPolyFitReg.score(dfX, dsY):0.4f}')

In [None]:
# Decision Tree Regression Model Fitting
# Though an Ensemble is used, this section will ignore that for simplicity.

#===========================Fill This===========================#
# 1. Construct an object for Grid Search.
# 2. Optimize the hyperparameters: Maximum Depth, Regularization of L1 (`reg_alpha`), Regularization of L2 (`reg_lambda`).
# !! Use `numFolds` for the number of folds in Cross Validation.
# !! Start with small number of combination, then focus on the area of interest (Higher score).
oGridSearchCv = GridSearchCV(
    estimator = oDecTreeReg,
    param_grid = {
        'max_depth': [2, 3],
        'reg_alpha': np.linspace(0, 0.75, 11).tolist(),
        'reg_lambda': np.linspace(4.25, 5, 11).tolist()
    },
    cv = numFolds,
    n_jobs = -1,
    verbose = 1
)
oGridSearchCv = oGridSearchCv.fit(dfX, dsY)
#===============================================================#

In [None]:
# Extract Optimal Model

#===========================Fill This===========================#
# 1. Extract the best model from the grid search object `oGridSearchCv`.
oDecTreeReg = oGridSearchCv.best_estimator_
#===============================================================#

print(f'Optimal Hyper Parameters: {oGridSearchCv.best_params_}')
print(f'Model Score: {oDecTreeReg.score(dfX, dsY):0.4f}')

## Analyze Results

In this section we'll analyze the results of the 2 models.  

In [None]:
# The Model Score
# The R2 score of the models (The default score for regressor on SciKit Learn)

#===========================Fill This===========================#
# 1. Calculate both models score using the R2 score.
modelR2Score001 = oPolyFitReg.score(dfX, dsY)
modelR2Score002 = oDecTreeReg.score(dfX, dsY)
#===============================================================#

print(f'The     Parametric model score (R2): {modelR2Score001:0.4f}.')
print(f'The Non Parametric model score (R2): {modelR2Score002:0.4f}.')

In [None]:
# Data Frame to Show Results

#===========================Fill This===========================#
# 1. Build a Data Frame `dfResults` that contains the following columns:
#    - 'Sex'.
#    - 'Height'.
#    - 'Weight'.
#    - 'Prediction' - The prediction of a model.
#    - 'Model' - 1 for the Parametric Model, 2 for the Non Parametric Model.
# !! Each row of the original data frame `dfPeople` should appear twice in `dfResults` (One per model prediction).
dfResults = dfPeople[['Sex', 'Height', 'Weight']].copy()
dfResults = pd.concat((dfResults, dfResults), axis = 0, ignore_index = True)
dfResults['Prediction'] = np.concatenate((oPolyFitReg.predict(dfX), oDecTreeReg.predict(dfX)), axis = 0)
dfResults['Model'] = np.concatenate((np.ones(dfX.shape[0]), 2 * np.ones(dfX.shape[0])), axis = 0)
#===============================================================#

dfResults['Model'] = dfResults['Model'].map({1: 'Polynomial Regression', 2: 'Decision Tree Regression'})

In [None]:
# Show Regression Error Plot

hF, hA = plt.subplots(figsize = (12, 8))

sns.lineplot(data = dfResults, x = 'Weight', y = 'Weight', ax = hA, color = 'r')
sns.scatterplot(data = dfResults, x = 'Weight', y = 'Prediction', hue = 'Sex', style = 'Model', ax = hA)
hA.set_title('Models Predictions Error Plot')
hA.set_xlabel('Weight Label')
hA.set_ylabel('Weight Prediction');

* <font color='red'>(**?**)</font> How come the improvement is not as big as one could expect?