[![Fixel Algorithms](https://i.imgur.com/AqKHVZ0.png)](https://fixelalgorithms.gitlab.io)

# AI Program

## Machine Learning - Supervised Learning - Regression - Polynomial Fit - Exercise

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 23/12/2025 | Royi Avital | Added type annotations to the methods                              |
| 1.0.000 | 23/03/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0046RegressorPolynomialFit.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from lightgbm import LGBMRegressor
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

# Miscellaneous
from platform import python_version
import random

# Typing
from typing import Callable, Dict, List, Optional, Self, Set, Tuple, Union
from numpy.typing import NDArray

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

```python
valToFill = ???
```

 - Multi Line to Fill (At least one)

```python
# You need to start writing
?????
```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

?????
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

PEOPLE_CSV_URL = 'https://github.com/FixelAlgorithmsTeam/FixelCourses/raw/master/DataSets/People.csv'

In [None]:
# Courses Packages

# from DataVisualization import PlotRegressionData, PlotRegressionResults

In [None]:
# General Auxiliary Functions

def PolyModelString( vW: np.ndarray, applyLatex: bool = True ) -> str:
    
    latexDelimiter = '$' if applyLatex else ''

    modelTxt = latexDelimiter + 'y = '
    for ii in range(len(vW)):
        modelTxt += f'({vW[ii]:0.3f}) {{x}}^{{{ii}}} + '
    
    modelTxt = modelTxt[:-2]
    modelTxt += latexDelimiter

    return modelTxt

## Polynomial Fit

This notebooks compares the performance of a _Regression Decision Tree_ vs. a _Linear Regressor_.  

The data is based on [`People.csv`](https://github.com/FixelAlgorithmsTeam/FixelCourses/blob/master/DataSets/People.csv) data set.  
It includes 1000 samples of peoples: Sex, Age, Height (CM), Weight (KG).  

The objective is to estimate the weight given the height and sex.  
The Sex is a categorical feature which is to work with with Decision Tree while Linear Model sometimes struggle with.

This notebook goes through:

1. Load the [`People.csv`](https://github.com/FixelAlgorithmsTeam/FixelCourses/blob/master/DataSets/People.csv) data set using `pd.csv_read()`.
2. Build a baseline regressor based on a _pipeline_ of _Polynomial Features_ and _Linear Regression_.
3. Build a regressor based on (Ensemble) Decision Tree Regressor.
4. Optimize the Hyper Parameters of both.
4. Compare results.

* <font color='brown'>(**#**)</font> In order to let the classifier know the data is binary / categorical we'll use a **Data Frame** as the data structure.

In [None]:
# Parameters

# Model
polynomDeg = 2

## Generate / Load Data

Loads the online `csv` file directly as a Data Frame.

In [None]:
# Load Data

dfPeople = pd.read_csv(PEOPLE_CSV_URL)

dfPeople.head(10)

### Plot Data

In [None]:
# Pair Plot

sns.pairplot(data = dfPeople, hue = 'Sex')

* <font color='red'>(**?**)</font> How would you model the data for the task of estimation of the weight of a person given his sex, age and height?

In [None]:
# The Training Data 

#===========================Fill This===========================#
# 1. Extract the 'Height' column into a series `dsX`.
# 2. Extract the 'Weight' column into a series `dsY`.
dfX = dfPeople[['Sex', 'Height']].copy()
dsY = dfPeople['Weight'].copy()
#===============================================================#

print(f'The features data shape: {dfX.shape}')
print(f'The labels data shape: {dsY.shape}')

In [None]:
# Plot the Data

hF, hA = plt.subplots(figsize = FIG_SIZE_DEF)
sns.scatterplot(data = dfPeople, x = 'Height', y = 'Weight', hue = 'Sex', ax = hA);

In [None]:
# Convert String to Numeric

dfX['Sex'] = dfX['Sex'].map({'f': 0, 'm': 1})
# Set 'Sex' as categorical variable
dfX['Sex'] = dfX['Sex'].astype('category') #<! LightGBM can handle categorical variables directly

dfX

In [None]:
# Data Frame Info
dfX.info()

* <font color='red'>(**?**)</font> Which polynomial order fits the data?

## Regressors

The PolyFit optimization problem is given by:

$$ \arg \min_{\boldsymbol{w}} {\left\| \boldsymbol{\Phi} \boldsymbol{w} - \boldsymbol{y} \right\|}_{2}^{2} $$

Where

$$
\boldsymbol{\Phi} = \begin{bmatrix} 1 & x_{1} & x_{1}^{2} & \cdots & x_{1}^{p} \\
1 & x_{2} & x_{2}^{2} & \cdots & x_{2}^{p} \\
\vdots & \vdots & \vdots &  & \vdots \\
1 & x_{N} & x_{N}^{2} & \cdots & x_{N}^{p}
\end{bmatrix}
$$

This is a _polyfit_ with hyper parameter $p$.

The optimal weights are calculated by linear system solvers.  
Yet it is better to use solvers optimized for this task, such as:

 * NumPy: [`polyfit`](https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html).
 * SciKit Learn: [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) combined with [`PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html).

In this notebook we'll implement our own class based on SciKit Learn's solutions.

* <font color='brown'>(**#**)</font> For arbitrary $\Phi$ the above becomes a _linear regression_ problem.

### Polyfit Estimator

We could create the linear polynomial fit estimator using a `Pipeline` of `PolynomialFeatures` and `LinearRegression`.  
Yet since this is a simple task it is a good opportunity to exercise the creation of a _SciKit Estimator_.

We need to provide 4 main methods:

1. The `__init()__` Method: The constructor of the object. It should set the degree of the polynomial model used.
2. The `fit()` Method: The training phase. It should calculate the matrix and solve the linear regression problem.

In [None]:
# Linear Regression Model with Polynomial Features

oPolyFitReg = Pipeline([
    ('PolyFeatures', PolynomialFeatures(include_bias = False)),
    ('RidgeReg', Ridge(fit_intercept = True))
])

In [None]:
# Decision Tree Regression Model

oDecTreeReg = LGBMRegressor(random_state = seedNum)

### Training

In this section we'll train the model on the whole data using the class implemented above.

In [None]:
# Optimize Hyper Parameters with Grid Search and Cross Validation

#===========================Fill This===========================#
# 1. Construct the model using the `PolyFitRegressor` class and `polynomDeg`.
oGridSearchCv = GridSearchCV(
    estimator = oPolyFitReg,
    param_grid = {
        'PolyFeatures__degree': [1, 2, 3, 4],
        'RidgeReg__alpha': np.linspace(0, 5, 21).tolist()
    },
    cv = 20,
    n_jobs = -1,
    verbose = 1
)
oGridSearchCv = oGridSearchCv.fit(dfX, dsY)
#===============================================================#

In [None]:
# Extract Optimal Model
oPolyFitReg = oGridSearchCv.best_estimator_
print(f'Optimal Hyper Parameters: {oGridSearchCv.best_params_}')
print(f'Model Score: {oPolyFitReg.score(dfX, dsY):0.4f}')

In [None]:
# Decision Tree Regression Model Fitting
# Though an Ensemble is used, this section will ignore that for simplicity.

#===========================Fill This===========================#
# 1. Convert `dsX` into a 2D matrix `mX` of shape `(numSamples, 1)`.
# 2. Convert `dsY` into a vector `vY` of shape `(numSamples, )`.
# 3. Fit the model using `mX` and `vY`.
# !! SciKit Learn's model requires input data as 2D array (DF / Matrix).
oGridSearchCv = GridSearchCV(
    estimator = oDecTreeReg,
    param_grid = {
        'max_depth': [2, 3],
        'reg_alpha': np.linspace(0, 0.75, 11).tolist(),
        'reg_lambda': np.linspace(4.25, 5, 11).tolist()
    },
    cv = 20,
    n_jobs = -1,
    verbose = 1
)
oGridSearchCv = oGridSearchCv.fit(dfX, dsY)
#===============================================================#

In [None]:
# Extract Optimal Model
oDecTreeReg = oGridSearchCv.best_estimator_
print(f'Optimal Hyper Parameters: {oGridSearchCv.best_params_}')
print(f'Model Score: {oDecTreeReg.score(dfX, dsY):0.4f}')

In [None]:
# Model Parameters

# Extract the Coefficients of the model.
vW = oPolyFit.vW_

In [None]:
# Verify Model

vWRef = np.polyfit(dfX.to_numpy(), dsY.to_numpy(), deg = polynomDeg)[::-1]

for ii in range(polynomDeg + 1):
    print(f'The model {ii} coefficient: {vW[ii]}, The reference coefficient: {vWRef[ii]}')

maxAbsDev = np.max(np.abs(vW - vWRef))
print(f'The maximum absolute deviation: {maxAbsDev}') #<! Should be smaller than 1e-8

if (maxAbsDev > 1e-8):
    print(f'Error: The implementation of the model is in correct!')

### Display Error and Score

When dealing with regression there is a useful visualization which shows the predicted value vs the reference value.  
This allows showing the results regardless of the features number of dimensions.

In [None]:
# Plot the Prediction
hF, hA = plt.subplots(figsize = FIG_SIZE_DEF)

PlotRegressionResults(vY, oPolyFit.predict(mX), hA = hA, axisTitle = f'Estimation vs. Ground Truth with RMSE = {oPolyFit.score(mX, vY):0.3f} [KG]');

Since the features are 1D we can also show the prediction as a function of the input.

In [None]:
# Prediction vs. Features

vXX = np.linspace(120, 220, 2000)
hF, hA = plt.subplots(figsize = FIG_SIZE_DEF)

modelTxt = '$y = '
for ii in range(polynomDeg + 1):
    modelTxt += f'({vW[ii]:0.3f}) {{x}}^{{{ii}}} + '

modelTxt = modelTxt[:-2]
modelTxt += '$'

hA.scatter(dfX.to_numpy(), dsY.to_numpy(), color = 'b', label = 'Train Data')
hA.plot(vXX, oPolyFit.predict(np.reshape(vXX, (-1, 1))), color = 'r', label = 'Model Estimation')
hA.set_title(f'The Linear Regression Model: {modelTxt}')
hA.set_xlabel('$x$ - Height [CM]')
hA.set_ylabel('$y$ - Weight [KG]')
hA.legend();

* <font color='red'>(**?**)</font> What did the model predict? What should be done?
* <font color='blue'>(**!**)</font> Try the above with the model order fo 1 and 3.