[![Fixel Algorithms](https://i.imgur.com/AqKHVZ0.png)](https://fixelalgorithms.gitlab.io)

# AI Program

## Machine Learning - Supervised Learning - Regression - Polynomial Fit

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 12/09/2025 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0046RegressorPolynomialFit.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

# Miscellaneous
from platform import python_version
import random

# Typing
from typing import Callable, Dict, List, Optional, Set, Tuple, Union
from numpy.typing import NDArray

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

```python
valToFill = ???
```

 - Multi Line to Fill (At least one)

```python
# You need to start writing
?????
```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

?????
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

In [None]:
# Courses Packages


In [None]:
# General Auxiliary Functions

def PlotRegressionResults( vY: NDArray, vYPred: NDArray, /, *, hA: Optional[plt.Axes] = None, figSize: Tuple[int, int] = FIG_SIZE_DEF, lineWidth: int = LINE_WIDTH_DEF, elmSize: int = ELM_SIZE_DEF, classColor: Tuple[str, str] = CLASS_COLOR, axisTitle: Optional[str] = None ) -> plt.Axes:

    if hA is None:
        hF, hA = plt.subplots(figsize = figSize)
    else:
        hF = hA.get_figure()

    numSamples = len(vY)
    if (numSamples != len(vYPred)):
        raise ValueError(f'The inputs `vY` and `vYPred` must have the same number of elements')
    
    hA.plot(vY, vY, color = 'r', lw = lineWidth, label = 'Ground Truth')
    hA.scatter(vY, vYPred, s = elmSize, color = classColor[0], edgecolor = 'k', label = f'Estimation')
    hA.set_xlabel('Label Value')
    hA.set_ylabel('Prediction Value')
    # hA.axis('equal')
    if axisTitle is not None:
        hA.set_title(axisTitle)
    hA.legend()
    
    return hA

## Polynomial Fit

This notebook is about a Linear Regression task with a feature transform pipeline.  
It optimizes the polynomial degree and the Ridge Regularization parameter `α`.

The dataset is the [UC Irvine (UCI) Machine Learning Repository - Concrete Compressive Strength](https://archive.ics.uci.edu/dataset/165).

* <font color='brown'>(**#**)</font> The data will be imported as a Data Frame.

In [None]:
# Parameters

# Data
fileUrl       = r'https://raw.githubusercontent.com/FixelAlgorithmsTeam/FixelCourses/refs/heads/master/DataSets/ConcreteCompressiveStrength.csv'
trainSetRatio = 0.85

# Model
polynomDeg = 1 #<! Baseline
α          = 0.05

lP = [1, 2, 3, 4]
vα = np.linspace(0.0, 0.5, num = 25)

## Generate / Load Data

The data (Features) description:

| Variable Name                 | Role    | Type       | Description | Units  | Missing Values |
|-------------------------------|---------|------------|-------------|--------|----------------|
| Cement                        | Feature | Continuous |             | kg/m^3 | no             |
| Blast Furnace Slag            | Feature | Integer    |             | kg/m^3 | no             |
| Fly Ash                       | Feature | Continuous |             | kg/m^3 | no             |
| Water                         | Feature | Continuous |             | kg/m^3 | no             |
| Superplasticizer              | Feature | Continuous |             | kg/m^3 | no             |
| Coarse Aggregate              | Feature | Continuous |             | kg/m^3 | no             |
| Fine Aggregate                | Feature | Continuous |             | kg/m^3 | no             |
| Age                           | Feature | Integer    |             | day    | no             |
| Concrete Compressive Strength | Target  | Continuous |             | MPa    | no             |

The target variable is `Concrete Compressive Strength`.

In [None]:
# Load Data

dfData = pd.read_csv(fileUrl)

dfData.head(10)

In [None]:
# Data Summary

dfData.info()

In [None]:
# Data Summary

dfData.describe()

### Plot Data

In [None]:
# Pair Plot

sns.pairplot(data = dfData)

In [None]:
# Correlation Matrix
mCorr = np.abs(dfData.corr())

hF, hA = plt.subplots(figsize = (6, 4))

sns.heatmap(mCorr, annot = True, fmt = '0.2f', cmap = 'coolwarm', ax = hA)
hA.xaxis.set_tick_params(rotation = 90)

* <font color='red'>(**?**)</font> Which feature is the most important?
* <font color='red'>(**?**)</font> If one feature must be dropped, which one would you drop?

In [None]:
# The Data

dfX = dfData.copy()
dfX = dfX.drop(columns = ['Compressive Strength'])
dsY = dfData['Compressive Strength'].copy()

print(f'The features data shape: {dfX.shape}')
print(f'The labels data shape: {dsY.shape}')

In [None]:
# Train & Validation Split

dfXTrain, dfXVal, dsYTrain, dsYVal = train_test_split(dfX, dsY, train_size = trainSetRatio, random_state = seedNum, shuffle = True)

print(f'The training features data shape  : {dfXTrain.shape}')
print(f'The training labels data shape    : {dsYTrain.shape}')
print(f'The validation features data shape: {dfXVal.shape}')
print(f'The validation labels data shape  : {dsYVal.shape}')

## Ridge Regressor

The _Ridge Regression_ optimization problem is given by:

$$ \arg \min_{\boldsymbol{w}} {\left\| \boldsymbol{\Phi} \boldsymbol{w} - \boldsymbol{y} \right\|}_{2}^{2} + \alpha {\left\| \boldsymbol{w} \right\|}_{2}^{2} $$

Where $\boldsymbol{\Phi}$ is the model matrix.  

In many cases $\boldsymbol{\Phi}$ is built by a Polynomial Features of the RAW features.  
The $\alpha$ parameter is the _Regularization_ parameter.

* <font color='brown'>(**#**)</font> In SciKit Learn [`PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) is used to generate Polynomial Features.

### SciKit Pipeline

Using SciKit Learn's [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) one create an Estimator as a chain of operations.  
Usually a chain of transformation(s) with an estimator at the end.

### Baseline Training

This section generates a Pipeline with the default value for the polynomial order and regularization as a baseline result.

In [None]:
# Pipeline

oPolyRidgeReg = Pipeline([('PolyFeatures', PolynomialFeatures(degree = polynomDeg)), ('Regressor', Ridge(alpha= α))])

In [None]:
# Train the Model

oPolyFit = oPolyRidgeReg.fit(dfXTrain, dsYTrain)

In [None]:
# Results on the Training Set

hF, hA = plt.subplots(figsize = (8, 6))
hA = PlotRegressionResults(dsYTrain.to_numpy(), oPolyFit.predict(dfXTrain), hA = hA)
hA.set_title(f'Ridge Regression with Polynomial Degree: {polynomDeg}, α: {α:0.2f}, $R^2$ = {oPolyFit.score(dfXTrain, dsYTrain):0.3f}');

In [None]:
# Results on the Validation Set

hF, hA = plt.subplots(figsize = (8, 6))
hA = PlotRegressionResults(dsYVal.to_numpy(), oPolyFit.predict(dfXVal), hA = hA)
hA.set_title(f'Ridge Regression with Polynomial Degree: {polynomDeg}, α: {α:0.2f}, $R^2$ = {oPolyFit.score(dfXVal, dsYVal):0.3f}');

In [None]:
# Optimize Hyper Parameters

oGridSearch = GridSearchCV(oPolyRidgeReg, param_grid = {'PolyFeatures__degree': lP, 'Regressor__alpha': vα}, cv = 10, verbose = 5)
oGridSearch.fit(dfX, dsY)

* <font color='red'>(**?**)</font> How come all data is used?

In [None]:
# Best Model

oBestModel = oGridSearch.best_estimator_
print(f'The best model parameters: {oGridSearch.best_params_}')

### Display Error and Score

When dealing with regression there is a useful visualization which shows the predicted value vs the reference value.  
This allows showing the results regardless of the features number of dimensions.

In [None]:
# Results on the Validation Set

hF, hA = plt.subplots(figsize = (8, 6))
hA = PlotRegressionResults(dsYVal.to_numpy(), oBestModel.predict(dfXVal), hA = hA)
hA.set_title(f'Ridge Regression with Polynomial Degree: {oGridSearch.best_params_["PolyFeatures__degree"]}, α: {oGridSearch.best_params_["Regressor__alpha"]:0.2f}, $R^2$ = {oBestModel.score(dfXVal, dsYVal):0.3f}');