[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io)

# Machine Learning Methods

## Supervised Learning - Ensemble Methods - Gradient Boosting

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 0.1.000 | 19/02/2023 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/MachineLearningMethods/2023_01/0032EnsembleGradientBoosting.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.ensemble import GradientBoostingRegressor

# Miscellaneous
import os
from platform import python_version
import random

# Typing
from typing import Callable, List, Tuple

# Visualization
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm, Normalize
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

In [None]:
# Configuration
#%matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2


In [None]:
# Fixel Algorithms Packages


## Gradient Boosting Regression

In this note book we'll use the Gradient Boosting based regressor in the task of estimating a function model based on measurements.  
The gradient boosting is a sequence of estimators which are built in synergy to compensate of the weaknesses of the previous models.

* <font color='brown'>(**#**)</font> In this notebook we use SciKit's Learn `GradientBoostingClassifier`.  
  In practice it is better to use more optimized implementations:
  - XGBoost.
  - LightGBM.
  - CatBoost.  
* <font color='brown'>(**#**)</font> All implementations above offer a SciKit Learn compatible API.
* <font color='brown'>(**#**)</font> In this case we usually after higher bias and lower variance. Namely each model should be lean.
* <font color='brown'>(**#**)</font> The Gradient Boosting approach is currently considered the _go to_ approach when working on tabular data.

In [None]:
# Parameters

# Data
numSamples  = 150
noiseStd    = 0.1

# Model
numEstimators = 200
learningRate  = 0.1

# Feature Permutation
numRepeats = 50


In [None]:
# Auxiliary Functions

def PlotRegressionData( mX: np.ndarray, vY: np.ndarray, hA:plt.Axes = None, figSize: Tuple[int, int] = FIG_SIZE_DEF, elmSize: int = ELM_SIZE_DEF, classColor: Tuple[str, str] = CLASS_COLOR, axisTitle: str = None ) -> plt.Axes:

    if hA is None:
        hF, hA = plt.subplots(figsize = figSize)
    else:
        hF = hA.get_figure()
    
    if np.ndim(mX) == 1:
        mX = np.reshape(mX, (mX.size, 1))

    numSamples = len(vY)
    numDim     = mX.shape[1]
    if (numDim > 2):
        raise ValueError(f'The features data must have at most 2 dimensions')
    
    # Work on 1D, Add support for 2D when needed
    # See https://matplotlib.org/stable/api/toolkits/mplot3d.html
    hA.scatter(mX[:, 0], vY, s = elmSize, color = classColor[0], edgecolor = 'k', label = f'Samples')
    hA.axvline(x = 0, color = 'k')
    hA.axhline(y = 0, color = 'k')
    hA.set_xlabel('${x}_{1}$')
    # hA.axis('equal')
    if axisTitle is not None:
        hA.set_title(axisTitle)
    hA.legend()
    
    return hA



## Generate / Load Data

In the following we'll generate data according to the following model:

$$ y_{i} = f \left( x_{i} \right) + \epsilon_{i} $$

Where

$$ f \left( x \right) = \sin \left( 20 x \right) + \sin \left( 10 {x}^{1.1} \right) + \frac{x}{10} $$


In [None]:
# Data Generation Function

def f( vX: np.ndarray ) -> np.ndarray:
    return np.sin(20 * vX) * np.sin(10 * (vX ** 1.1)) + (0.1 * vX)

In [None]:
# Loading / Generating Data

vX = np.sort(np.random.rand(numSamples))
vY = f(vX) + (noiseStd * np.random.randn(numSamples))

print(f'The features data shape: {vX.shape}')
print(f'The labels data shape: {vY.shape}')

### Plot Data

In [None]:
# Display the Data

hF, hA = plt.subplots(figsize = (12, 4))
hA = PlotRegressionData(vX, vY, hA = hA)
hA.set_xlabel('$x$')
hA.set_ylabel('$y$')

plt.show()

* <font color='brown'>(**#**)</font> 

## Train a Gradient Boosting Model

In this section we'll trina a gradient boosting model and recreate its prediction process manually.

In [None]:
# Constructing and Training the Model
oGradBoostReg = GradientBoostingRegressor(n_estimators = numEstimators, learning_rate = learningRate)
oGradBoostReg = oGradBoostReg.fit(np.reshape(vX, (-1, 1)), vY)

In [None]:
# Plot the Model by Number of Estimators
def PlotGradientBoosting(numEst: int, learningRate: float, hF: Callable, oGradBoostReg: GradientBoostingRegressor, vX: np.ndarray, vY: np.ndarray, vG: np.ndarray):
    vYPredX = 0 * vX
    vYPredG = 0 * vG
    #<! Building the ensemble of trees
    for ii in range(numEst):
        vYPredG += learningRate * oGradBoostReg.estimators_[ii, 0].predict(np.reshape(vG, (-1, 1))) #<! The `estimators_` is 2D array (Single Column Matrix)
        vYPredX += learningRate * oGradBoostReg.estimators_[ii, 0].predict(np.reshape(vX, (-1, 1)))

    _, hA = plt.subplots(nrows = 1, ncols = 2, figsize = (14, 6))
    hA[0].plot(vG, hF(vG), 'b', label = '$f(x)$')
    hA[0].plot(vG, vYPredG, 'g', label = '$\hat{f}(x)$')
    hA[0].plot(vX, vY, '.r', label = '$y_i$')
    hA[0].set_title (f'Gradient Boosting: {numEst} Trees')
    hA[0].set_xlabel('$x$')
    hA[0].grid(True)
    hA[0].legend()
    
    hA[1].plot(vX, vY, '.r', label = '$y_i$')
    hA[1].stem(vX, vY - vYPredX, '.m', label = '$\hat{r}_i$', markerfmt = '.m')
    hA[1].axhline(y = 0, color = 'k')
    hA[1].set_title (f'Gradient Boosting: Residuals')
    hA[1].set_xlabel('$x$')
    hA[1].grid(True)   
    hA[1].legend()
    
    plt.show()

In [None]:
# Plotting Wrapper
vG = np.linspace(0, 1, 1000)
hPlotGradientBoosting = lambda numEst: PlotGradientBoosting(numEst, learningRate, f, oGradBoostReg, vX, vY, vG)

In [None]:
# Interactive Plot
numEstSlider = IntSlider(min = 1, max = oGradBoostReg.n_estimators_, step = 1, value = 1, layout = Layout(width = '30%'))
interact(hPlotGradientBoosting, numEst = numEstSlider)

plt.show()