[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io)

# Machine Learning Methods

## Supervised Learning - Regression - Regression Tree Classifier

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 0.1.000 | 18/02/2023 | Royi Avital | First version                                                      |
|         |            |             |                                                                    |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/MachineLearningMethods/2023_01/0030RegressorDecisionTree.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.tree import DecisionTreeRegressor, plot_tree

# Misc
import os
from platform import python_version
import random

# Typing
from typing import Callable, List, Tuple

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

In [None]:
# Configuration
#%matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF = (8, 8)
ELM_SIZE_DEF = 50
CLASS_COLOR = ('b', 'r')


In [None]:
# Fixel Algorithms Packages


In [None]:
# Parameters

# Data Generation (1st)
numSamples = 201
noiseStd   = 0.05


# Data Visualization
numGridPts = 500

In [None]:
# Auxiliary Functions

def PlotRegressionData( mX: np.ndarray, vY: np.ndarray, hA:plt.Axes = None, figSize: Tuple[int, int] = FIG_SIZE_DEF, elmSize: int = ELM_SIZE_DEF, classColor: Tuple[str, str] = CLASS_COLOR, axisTitle: str = None ) -> plt.Axes:

    if hA is None:
        hF, hA = plt.subplots(figsize = figSize)
    else:
        hF = hA.get_figure()
    
    if np.ndim(mX) == 1:
        mX = np.reshape(mX, (mX.size, 1))

    numSamples = len(vY)
    numDim     = mX.shape[1]
    if (numDim > 2):
        raise ValueError(f'The features data must have at most 2 dimensions')
    
    # Work on 1D, Add support for 2D when needed
    # See https://matplotlib.org/stable/api/toolkits/mplot3d.html
    hA.scatter(mX[:, 0], vY, s = elmSize, color = classColor[0], edgecolor = 'k', label = f'Samples')
    hA.axvline(x = 0, color = 'k')
    hA.axhline(y = 0, color = 'k')
    hA.set_xlabel('${x}_{1}$')
    # hA.axis('equal')
    if axisTitle is not None:
        hA.set_title(axisTitle)
    hA.legend()
    
    return hA

def PlotRegressor( hR: Callable, vX: np.ndarray, labelReg: str = 'Regressor', hA: plt.Axes = None ):

    if hA is None:
        hF, hA = plt.subplots(figsize = figSize)
    else:
        hF = hA.get_figure()
    
    hA.plot(vX, hR(np.reshape(vX, (-1, 1))), c = 'r', lw = 2, label = labelReg)

    return hA



## Generate / Load Data


In [None]:
# Data Generating Function 

def f( vX: np.ndarray ) -> np.ndarray:
    vY            = 0.5 * np.ones(vX.shape[0])
    vY[vX < 3.25] = 1
    vY[vX < 2.5 ] = 0.5 + (vX[vX < 2.5] / 5) - 0.25
    vY[vX < 1.5 ] = 0
    
    return vY

In [None]:
# Loading / Generating Data

vX = 5 * np.random.rand(numSamples)
vY = f(vX) + (noiseStd * np.random.randn(numSamples))

print(f'The features data shape: {vX.shape}')
print(f'The labels data shape: {vY.shape}')

### Plot Data

In [None]:
# Display the Data

PlotRegressionData(vX, vY)

plt.show()

## Train a Decision Tree Regressor

Decision trains, with enough degrees of freedom, can easily overfit to data (Represent any data).  
Hence their tweaking is important.  

The decision tree is implemented in the `DecisionTreeRegressor` class.

* <font color='brown'>(**#**)</font> The SciKit Learn default for a Decision Tree tend to overfit data.
* <font color='brown'>(**#**)</font> The `max_depth` parameter and `max_leaf_nodes` parameter are usually used exclusively.
* <font color='brown'>(**#**)</font> We can learn about the data by the orientation of the tree (How balanced it is). 
* <font color='brown'>(**#**)</font> Decision Trees are usually used in the context of ensemble (Random Forests / Boosted Trees).


In [None]:
# Plotting Function

vG = np.linspace(-0.5, 5.5, 1000)

def PlotDecisionTree( splitCriteria, numLeaf, vX: np.ndarray, vY: np.ndarray, vG: np.ndarray ):

    mX = np.reshape(vX, (-1, 1))
    mG = np.reshape(vG, (-1, 1))

    # Train the classifier
    oTreeReg = DecisionTreeRegressor(criterion = splitCriteria, max_leaf_nodes = numLeaf, random_state = 0)
    oTreeReg = oTreeReg.fit(mX, vY)
    scoreR2  = oTreeReg.score(mX, vY)
    
    hF, hA = plt.subplots(1, 2, figsize = (16, 8))
    hA = hA.flat
    
    # Decision Boundary
    hA[0] = PlotRegressor(oTreeReg.predict, vG, hA = hA[0])
    hA[0] = PlotRegressionData(vX, vY, hA = hA[0], axisTitle = f'Regression, R2 = {scoreR2:0.2f}')
    hA[0].set_xlabel('$x$')
    hA[0].set_ylabel('$y$')

    # Plot the Tree
    plot_tree(oTreeReg, filled = True, ax = hA[1], rounded = True)
    hA[1].set_title(f'Max Leaf Nodes = {numLeaf}')

In [None]:
# Plotting Wrapper
hPlotDecisionTree = lambda splitCriteria, numLeaf: PlotDecisionTree(splitCriteria, numLeaf, vX, vY, vG)

In [None]:
# Display the Geometry of the Classifier

splitCriteriaDropdown   = Dropdown(options = ['squared_error', 'friedman_mse', 'absolute_error'], value = 'squared_error', description = 'Split Criteria')
numLeafSlider           = IntSlider(min = 2, max = 25, step = 1, value = 2, layout = Layout(width = '30%'))
interact(hPlotDecisionTree, splitCriteria = splitCriteriaDropdown, numLeaf = numLeafSlider)

plt.show()

* <font color='red'>(**?**)</font> What are the values beyond the original domain?