[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io/)

# Machine Learning Methods

## Exercise 004 - Regression

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 0.1.000 | 17/02/2023 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/MachineLearningMethods/2023_01/Exercise0004Regression.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

# Miscellaneous
import os
from platform import python_version
import random

# Typing
from typing import Tuple

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from bokeh.plotting import figure, show

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

PEOPLE_CSV_URL = 'https://github.com/FixelAlgorithmsTeam/FixelCourses/raw/master/DataSets/People.csv'


In [None]:
# Fixel Algorithms Packages


## Exercise

In this exercise we'll exercise 2 approaches to solving the same problem with Linear Regression.
The models will employ a a Polynomial fit of degree `P`. 

We'll us the [`People.csv`](https://github.com/FixelAlgorithmsTeam/FixelCourses/blob/master/DataSets/People.csv) data set.  
It includes 1000 samples of peoples: Sex, Age, Height (CM), Weight (KG).  

The objective is to estimate the weight given the sex and height.  

I this exercise we'll do the following:

1. Load the [`People.csv`](https://github.com/FixelAlgorithmsTeam/FixelCourses/blob/master/DataSets/People.csv) data set using `pd.csv_read()`.
2. Analyze the data and the effect of the age features.
2. Create 2 estimators:
 - Treats both sex and height as a features for the polynomial fit.
 - Uses the sex as a selection of model and height as a feature for linear fit.
3. Verify the estimator vs. `np.polyfit()`.
4. Display th output of the model.

* <font color='brown'>(**#**)</font> In order to let the classifier know the data is binary / categorical we'll use a **Data Frame** as the data structure.

In [None]:
# Parameters

# Model
polynomDeg = 1

# Data Visualization
gridNoiseStd = 0.05
numGridPts = 250

In [None]:
# Auxiliary Functions

def PlotRegressionData( mX: np.ndarray, vY: np.ndarray, hA:plt.Axes = None, figSize: Tuple[int, int] = FIG_SIZE_DEF, elmSize: int = ELM_SIZE_DEF, classColor: Tuple[str, str] = CLASS_COLOR, axisTitle: str = None ) -> plt.Axes:

    if hA is None:
        hF, hA = plt.subplots(figsize = figSize)
    else:
        hF = hA.get_figure()
    
    if np.ndim(mX) == 1:
        mX = np.reshape(mX, (mX.size, 1))

    numSamples = len(vY)
    numDim     = mX.shape[1]
    if (numDim > 2):
        raise ValueError(f'The features data must have at most 2 dimensions')
    
    # Work on 1D, Add support for 2D when needed
    # See https://matplotlib.org/stable/api/toolkits/mplot3d.html
    hA.scatter(mX[:, 0], vY, s = elmSize, color = classColor[0], edgecolor = 'k', label = f'Samples')
    hA.axvline(x = 0, color = 'k')
    hA.axhline(y = 0, color = 'k')
    hA.set_xlabel('${x}_{1}$')
    # hA.axis('equal')
    if axisTitle is not None:
        hA.set_title(axisTitle)
    hA.legend()
    
    return hA

def PlotRegResults( vY, vYPred, hA:plt.Axes = None, figSize: Tuple[int, int] = FIG_SIZE_DEF, lineWidth: int = LINE_WIDTH_DEF, elmSize: int = ELM_SIZE_DEF, classColor: Tuple[str, str] = CLASS_COLOR, axisTitle: str = None ) -> plt.Axes:

    if hA is None:
        hF, hA = plt.subplots(figsize = figSize)
    else:
        hF = hA.get_figure()

    numSamples = len(vY)
    if (numSamples != len(vYPred)):
        raise ValueError(f'The inputs `vY` and `vYPred` must have the same number of elements')
    
    
    hA.plot(vY, vY, color = 'r', lw = lineWidth, label = 'Ground Truth')
    hA.scatter(vY, vYPred, s = elmSize, color = classColor[0], edgecolor = 'k', label = f'Estimation')
    hA.set_xlabel('Label Value')
    hA.set_ylabel('Prediction Value')
    # hA.axis('equal')
    if axisTitle is not None:
        hA.set_title(axisTitle)
    hA.legend()
    
    return hA

def PolyModelString( vW: np.ndarray, applyLatex: bool = True ) -> str:
    modelTxt = '$y = '
    for ii in range(len(vW)):
        modelTxt += f'({vW[ii]:0.3f}) {{x}}^{{{ii}}} + '
    
    modelTxt = modelTxt[:-2]
    modelTxt += '$'

    return modelTxt


## Generate / Load Data


In [None]:
# Loading / Generating Data

dfPeople = pd.read_csv(PEOPLE_CSV_URL)

dfPeople.head(10)

In [None]:
# Data Visualization

sns.pairplot(data = dfPeople, hue = 'Sex')

plt.show()

* <font color='red'>(**?**)</font> Are all features important?

In [None]:
# Calculating the Correlation (Normalized) of Age to Weight
# Basically we're after Pearson's Correlation: Covariance(X, Y) / (Std(X) * Std(Y))

#===========================Fill This===========================#
# 1. Calculate the covariance matrix using `np.cov()` for the age and weight features
# 2. Calculate the normalized (Pearson's) correlation.
mCov         = ???
pearosanCorr = ???
#===============================================================#

print(f'The Pearson Correlation of Age and Weight is: {pearosanCorr:0.3f}')

For linear models, lack of correlation means lack of significance.  
Since we use a Linear Model we can drop this features.

In [None]:
# The Training Data 

#===========================Fill This===========================#
# 1. Extract the 'Sex' and 'Height' columns into a data frame `dfX`.
# 2. Extract the 'Weight' column into a series `dsY`.
dfX = ???
dsY = ???
#===============================================================#

print(f'The features data shape: {dfX.shape}')
print(f'The labels data shape: {dsY.shape}')

* <font color='brown'>(**#**)</font> Try running `dfY = dfPeople[['Weight']]`? What's the difference? Pay attention to the type of data and dimensions.

### Pre Process of Data

We have a string feature which we need to map into a numerical value.  
In previous notebooks we used the `map()` method on the `Sex` column.  
In this one we'll use the [`get_dummies()`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) method of Pandas.  
This method basically convert categorical features into _one hot_ encoding.

In [None]:
pd.get_dummies(dfX)

#===========================Fill This===========================#
# 1. Apply the `get_dummies()` method of `dfX`.
# 2. The output should have only 2 columns.
# !! The objective is have indication if the person is male or not.
# !! Use the `drop_first` parameter properly.
dfX = ???
#===============================================================#

dfX.columns = ['Height', 'Male']
dfX

## Build the Regressors

In this section we'll do as following:

1. Build a model based on a pipeline:
  - Calculate the features according to the degree.
  - Apply linear regression.
2. Build a model based on a sub class of regressor:
  - The initialization set the polynomial order.
  - It trains 2 models per sex.
3. Train the models on the whole data.


* <font color='brown'>(**#**)</font> We don't do cross validation or testing in this exercise as the emphasize is building the models.
* <font color='brown'>(**#**)</font> The idea is to observe the way linear models interact with binary features.
* <font color='brown'>(**#**)</font> Linear models don't interact well with categorical features. Hence, usually, they are encoded as one hot encoding.

In [None]:
# Model I
# Model based on a pipeline of `PolynomialFeatures` and `LinearRegression`.

#===========================Fill This===========================#
# 1. Create a pipeline:
#   - 'Transformer' - `PolynomialFeatures`. Set its parameters to reduce memory footprint.
#   - 'Regressor' - `LinearRegression` which includes the intercept (The bias term) as part of the model.
# 2. Set the 'Transformer' degree using the `set_params()` method. The syntax is `{'<step_name>__<parameter_name>': value}`.
oLinRegModel001 = ???
oLinRegModel001 = ???
#===============================================================#


In [None]:
# Model II

class MaleFemaleRegressor(RegressorMixin, BaseEstimator):
    def __init__(self, polyDeg = 1):
        #===========================Fill This===========================#
        # 1. Add `polyDeg` as an attribute of the object.
        # 2. Add `PolynomialFeatures` object as an attribute of the object.
        # 3. Add `LinearRegression` object as an attribute of the object (For males).
        # 4. Add `LinearRegression` object as an attribute of the object (For females).

        # !! Configure `PolynomialFeatures` and `LinearRegression` properly to optimize memory consumption.
        self.polyDeg   = ???
        self.oPolyFeat = ???
        self.oLinRegM  = ???
        self.oLinRegF  = ???
        #===============================================================#
    
    def fit(self, dfX: pd.DataFrame, dsY: pd.Series):
        
        dfXM = dfX.loc[dfX['Male'] == 1, ['Height']] #<! Using ['Height'] makes the output a DF and not a series
        dfXF = dfX.loc[dfX['Male'] == 0, ['Height']] #<! Using ['Height'] makes the output a DF and not a series
        #===========================Fill This===========================#
        # 1. Extract the labels into male and females groups.
        # 2. Apply `fit_transform()` for the features using `oPolyFeat`.
        # 3. Apply `fit()` on the features using the models.
        dsYM = ??? #<! Males
        dsYF = ??? #<! Females
        mXM  = ??? #!< Males
        mXF  = ??? #<! Females
        
        self.oLinRegM = ??? #!< Males
        self.oLinRegF = ??? #<! Females
        #===============================================================#

        return self
    
    def predict(self, dfX: pd.DataFrame):
        
        #===========================Fill This===========================#
        # 1. Split the data according to sex.
        # 1. Construct the features matrix per sex.
        # 2. Apply the `predict()` method of `oLinReg` per sex.
        dfXM = ??? #<! Using ['Height'] makes the output a DF and not a series
        dfXF = ??? #<! Using ['Height'] makes the output a DF and not a series
        mXM  = ???
        mXF  = ???
        vYM  = ???
        vYF  = ???
        #===============================================================#

        numSamples = dfX.shape[0]
        vY = np.zeros(numSamples)

        # Reconstruct the output according to the input order
        vY[(dfX['Male'] == 1).to_numpy()] = vYM
        vY[(dfX['Male'] == 0).to_numpy()] = vYF

        return vY
    
    def score(self, dfX: pd.DataFrame, dsY: pd.Series):
        # Return the R2 as the score

        #===========================Fill This===========================#
        # 1. Apply the prediction on the input features.
        # 2. Calculate the R2 score.
        vYPred = ???
        valR2  = ???
        #===============================================================#

        return valR2


In [None]:
# Construct the 2nd Model

#===========================Fill This===========================#
# 1. Construct the model using the `MaleFemaleRegressor` class.
oLinRegModel002 = ???
#===============================================================#


### Train the Model

In [None]:
# Train the Model

#===========================Fill This===========================#
# 1. Fit the 1st model on the whole data.
# 1. Fit the 2nd model on the whole data.
oLinRegModel001 = ???
oLinRegModel002 = ???
#===============================================================#

In [None]:
# Extract the Model Parameters

vW001  = np.r_[oLinRegModel001[1].intercept_, oLinRegModel001[1].coef_]
vW002M = np.r_[oLinRegModel002.oLinRegM.intercept_, oLinRegModel002.oLinRegM.coef_]
vW002F = np.r_[oLinRegModel002.oLinRegF.intercept_, oLinRegModel002.oLinRegF.coef_]

In [None]:
# Model Parameters

print(f'The 1st model coefficients         : {vW001}.')
print(f'The 2nd model coefficients (Male)  : {vW002M}.')
print(f'The 2nd model coefficients (Female): {vW002F}.')

* <font color='red'>(**?**)</font> Why does the 2 model has less coefficients? 
* <font color='red'>(**?**)</font> Do both models have the same degree?

## Analyze Results

In this section we'll analyze the results of the 2 models.  

In [None]:
# The Model Score
# The R2 score of the models (The default score for regressor on Sci Kit Learn)

#===========================Fill This===========================#
# 1. Calculate both models score using the R2 score.
modelR2Score001 = ???
modelR2Score002 = ???
#===============================================================#

print(f'The 1st model score (R2): {modelR2Score001}.')
print(f'The 2nd model score (R2): {modelR2Score002}.')

* <font color='red'>(**?**)</font> Why does the 2nd model has a single R2 score if it has 2 models in it?
* <font color='red'>(**?**)</font> If we could have the score for female and males separately, what would be their relation to the score above? Could we calculate it?
* <font color='red'>(**?**)</font> Which model is better? Why?

In [None]:
# Data Frame to Show Results
dfResults = dfPeople[['Sex', 'Height', 'Weight']].copy()
dfResults = pd.concat((dfResults, dfResults), axis = 0, ignore_index = True)
dfResults['Prediction'] = np.concatenate((oLinRegModel001.predict(dfX), oLinRegModel002.predict(dfX)), axis = 0)
dfResults['Model'] = np.concatenate((np.ones(dfX.shape[0]), 2 * np.ones(dfX.shape[0])), axis = 0)

In [None]:
# Show Regression Error Plot

hF, hA = plt.subplots(figsize = (12, 8))

sns.lineplot(data = dfResults, x = 'Weight', y = 'Weight', ax = hA, color = 'r')
sns.scatterplot(data = dfResults, x = 'Weight', y = 'Prediction', hue = 'Sex', style = 'Model', ax = hA)
hA.set_title('Models Predictions')
hA.set_xlabel('Weight Label')
hA.set_ylabel('Weight Prediction')

plt.show()

* <font color='red'>(**?**)</font> Why are results so similar?
* <font color='red'>(**?**)</font> Have a look at the previous notebook of this data. How come results are so different?