[![Fixel Algorithms](https://i.imgur.com/AqKHVZ0.png)](https://fixelalgorithms.gitlab.io/)

# AI Program

## Exercise 0007 - Regression

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.002 | 10/04/2024 | Royi Avital | Fixed a typo in section header                                     |
| 1.0.001 | 08/04/2024 | Royi Avital | Fixed the calculation of the _Pearson Correlation Coefficient_     |
| 1.0.000 | 23/03/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/Exercise0007.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

# Miscellaneous
import gdown
import json
import os
import random
import urllib.request
import re

# Typing
from typing import Callable, Dict, List, Optional, Self, Set, Tuple, Union

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

 ```python
 vallToFill = ???
 ```

 - Multi Line to Fill (At least one)

 ```python
 # You need to start writing
 ????
 ```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

???
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())


In [None]:
# Constants

PEOPLE_CSV_URL = 'https://github.com/FixelAlgorithmsTeam/FixelCourses/raw/master/DataSets/People.csv'


In [None]:
# Course Packages


In [None]:
# General Auxiliary Functions


## Exercise

In this exercise we'll exercise 2 approaches to solving the same problem with Linear Regression.
The models will employ a a Polynomial fit of degree `P`. 

We'll us the [`People.csv`](https://github.com/FixelAlgorithmsTeam/FixelCourses/blob/master/DataSets/People.csv) data set.  
It includes 1000 samples of peoples: Sex, Age, Height (CM), Weight (KG).  

The objective is to estimate the weight given the sex and height.  

I this exercise we'll do the following:

1. Load the [`People.csv`](https://github.com/FixelAlgorithmsTeam/FixelCourses/blob/master/DataSets/People.csv) data set using `pd.csv_read()`.
2. Analyze the data and the effect of the age features.
2. Create 2 estimators:
 - Treats both sex and height as a features for the polynomial fit.
 - Uses the sex as a selection of model and height as a feature for linear fit.
3. Verify the estimator vs. `np.polyfit()`.
4. Display th output of the model.

* <font color='brown'>(**#**)</font> In order to let the classifier know the data is binary / categorical we'll use a **Data Frame** as the data structure.

In [None]:
# Parameters

# Model
polynomDeg = 1

# Data Visualization
gridNoiseStd = 0.05
numGridPts = 250

* <font color='blue'>(**!**)</font> Fill the functions in `Auxiliary Functions` **after** reading the code below which use them.

## Generate / Load Data

Load the classification data set.

In [None]:
# Load Data

dfPeople = pd.read_csv(PEOPLE_CSV_URL)

dfPeople.head(10)

### Plot Data

In [None]:
# Pair Plot of Features

sns.pairplot(data = dfPeople, hue = 'Sex')

plt.show()

* <font color='red'>(**?**)</font> Are all features important?

## Pre Process Data

### Feature Selection by Correlation

For linear models, lack of correlation means lack of significance.  
Since we use a Linear Model we can drop features with low correlation.

* <font color='brown'>(**#**)</font> This is marginal correlation, namely each feature on its own.  
There could be some interactions which the correlation might miss.

In [None]:
# Feature Analysis

# Calculating the Correlation (Normalized) of Age to Weight.
# Basically we're after Pearson's Correlation: Covariance(X, Y) / (Std(X) * Std(Y)).

#===========================Fill This===========================#
# 1. Calculate the covariance matrix using `np.cov()` for the age and weight features.
# 2. Calculate the normalized (Pearson's) correlation.
mCov         = ???
pearosanCorr = ???
#===============================================================#

print(f'The Pearson Correlation of Age and Weight is: {pearosanCorr:0.3f}')

* <font color='brown'>(**#**)</font> For zero mean random variables the [_Pearson Correlation_](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is basically the equivalent of the _cosine_ of the angle between the vectors. 
* <font color='brown'>(**#**)</font> Correlation is better be used for filtering features which have high correlation.

### Train Data

Split data into features and labels.

In [None]:
# The Training Data 

#===========================Fill This===========================#
# 1. Extract the 'Sex' and 'Height' columns into a data frame `dfX`.
# 2. Extract the 'Weight' column into a series `dsY`.
dfX = ???
dsY = ???
#===============================================================#

print(f'The features data shape: {dfX.shape}')
print(f'The labels data shape: {dsY.shape}')

* <font color='brown'>(**#**)</font> Try running `dfY = dfPeople[['Weight']]`. What's the difference?  
  Pay attention to the type of data and dimensions.

### Feature Engineering

We have a string feature which we need to map into a numerical value.  
In previous notebooks we used the `map()` method on the `Sex` column.  
In this one we'll use the [`get_dummies()`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) method of Pandas.  
This method basically convert categorical features into _one hot_ encoding.

In [None]:
# Map a String into Binary (Categorical)

pd.get_dummies(dfX)

#===========================Fill This===========================#
# 1. Apply the `get_dummies()` method of `dfX`.
# 2. The output should have only 2 columns.
# !! The objective is have indication if the person is male or not.
# !! Use the `drop_first` parameter properly.
dfX = ???
#===============================================================#

dfX.columns = ['Height', 'Male']
dfX

## Build the Regressors

In this section we'll do as following:

1. Build a model based on a pipeline:
   - Calculate the features according to the degree.
   - Apply linear regression.
2. Build a model based on a sub class of regressor:
   - The initialization set the polynomial order.
   - It trains 2 models per sex.
3. Train the models on the whole data.

<br/>

* <font color='brown'>(**#**)</font> We don't do cross validation or testing in this exercise as the emphasize is building the models.
* <font color='brown'>(**#**)</font> The idea is to observe the way linear models interact with binary features.
* <font color='brown'>(**#**)</font> Linear models don't interact well with categorical features. Hence, usually, they are encoded as one hot encoding.

In [None]:
# Model I
# Model based on a pipeline of `PolynomialFeatures` and `LinearRegression`.

#===========================Fill This===========================#
# 1. Create a pipeline:
#   - 'Transformer' - `PolynomialFeatures`. Set its parameters to reduce memory footprint.
#   - 'Regressor' - `LinearRegression` which includes the intercept (The bias term) as part of the model.
# 2. Set the 'Transformer' degree using the `set_params()` method. The syntax is `{'<step_name>__<parameter_name>': value}`.
oLinRegModel001 = ???
oLinRegModel001 = ???
#===============================================================#


In [None]:
# Model II

class MaleFemaleRegressor(RegressorMixin, BaseEstimator):
    def __init__(self, polyDeg = 1):
        #===========================Fill This===========================#
        # 1. Add `polyDeg` as an attribute of the object.
        # 2. Add `PolynomialFeatures` object as an attribute of the object.
        # 3. Add `LinearRegression` object as an attribute of the object (For males).
        # 4. Add `LinearRegression` object as an attribute of the object (For females).
        # !! Configure `PolynomialFeatures` and `LinearRegression` properly to optimize memory consumption.

        self.polyDeg   = ???
        self.oPolyFeat = ???
        self.oLinRegM  = ??? #<! Male model
        self.oLinRegF  = ??? #<! Female model
        #===============================================================#
    
    def fit(self, dfX: pd.DataFrame, dsY: pd.Series):
        
        dfXM = dfX.loc[dfX['Male'] == 1, ['Height']] #<! Using ['Height'] makes the output a DF and not a series
        dfXF = dfX.loc[dfX['Male'] == 0, ['Height']] #<! Using ['Height'] makes the output a DF and not a series
        #===========================Fill This===========================#
        # 1. Extract the labels into male and females groups.
        # 2. Apply `fit_transform()` for the features using `oPolyFeat`.
        # 3. Apply `fit()` on the features using the models.
        
        dsYM = ??? #<! Males
        dsYF = ??? #<! Females
        mXM  = ??? #!< Males
        mXF  = ??? #<! Females
        
        self.oLinRegM = ??? #!< Males
        self.oLinRegF = ??? #<! Females
        #===============================================================#

        return self
    
    def predict(self, dfX: pd.DataFrame):
        
        #===========================Fill This===========================#
        # 1. Split the data according to sex.
        # 1. Construct the features matrix per sex.
        # 2. Apply the `predict()` method of `oLinReg` per sex.
        
        dfXM = ??? #<! Using ['Height'] makes the output a DF and not a series
        dfXF = ??? #<! Using ['Height'] makes the output a DF and not a series
        mXM  = ???
        mXF  = ???
        vYM  = ???
        vYF  = ???
        #===============================================================#

        numSamples = dfX.shape[0]
        vY = np.zeros(numSamples)

        # Reconstruct the output according to the input order
        vY[(dfX['Male'] == 1).to_numpy()] = vYM
        vY[(dfX['Male'] == 0).to_numpy()] = vYF

        return vY
    
    def score(self, dfX: pd.DataFrame, dsY: pd.Series):
        # Return the R2 as the score

        #===========================Fill This===========================#
        # 1. Apply the prediction on the input features.
        # 2. Calculate the R2 score.
        
        vYPred = ???
        valR2  = ???
        #===============================================================#

        return valR2


In [None]:
# Construct the 2nd Model

#===========================Fill This===========================#
# 1. Construct the model using the `MaleFemaleRegressor` class.
oLinRegModel002 = ???
#===============================================================#


### Train the Model

In [None]:
# Train the Model

#===========================Fill This===========================#
# 1. Fit the 1st model on the whole data.
# 1. Fit the 2nd model on the whole data.
oLinRegModel001 = ???
oLinRegModel002 = ???
#===============================================================#

In [None]:
# Extract the Model Parameters

vW001  = ???
vW002M = ???
vW002F = ???

In [None]:
# Model Parameters

print(f'The 1st model coefficients         : {vW001}.')
print(f'The 2nd model coefficients (Male)  : {vW002M}.')
print(f'The 2nd model coefficients (Female): {vW002F}.')

* <font color='red'>(**?**)</font> Why does the 2 model has less coefficients? 
* <font color='red'>(**?**)</font> Do both models have the same degree?

## Analyze Results

In this section we'll analyze the results of the 2 models.  

In [None]:
# The Model Score
# The R2 score of the models (The default score for regressor on Sci Kit Learn)

#===========================Fill This===========================#
# 1. Calculate both models score using the R2 score.
modelR2Score001 = ???
modelR2Score002 = ???
#===============================================================#

print(f'The 1st model score (R2): {modelR2Score001}.')
print(f'The 2nd model score (R2): {modelR2Score002}.')

* <font color='red'>(**?**)</font> Why does the 2nd model has a single R2 score if it has 2 models in it?
* <font color='red'>(**?**)</font> If we could have the score for female and males separately, what would be their relation to the score above? Could we calculate it?
* <font color='red'>(**?**)</font> Which model is better? Why?

In [None]:
# Data Frame to Show Results
dfResults = dfPeople[['Sex', 'Height', 'Weight']].copy()
dfResults = pd.concat((dfResults, dfResults), axis = 0, ignore_index = True)
dfResults['Prediction'] = np.concatenate((oLinRegModel001.predict(dfX), oLinRegModel002.predict(dfX)), axis = 0)
dfResults['Model'] = np.concatenate((np.ones(dfX.shape[0]), 2 * np.ones(dfX.shape[0])), axis = 0)

In [None]:
# Show Regression Error Plot

hF, hA = plt.subplots(figsize = (12, 8))

sns.lineplot(data = dfResults, x = 'Weight', y = 'Weight', ax = hA, color = 'r')
sns.scatterplot(data = dfResults, x = 'Weight', y = 'Prediction', hue = 'Sex', style = 'Model', ax = hA)
hA.set_title('Models Predictions')
hA.set_xlabel('Weight Label')
hA.set_ylabel('Weight Prediction')

plt.show()

* <font color='red'>(**?**)</font> Why are results so similar?
* <font color='red'>(**?**)</font> Have a look at the previous notebook of this data. How come results are so different?