# Loss Functions

In this exercise, you will compare the effects of Loss functions on a linear regression model.

👇 Import the data from the attached csv file

In [53]:
import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')

df.head(5)

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Average Temperature
0,0.98,514.5,294.0,110.25,7.0,0.0,18.44
1,0.98,514.5,294.0,110.25,7.0,0.0,18.44
2,0.98,514.5,294.0,110.25,7.0,0.0,18.44
3,0.98,514.5,294.0,110.25,7.0,0.0,18.44
4,0.9,563.5,318.5,122.5,7.0,0.0,24.56


🎯 Your task is to predict the average temperature inside a greenhouse based on its design. Your temperature predictions will help you select the appropriate greenhouse design for each plant, based on their climatic needs. 

🌿 You know that plants can handle small temperature variations, but are exponentially more sensitive as the temperature variations increase. 

## 1. Theory 

❓ Theoretically, which Loss function would you train your model on to limit the risk of killing plants?

<details>
<summary> 🆘 Answer </summary>
    
By theory, you would use a Mean Square Error (MSE) Loss function. It would penalize outlier predictions and prevent your model from committing large errors. This would ensure smaller temperature variations and a lower risk for plants.

</details>

Mean Square Error Loss Function

## 2. Application

### 2.1 Preprocessing

👇 Scale the features

In [54]:
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

scaler = MinMaxScaler()

X = df.drop(columns='Average Temperature')

y = df['Average Temperature']

scaler.fit(X)

X = scaler.transform(X)

### 2.2 Modelling

In this section, you are going to verify the theory by evaluating models optimized on different Loss functions.

### Least Squares (MSE) Loss

👇 **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **Least Squares Loss** (MSE)



In [65]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import SGDRegressor, LinearRegression

model = SGDRegressor(loss='squared_loss') # OLS solved by matrix inversion (SVD method)

cv_results = cross_validate(model, X, y, cv=10, scoring=['r2', 'max_error'])

cv_results



{'fit_time': array([0.01502633, 0.00914073, 0.00866985, 0.00840831, 0.00940275,
        0.00867176, 0.00792766, 0.01006627, 0.01014376, 0.01018786]),
 'score_time': array([0.00177479, 0.00069094, 0.00068402, 0.00062895, 0.00057101,
        0.00066137, 0.00062943, 0.00052619, 0.000772  , 0.00063705]),
 'test_r2': array([0.77472343, 0.89925494, 0.88659834, 0.87600356, 0.92587122,
        0.89219304, 0.92386689, 0.91315961, 0.89222853, 0.93479958]),
 'test_max_error': array([-9.27305894, -9.14738682, -9.33477141, -9.71515696, -9.36588504,
        -9.14812382, -9.11183829, -9.33542511, -8.86857081, -8.11682497])}

👇 Compute 
- the mean cross validated R2 score `r2`
- the single biggest prediction error in °C of all your folds `max_error`?

(Tips: `max_error` is an accepted scoring metrics in sklearn)

In [67]:
r2 = cv_results['test_r2'].mean()

r2

0.8918699147136033

In [71]:
max_error = cv_results['test_max_error'].mean()

max_error

-9.14170421782925

### Mean Absolute Error (MAE) Loss

What if we optimize our model on the MAE instead?

👇 **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **MAE** Loss

<details>
<summary>💡 Hints</summary>

- MAE loss cannot be directly specified in `SGDRegressor`. It must be engineered by adjusting the right parameters

</details>

In [72]:
model = SGDRegressor(loss='epsilon_insensitive', epsilon=0) # OLS solved by matrix inversion (SVD method)

cv_results = cross_validate(model, X, y, cv=10, scoring=['r2', 'max_error'])

cv_results

{'fit_time': array([0.00615406, 0.00582886, 0.00531292, 0.00469375, 0.00536704,
        0.00436068, 0.00483084, 0.00432873, 0.00459313, 0.00470948]),
 'score_time': array([0.00161934, 0.00064969, 0.00134015, 0.00055599, 0.00061178,
        0.0005641 , 0.0006187 , 0.00049305, 0.00066853, 0.00051665]),
 'test_r2': array([0.72590078, 0.86616605, 0.86887617, 0.84209119, 0.90570449,
        0.85659732, 0.91207074, 0.86914598, 0.86769024, 0.92389726]),
 'test_max_error': array([-11.47406541, -10.00271375, -10.3436236 , -10.94746849,
        -11.17497047, -11.18771345, -10.78606503, -12.07062098,
        -11.43656362, -10.86903132])}

👇 Compute 
- the mean cross validated R2 score `r2_mae`
- the single biggest prediction error of all your folds `max_error_mae`?

In [75]:
r2_mae = cv_results['test_r2'].mean()

r2_mae

0.8638140221262894

In [76]:
max_error_mae = cv_results['test_max_error'].mean()

max_error_mae

-11.029283612408136

## 3. Conclusion

❓Which of the models you evaluated seems the most appropriate for your task?

<details>
<summary> 🆘Answer </summary>
    
Although mean cross-validated r2 scores are approximately similar between the two models, the one optimized on a MAE has more chance to make larger mistakes from time to time, increasing risk of killing plants!

    
</details>

MSE

# 🏁 Check your code

In [74]:
from nbresult import ChallengeResult

result = ChallengeResult('loss_functions',
    r2 = r2,
    r2_mae = r2_mae,
    max_error = max_error,
    max_error_mae = max_error_mae,                     
)
result.write()
print(result.check())

platform linux -- Python 3.8.12, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/useradd/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/useradd/code/LucaVanTichelen/data-challenges/05-ML/04-Under-the-hood/01-Loss-Functions
plugins: dash-2.0.0, anyio-3.3.2
[1mcollecting ... [0mcollected 3 items

tests/test_loss_functions.py::TestLossFunctions::test_max_error_order [32mPASSED[0m[32m [ 33%][0m
tests/test_loss_functions.py::TestLossFunctions::test_r2 [32mPASSED[0m[32m          [ 66%][0m
tests/test_loss_functions.py::TestLossFunctions::test_r2_mae [32mPASSED[0m[32m      [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/loss_functions.pickle

[32mgit[39m commit -m [33m'Completed loss_functions step'[39m

[32mgit[39m push origin master
