# Loss Functions

In this exercice, you will compare the effects of Loss functions on model performance.

👇 Import the data

In [14]:
import pandas as pd

data = pd.read_csv("data.csv")

data.head()

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Average Temperature
0,0.98,514.5,294.0,110.25,7.0,0.0,18.44
1,0.98,514.5,294.0,110.25,7.0,0.0,18.44
2,0.98,514.5,294.0,110.25,7.0,0.0,18.44
3,0.98,514.5,294.0,110.25,7.0,0.0,18.44
4,0.9,563.5,318.5,122.5,7.0,0.0,24.56


The dataset describes architectural properties of greenhouses, in which plants requiring regulated climatic conditions are grown. These properties affect the average temperature inside the greenhouse, the target for this exercice.

## 1. Preprocessing

👇 Scale the features

In [143]:
from sklearn.preprocessing import MinMaxScaler

# Select only the features 
X = data.loc[:,'Relative Compactness':'Glazing Area']

# Fit scaler to continuous features 
scaler = MinMaxScaler().fit(X)

# Scale continuous features 
X_scaled = scaler.transform(X)

## 2. Least Squares Loss Modelling

👇 10-Fold Cross validate a regression model optimized by Stochastic Gradient Descent on a Least Squares Loss. Return the following metrics:

- R2 Score
- Max Error


<details>
<summary>💡 Hint</summary>
    
`max_error` [(doc)](https://scikit-learn.org/stable/modules/model_evaluation.html) returns the largest error and can be passed as a scoring parameter in cross validation.


</details>



In [133]:
import numpy as np
from sklearn.model_selection import cross_validate
from sklearn.linear_model import SGDRegressor

# Squared loss SGD Regressor
sl_model = SGDRegressor(loss="squared_loss")

# Cross Validate Model
sl_cv = cross_validate(sl_model, X, data['Average Temperature'], cv = 10, scoring = ['r2','max_error'] )

👇 What is the model's average R2 score?

In [134]:
sl_cv['test_r2'].mean()

0.8899348398564205

👇 What is the largest error the model makes?

In [136]:
sl_cv['test_max_error'].min()

-10.064620796095618

## 3. Mean Absolute Error Loss Modelling

👇 10-Fold Cross validate a linear regression model optimized by Stochastic Gradient Descent on a Mean Absolute Error Loss. This type of Loss cannot be directly specified, and must be engineered by adjusting the right parameters 😉

<details>
<summary>💡 Hint</summary>
    
In `SGDRegressor`, one type of loss "ignores errors less than epsilon and is linear past that". 

</details>

In [138]:
#MAE loss engineered by setting epsilon_insensitive = 0
mae_model = SGDRegressor(loss="epsilon_insensitive", epsilon = 0)

# Cross Validate Model
mae_sgd = cross_validate(mae_model, X, data['Average Temperature'], cv = 10,  scoring = ['r2','max_error'])

👇 What is the model's average R2 score?

In [139]:
mae_sgd['test_r2'].mean()

0.8659243592843813

👇 What is the largest error the model makes?

In [140]:
mae_sgd['test_max_error'].min()

-12.03934928242115

## 4. Model Selection

❓ Your task is to predict the average temperature inside a greenhouse based on its design. You know that plants can handle small temperature variations, but are exponentially more sensitive as the temperature variations increase. Which Loss function do you train your model on to limit the risk of killing plants?

<details>
<summary> Answer </summary>
    
By theory, you would use a Squared Loss function. It would penalize outlier predictions and prevent your model from committing large errors. 

Comparing the maximum error of each model confirms the theory.

</details>

In [131]:
sl_cv['test_max_error'].min()

-10.02875488188051

In [132]:
mae_sgd['test_max_error'].min()

-12.043112757725083

### ⚠️ Please, push your exercice when you are done 🙃

# 🏁