# Loss Functions

In this exercise, you will compare the effects of Loss functions on a linear regression model.

👇 Import the data from the attached csv file

In [1]:
import pandas as pd
data = pd.read_csv("data.csv")
data.head()

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Average Temperature
0,0.98,514.5,294.0,110.25,7.0,0.0,18.44
1,0.98,514.5,294.0,110.25,7.0,0.0,18.44
2,0.98,514.5,294.0,110.25,7.0,0.0,18.44
3,0.98,514.5,294.0,110.25,7.0,0.0,18.44
4,0.9,563.5,318.5,122.5,7.0,0.0,24.56


🎯 Your task is to predict the average temperature inside a greenhouse based on its design. Your temperature predictions will help you select the appropriate greenhouse design for each plant, based on their climatic needs. 

🌿 You know that plants can handle small temperature variations, but are exponentially more sensitive as the temperature variations increase. 

## 1. Theory 

❓ Theoretically, which Loss function would you train your model on to limit the risk of killing plants?

<details>
<summary> 🆘 Answer </summary>
    
By theory, you would use a Mean Square Error (MSE) Loss function. It would penalize outlier predictions and prevent your model from committing large errors. This would ensure smaller temperature variations and a lower risk for plants.

</details>

it is a classification problem: do the plants die or not, is the variation of temperature to high or in an acceptable range

## 2. Application

### 2.1 Preprocessing

👇 Scale the features

In [2]:
X = data.copy().drop(columns="Average Temperature")
y = data["Average Temperature"]

In [3]:
from sklearn.preprocessing import RobustScaler
r_scaler = RobustScaler() # Instanciate Robust Scaler
r_scaler.fit(X) # Fit scaler to feature
X_scaled = r_scaler.transform(X) #Scale
X_scaled

array([[ 1.55932203, -1.18181818, -0.5       , -0.92307692,  0.5       ,
        -0.83333333],
       [ 1.55932203, -1.18181818, -0.5       , -0.92307692,  0.5       ,
        -0.83333333],
       [ 1.55932203, -1.18181818, -0.5       , -0.92307692,  0.5       ,
        -0.83333333],
       ...,
       [-0.88135593,  1.        ,  1.        ,  0.46153846, -0.5       ,
         0.5       ],
       [-0.88135593,  1.        ,  1.        ,  0.46153846, -0.5       ,
         0.5       ],
       [-0.88135593,  1.        ,  1.        ,  0.46153846, -0.5       ,
         0.5       ]])

### 2.2 Modelling

In this section, you are going to verify the theory by evaluating models optimized on different Loss functions.

### Least Squares (MSE) Loss

👇 **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **Least Squares Loss** (MSE)



In [4]:
from sklearn.linear_model import SGDRegressor, LinearRegression
lin_reg_sgd = SGDRegressor(loss='squared_loss')

In [5]:
from sklearn.model_selection import cross_validate
cv_results = cross_validate(lin_reg_sgd, X_scaled, y, cv=10, 
scoring=['max_error','r2'])

In [6]:
cv_results

{'fit_time': array([0.01031899, 0.01071787, 0.00897503, 0.00793672, 0.00755668,
        0.00792146, 0.00828075, 0.00770187, 0.00757289, 0.00833178]),
 'score_time': array([0.00062156, 0.00055718, 0.00050735, 0.00038218, 0.00041342,
        0.00037909, 0.00039482, 0.00036502, 0.00037599, 0.00045061]),
 'test_max_error': array([-9.49579676, -8.92020163, -9.12251524, -9.53990582, -9.26036828,
        -8.93465878, -8.84849578, -9.1473085 , -8.77581784, -7.96480059]),
 'test_r2': array([0.78312402, 0.90786231, 0.8942457 , 0.88170295, 0.93128583,
        0.89669243, 0.92825743, 0.91603718, 0.89625968, 0.93941866])}

👇 Compute 
- the mean cross validated R2 score `r2`
- the single biggest prediction error in °C of all your folds `max_error`?

(Tips: `max_error` is an accepted scoring metrics in sklearn)

In [7]:
r2 = cv_results["test_r2"].mean()
r2

0.8974886188302046

In [8]:
max_error = cv_results["test_max_error"].max()
max_error

-7.9648005852573505

### Mean Absolute Error (MAE) Loss

What if we optimize our model on the MAE instead?

👇 **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **MAE** Loss

<details>
<summary>💡 Hints</summary>

- MAE loss cannot be directly specified in `SGDRegressor`. It must be engineered by adjusting the right parameters

</details>

In [9]:
lin_reg_sgd = SGDRegressor(loss='epsilon_insensitive', epsilon=0)

In [10]:
cv_mae = cross_validate(lin_reg_sgd, X_scaled, y, cv=10, 
scoring=["r2", "max_error"])
cv_mae

{'fit_time': array([0.00639343, 0.00507855, 0.00521922, 0.00463796, 0.00496721,
        0.0044353 , 0.00465894, 0.00532484, 0.00463343, 0.00480008]),
 'score_time': array([0.00055766, 0.00097585, 0.00045204, 0.00048661, 0.0007019 ,
        0.00043297, 0.00049686, 0.00059509, 0.00040865, 0.00035334]),
 'test_r2': array([0.67813416, 0.82049636, 0.83704044, 0.79550061, 0.88935138,
        0.83072671, 0.89019441, 0.86124105, 0.83861547, 0.91679893]),
 'test_max_error': array([-13.31751314, -11.65340278, -11.87586669, -12.35105989,
        -12.60998421, -12.41999414, -12.26374054, -13.14224208,
        -12.66355283, -12.01623514])}

👇 Compute 
- the mean cross validated R2 score `r2_mae`
- the single biggest prediction error of all your folds `max_error_mae`?

In [11]:
r2_mae = cv_mae["test_r2"].mean()
r2_mae

0.8358099504680048

In [12]:
max_error_mae = cv_mae["test_max_error"].max()
max_error_mae

-11.653402776566736

## 3. Conclusion

❓Which of the models you evaluated seems the most appropriate for your task?

<details>
<summary> 🆘Answer </summary>
    
Although mean cross-validated r2 scores are approximately similar between the two models, the one optimized on a MAE has more chance to make larger mistakes from time to time, increasing risk of killing plants!

    
</details>

# 🏁 Check your code

In [13]:
from nbresult import ChallengeResult

result = ChallengeResult('loss_functions',
    r2 = r2,
    r2_mae = r2_mae,
    max_error = max_error,
    max_error_mae = max_error_mae,                     
)
result.write()
print(result.check())

platform linux -- Python 3.8.12, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/mz/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/mz/code/MaCoZu/data-challenges/05-ML/04-Under-the-hood/01-Loss-Functions
plugins: dash-2.0.0, anyio-3.3.2
[1mcollecting ... [0mcollected 3 items

tests/test_loss_functions.py::TestLossFunctions::test_max_error_order [32mPASSED[0m[32m [ 33%][0m
tests/test_loss_functions.py::TestLossFunctions::test_r2 [32mPASSED[0m[32m          [ 66%][0m
tests/test_loss_functions.py::TestLossFunctions::test_r2_mae [32mPASSED[0m[32m      [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/loss_functions.pickle

[32mgit[39m commit -m [33m'Completed loss_functions step'[39m

[32mgit[39m push origin master
