# Loss Functions

In this exercise, you will compare the effects of Loss functions on a `LinearRegression` model.

👇 Let's download a CSV file to use for this challenge and parse it into a DataFrame

In [1]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/04-Under-the-Hood/loss_functions_dataset.csv")
data.sample(5)

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Average Temperature
545,0.79,637.0,343.0,147.0,7.0,0.4,40.845
482,0.98,514.5,294.0,110.25,7.0,0.25,29.075
454,0.76,661.5,416.5,122.5,7.0,0.25,36.93
312,0.74,686.0,245.0,220.5,3.5,0.25,13.81
571,0.64,784.0,343.0,220.5,3.5,0.4,20.975


🎯 Your task is to predict the average temperature inside a greenhouse based on its design. Your temperature predictions will help you select the appropriate greenhouse design for each plant, based on their climate needs. 

🌿 You know that plants can handle small temperature variations, but are exponentially more sensitive as the temperature variations increase. 

## 1. Theory 

❓ Theoretically, which Loss function would you train your model on to limit the risk of killing plants?

<details>
<summary> 🆘 Answer </summary>
    
By theory, you would use a Mean Square Error (MSE) Loss function. It would penalize outlier predictions and prevent your model from committing large errors. This would ensure smaller temperature variations and a lower risk for plants.

</details>

> YOUR ANSWER HERE

## 2. Application

### 2.1 Preprocessing

❓ Standardise the features

In [2]:
# YOUR CODE HERE
from sklearn.preprocessing import StandardScaler

X = data.drop(columns= ['Average Temperature'])

scaler = StandardScaler().fit(X)

X_scaled = scaler.transform(X)

### 2.2 Modeling

In this section, you are going to verify the theory by evaluating models optimized on different Loss functions.

### Least Squares (MSE) Loss

❓ **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **Least Squares Loss** (MSE)



In [3]:
# YOUR CODE HERE
from sklearn.model_selection import cross_validate
from sklearn.linear_model import SGDRegressor

sgd_model = SGDRegressor(loss="squared_error")

modelcv = cross_validate(sgd_model, X_scaled, data['Average Temperature'], cv =10, scoring=['r2','max_error'])

modelcv

{'fit_time': array([0.0144794 , 0.00508118, 0.00364566, 0.00393701, 0.00404739,
        0.00386071, 0.00361562, 0.00387931, 0.00455642, 0.00369143]),
 'score_time': array([0.00487542, 0.00123286, 0.00101542, 0.0009737 , 0.00104475,
        0.00081158, 0.0008862 , 0.0014658 , 0.00085092, 0.00077128]),
 'test_r2': array([0.78676839, 0.90917169, 0.89565695, 0.88375602, 0.93133697,
        0.89671258, 0.9270804 , 0.91624173, 0.89538475, 0.93955611]),
 'test_max_error': array([-9.7778641 , -8.65877552, -8.8652513 , -9.15034147, -8.85218401,
        -8.58936341, -8.55820007, -8.80733745, -8.34607764, -7.64764026])}

❓ Compute 
- the mean cross-validated R2 score and save it in the variable `r2`
- the single biggest prediction error in °C of all your folds and save it in the variable `max_error_celsius`?

(Tips: `max_error` is an accepted scoring metric in sklearn)

In [4]:
# YOUR CODE HERE
r2 = modelcv['test_r2'].mean()
max_error_celsius = abs(modelcv['test_max_error']).max()


### Mean Absolute Error (MAE) Loss

What if we optimize our model on the MAE instead?

❓ **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **MAE** Loss

<details>
<summary>💡 Hints</summary>

- MAE loss cannot be directly specified in `SGDRegressor`. It must be engineered by adjusting the right parameters

</details>

In [5]:
# YOUR CODE HERE
sgd_model = SGDRegressor(loss="squared_error", epsilon= 0)

modelmae = cross_validate(sgd_model, X_scaled, data['Average Temperature'], cv =10, scoring=['r2','max_error'])

modelmae

{'fit_time': array([0.01517844, 0.00869846, 0.00597787, 0.00466084, 0.00955057,
        0.00569558, 0.00645018, 0.00933552, 0.00441313, 0.00884366]),
 'score_time': array([0.00269175, 0.00341415, 0.00110865, 0.00119472, 0.00198984,
        0.00123239, 0.00237465, 0.00239539, 0.00221586, 0.00207496]),
 'test_r2': array([0.78658682, 0.90900385, 0.89625607, 0.8832992 , 0.93133685,
        0.89683386, 0.92715379, 0.91655068, 0.89519896, 0.9390222 ]),
 'test_max_error': array([-9.85991134, -8.63975133, -8.72590242, -9.18396799, -8.86548691,
        -8.48251021, -8.51177967, -8.85118257, -8.47963833, -7.67478106])}

❓ Compute 
- the mean cross-validated R2 score, store it in `r2_mae`
- the single biggest prediction error of all your folds, store it in `max_error_mae`?

In [6]:
# YOUR CODE HERE
r2_mae = modelmae['test_r2'].mean()
max_error_mae = abs(modelmae['test_max_error']).max()


## 3. Conclusion

❓Which of the models you evaluated seems the most appropriate for your task?

<details>
<summary> 🆘Answer </summary>
    
Although mean cross-validated r2 scores are approximately similar between the two models, the one optimized on a MAE has more chance to make larger mistakes from time to time, increasing the risk of killing plants!

    
</details>

> YOUR ANSWER HERE

# 🏁 Check your code and push your notebook

In [7]:
from nbresult import ChallengeResult

result = ChallengeResult(
    'loss_functions',
    r2 = r2,
    r2_mae = r2_mae,
    max_error = max_error_celsius,
    max_error_mae = max_error_mae
)

result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/yousif/.pyenv/versions/3.10.6/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /home/yousif/code/ai-yousif/05-ML/04-Under-the-hood/data-loss-functions/tests
plugins: anyio-3.6.2, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 3 items

test_loss_functions.py::TestLossFunctions::test_max_error_order [32mPASSED[0m[32m   [ 33%][0m
test_loss_functions.py::TestLossFunctions::test_r2 [32mPASSED[0m[32m                [ 66%][0m
test_loss_functions.py::TestLossFunctions::test_r2_mae [32mPASSED[0m[32m            [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/loss_functions.pickle

[32mgit[39m commit -m [33m'Completed loss_functions step'[39m

[32mgit[39m push origin master

