# Loss Functions

In [40]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import SGDRegressor


In this exercise, you will compare the effects of Loss functions on a `LinearRegression` model.

👇 Let's download a CSV file to use for this challenge and parse it into a DataFrame

In [41]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/04-Under-the-Hood/loss_functions_dataset.csv")
data.sample(5)


Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Average Temperature
482,0.98,514.5,294.0,110.25,7.0,0.25,29.075
313,0.74,686.0,245.0,220.5,3.5,0.25,14.0
628,0.9,563.5,318.5,122.5,7.0,0.4,35.75
385,0.98,514.5,294.0,110.25,7.0,0.25,29.17
597,0.76,661.5,416.5,122.5,7.0,0.4,40.275


🎯 Your task is to predict the average temperature inside a greenhouse based on its design. Your temperature predictions will help you select the appropriate greenhouse design for each plant, based on their climate needs. 

🌿 You know that plants can handle small temperature variations, but are exponentially more sensitive as the temperature variations increase. 

## 1. Theory 

❓ Theoretically, which Loss function would you train your model on to limit the risk of killing plants?

<details>
<summary> 🆘 Answer </summary>
    
By theory, you would use a Mean Square Error (MSE) Loss function. It would penalize outlier predictions and prevent your model from committing large errors. This would ensure smaller temperature variations and a lower risk for plants.

</details>

use a mean square error (mse) loss function to minimize outlier predictions and ensure stable temperatures, reducing the risk to plants.

## 2. Application

### 2.1 Preprocessing

❓ Standardise the features

In [42]:
data_standardized = (data - data.mean()) / data.std()


### 2.2 Modeling

In this section, you are going to verify the theory by evaluating models optimized on different Loss functions.

### Least Squares (MSE) Loss

❓ **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **Least Squares Loss** (MSE)



In [43]:
X = data_standardized.drop(columns=['Average Temperature'])
y = data_standardized['Average Temperature']

sgd_regressor = SGDRegressor(loss='squared_error', max_iter=1000, random_state=42)

cross_val_scores = cross_val_score(sgd_regressor, X, y, cv=10, scoring='neg_mean_squared_error')

print("cross-validation scores (negative mean squared error)")
print(cross_val_scores)


cross-validation scores (negative mean squared error)
[-0.2157124  -0.08173209 -0.11702542 -0.12651594 -0.0714719  -0.11407788
 -0.07618872 -0.10621348 -0.11233026 -0.07037723]


❓ Compute 
- the mean cross-validated R2 score and save it in the variable `r2`
- the single biggest prediction error in °C of all your folds and save it in the variable `max_error_celsius`?

(Tips: `max_error` is an accepted scoring metric in sklearn)

In [44]:
r2 = cross_val_score(sgd_regressor, X, y, cv=10, scoring='r2').mean()

max_error_celsius = -cross_val_score(sgd_regressor, X, y, cv=10, scoring='max_error').max()

print("mean cross-validated R2 score (mse)", r2)
print("single biggest prediction error (mse)", max_error_celsius)


mean cross-validated R2 score (mse) 0.8826779284869563
single biggest prediction error (mse) 0.922562411115637


### Mean Absolute Error (MAE) Loss

What if we optimize our model on the MAE instead?

❓ **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **MAE** Loss

<details>
<summary>💡 Hints</summary>

- MAE loss cannot be directly specified in `SGDRegressor`. It must be engineered by adjusting the right parameters

</details>

In [53]:
sgd_regressor_mae = SGDRegressor(loss='epsilon_insensitive', epsilon=0, max_iter=1000, random_state=42)

cross_val_scores_mae = cross_val_score(sgd_regressor_mae, X, y, cv=10, scoring='neg_mean_absolute_error')

print("\ncross-validation scores (negative mean absolute error)")
print(cross_val_scores_mae)



cross-validation scores (negative mean absolute error)
[-0.35204504 -0.20354279 -0.23738495 -0.2392159  -0.1630961  -0.21765232
 -0.17854324 -0.21150505 -0.21882908 -0.18283278]


❓ Compute 
- the mean cross-validated R2 score, store it in `r2_mae`
- the single biggest prediction error of all your folds, store it in `max_error_mae`?

In [52]:
r2_mae = cross_val_score(sgd_regressor_mae, X, y, cv=10, scoring='r2').mean()
max_error_mae = -cross_val_score(sgd_regressor_mae, X, y, cv=10, scoring='max_error').max()

print("mean cross-validated r2 score (mae)", r2_mae)
print("single biggest prediction error (mae)", max_error_mae)


mean cross-validated r2 score (mae) 0.893949356942436
single biggest prediction error (mae) 0.8980472949087704


## 3. Conclusion

❓Which of the models you evaluated seems the most appropriate for your task?

<details>
<summary> 🆘Answer </summary>
    
Although mean cross-validated r2 scores are approximately similar between the two models, the one optimized on a MAE has more chance to make larger mistakes from time to time, increasing the risk of killing plants!

    
</details>

mean squared error (mse) loss

# 🏁 Check your code and push your notebook

In [50]:
from nbresult import ChallengeResult

result = ChallengeResult(
    'loss_functions',
    r2 = r2,
    r2_mae = r2_mae,
    max_error = max_error_celsius,
    max_error_mae = max_error_mae
)

result.write()
print(result.check())



platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/reecepalmer/.pyenv/versions/3.10.6/envs/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/reecepalmer/Code/RPalmr/05-ML/04-Under-the-hood/data-loss-functions/tests
plugins: asyncio-0.19.0, dash-2.14.0, typeguard-2.13.3, anyio-3.6.2, hydra-core-1.3.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 3 items

test_loss_functions.py::TestLossFunctions::test_max_error_order [31mFAILED[0m[31m   [ 33%][0m
test_loss_functions.py::TestLossFunctions::test_r2 [32mPASSED[0m[31m                [ 66%][0m
test_loss_functions.py::TestLossFunctions::test_r2_mae [32mPASSED[0m[31m            [100%][0m

[31m[1m____________________ TestLossFunctions.test_max_error_order ____________________[0m

self = <tests.test_loss_functions.TestLossFunctions testMethod=test_max_error_order>

    [94mdef[39;49;00m [92mtest_max_error_order[39;49;00m([96mself[39;49;00m):
>       [96mself[39;49;00m.assertLess([96m