In [15]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import max_error

# Loss Functions

In this exercise, you will compare the effects of Loss functions on a `LinearRegression` model.

üëá Let's download a CSV file to use for this challenge and parse it into a DataFrame

In [16]:
data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/04-Under-the-Hood/loss_functions_dataset.csv")
data.sample(5)

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Average Temperature
515,0.69,735.0,294.0,220.5,3.5,0.25,13.535
100,0.9,563.5,318.5,122.5,7.0,0.1,30.71
282,0.64,784.0,343.0,220.5,3.5,0.1,17.19
295,0.9,563.5,318.5,122.5,7.0,0.25,34.79
752,0.69,735.0,294.0,220.5,3.5,0.4,15.375


üéØ Your task is to predict the average temperature inside a greenhouse based on its design. Your temperature predictions will help you select the appropriate greenhouse design for each plant, based on their climate needs. 

üåø You know that plants can handle small temperature variations, but are exponentially more sensitive as the temperature variations increase. 

## 1. Theory 

‚ùì Theoretically, which Loss function would you train your model on to limit the risk of killing plants?

<details>
<summary> üÜò Answer </summary>
    
By theory, you would use a Mean Square Error (MSE) Loss function. It would penalize outlier predictions and prevent your model from committing large errors. This would ensure smaller temperature variations and a lower risk for plants.

</details>

Mean Square Error (MSE) Penaliza los errores grandes de manera cuadr√°tica, lo que significa que los errores mayores reciben un castigo significativamente m√°s alto que los errores peque√±os, es decir, el modelo estar√° m√°s incentivado a reducir los errores grandes, evitando predicciones que pudieran poner en riesgo las plantas sensibles a variaciones.

## 2. Application

### 2.1 Preprocessing

‚ùì Standardise the features

In [17]:
X = data.drop(columns=['Average Temperature'])
y = data['Average Temperature']

In [18]:
rb = RobustScaler()
X_scaled = rb.fit_transform(X)
data_scaled = pd.DataFrame(X_scaled, columns=X.columns)
data_scaled.head()

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area
0,1.559322,-1.181818,-0.5,-0.923077,0.5,-0.833333
1,1.559322,-1.181818,-0.5,-0.923077,0.5,-0.833333
2,1.559322,-1.181818,-0.5,-0.923077,0.5,-0.833333
3,1.559322,-1.181818,-0.5,-0.923077,0.5,-0.833333
4,1.016949,-0.818182,0.0,-0.769231,0.5,-0.833333


### 2.2 Modeling

In this section, you are going to verify the theory by evaluating models optimized on different Loss functions.

### Least Squares (MSE) Loss

‚ùì **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **Least Squares Loss** (MSE)



In [19]:
model = SGDRegressor(loss='squared_error')
scaled_score = cross_val_score(model, X_scaled, y, cv=10, scoring='r2').mean()
scaled_score

0.8973722044957902

‚ùì Compute 
- the mean cross-validated R2 score and save it in the variable `r2`
- the single biggest prediction error in ¬∞C of all your folds and save it in the variable `max_error_celsius`?

(Tips: `max_error` is an accepted scoring metric in sklearn)

In [20]:
r2 = scaled_score

In [35]:
max_error_celsius = cross_val_score(model, X_scaled, y, cv=10, scoring='max_error').mean()
max_error_celsius = np.abs(max_error_celsius)
max_error_celsius

9.012752119915518

### Mean Absolute Error (MAE) Loss

What if we optimize our model on the MAE instead?

‚ùì **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **MAE** Loss

<details>
<summary>üí° Hints</summary>

- MAE loss cannot be directly specified in `SGDRegressor`. It must be engineered by adjusting the right parameters

</details>

In [47]:
model_mae = SGDRegressor(loss='epsilon_insensitive', epsilon=0)
mae_score = cross_val_score(model_mae, X_scaled, y, cv=10, scoring='r2').mean()
mae_score

0.8357117160454356

‚ùì Compute 
- the mean cross-validated R2 score, store it in `r2_mae`
- the single biggest prediction error of all your folds, store it in `max_error_mae`?

In [30]:
r2_mae = mae_score

In [44]:
max_error_mae = cross_val_score(model, X_scaled, y, cv=10, scoring='max_error').mean()
max_error_mae = np.abs(max_error_mae)
max_error_mae

9.017779277365625

## 3. Conclusion

‚ùìWhich of the models you evaluated seems the most appropriate for your task?

<details>
<summary> üÜòAnswer </summary>
    
Although mean cross-validated r2 scores are approximately similar between the two models, the one optimized on a MAE has more chance to make larger mistakes from time to time, increasing the risk of killing plants!

    
</details>

MAE

# üèÅ Check your code and push your notebook

In [36]:
from nbresult import ChallengeResult

result = ChallengeResult(
    'loss_functions',
    r2 = r2,
    r2_mae = r2_mae,
    max_error = max_error_celsius,
    max_error_mae = max_error_mae
)

result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /root/.pyenv/versions/3.10.6/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /root/code/MonicaVenzor/05-ML/04-Under-the-hood/data-loss-functions/tests
plugins: anyio-3.6.2, asyncio-0.19.0
asyncio: mode=strict
[1mcollecting ... [0mcollected 3 items

test_loss_functions.py::TestLossFunctions::test_max_error_order [32mPASSED[0m[32m   [ 33%][0m
test_loss_functions.py::TestLossFunctions::test_r2 [32mPASSED[0m[32m                [ 66%][0m
test_loss_functions.py::TestLossFunctions::test_r2_mae [32mPASSED[0m[32m            [100%][0m



üíØ You can commit your code:

[1;32mgit[39m add tests/loss_functions.pickle

[32mgit[39m commit -m [33m'Completed loss_functions step'[39m

[32mgit[39m push origin master

