# Loss Functions

In this exercise, you will compare the effects of Loss functions on a `LinearRegression` model.

👇 Let's download a CSV file to use for this challenge and parse it into a DataFrame

In [1]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/04-Under-the-Hood/loss_functions_dataset.csv")
data.sample(5)

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Average Temperature
746,0.74,686.0,245.0,220.5,3.5,0.4,15.555
538,0.86,588.0,294.0,147.0,7.0,0.4,31.895
615,0.66,759.5,318.5,220.5,3.5,0.4,16.725
742,0.76,661.5,416.5,122.5,7.0,0.4,38.55
617,0.64,784.0,343.0,220.5,3.5,0.4,20.475


🎯 Your task is to predict the average temperature inside a greenhouse based on its design. Your temperature predictions will help you select the appropriate greenhouse design for each plant, based on their climate needs. 

🌿 You know that plants can handle small temperature variations, but are exponentially more sensitive as the temperature variations increase. 

## 1. Theory 

❓ Theoretically, which Loss function would you train your model on to limit the risk of killing plants?

<details>
<summary> 🆘 Answer </summary>
    
By theory, you would use a Mean Square Error (MSE) Loss function. It would penalize outlier predictions and prevent your model from committing large errors. This would ensure smaller temperature variations and a lower risk for plants.

</details>

> YOUR ANSWER HERE
<!-- One suitable loss function in this case would be the Mean Squared Error (MSE). The MSE measures the average squared difference between the predicted values and the actual values. By squaring the differences, the MSE amplifies larger errors, giving them more weight in the training process. -->

## 2. Application

### 2.1 Preprocessing

❓ Standardise the features

In [2]:
# YOUR CODE HERE
from sklearn.preprocessing import StandardScaler

features = ['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area', 'Overall Height', 'Glazing Area']

scaler = StandardScaler()

data[features] = scaler.fit_transform(data[features])



   Relative Compactness  Surface Area  Wall Area  Roof Area  Overall Height  \
0              2.041777     -1.785875  -0.561951  -1.470077             1.0   
1              2.041777     -1.785875  -0.561951  -1.470077             1.0   
2              2.041777     -1.785875  -0.561951  -1.470077             1.0   
3              2.041777     -1.785875  -0.561951  -1.470077             1.0   
4              1.284979     -1.229239   0.000000  -1.198678             1.0   

   Glazing Area  Average Temperature  
0     -1.760447                18.44  
1     -1.760447                18.44  
2     -1.760447                18.44  
3     -1.760447                18.44  
4     -1.760447                24.56  


### 2.2 Modeling

In this section, you are going to verify the theory by evaluating models optimized on different Loss functions.

### Least Squares (MSE) Loss

❓ **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **Least Squares Loss** (MSE)



In [10]:
# YOUR CODE HERE
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
import numpy as np

X = data[['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area', 'Overall Height', 'Glazing Area']]
y = data['Average Temperature']

model = SGDRegressor(loss='squared_loss', random_state=42)

cv_scores = cross_val_score(model, X, y, cv=10, scoring='neg_mean_squared_error')

cv_scores = -cv_scores

mean_mse = np.mean(cv_scores)





❓ Compute 
- the mean cross-validated R2 score and save it in the variable `r2`
- the single biggest prediction error in °C of all your folds and save it in the variable `max_error_celsius`?

(Tips: `max_error` is an accepted scoring metric in sklearn)

In [6]:
# YOUR CODE HERE

cv_scores_r2 = cross_val_score(model, X, y, cv=10, scoring='r2')

r2 = cv_scores_r2.mean()

cv_scores_error = cross_val_score(model, X, y, cv=10, scoring='max_error')

max_error_celsius = -cv_scores_error.max() 

print("Mean Cross-Validated R2 Score:", r2)
print("Single Biggest Prediction Error in °C:", max_error_celsius)

Mean Cross-Validated R2 Score: 0.8982272875896993
Single Biggest Prediction Error in °C: 7.6641500385444346




### Mean Absolute Error (MAE) Loss

What if we optimize our model on the MAE instead?

❓ **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **MAE** Loss

<details>
<summary>💡 Hints</summary>

- MAE loss cannot be directly specified in `SGDRegressor`. It must be engineered by adjusting the right parameters

</details>

In [12]:
# YOUR CODE HERE
from sklearn.metrics import make_scorer, mean_absolute_error

model = SGDRegressor(loss='epsilon_insensitive', epsilon=0.0, random_state=42)
scoring = make_scorer(mean_absolute_error, greater_is_better=False)
cv_scores_mae = -cross_val_score(model, X, y, cv=10, scoring=scoring)
mean_mae = cv_scores_mae.mean()
mean_mae

2.287427307238857

❓ Compute 
- the mean cross-validated R2 score, store it in `r2_mae`
- the single biggest prediction error of all your folds, store it in `max_error_mae`?

In [14]:
# YOUR CODE HERE
from sklearn.metrics import r2_score, max_error
model = SGDRegressor(loss='epsilon_insensitive', epsilon=0.0, random_state=42)

cv_scores_r2_mae = cross_val_score(model, X, y, cv=10, scoring='r2')
r2_mae = cv_scores_r2_mae.mean()
cv_scores_error_mae = cross_val_score(model, X, y, cv=10, scoring='max_error')
max_error_mae = -cv_scores_error_mae.max()  
print("Mean Cross-Validated R2 Score (MAE):", r2_mae)
print("Single Biggest Prediction Error (MAE):", max_error_mae)

Mean Cross-Validated R2 Score (MAE): 0.8763275370982454
Single Biggest Prediction Error (MAE): 10.15643538745553


## 3. Conclusion

❓Which of the models you evaluated seems the most appropriate for your task?

<details>
<summary> 🆘Answer </summary>
    
Although mean cross-validated r2 scores are approximately similar between the two models, the one optimized on a MAE has more chance to make larger mistakes from time to time, increasing the risk of killing plants!

    
</details>

> YOUR ANSWER HERE
<!-- Although mean cross-validated r2 scores are approximately similar between the two models, the one optimized on a MAE has more chance to make larger mistakes from time to time, increasing the risk of killing plants! -->


# 🏁 Check your code and push your notebook

In [9]:
from nbresult import ChallengeResult

result = ChallengeResult(
    'loss_functions',
    r2 = r2,
    r2_mae = r2_mae,
    max_error = max_error_celsius,
    max_error_mae = max_error_mae
)

result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/saikotdasjoy/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/saikotdasjoy/code/Saikot1997/data-loss-functions/tests
plugins: asyncio-0.19.0, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 3 items

test_loss_functions.py::TestLossFunctions::test_max_error_order [32mPASSED[0m[32m   [ 33%][0m
test_loss_functions.py::TestLossFunctions::test_r2 [32mPASSED[0m[32m                [ 66%][0m
test_loss_functions.py::TestLossFunctions::test_r2_mae [32mPASSED[0m[32m            [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/loss_functions.pickle

[32mgit[39m commit -m [33m'Completed loss_functions step'[39m

[32mgit[39m push origin master

