# Loss Functions

In this exercise, you will compare the effects of Loss functions on a `LinearRegression` model.

👇 Let's download a CSV file to use for this challenge and parse it into a DataFrame

## Imports

In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import cross_validate
import pandas as pd

In [2]:
data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/04-Under-the-Hood/loss_functions_dataset.csv")
data.sample(5)

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Average Temperature
435,0.98,514.5,294.0,110.25,7.0,0.25,29.37
179,0.69,735.0,294.0,220.5,3.5,0.1,12.77
29,0.71,710.5,269.5,220.5,3.5,0.0,9.06
523,0.64,784.0,343.0,220.5,3.5,0.25,18.605
748,0.71,710.5,269.5,220.5,3.5,0.4,14.01


🎯 Your task is to predict the average temperature inside a greenhouse based on its design. Your temperature predictions will help you select the appropriate greenhouse design for each plant, based on their climate needs. 

🌿 You know that plants can handle small temperature variations, but are exponentially more sensitive as the temperature variations increase. 

## Application

### Preprocessing

❓ Standardise the features

In [3]:
X = data.drop(columns="Average Temperature")
y = data["Average Temperature"]
scaler = StandardScaler().fit(X)
X_scaled = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled)
X_scaled.columns = X.columns
X_scaled

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area
0,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.760447
1,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.760447
2,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.760447
3,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.760447
4,1.284979,-1.229239,0.000000,-1.198678,1.0,-1.760447
...,...,...,...,...,...,...
763,-1.174613,1.275625,0.561951,0.972512,-1.0,1.244049
764,-1.363812,1.553943,1.123903,0.972512,-1.0,1.244049
765,-1.363812,1.553943,1.123903,0.972512,-1.0,1.244049
766,-1.363812,1.553943,1.123903,0.972512,-1.0,1.244049


### Modeling

In this section, you are going to verify the theory by evaluating models optimized on different Loss functions.

### Least Squares (MSE) Loss

❓ **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **Least Squares Loss** (MSE)



In [4]:
sqd_model = SGDRegressor(loss="squared_error")
sqd_model_cv = cross_validate(
    sqd_model,
    X_scaled,
    y,
    cv=10,
    scoring=["max_error", "r2"]
)
sqd_model_cv

{'fit_time': array([0.03300595, 0.01200008, 0.01199961, 0.01400018, 0.02000022,
        0.01598454, 0.01400661, 0.01200032, 0.01300001, 0.01400042]),
 'score_time': array([0.00399351, 0.00400043, 0.00300002, 0.00399971, 0.00701451,
        0.00300097, 0.00300002, 0.0030055 , 0.00300002, 0.00299382]),
 'test_max_error': array([-9.80461773, -8.72248387, -8.90540346, -9.20232736, -8.95424929,
        -8.62290341, -8.55489087, -8.90999772, -8.38879741, -7.70003265]),
 'test_r2': array([0.78451676, 0.90830866, 0.89469557, 0.88410197, 0.93116   ,
        0.89661863, 0.92730818, 0.91534992, 0.89555415, 0.93836386])}

❓ Compute 
- the mean cross-validated R2 score and save it in the variable `r2`
- the single biggest prediction error in °C of all your folds and save it in the variable `max_error_celsius`?

(Tips: `max_error` is an accepted scoring metric in sklearn)

In [5]:
r2 = sqd_model_cv["test_r2"].mean()
r2

0.8975977680368816

In [6]:
max_error_celsius = abs(sqd_model_cv["test_max_error"]).max()
max_error_celsius

9.804617728678032

### Mean Absolute Error (MAE) Loss

What if we optimize our model on the MAE instead?

❓ **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **MAE** Loss

In [7]:
reg_sgd = SGDRegressor(loss = "epsilon_insensitive", epsilon = 0).fit(X_scaled, y)
reg_sgd

❓ Compute 
- the mean cross-validated R2 score, store it in `r2_mae`
- the single biggest prediction error of all your folds, store it in `max_error_mae`?

In [8]:
mae_sgd_cv = cross_validate(
    reg_sgd,
    X_scaled,y,
    cv=10,
    scoring=["max_error", "r2"]
)
mae_sgd_cv

{'fit_time': array([0.02003884, 0.01900387, 0.01999426, 0.0179944 , 0.01899767,
        0.01999354, 0.02099967, 0.02099276, 0.0179987 , 0.02199769]),
 'score_time': array([0.00699162, 0.00399375, 0.00300527, 0.00400567, 0.00400591,
        0.00400305, 0.00300407, 0.00399756, 0.00401998, 0.00300384]),
 'test_max_error': array([-11.22380437, -10.61513463, -10.70818102, -11.19309908,
        -11.17414633, -10.93539257, -10.77693217, -11.17337399,
        -10.92506744, -10.12402533]),
 'test_r2': array([0.7401912 , 0.87630251, 0.87314047, 0.84629321, 0.9169458 ,
        0.87414667, 0.91817837, 0.89869045, 0.87898485, 0.93590884])}

In [9]:
r2_mae = mae_sgd_cv["test_r2"].mean()
r2_mae

0.875878235698402

In [10]:
max_error_mae = abs(mae_sgd_cv["test_max_error"]).max()
max_error_mae

11.223804367831676