# 📝 Exercise M7.03

As with the classification metrics exercise, we will evaluate the regression
metrics within a cross-validation framework to get familiar with the syntax.

We will use the Ames house prices dataset.

In [1]:
import os

os.makedirs("../../datasets", exist_ok=True)

In [2]:
%%bash

wget -qO "../../datasets/house_prices.csv" "https://github.com/INRIA/scikit-learn-mooc/raw/master/datasets/house_prices.csv"

In [3]:
import pandas as pd
import numpy as np

ames_housing = pd.read_csv("../../datasets/house_prices.csv")
data = ames_housing.drop(columns="SalePrice")
target = ames_housing["SalePrice"]
data = data.select_dtypes(np.number)
target /= 1000
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 34 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Id             1460 non-null   int64
 1   MSSubClass     1460 non-null   int64
 2   LotArea        1460 non-null   int64
 3   OverallQual    1460 non-null   int64
 4   OverallCond    1460 non-null   int64
 5   YearBuilt      1460 non-null   int64
 6   YearRemodAdd   1460 non-null   int64
 7   BsmtFinSF1     1460 non-null   int64
 8   BsmtFinSF2     1460 non-null   int64
 9   BsmtUnfSF      1460 non-null   int64
 10  TotalBsmtSF    1460 non-null   int64
 11  1stFlrSF       1460 non-null   int64
 12  2ndFlrSF       1460 non-null   int64
 13  LowQualFinSF   1460 non-null   int64
 14  GrLivArea      1460 non-null   int64
 15  BsmtFullBath   1460 non-null   int64
 16  BsmtHalfBath   1460 non-null   int64
 17  FullBath       1460 non-null   int64
 18  HalfBath       1460 non-null   int64
 19  Bedroo

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you want a deeper overview regarding this dataset, you can refer to the
Appendix - Datasets description section at the end of this MOOC.</p>
</div>

We can display an interactive diagram with the following command:

In [4]:
from sklearn import set_config
set_config(display='diagram')

The first step will be to create a linear regression model.

In [5]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model

Then, use the `cross_val_score` to estimate the statistical performance of
the model. Use a `KFold` cross-validation with 10 folds. Make the use of the
$R^2$ score explicit by assigning the parameter `scoring` (even though it is
the default score).

In [8]:
%%time
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, data, target, 
    cv=10, scoring="r2")
print(f"R2 score: {scores.mean():.3f} +/- {scores.std():.3f}")

R2 score: 0.794 +/- 0.103
CPU times: user 305 ms, sys: 10.5 ms, total: 316 ms
Wall time: 79.9 ms


Then, instead of using the $R^2$ score, use the mean absolute error. You need
to refer to the documentation for the `scoring` parameter.

In [9]:
%%time
scores = cross_val_score(model, data, target, 
    cv=10, scoring="neg_mean_absolute_error")
errors = -scores
print(f"Mean absolute error: "
    f"{errors.mean():.3f} k$ +/- {errors.std():.3f} k$")

Mean absolute error: 21.892 k$ +/- 2.225 k$
CPU times: user 332 ms, sys: 16.5 ms, total: 349 ms
Wall time: 63.1 ms


The `scoring` parameter in scikit-learn expects score. It means that the higher the values, and the smaller the errors are, the better the model is. Therefore, the error should be multiplied by -1. That's why the string given the `scoring` starts with `neg_` when dealing with metrics which are errors.

Finally, use the `cross_validate` function and compute multiple scores/errors
at once by passing a list of scorers to the `scoring` parameter. You can
compute the $R^2$ score and the mean absolute error for instance.

In [10]:
%%time
from sklearn.model_selection import cross_validate
import pandas as pd

scoring = ["r2", "neg_mean_absolute_error"]
cv_results = cross_validate(model, data, target, 
    scoring=scoring)
scores = {
    "R2": cv_results["test_r2"],
    "MSE": -cv_results["test_neg_mean_absolute_error"]
}
scores = pd.DataFrame(scores)
scores

CPU times: user 175 ms, sys: 4.05 ms, total: 180 ms
Wall time: 35.6 ms


Unnamed: 0,R2,MSE
0,0.848721,21.256799
1,0.816374,22.084083
2,0.813513,22.113367
3,0.814138,20.448279
4,0.637473,24.370341
