# 📝 Exercise M7.03

As with the classification metrics exercise, we will evaluate the regression
metrics within a cross-validation framework to get familiar with the syntax.

We will use the Ames house prices dataset.

In [1]:
import pandas as pd
import numpy as np

ames_housing = pd.read_csv("../datasets/house_prices.csv")
data = ames_housing.drop(columns="SalePrice")
target = ames_housing["SalePrice"]
data = data.select_dtypes(np.number)
target /= 1000

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you want a deeper overview regarding this dataset, you can refer to the
Appendix - Datasets description section at the end of this MOOC.</p>
</div>

The first step will be to create a linear regression model.

In [2]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

Then, use the `cross_val_score` to estimate the generalization performance of
the model. Use a `KFold` cross-validation with 10 folds. Make the use of the
$R^2$ score explicit by assigning the parameter `scoring` (even though it is
the default score).

In [3]:
from sklearn.model_selection import cross_val_score

reg_scores = cross_val_score(regressor, data, target, cv=10, scoring='r2')
reg_scores

array([0.84390289, 0.85497435, 0.88752303, 0.74951104, 0.81698014,
       0.82013355, 0.81554085, 0.81452472, 0.50115778, 0.83330693])

Then, instead of using the $R^2$ score, use the mean absolute error. You need
to refer to the documentation for the `scoring` parameter.

In [4]:
reg_absolute_scores = cross_val_score(regressor, data, target, 
                                      cv=10, scoring='neg_mean_absolute_error')

reg_absolute_scores

array([-20.48049905, -21.38003105, -21.26831487, -22.86887664,
       -24.79955736, -18.95827641, -20.11793792, -20.5040172 ,
       -26.76774564, -21.77871056])

Finally, use the `cross_validate` function and compute multiple scores/errors
at once by passing a list of scorers to the `scoring` parameter. You can
compute the $R^2$ score and the mean absolute error for instance.

In [5]:
import pandas as pd
from sklearn.model_selection import cross_validate

cv_results = cross_validate(regressor, data, target, cv=10,
                           scoring=['r2', 'neg_mean_absolute_error'])

cv_results = pd.DataFrame(cv_results)
cv_results

Unnamed: 0,fit_time,score_time,test_r2,test_neg_mean_absolute_error
0,0.004727,0.001997,0.843903,-20.480499
1,0.003231,0.001918,0.854974,-21.380031
2,0.003237,0.001881,0.887523,-21.268315
3,0.003134,0.001902,0.749511,-22.868877
4,0.003301,0.001778,0.81698,-24.799557
5,0.00331,0.001979,0.820134,-18.958276
6,0.003197,0.001868,0.815541,-20.117938
7,0.003175,0.001816,0.814525,-20.504017
8,0.003005,0.001822,0.501158,-26.767746
9,0.003002,0.001902,0.833307,-21.778711
