# SLU10 - Metrics for Regression: Exercise Notebook

In this notebook, you will implement:
    - Mean Absolute Error (MAE)
    - Mean Squared Error (MSE)
    - Root Mean Squared Error (RMSE)
    - Coefficient of Determination (R²)
    - Adjusted R²
    - Scikitlearn metrics
    - Using metrics for k-fold cross validation


Start by loading the data we will use to fit a linear regression - hopefully you still have SLU07 in your memory - and fitting the LinearRegression estimator from scikitlearn:

In [None]:
# Base imports
import math
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression

data = load_boston()

x = pd.DataFrame(data['data'], columns=data['feature_names'])
y = pd.Series(data['target'])
x.head()

In [None]:
np.random.seed(42)

x_housing = x.values
y_housing = y.values

lr = LinearRegression()
lr.fit(x_housing, y_housing)

y_hat_housing = lr.predict(x_housing)
betas_housing = pd.Series([lr.intercept_] + list(lr.coef_))

## 1 Metrics

We will start by covering the metrics we learned in the unit, in particular a set of related metrics:

- Mean Squared Error

$$MSE = \frac{1}{N} \sum_{n=1}^N (y_n - \hat{y}_n)^2$$


- Root Mean Squared Error

$$RMSE = \sqrt{MSE}$$


- Mean Absolute Error

$$MAE = \frac{1}{N} \sum_{n=1}^N \left| y_n - \hat{y}_n \right|$$

### 1.1 Mean Squared Error

Implement the mean squared error in the next function:

In [None]:
def mean_squared_error(y, y_pred): 
    """
    Args: 
        y_pred : numpy.array with shape (num_samples,) - predictions
        y : numpy.array with shape (num_samples,) - labels 
        
    Returns: 
        mse : float with Mean Squared Error Value

    """
    # 1) Compute the error.
    # error = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # 2) Compute the squared value of the errors for each sample
    # squared_error = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # 3) Compute the mean squared value of the errors
    # mae = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return mse

Check the outputs of your function match the results below:

In [None]:
np.random.seed(42)
mse = mean_squared_error(np.random.rand(10), np.random.rand(10))
assert math.isclose(0.066594135739203, mse)

Now compute the Mean Squared Error for our housing dataset:

In [None]:
MSE = mean_squared_error(y_hat_housing, y_housing)
print('Mean Squared Error Housing dataset: {}'.format(MSE))

### 1.2 Root Mean Squared Error
Implement the root mean squared error in the function below:

In [None]:
def root_mean_squared_error(y, y_pred): 
    """
    Args: 
        y_pred : numpy.array with shape (num_samples,) - predictions
        y : numpy.array with shape (num_samples,) - labels 
        
    Returns: 
    """
    # 1) Compute the mean squared error. Tip: don't forget our previous function
    # returned an extra output value:
    # mse, _ = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # 2) Compute the root square.
    # rmse = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return rmse

Check the outputs of your function match the results below:

In [None]:
np.random.seed(42)
rmse = root_mean_squared_error(np.random.rand(10), np.random.rand(10))
assert math.isclose(0.25805839598665065, rmse)

Finally, compute the Root Mean Squared Error for our housing dataset:

In [None]:
RMSE = root_mean_squared_error(y_hat_housing, y_housing)
print('Root Mean Squared Error Housing dataset: {}'.format(RMSE))

### 1.3 Mean Absolute Error

Finally, implement the Mean Absolute Error in the function below. 

In [None]:
def mean_absolute_error(y, y_pred): 
    """
    Args: 
        y_pred : numpy.array with shape (num_samples,) - predictions
        y : numpy.array with shape (num_samples,) - labels 
        
    Returns: 
        mae : float with Mean Absolute Error
    """
    # 1) Compute the error.
    # error = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # 2) Compute the absolute value of the errors for each sample
    # abs_error = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # 3) Compute the mean of the absolute value of the errors
    # mae = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return mae

Check the outputs of your function match the results below:

In [None]:
np.random.seed(42)
mae = mean_absolute_error(np.random.rand(10), np.random.rand(10))
assert math.isclose(0.20867273347264428, mae)

Now compute the Mean Absolute Error for our housing dataset:

In [None]:
MAE = mean_absolute_error(y_hat_housing, y_housing)
print('Mean Absolute Error Housing dataset: {}'.format(MAE))

Next we will focus on the Coefficient of Determination - $R^2$ - and its adjusted form. See the equations below:

- $R^2$ score 

$$R² = 1 - \frac{MSE(y, \hat{y})}{MSE(y, \bar{y})} 
= 1 - \frac{\frac{1}{N} \sum_{n=1}^N (y_n - \hat{y}_n)^2}{\frac{1}{N} \sum_{n=1}^N (y_n - \bar{y})^2}
= 1 - \frac{\sum_{n=1}^N (y_n - \hat{y}_n)^2}{\sum_{n=1}^N (y_n - \bar{y})^2}$$

where $$\bar{y} = \frac{1}{N} \sum_{n=1}^N y_n$$

- Adjusted $R^2$ score 

$$\bar{R}^2 = 1 - \frac{N - 1}{N - K - 1} (1 - R^2)$$

where $N$ is the number of observations in the dataset used for training the model (i.e. number of rows of the pandas dataframe) and $K$ is the number of features used by your model (i.e. number of columns of the pandas dataframe)


### 1.4 R² score

Start by implementing the $R^2$ score in the function below:

In [None]:
def r_squared(y_pred, y): 
    """
    Args: 
        y_pred : numpy.array with shape (num_samples,) - predictions
        y : numpy.array with shape (num_samples,) - labels 
        
    Returns: 
        r2 : float with R squared value
    """

    # 1) Compute labels mean.
    # y_mean = ...
    # YOUR CODE HERE
    raise NotImplementedError()

    # 2) Compute the mean squared error between the target and the predictions.
    # Tip: don't forget our previous function returned an extra output value:
    # mse_pred, _ = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # 3) Compute the mean squared error between the target and its mean.
    # Tip: don't forget our previous function returned an extra output value:
    # mse_mean, _ = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # 4) Finally, compute R²
    # r2 = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return r2

Check the outputs of your function match the results below:

In [None]:
np.random.seed(42)
r2 = r_squared(np.random.rand(10), np.random.rand(10))
assert math.isclose(0.19069113996339448, r2)

Now compute the $R^2$ metric for our housing dataset:

In [None]:
r2 = r_squared(y_hat_housing, y_housing)
print('R² Housing dataset: {}'.format(r2))

### 1.5 Adjusted R² score

Then implement the adjusted $R^2$ score in the function below:

In [None]:
def adjusted_r_squared(y_pred, y, K):
    """
    Args: 
        y_pred : numpy.array with shape (num_samples,) - predictions
        y : numpy.array with shape (num_samples,) - labels 
        K : integer - Number of features used in the model that computed y_hat.

    Returns: 
        r2_adj : float with adjusted R squared value
    """
    
    # 1) Compute R².
    # r2 = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # 2) Get number of samples 
    # N = ...
    # YOUR CODE HERE
    raise NotImplementedError()

    # 3) Adjust R²
    # r2_adj = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return r2_adj

Check the outputs of your function match the results below:

In [None]:
np.random.seed(42)
r2 = adjusted_r_squared(np.random.rand(10), np.random.rand(10), 2)
assert math.isclose(-0.04053996290420714, r2)

Finally compute the adjusted $R^2$ metric for our housing dataset:

In [None]:
r2 = adjusted_r_squared(y_hat_housing, y_housing, x_housing.shape[1])
print('Adjusted R² Housing dataset: {}'.format(r2))

## 2 Scikit-Learn metrics

As you know, scikitlearn also already provides you with implementations of these metrics: 

- `sklearn.metrics.mean_absolute_error`
- `sklearn.metrics.mean_squared_error`
- `sklearn.metrics.r2_score`
- `sklearn.linear_model.LinearRegression.score` 

In [None]:
# Import sklearn metrics
from sklearn import metrics as sklearn_metrics

#### 2.1 Root Mean Squared Error

Implement the root mean squared error functions below with scikitlearn:

In [None]:
def sklearn_root_mean_squared_error(y_pred, y): 
    """
    Args: 
        y_pred : numpy.array with shape (num_samples,) - predictions
        y : numpy.array with shape (num_samples,) - labels 
        
    Returns: 
        mae : float with Root Mean Squared Error
    """
    # YOUR CODE HERE
    raise NotImplementedError()

Make sure your function passes the tests below:

In [None]:
np.random.seed(42)
rmse = sklearn_root_mean_squared_error(np.random.rand(10), np.random.rand(10))
assert math.isclose(0.25805839598665065, rmse)

#### 2.2  Adjusted R² score

Implement the adjusted R² score below using scikitlearn:

In [None]:
def sklearn_adjusted_r_squared(y_pred, y, K): 
    """
    Args: 
        y_pred : numpy.array with shape (num_samples,) - predictions
        y : numpy.array with shape (num_samples,) - labels 
        K : integer - Number of features used in the model that computed y_hat.

    Returns: 
        r2_adj : float with adjusted R squared value
    """
    # YOUR CODE HERE
    raise NotImplementedError()

Make sure your function passes the tests below:

In [None]:
np.random.seed(42)
r2 = sklearn_adjusted_r_squared(np.random.rand(10), np.random.rand(10), 4)
assert math.isclose(-0.45675594806589004, r2)

Finally, compare the sklearn-based metrics with your own for the housing dataset:

In [None]:
MAE = mean_absolute_error(y_hat_housing, y_housing)
MSE = mean_squared_error(y_hat_housing, y_housing)
RMSE = root_mean_squared_error(y_hat_housing, y_housing)
R2 = r_squared(y_hat_housing, y_housing)
R2_adj = adjusted_r_squared(y_hat_housing, y_housing, x_housing.shape[1])

print('Metric for Housing dataset with base implementation:')
print('Mean Absolute Error Housing dataset: {}'.format(MAE))
print('Mean Squared Error Housing dataset: {}'.format(MSE))
print('Root Mean Squared Error Housing dataset: {}'.format(RMSE))
print('R² Housing dataset: {}'.format(R2))
print('Adjusted R² Housing dataset: {}'.format(R2_adj))
print('\n')

SK_MAE = sklearn_metrics.mean_absolute_error(y_hat_housing, y_housing)
SK_MSE = sklearn_metrics.mean_squared_error(y_hat_housing, y_housing)
SK_RMSE = sklearn_root_mean_squared_error(y_hat_housing, y_housing)
SK_R2 = sklearn_metrics.r2_score(y_housing, y_hat_housing)
SK_R2_adj = sklearn_adjusted_r_squared(y_hat_housing, y_housing, x_housing.shape[1])

print('Metric for Housing dataset with scikitlearn:')
print('Mean Absolute Error Housing dataset: {}'.format(SK_MAE))
print('Mean Squared Error Housing dataset: {}'.format(SK_MSE))
print('Root Mean Squared Error Housing dataset: {}'.format(SK_RMSE))
print('R² Housing dataset: {}'.format(SK_R2))
print('Adjusted R² Housing dataset: {}'.format(SK_R2_adj))


## 3 Using the Metrics

Now you'll use the metrics to fit and check performance of your LinearRegression and SGDRegressor, with the `cross_val_scores` method of scikitlearn. Implement the missing steps below with the mean squared error metric:


In [None]:
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn import linear_model

def estimator_cross_fold(X, y, K, clf_choice='linear'):
    """
    Args: 
        X : numpy.array with shape (num_samples, num_features) - sample data
        y : numpy.array with shape (num_samples,) - sample labels 
        K : integer - Number of iterations for k-fold
        clf_choice: choice of estimator 

    Returns: 
        clf: estimator trained with full data
        scores : scores for each fold
    """
    
    if clf_choice == 'linear':
        clf = linear_model.LinearRegression()
    elif clf_choice == 'sgd':
        clf = linear_model.SGDRegressor()
    else:
        print('Invalid estimator')
        return None
     
    # 1) Fit linear_model
    # YOUR CODE HERE
    raise NotImplementedError()

    # 2) Run k-fold cross validation
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return clf, scores

Now test your implementation in the examples below

In [None]:
np.random.seed(59)
_, scores = estimator_cross_fold(np.random.rand(10).reshape((-1, 1)), np.random.rand(10), 3, clf_choice='linear')
np.testing.assert_array_almost_equal(np.array([-0.07839893, -0.04592946, -0.01665195]), scores)

np.random.seed(59)
_, scores = estimator_cross_fold(np.random.rand(10).reshape((-1, 1)), np.random.rand(10), 3, clf_choice='sgd')
np.testing.assert_array_almost_equal(np.array([-0.16860628, -0.00500938, -0.032316]), scores)

Let's check the performance a dataset of linear data:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline  

example_data = pd.read_csv('data/example_data.csv')

x = example_data['LotArea'].values.reshape(-1, 1)
y = example_data['SalePrice'].values.reshape(-1, 1)

plt.plot(x, y, 'b.')

Run the k-fold cross validation for both regressors and get the average error:

In [None]:
np.random.seed(42)

clf_lr, scores_lr = estimator_cross_fold(x, y.ravel(), 5, clf_choice='linear')
assert math.isclose(-0.018198308576712695, scores_lr.mean())

clf_sgd, scores_sgd = estimator_cross_fold(x, y.ravel(), 5, clf_choice='sgd')
assert math.isclose(-0.01929775798505724, scores_sgd.mean())

We would conclude then that the SGDRegressor is better for this data in estimating unseen examples. This does not mean that the error for all data will be smaller, but it means that when holding out data, the SGDRegressor is slightly better at classifying new examples. Actually, both estimators for this particular use case are quite similar:

In [None]:
y_hat_linear = clf_lr.predict(x)
y_hat_sgd = clf_sgd.predict(x)

plt.plot(x, y, 'b.')
plt.plot(x, y_hat_linear, 'r-')
plt.plot(x, y_hat_sgd, 'g-')

print('Error for full dataset for linear regressor: {}'.format(sklearn_metrics.mean_squared_error(y_hat_linear, y)))
print('Error for full dataset for SGD: {}'.format(sklearn_metrics.mean_squared_error(y_hat_sgd, y)))
