# Regression Metrics

In this notebook, we will go through the most common regression metrics. This is a companion workbook for the 365 Data Science course on ML Process. This notebook only focuses on implementation. Check out the course or the documentation for the in-depth explanations of each approach.

We will cover:

- R2 Score
- Adj R2 Score
- Mean Absolute Error
- Root Mean Squared Error

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

## Load Data

This dataset is a housing dataset. It's common to use ML models to optimally price certain houses based off the features within the dataset. In this notebook, we'll be building a housing price prediction model to help price homes: 

In [9]:
df = pd.read_csv("housing_data.csv")

df.head()

Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country
0,2014-05-02 00:00:00,313000.0,3.0,1.5,1340,7912,1.5,0,0,3,1340,0,1955,2005,18810 Densmore Ave N,Shoreline,WA 98133,USA
1,2014-05-02 00:00:00,2384000.0,5.0,2.5,3650,9050,2.0,0,4,5,3370,280,1921,0,709 W Blaine St,Seattle,WA 98119,USA
2,2014-05-02 00:00:00,342000.0,3.0,2.0,1930,11947,1.0,0,0,4,1930,0,1966,0,26206-26214 143rd Ave SE,Kent,WA 98042,USA
3,2014-05-02 00:00:00,420000.0,3.0,2.25,2000,8030,1.0,0,0,4,1000,1000,1963,0,857 170th Pl NE,Bellevue,WA 98008,USA
4,2014-05-02 00:00:00,550000.0,4.0,2.5,1940,10500,1.0,0,0,4,1140,800,1976,1992,9105 170th Ave NE,Redmond,WA 98052,USA


## Feature Selection

Let's select the features we want to use in the model. To keep things simple, we've manually selected a list of features:

In [10]:
features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
       'floors', 'waterfront', 'view', 'condition', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated']

target = "price"

## Cross-Validation

Next, we'll want to split our data into training and testing sets:

In [11]:
from sklearn.model_selection import train_test_split

y = df[target]
X = df[features]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Model

To keep things simple, we'll use linear regression:

In [12]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train, y_train)
y_preds = model.predict(X_test)

## Evaluation Metrics

The most common evaluation metrics are r2_score, rmse and mae. Here, sklearn has its own implementation:

In [25]:
from sklearn.metrics import (
    r2_score,
    mean_absolute_error,
    mean_squared_error
)

r2 = r2_score(y_test, y_preds)
rmse = np.sqrt(mean_squared_error(y_test, y_preds))
mae = mean_absolute_error(y_test, y_preds)

print("r2_score: {0}".format(r2))
print("rmse: {0}".format(rmse))
print("mae: {0}".format(mae))

r2_score: 0.06623293120933915
rmse: 788277.1876925862
mae: 191902.15686062217


## R2 Score

Intuition behind R-squared is that it tells us what percent of the prediction error in the y variable is eliminated/explained by your model. We use this to determine “goodness of fit.” 

Here's an implementation of R2 so you can see the inner workings of the metric:

In [28]:
def r2_score(y_test, y_preds):
    SS_reg = np.sum((y_test - y_preds)**2)
    SS_total = np.sum((y_test - np.mean(y_test))**2)
    r2 = 1-SS_reg/SS_total
    return r2
    
r2_score(y_test, y_preds)

0.06623293120933915

## Adjusted R-Squared

The problem is that R-Squared can be easily hacked. If we overfit our model, this will always increase our r2 score. So the solution is to use adjusted R-squared. Adjusted R-squared will adjust our R-squared number based on the number of features in our model":

In [31]:
def adj_r2_score(X, y_test, y_preds):
    SS_reg = np.sum((y_test - y_preds)**2)
    SS_total = np.sum((y_test - np.mean(y_test))**2)
    r2 = 1-SS_reg/SS_total
    
    N = len(X)
    p = len(X.columns)
    
    adj_r2 = 1-((1-r2)*(N-1))/(N-p-1)
    return adj_r2
    
adj_r2_score(X, y_test, y_preds)

0.06379011350158081

## Mean Absolute Error

In simple terms, we’re just looking at the absolute average errors for each data point. Then taking an average. This gives us the magnitude of the average error in our dataset:

In [36]:
def mean_absolute_error(y_test, y_preds):
    return np.sum(abs(y_preds - y_test))/len(y_preds)
    
mean_absolute_error(y_test, y_preds)

191902.15686062217

## Root Mean Squared Error

Instead of taking the absolute value of the errors, in this case we square the errors first. This forces all the errors to be positive. We take the average of the squared errors, which becomes mean-squared error. Then we take the square root, to get RMSE. 

In [38]:
def mean_squared_error(y_test, y_preds):
    return np.sum((y_preds - y_test)**2)/len(y_preds)
    
np.sqrt(mean_squared_error(y_test, y_preds))

788277.1876925862

## Conclusion

In review, we went over four different regression metrics in this notebook:

- R2
- Adjusted R2
- Mean Absolute Error
- Root Mean Squared Error 