Style of the data

In [None]:
%%HTML
<style type="text/css">
div.h1 { background-color:#b300b3;
        color: white; padding: 8px; padding-right: 300px;font-size: 35px; max-width: 1500px; margin: auto; margin-top: 50px; }
div.h2 {background-color:#b300b3;
        color: white; padding: 8px; padding-right: 300px; font-size: 25px; max-width: 1500px; margin: auto; margin-top: 50px; }
div.h3 { color:#b300b3;
        font-size: 16px; margin-top: 20px; margin-bottom:4px; }
hr {display: block; color: gray; height: 1px; border: 0; border-top: 1px solid; }
hr.light {display: block; color: lightgray; height: 1px; border: 0; border-top: 1px solid; }
</style>

[Crislânio Macêdo](https://medium.com/sapere-aude-tech) -  March, 13th, 2020

<div class="h1">Understanding Regression Error Metrics in Python🐍</div>

- [**Github**](https://github.com/crislanio)
- [**Linkedin**](https://www.linkedin.com/in/crislanio/)
- [**Medium**](https://medium.com/sapere-aude-tech)
- [**Quora**](https://www.quora.com/profile/Crislanio)
- [**Hackerrank**](https://www.hackerrank.com/crislanio_ufc?hr_r=1)
- [**Blog**](https://medium.com/@crislanio.ufc)
- [**Personal Page**](https://crislanio.wordpress.com/about)
- [**Twitter**](https://twitter.com/crs_macedo)


<a class="anchor" id="top"></a>
<a id='dsf4'></a>
# <div class="h2">  Table of contents</div>
1. [Imports](#IMPORT)
2. [Regression metrics summary ](#M)
   -  <a href='#m1'>MAE</a>
   -  <a href='#m2'>MSE</a>
   -  <a href='#m3'>RMSE</a>
   -  <a href='#m4'>MAPE</a>
   -  <a href='#m5'>RMLSE</a>
   -  <a href='#m6'>R-Square</a>   
   -  <a href='#m6'>Ajusted R-Square</a>
   -  <a href='#m7'>Residual Sum of Squares (RSS)</a>
   
  <hr>

# <div class="h1">Imports </div>
<a id="IMPORT"></a>
[Back to Table of Contents](#top)

[The End](#theend)

We are using a stack: ``numpy``, ``pandas``, ``sklearn``, ``matplotlib``.

In [None]:
import numpy as np
import pandas as pd

from scipy.stats import boxcox_normmax
from scipy.special import boxcox1p
from sklearn.linear_model import LinearRegression

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

Read the data

In [None]:
# Read the data
train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')

In [None]:
y = train.SalePrice.reset_index(drop=True)
features = train
end_features = ['OverallQual','GrLivArea','GarageCars','GarageArea','TotalBsmtSF','1stFlrSF','FullBath','TotRmsAbvGrd','MSSubClass','MSZoning']
features = features[end_features]
features['MSSubClass'] = features['MSSubClass'].apply(str)
features['MSZoning'] = features.groupby('MSSubClass')['MSZoning'].transform(lambda x: x.fillna(x.mode()[0]))
objects = [col for col in features.columns if features[col].dtype == "object"]
features.update(features[objects].fillna('None'))
numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerics = [col for col in features.columns if features[col].dtype in numeric_dtypes]
features.update(features[numerics].fillna(0))

for i in numerics:
    features[i] = boxcox1p(features[i], boxcox_normmax(features[i] + 1))
X = pd.get_dummies(features).reset_index(drop=True)
#----------------- The model
reg = LinearRegression().fit(X, y)
y_pred = reg.predict(X)

The model

In [None]:
reg = LinearRegression().fit(X, y)
y_pred = reg.predict(X)

# <div class="h1">Regression metrics summary </div>
<a id="M"></a>
[Back to Table of Contents](#top)

[The End](#theend)

![](https://miro.medium.com/max/1308/1*lke9jk2uY-ppHO0h0xytQw.png)

# <div class="h3">MAE: Mean absolute error</div>
<a id="m1"></a>
[Back to Table of Contents](#top)

[The End](#theend)

![](https://miro.medium.com/max/1189/0*sA9a9MlNiZ1dI7so.jpg)

> MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight.

In [None]:
def MAE(predict,target):
    return (abs(predict-target)).mean()

from sklearn.metrics import mean_absolute_error
print ('MAE: ' + str(mean_absolute_error(y,y_pred)) )

# <div class="h3">MSE: Mean squared error</div>
<a id="m2"></a>
[Back to Table of Contents](#top)

[The End](#theend)

![](https://miro.medium.com/max/978/0*7RxO773DPeY8IYeD.png)
source: https://www.geeksforgeeks.org/ml-mathematical-explanation-of-rmse-and-r-squared-error/

> MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive (and not zero) is because of randomness or because the estimator does not account for information that could produce a more accurate estimate. The MSE is a measure of the quality of an estimator — it is always non-negative, and values closer to zero are better

In [None]:
from sklearn.metrics import mean_squared_error
print ('MSE: ' + str(mean_squared_error(y,y_pred)) )

# <div class="h3">RMSE: Root mean square error</div>
<a id="m3"></a>
[Back to Table of Contents](#top)

[The End](#theend)


![](https://miro.medium.com/max/650/0*at-j68ROeSmiruDE.png)
source: https://www.includehelp.com/ml-ai/root-mean-square%20error-rmse.aspx
> RMSE is a quadratic scoring rule that also measures the average magnitude of the error. It’s the square root of the average of squared differences between prediction and actual observation.

In [None]:
def rmsle(predict, target):
    return np.sqrt(((predict - target) ** 2).mean())
print ('RMSE: ' + str(rmsle(y_pred,y)) )

# <div class="h3">MAPE: Mean absolute percentage error</div>
<a id="m4"></a>
[Back to Table of Contents](#top)

[The End](#theend)

![](https://miro.medium.com/max/1063/0*N8USfmlDmXq7YuNy.png)
> Measure of prediction accuracy of a forecasting method in statistics, for example in trend estimation, also used as a loss function for regression problems in machine learning. It usually expresses accuracy as a percentage.

In [None]:
def MAPE(predict,target):
    return ( abs((target - predict) / target).mean()) * 100
print ('MAPE: ' + str(MAPE(y_pred,y)) )

# <div class="h3">R² and R-Squared: Coefficient of determination</div>

<a id="m5"></a>
[Back to Table of Contents](#top)

[The End](#theend)



![](https://miro.medium.com/max/888/0*-lBX506Imc6Hjqpu)

> R² and R-Squared help us to know how good our regression model as compared to a very simple model that just predicts the mean value of target from the train set as predictions.


In [None]:
def R2(predict, target):
    return 1 - (MAE(predict,target) / MAE(target.mean(),target))
def R_SQR(predict, target):
    r2 = R2(predict,target)
    return np.sqrt(r2)
print ('R2         : ' + str(R2(y_pred,y)) )
print ('R          : ' + str(R_SQR(y_pred,y)) )

# <div class="h3">Adjusted R²</div>
<a id="m5"></a>
[Back to Table of Contents](#top)

[The End](#theend)

![](https://miro.medium.com/max/955/0*1jxDmwoJF8R4tOVq.png)
source: http://www.haghish.com/statistics/stata-blog/stata-programming/adjusted_R_squared.php

> A model performing equal to baseline would give R-Squared as 0. Better the model, higher the r2 value. The best model with all correct predictions would give R-Squared as 1. However, on adding new features to the model, the R-Squared value either increases or remains the same. R-Squared does not penalize for adding features that add no value to the model. So an improved version over the R-Squared is the adjusted R-Squared.


In [None]:
def R2_ADJ(predict, target, k):
    r2 = R2(predict,target)
    n = len(target)
    return (1 -  ( (1-r2) *  ( (n-1) / (n-(k+1)) ) ) )

k= len(features.columns)
print ('R2 adjusted: ' + str(R2_ADJ(y_pred,y,k)) )

# <div class="h3">Residual Sum of Squares (RSS)</div>
<a id="m6"></a>
[Back to Table of Contents](#top)

[The End](#theend)

The residual sum of squares is the top term in the  R2  metric (albeit adjusted by 1 to account for degrees of freedom). It takes the distance between observed and predicted values (the residuals), squares them, and sums them all together. Ordinary least squares regression is designed to minimize exactly this value.

RSS=∑0n−1(yi−y^i)2
 
RSS is not very interpretable on its own, because it is the sum of many (potentially very large) residuals. For this reason it is rarely used as a metric, but because it is so important to regression, it's often included in statistical fit assays.

In [None]:
def rss_score(y, y_pred):
    return np.sum((y - y_pred)**2)
rss = rss_score(y, y_pred) 
print ('Residual Sum of Squares (RSS): ' + str( rss ) )

Refer: 
- [Metrics and Python](https://towardsdatascience.com/metrics-and-python-850b60710e0c)
- [Understanding Regression Error Metrics](https://www.dataquest.io/blog/understanding-regression-error-metrics/)
- [The Absolute Best Way to Measure Forecast Accuracy](https://www.axsiumgroup.com/the-absolute-best-way-to-measure-forecast-accuracy-2/)


[Back to Table of Contents](#top)

<a class="anchor" id="theend"></a>
# Final