## Regression Metrics

### Mean Absolute Error

Mean Absolute Error is the average of the difference between the Original Values and the Predicted Values.It gives an idea of how wrong the predictions were.

$$\Large MAE = \frac{1}{N} \sum\limits_{i = 1}^{N} \mid {y_i - \hat{y_i}} \mid$$

$$\Large MAE = \Large Average(Actual\ target\ value - Predicted\ target\ value)$$

In [2]:
import pandas as pd
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pd.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = LinearRegression()
scoring = 'neg_mean_absolute_error'
results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MAE: %.3f (%.3f)" % (results.mean(), results.std()))

MAE: -4.005 (2.084)


### Mean Squared Error

Mean Squared Error(MSE) is quite similar to Mean Absolute Error, the only difference is that MSE takes the average of the square of the difference between the original values and the predicted values.

$$\Large MSE = \frac{1}{N} \sum\limits_{i = 1}^{N} ( {y_i - \hat{y_i}} )^2$$

$$\Large MSE = \Large Average(Actual\ target\ value - Predicted\ target\ value)^2$$

In [32]:
scoring = 'neg_mean_squared_error'
results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MSE: %.3f (%.3f)" % (results.mean(), results.std()))

MSE: -34.705 (45.574)


### R Squared

The $R^2$ (or R Squared) metric provides an indication of how well the model captures the variance in the data. It ranges between 0 and 1 for no-variance explained to complete variance captured respectively.

$$(Residual\ Square\ sum)\ \Large RSS = \Large \sum\limits_{i = 1}^{N} (Actual\ target\ value - Predicted\ target\ value)$$

$$(Total\ Square\ sum)\ \Large TSS = \Large \sum\limits_{i = 1}^{N} (Actual\ target\ value - Mean\ target\ value)$$

$$\Large R^2 = \Large (1 - \frac{RSS}{TSS})$$

In [33]:
scoring = 'r2'
results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("R^2: %.3f (%.3f)" % (results.mean(), results.std()))

R^2: 0.203 (0.595)


## Exercise

In [None]:
# calculate all the metrics for the simple linear regression exercise

In [18]:
# Load the pggm dataset
pggm = pd.read_csv('http://bit.ly/PGGM_dataset')

In [19]:
pggm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38304 entries, 0 to 38303
Data columns (total 18 columns):
Identifier               38304 non-null object
Name                     38304 non-null object
Period                   38304 non-null object
Period_YYYYMMDD          38304 non-null int64
Ticker                   38304 non-null object
Universe_Returns_F1W     38300 non-null float64
Universe_Returns_F4W     38296 non-null float64
Universe_Returns_F12W    38268 non-null float64
Weight                   38304 non-null float64
GICS_Sector              38304 non-null object
GICS_Ind_Grp             38304 non-null object
Market_Cap_USD           38304 non-null float64
Price_USD                38304 non-null float64
NTM_EP                   38246 non-null float64
LTM_ROA                  38245 non-null float64
BP                       38288 non-null float64
LTM_EP                   38215 non-null float64
5Y_Sales_Growth          38197 non-null float64
dtypes: float64(11), int64(1), obje

In [20]:
# Remove missing data
pggm = pggm.dropna()

In [21]:
pggm.head()

Unnamed: 0,Identifier,Name,Period,Period_YYYYMMDD,Ticker,Universe_Returns_F1W,Universe_Returns_F4W,Universe_Returns_F12W,Weight,GICS_Sector,GICS_Ind_Grp,Market_Cap_USD,Price_USD,NTM_EP,LTM_ROA,BP,LTM_EP,5Y_Sales_Growth
0,17290810,Cintas Corporation,12/31/2014,20141231,CTAS-US,-2.517855,-0.121111,4.156041,0.000402,Industrials,Commercial & Professional Services,7761.12,78.44,0.044387,9.089989,0.246962,0.042708,4.718765
1,80589M10,SCANA Corporation,12/31/2014,20141231,SCG-US,2.036428,6.307948,-8.426744,0.000422,Utilities,Utilities,8151.001,60.4,0.061397,3.472852,0.572871,0.062748,-0.949881
2,50241310,L-3 Communications Holdings Inc.,12/31/2014,20141231,LLL-US,-0.396162,-1.28358,-0.753021,0.000563,Industrials,Capital Goods,10883.341,126.21,0.060554,4.744629,0.570099,0.059821,-4.316938
3,91301710,United Technologies Corporation,12/31/2014,20141231,UTX-US,-1.973909,1.669562,1.815629,0.005174,Industrials,Capital Goods,99942.99,115.0,0.062889,6.805052,0.325584,0.059088,3.083364
4,92939U10,Wisconsin Energy Corporation,12/31/2014,20141231,WEC-US,1.118696,7.679176,-6.160975,0.000616,Utilities,Utilities,11893.872,52.74,0.051359,4.201019,0.369798,0.050815,1.421392


In [22]:
X = pggm.iloc[:, 8] 
y = pggm.iloc[:, 17] 

In [23]:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = LinearRegression()
scoring = 'neg_mean_absolute_error'
results = model_selection.cross_val_score(model, X.values.reshape(-1, 1), y.values.reshape(-1, 1), cv=kfold, scoring=scoring)
print("MAE: %.3f (%.3f)" % (results.mean(), results.std()))

MAE: -22.910 (8.707)


In [24]:
scoring = 'neg_mean_squared_error'
results = model_selection.cross_val_score(model, X.values.reshape(-1, 1), y.values.reshape(-1, 1), cv=kfold, scoring=scoring)
print("MSE: %.3f (%.3f)" % (results.mean(), results.std()))

MSE: -98321.894 (124995.824)


In [25]:
scoring = 'r2'
results = model_selection.cross_val_score(model, X.values.reshape(-1, 1), y.values.reshape(-1, 1), cv=kfold, scoring=scoring)
print("R^2: %.3f (%.3f)" % (results.mean(), results.std()))

R^2: -0.097 (0.119)
