# Linear Regression
The purpose of this notebook is to practice training (also known as fitting), interpreting and evaluating linear regression predictive models. 
We will use Python packages: pandas, matplotlib and scikit-learn.
Besides the material presented in this notebook, please also read this [notebook](http://www.dataschool.io/linear-regression-in-python/) that is very well written, contains many useful details and gives pointers to further reading. 

Training a linear regression model means estimating a set of weights (one weight per feature, plus an extra weight called the bias or the intercept) on a dataset called the training set. 

The model estimated is a linear model taking the form:

$target\_feature = w_0 + w_1 * feature_1 + w_2*feature_2 + ...+ w_n*feature_n $

The learned model can be used to predict the target feature for new examples where we know the descriptive features, but not the target feature. This is called the test example or the test data. In this notebook we will see the difference between evaluating the model on the training data and measuring the model error (called in-sample error) versus evaluating the model on the test data and measuring that error (called out-of-sample error). It is recommended that we always evaluate our model on a second data sample that was not used during training. This way we avoid overfitting or memorising the training data.

## Reading data

In [302]:
# Library Imports.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Allows plots to appear directly in the notebook.
%matplotlib inline

from patsy import dmatrices
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import cross_validate

In [303]:
# Read a CSV dataset with 10 example offices into a dataframe.
# The data is described by 5 features (4 descriptive features: Size, Floor, BroadbandRate, EnergyRating;
# the target feature: RentalPrice).


# Read csv file into a dataframe.
df = pd.read_csv('Offices.csv')
df.head(10)

Unnamed: 0,ID,Size,Floor,BroadbandRate,EnergyRating,RentalPrice
0,1,500,4,8,C,320
1,2,550,7,50,A,380
2,3,620,9,7,A,400
3,4,630,5,24,B,390
4,5,665,8,100,C,385
5,6,700,4,8,B,410
6,7,770,10,7,B,480
7,8,880,12,50,A,600
8,9,920,14,8,C,570
9,10,1000,9,24,B,620


In [304]:
# Print the average RentalPrice in our dataset.
# We could use this as a very simple baseline prediction model.
# A better prediction model should at least improve on this baseline model.
df.RentalPrice.mean()

455.5

In [305]:
# Print the feature types in our dataset.
df.dtypes

ID                int64
Size              int64
Floor             int64
BroadbandRate     int64
EnergyRating     object
RentalPrice       int64
dtype: object

### Preparing the data

In [306]:
# Prepare the descriptive features
print(df.head(10))
#cont_features = ['Size']
cont_features = ['Size', 'Floor', 'BroadbandRate']

X = df[cont_features]
y = df.RentalPrice

print("\nDescriptive features in X:\n", X)
print("\nTarget feature in y:\n", y)

   ID  Size  Floor  BroadbandRate EnergyRating  RentalPrice
0   1   500      4              8            C          320
1   2   550      7             50            A          380
2   3   620      9              7            A          400
3   4   630      5             24            B          390
4   5   665      8            100            C          385
5   6   700      4              8            B          410
6   7   770     10              7            B          480
7   8   880     12             50            A          600
8   9   920     14              8            C          570
9  10  1000      9             24            B          620

Descriptive features in X:
    Size  Floor  BroadbandRate
0   500      4              8
1   550      7             50
2   620      9              7
3   630      5             24
4   665      8            100
5   700      4              8
6   770     10              7
7   880     12             50
8   920     14              8
9  1000    

## Multiple linear regression (using more than one feature)
### Training the model

In [307]:
# Use more features for training
# Train aka fit, a model using all continuous features.

multiple_linreg = LinearRegression().fit(X[cont_features], y)

# Print the weights learned for each feature.
print("Features: \n", cont_features)
print("Coeficients: \n", multiple_linreg.coef_)
print("\nIntercept: \n", multiple_linreg.intercept_)

Features: 
 ['Size', 'Floor', 'BroadbandRate']
Coeficients: 
 [ 0.54873985  4.96354677 -0.06209515]

Intercept: 
 19.561558897449345


### Testing the model
Using the trained model to predict the target feature RentalPrice, given the descriptive features Size, Floor  BroadbandRate.

In [308]:
multiple_linreg_predictions = multiple_linreg.predict(X[cont_features])

print("\nPredictions with multiple linear regression: \n")
actual_vs_predicted_multiplelinreg = pd.concat([y, pd.DataFrame(multiple_linreg_predictions, columns=['Predicted'])], axis=1)
print(actual_vs_predicted_multiplelinreg)


Predictions with multiple linear regression: 

   RentalPrice   Predicted
0          320  313.288908
1          380  353.008544
2          400  404.017519
3          390  388.595112
4          385  417.972416
5          410  423.036877
6          480  491.292042
7          600  558.910427
8          570  593.395111
9          620  611.483043


In [309]:
#Pair the actual and the predicted values
#This can be done directly with sklearn functions, but below is a manual example to understand how it works
prediction_errors = y - multiple_linreg_predictions
print("Actual - Predicted:\n", prediction_errors)
print("\n(Actual - Predicted) squared:\n", prediction_errors**2)

Actual - Predicted:
 0     6.711092
1    26.991456
2    -4.017519
3     1.404888
4   -32.972416
5   -13.036877
6   -11.292042
7    41.089573
8   -23.395111
9     8.516957
Name: RentalPrice, dtype: float64

(Actual - Predicted) squared:
 0      45.038756
1     728.538680
2      16.140455
3       1.973709
4    1087.180212
5     169.960169
6     127.510219
7    1688.352971
8     547.331226
9      72.538563
Name: RentalPrice, dtype: float64


In [310]:
# Print the Mean Squared Error of the model on the training set
mse = (prediction_errors** 2).mean()
rmse = ((prediction_errors** 2).mean())**0.5

print("\nMean Squared Error:\n", mse)
print("\nRoot Mean Squared Error:\n", rmse)


Mean Squared Error:
 448.4564959646251

Root Mean Squared Error:
 21.1767914464072


In [311]:
print("|Actual - Predicted|:\n", abs(prediction_errors))

|Actual - Predicted|:
 0     6.711092
1    26.991456
2     4.017519
3     1.404888
4    32.972416
5    13.036877
6    11.292042
7    41.089573
8    23.395111
9     8.516957
Name: RentalPrice, dtype: float64


In [312]:
# Print the Mean Absolute Error of the model on the training set
mae = abs(prediction_errors).mean()
print("\nMean Absolute Error:\n", mae)


Mean Absolute Error:
 16.942793036576184


In [313]:
#This function is used repeatedly to compute all metrics
def printMetrics(testActualVal, predictions):
    #classification evaluation measures
    print('\n==============================================================================')
    print("MAE: ", metrics.mean_absolute_error(testActualVal, predictions))
    #print("MSE: ", metrics.mean_squared_error(testActualVal, predictions))
    print("RMSE: ", metrics.mean_squared_error(testActualVal, predictions)**0.5)
    print("R2: ", metrics.r2_score(testActualVal, predictions))
        

In [314]:
printMetrics(y, multiple_linreg_predictions)


MAE:  16.942793036576184
RMSE:  21.1767914464072
R2:  0.9552092191101276


# Training with continuous and categorical features

In [315]:
# Use more features for training
# Train aka fit, a model using all continuous and categorical features.
EnergyRating_dummies = pd.get_dummies(df['EnergyRating'], prefix='EnergyRating', drop_first=True)
print("EnergyRatingDummies:", EnergyRating_dummies)

categ_features = EnergyRating_dummies.columns.values.tolist()

features = cont_features + categ_features
print("\nCont features: ", cont_features)
print("Categ features: ", categ_features)
print("Features: ", features)

EnergyRatingDummies:    EnergyRating_B  EnergyRating_C
0               0               1
1               0               0
2               0               0
3               1               0
4               0               1
5               1               0
6               1               0
7               0               0
8               0               1
9               1               0

Cont features:  ['Size', 'Floor', 'BroadbandRate']
Categ features:  ['EnergyRating_B', 'EnergyRating_C']
Features:  ['Size', 'Floor', 'BroadbandRate', 'EnergyRating_B', 'EnergyRating_C']


In [316]:
df_all = pd.concat([df, EnergyRating_dummies], axis=1)
print(df_all)

df_all = df_all.drop('EnergyRating', axis = 1)
print(df_all)

   ID  Size  Floor  BroadbandRate EnergyRating  RentalPrice  EnergyRating_B  \
0   1   500      4              8            C          320               0   
1   2   550      7             50            A          380               0   
2   3   620      9              7            A          400               0   
3   4   630      5             24            B          390               1   
4   5   665      8            100            C          385               0   
5   6   700      4              8            B          410               1   
6   7   770     10              7            B          480               1   
7   8   880     12             50            A          600               0   
8   9   920     14              8            C          570               0   
9  10  1000      9             24            B          620               1   

   EnergyRating_C  
0               1  
1               0  
2               0  
3               0  
4               1  
5         

In [317]:
#We can also do this directly for all categorical features
df = pd.get_dummies(df, drop_first=True)
df

Unnamed: 0,ID,Size,Floor,BroadbandRate,RentalPrice,EnergyRating_B,EnergyRating_C
0,1,500,4,8,320,0,1
1,2,550,7,50,380,0,0
2,3,620,9,7,400,0,0
3,4,630,5,24,390,1,0
4,5,665,8,100,385,0,1
5,6,700,4,8,410,1,0
6,7,770,10,7,480,1,0
7,8,880,12,50,600,0,0
8,9,920,14,8,570,0,1
9,10,1000,9,24,620,1,0


In [318]:
X = df_all[features]
y = df_all.RentalPrice

print("\nDescriptive features in X:\n", X)
print("\nTarget feature in y:\n", y)


Descriptive features in X:
    Size  Floor  BroadbandRate  EnergyRating_B  EnergyRating_C
0   500      4              8               0               1
1   550      7             50               0               0
2   620      9              7               0               0
3   630      5             24               1               0
4   665      8            100               0               1
5   700      4              8               1               0
6   770     10              7               1               0
7   880     12             50               0               0
8   920     14              8               0               1
9  1000      9             24               1               0

Target feature in y:
 0    320
1    380
2    400
3    390
4    385
5    410
6    480
7    600
8    570
9    620
Name: RentalPrice, dtype: int64


In [319]:
# Use more features for training
# Train aka fit, a model using all continuous and categorical features.

multiple_linreg = LinearRegression().fit(X, y)

# Print the weights learned for each feature.
#print("Features: \n", features)
#print("Coeficients: \n", multiple_linreg.coef_)
print("\nIntercept: \n", multiple_linreg.intercept_)
print("Features and coeficients:", list(zip(features, multiple_linreg.coef_)))


Intercept: 
 25.08094730527324
Features and coeficients: [('Size', 0.6431537548899605), ('Floor', 0.016720058467198474), ('BroadbandRate', -0.13248786053570333), ('EnergyRating_B', -46.555463950825725), ('EnergyRating_C', -42.094850186464285)]


In [320]:
multiple_linreg_predictions = multiple_linreg.predict(X[features])

print("\nPredictions with multiple linear regression: \n")
actual_vs_predicted_multiplelinreg = pd.concat([df, pd.DataFrame(multiple_linreg_predictions, columns=['Predicted'])], axis=1)
print(actual_vs_predicted_multiplelinreg)


Predictions with multiple linear regression: 

   ID  Size  Floor  BroadbandRate  RentalPrice  EnergyRating_B  \
0   1   500      4              8          320               0   
1   2   550      7             50          380               0   
2   3   620      9              7          400               0   
3   4   630      5             24          390               1   
4   5   665      8            100          385               0   
5   6   700      4              8          410               1   
6   7   770     10              7          480               1   
7   8   880     12             50          600               0   
8   9   920     14              8          570               0   
9  10  1000      9             24          620               1   

   EnergyRating_C   Predicted  
0               1  303.569952  
1               0  372.308160  
2               0  423.059341  
3               0  380.616241  
4               1  397.568319  
5               0  427.740089  
6

In [321]:
printMetrics(y, multiple_linreg_predictions)


MAE:  11.445895610626872
RMSE:  13.128429922471321
R2:  0.9827855205144458


# Evaluation with train/test split

In [322]:
# Split the data into train and test sets
# Take a third (random) data samples as test data, rest as training data
# Note that this training set if very small and the model will not be very reliable due to this sample size problem.
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

print("Training data:\n", pd.concat([X_train, y_train], axis=1))
print("\nTest data:\n", pd.concat([X_test, y_test], axis=1))

Training data:
    Size  Floor  BroadbandRate  EnergyRating_B  EnergyRating_C  RentalPrice
4   665      8            100               0               1          385
7   880     12             50               0               0          600
8   920     14              8               0               1          570
0   500      4              8               0               1          320
5   700      4              8               1               0          410
1   550      7             50               0               0          380
3   630      5             24               1               0          390

Test data:
    Size  Floor  BroadbandRate  EnergyRating_B  EnergyRating_C  RentalPrice
2   620      9              7               0               0          400
9  1000      9             24               1               0          620
6   770     10              7               1               0          480


In [323]:
# Train on the training sample and test on the test sample.
linreg = LinearRegression().fit(X_train, y_train)
# Print the weights learned for each feature.
#print(linreg_train.coef_)
print("Features and coeficients:", list(zip(features, linreg.coef_)))

Features and coeficients: [('Size', 0.5661564996424642), ('Floor', 2.5121997916397354), ('BroadbandRate', -0.3164931433051256), ('EnergyRating_B', -59.89194293205312), ('EnergyRating_C', -55.17029247157595)]


In [324]:
# Predicted price on training set
train_predictions = linreg.predict(X_train)
print("Actual values of training:\n", y_train)
print("Predictions on training:", train_predictions)
printMetrics(y_train, train_predictions)



Actual values of training:
 4    385
7    600
8    570
0    320
5    410
1    380
3    390
Name: RentalPrice, dtype: int64
Predictions on training: [386.92892569 589.69632192 575.48940104 312.58167327 421.09132274
 390.30367808 378.90867726]

MAE:  8.23237929940294
RMSE:  8.84188323226076
R2:  0.9918119780793977


In [325]:
# Predicted price on test set
test_predictions = linreg.predict(X_test)
print("Actual values of test:\n", y_test)
print("Predictions on test:", test_predictions)
printMetrics(y_test, test_predictions)

Actual values of test:
 2    400
9    620
6    480
Name: RentalPrice, dtype: int64
Predictions on test: [448.5682378  598.4353813  476.11196961]

MAE:  24.673628965960273
RMSE:  30.76265746652414
R2:  0.8855232547093517


# Evaluation with cross-validation

In [326]:
sorted(metrics.SCORERS.keys())

['accuracy',
 'adjusted_mutual_info_score',
 'adjusted_rand_score',
 'average_precision',
 'balanced_accuracy',
 'brier_score_loss',
 'completeness_score',
 'explained_variance',
 'f1',
 'f1_macro',
 'f1_micro',
 'f1_samples',
 'f1_weighted',
 'fowlkes_mallows_score',
 'homogeneity_score',
 'mutual_info_score',
 'neg_log_loss',
 'neg_mean_absolute_error',
 'neg_mean_squared_error',
 'neg_mean_squared_log_error',
 'neg_median_absolute_error',
 'normalized_mutual_info_score',
 'precision',
 'precision_macro',
 'precision_micro',
 'precision_samples',
 'precision_weighted',
 'r2',
 'recall',
 'recall_macro',
 'recall_micro',
 'recall_samples',
 'recall_weighted',
 'roc_auc',
 'v_measure_score']

In [327]:
scores = -cross_val_score(LinearRegression(), X, y, scoring='neg_mean_absolute_error', cv=5)
scores

array([43.46049301, 28.52278234, 53.27445054, 21.51730931,  8.2908062 ])

In [328]:
metrics = ['neg_mean_absolute_error', 'neg_mean_squared_error', 'r2']
scores = cross_validate(LinearRegression(), X, y, scoring=metrics, cv=5)
scores



{'fit_time': array([0.00531697, 0.00216484, 0.00251102, 0.00286531, 0.00175118]),
 'score_time': array([0.00350809, 0.002635  , 0.00352192, 0.00241876, 0.0023067 ]),
 'test_neg_mean_absolute_error': array([-43.46049301, -28.52278234, -53.27445054, -21.51730931,
         -8.2908062 ]),
 'train_neg_mean_absolute_error': array([ -6.81074616,  -7.61789661,  -5.77715676, -11.0244974 ,
        -13.38404809]),
 'test_neg_mean_squared_error': array([-2488.01107511, -1034.8429953 , -4195.344399  ,  -551.70794832,
         -131.38455052]),
 'train_neg_mean_squared_error': array([ -57.66367568,  -74.34399653,  -37.42755059, -145.42249629,
        -207.68520786]),
 'test_r2': array([ -1.76445675, -40.39371981, -25.85020415,   0.84674779,
          0.78978472]),
 'train_r2': array([0.99345631, 0.99345865, 0.99672407, 0.98450312, 0.96691717])}

In [329]:
sorted(scores.keys())

['fit_time',
 'score_time',
 'test_neg_mean_absolute_error',
 'test_neg_mean_squared_error',
 'test_r2',
 'train_neg_mean_absolute_error',
 'train_neg_mean_squared_error',
 'train_r2']