# Linear Regression

The most basic model that we will try and implement is the linear regression model. The target feature Y is predicted based on the magnitude of environmental features X. Although a non-complex model, linear regression is advantageous for its ease of results interpretation when compared to other regression models and tree-based models.

The intensity of the relationship between the features is represented by a line of best fit to the plotted datapoints, represented by the equation of that line _y = mx + b_, where y is the target feature and x is the predictor variable (KDNugggets).

In [1]:
#Import package pandas for data analysis
import pandas as pd
# Import package numpy for numeric computing
import numpy as np
from numpy import int64
from numpy import float64
from numpy import datetime64
# Import package matplotlib for visualisation/plotting
import matplotlib.pyplot as plt
import matplotlib.dates as dates
# Allows plots to appear directly in the notebook.
%matplotlib inline
# For dealing with some Accented characters (in Irish Place names)
import unidecode
# Date/time functionality
import datetime
import time
#for trigonometic calculations
import math
# Check if files exist
from os.path import exists
from os import makedirs
# System specific parameters and functions
import sys
# look at some z-scores for inspecting outliers.
from scipy import stats
import seaborn as sns
# lookup lat/long and convert lat/long to national grid references.
#import geopy
#import pyproj

from patsy import dmatrices
from sklearn import metrics
#binary encoding by using the OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_graphviz
#from sklearn.tree import export_text

from sklearn.metrics import mean_absolute_error, make_scorer
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score 
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import train_test_split

from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score


from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso

#import graphviz
#from graphviz import Source
#to read all CSV files in a folder
import os
import glob
import pickle

In [2]:
#method to read chunks from a file and create a dataframe list.
def get_chunks(x):
    """x=[location, dtypes, datecolumns ,date_parser_func, cols]
    exp:mydateparser = lambda x: pd.to_datetime(x, format="%Y-%m-%d %H:%M:%S")
    exp:dtypes={'TRIPID' :'category','PROGRNUMBER':'int16','STOPPOINTID':'category','PLANNEDTIME_ARR':'int64','PLANNEDTIME_DEP':'int64','ACTUALTIME_ARR':'int64','ACTUALTIME_DEP':'int64','DIFFERENCETIME_ARR':'int64','DIFFERENCETIME_DEP':'int64','LINEID':'category','ROUTEID':'category','DIRECTION':'category','TRIPS_TIME_PROPORTION':'float32','TOTAL_JOURNEY_TIME':'int16','TIME_OF_DAY':'category','DAY_OF_WEEK':'category'}
    exp:cols=['Date time','Temperature','Wind Speed','Precipitation','Cloud Cover']
    exp:x=['/Chunks/DBUS/QP_Implementation',dtypes,['ACTUALTIME_ARR_DATETIME'],mydateparser, cols]"""
    path = x[0]
    chunk_folder = glob.glob(os.path.join(path, "*.csv"))
    chunk_list=[]
    counter=0
    for filename in chunk_folder:
        df_chunk = pd.read_csv(filename, low_memory=False, dtype=x[1])
        chunk_list.append(df_chunk)
        print('Index at chunk_list',counter,' is ', filename, ': ',df_chunk.shape)
        counter+=1
    return chunk_list
#chunk_dbus_list=get_chunks(arg_list)

In [3]:
def printMetrics(actualVal, predictions):
    #classification evaluation measures
    print('\n==============================================================================')
    print("MAE: ", metrics.mean_absolute_error(actualVal, predictions))
    print("MSE: ", metrics.mean_squared_error(actualVal, predictions))
    print("RMSE: ", metrics.mean_squared_error(actualVal, predictions)**0.5)
    print("R²: ", metrics.r2_score(actualVal, predictions))

# 1 Read Chunks and train-test-split

The chunks are obtained for each individual route, and we will work with the first chunk for now, route 18_3, to explore the steps required to reach a model. Once the method has been estabished, I can loop through each route and apply the same steps to produce pickle files. 

I am performing a 70/30 train-test-split on which to train the model. In this instance, TOTAL_JOURNEY_TIME is our target feature y, and features such as weather conditions and day are included in X.<br>

In [4]:
# set out criteria for loading in chunks
dtype={'PLANNEDTIME_DEP':'int64',
       'ROUTEID':'category',
       'TOTAL_JOURNEY_TIME':'int16',
       'rain':'float16',
       'temp':'float16',
       'wet_bulb_temp(C)':'float16',
       'dew_pt_temp(C)':'float16',
       'vapour_pressure(hPa)':'float16',
       'humidity(%)':'int16',
       'sea_lvl_pressure(hPa)':'float16',
       'sin_hour_of_day': 'float32',
       'cos_hour_of_day': 'float32',
       'friday':'int16',
       'monday':'int16',
       'saturday':'int16',
       'sunday':'int16',
       'thursday':'int16',
       'tuesday':'int16',
       'wednesday':'int16'}


data_list = ["../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/", dtype]

# apply function
chunks = get_chunks(data_list)

Index at chunk_list 0  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/18_3.csv :  (3896, 19)
Index at chunk_list 1  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/67_6.csv :  (4001, 19)
Index at chunk_list 2  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/142_9.csv :  (252, 17)
Index at chunk_list 3  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/17_16.csv :  (307, 16)
Index at chunk_list 4  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/46A_74.csv :  (11186, 19)
Index at chunk_list 5  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/122_14.csv :  (4164, 19)
Index at chunk_list 6  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/150_8.csv :  (4907, 19)
Index at chunk_list 7  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/33X_49.csv :  (181, 17)
Index at chunk_list 8  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/83_19.csv :  (1051, 16)
Index at chunk_list 9  is  ../../Pelin/Chunks/D

Index at chunk_list 83  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/66_13.csv :  (1448, 19)
Index at chunk_list 84  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/185_56.csv :  (847, 19)
Index at chunk_list 85  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/14_12.csv :  (201, 18)
Index at chunk_list 86  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/83A_23.csv :  (1129, 19)
Index at chunk_list 87  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/53_20.csv :  (1052, 19)
Index at chunk_list 88  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/11_40.csv :  (2656, 19)
Index at chunk_list 89  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/83_22.csv :  (3603, 19)
Index at chunk_list 90  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/151_16.csv :  (1108, 19)
Index at chunk_list 91  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/84_31.csv :  (465, 19)
Index at chunk_list 92  is  ../../Pel

Index at chunk_list 172  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/79_10.csv :  (4129, 19)
Index at chunk_list 173  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/15A_84.csv :  (3887, 19)
Index at chunk_list 174  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/16_20.csv :  (7055, 19)
Index at chunk_list 175  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/31A_25.csv :  (833, 19)
Index at chunk_list 176  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/65_72.csv :  (681, 19)
Index at chunk_list 177  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/41_7.csv :  (4834, 19)
Index at chunk_list 178  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/7A_87.csv :  (2178, 19)
Index at chunk_list 179  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/13_70.csv :  (252, 17)
Index at chunk_list 180  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/39_23.csv :  (204, 19)
Index at chunk_list 181  is  ..

Index at chunk_list 255  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/45A_70.csv :  (395, 17)
Index at chunk_list 256  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/1_37.csv :  (4748, 19)
Index at chunk_list 257  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/114_5.csv :  (1350, 18)
Index at chunk_list 258  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/77A_30.csv :  (968, 19)
Index at chunk_list 259  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/45A_60.csv :  (2091, 19)
Index at chunk_list 260  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/122_20.csv :  (1580, 18)
Index at chunk_list 261  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/66_18.csv :  (2092, 19)
Index at chunk_list 262  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/45A_68.csv :  (295, 17)
Index at chunk_list 263  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/13_60.csv :  (2472, 19)
Index at chunk_list 264  is

In [5]:
df = chunks[0].copy()
df.drop('ROUTEID', inplace=True, axis=1)

In [6]:
# split into X and Y -> features and target
y = pd.DataFrame(df['TOTAL_JOURNEY_TIME'])
X = df.drop(['TOTAL_JOURNEY_TIME'], axis=1)

# take random 70/30 split
# maintaining the random state with an integer argument so we can use this arrangement again
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=1)

# print the number of rows in each dataset to ensure the sample was done correctly
print(f"Original set: {df.shape}")
print(f"Training set: {X_train.shape}")
print(f"Testing set: {X_test.shape}")

Original set: (3896, 18)
Training set: (2727, 17)
Testing set: (1169, 17)


# 2 Model training

The training set extracted above is now used to train our model for the bus route 18_3. We can examine the importance of each feature in determining the outcome total bus journey time by examining the coefficienets. From the output below, we can see that the sin value for time of day has the greatest influence on total journey time - this feature is linked with the cos hour of day so we can assume that the time of day in general has a high affect. This is followed by the day of the week on which the trip took place. Weekend days show a greater coefficient.

In [8]:
features = list(X_train)
# create the linear regression model
# only have continuous features so no need to exclude categoricals

LR = LinearRegression().fit(X_train, y_train)

print("Features                      Coefficients\n=================================================")
for it, feature in enumerate(features):
    print(f"{feature}               {LR.coef_[0][it]}")
    
print("\nIntercept: \n", LR.intercept_)

feature_importance = pd.DataFrame({'feature': features, 'importance':abs(LR.coef_.reshape(-1))})
feature_importance.sort_values('importance', ascending=False)

Features                      Coefficients
PLANNEDTIME_DEP               -0.060377691342932874
rain               110.11612015361182
temp               -45.47529130178701
wet_bulb_temp(C)               121.23062101382365
dew_pt_temp(C)               -2.2429247299119446
vapour_pressure(hPa)               -97.89318167518483
humidity(%)               7.652432414508368
sea_lvl_pressure(hPa)               2.785641363020405
sin_hour_of_day               -984.2551319449647
cos_hour_of_day               288.17032693513454
friday               383.59056121554653
monday               -98.57445273657166
saturday               -715.6251213591585
sunday               -874.3984990727544
thursday               359.20206577775195
tuesday               444.7003289259683
wednesday               501.1051172492165

Intercept: 
 [4473.689554]


Unnamed: 0,feature,importance
8,sin_hour_of_day,984.255132
13,sunday,874.398499
12,saturday,715.625121
16,wednesday,501.105117
15,tuesday,444.700329
10,friday,383.590561
14,thursday,359.202066
9,cos_hour_of_day,288.170327
3,wet_bulb_temp(C),121.230621
1,rain,110.11612


## 2.1 Evaluate on training set

The findings from this trial indicate whether or not our model is suffering from overfitting. The inaccuracies shown in the table show that this is not the case here.

In [9]:
# make prediction and display results
LRprediction = LR.predict(X_train)
print("First 10 linear regression predictions:\n")
actualVpredicted = pd.concat([y_train, pd.DataFrame(LRprediction, columns=["Predicted"], index=y_train.index)], axis=1)
actualVpredicted.head(10)

First 10 linear regression predictions:



Unnamed: 0,TOTAL_JOURNEY_TIME,Predicted
2289,4610,4869.565088
2344,3283,3601.930433
3114,5747,4778.178193
3305,2386,2497.55331
2347,3007,3206.58259
1988,3463,3407.613617
681,4091,4234.870232
2418,5500,4954.748207
250,3841,4681.162407
3570,3531,3002.507624


In [10]:
# compute mse, rmse and mae
mse = mean_squared_error(y_train, LRprediction)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_train, LRprediction)
r2 = r2_score(y_train, LRprediction)

# show results
print("Evaluation metrics on training data:\n")
print(f"Root mean squared error: {rmse}")
print(f"Mean absolute error: {mae}")
print(f"R sqared: {r2}")

Evaluation metrics on training data:

Root mean squared error: 768.0736455482638
Mean absolute error: 546.3839819478919
R sqared: 0.5349074688343272


## Evaluate on hold-out set

This evaluation shows the performance metrics of our model when faced with the test data. On average, our model predicted a time that was off the actual value by 570 seconds (9 and a half minutes or so). Due to the r2 value of 0.54, that is an expected result. A higher r2 (better fit between model and data) would yield more accurate predictions.

In [11]:
# Predicted price on test set
LRtest_prediction = LR.predict(X_test)
print("Linear regression prediction on test data:")
mod_test_actualVpredicted = pd.concat([y_test, pd.DataFrame(LRtest_prediction, columns=["Predicted"], index=y_test.index)], axis=1)
mod_test_actualVpredicted.head(10)

Linear regression prediction on test data:


Unnamed: 0,TOTAL_JOURNEY_TIME,Predicted
1413,3650,3449.336126
3362,3966,3852.04867
1787,4049,3523.343065
1396,4025,3727.976859
724,2348,3625.475002
1410,2728,2547.763797
710,6436,4930.06844
2011,4211,4834.819732
2805,2526,3490.116674
453,2619,3939.308676


In [12]:
# compute mse, rmse and mae
t_mse = mean_squared_error(y_test, LRtest_prediction)
t_rmse = np.sqrt(t_mse)
t_mae = mean_absolute_error(y_test, LRtest_prediction)
t_r2 = r2_score(y_test, LRtest_prediction)

# show results
print("Evaluation metrics on training data:\n")
print(f"Root mean squared error: {t_rmse}")
print(f"Mean absolute error: {t_mae}")
print(f"r^2: {t_r2}")

Evaluation metrics on training data:

Root mean squared error: 769.6789094187598
Mean absolute error: 570.2695119220922
r^2: 0.5408598337979247


## 5-fold cross validation
### MAE

In [13]:
mae_scores = -cross_val_score(LinearRegression(), X, y, scoring='neg_mean_absolute_error', cv=5)
CV_mae = np.mean(mae_scores)
CV_mae_std = np.std(mae_scores)

print("===== MAE =====\n")
print("Avg MAE score over 5 folds:", CV_mae)
print("Stddev MAE score over 5 folds:", CV_mae_std)
print("Variance over 5 folds:", np.std(mae_scores)*2)

===== MAE =====

Avg MAE score over 5 folds: 573.7456668541867
Stddev MAE score over 5 folds: 40.45893565219496
Variance over 5 folds: 80.91787130438992


### RMSE

In [14]:
rmse_scores = -cross_val_score(LinearRegression(), X, y, scoring='neg_root_mean_squared_error', cv=5)
CV_rmse = np.mean(rmse_scores)
CV_rmse_std = np.std(rmse_scores)

print("===== RMSE =====\n")
print("Avg RMSE score over 5 folds:", CV_rmse)
print("Stddev RMSE score over 5 folds:", CV_rmse_std)
print("Variance over 5 folds:", np.std(rmse_scores)*2)

===== RMSE =====

Avg RMSE score over 5 folds: 794.9996209460782
Stddev RMSE score over 5 folds: 58.31243232726411
Variance over 5 folds: 116.62486465452822


### r2

In [15]:
r2_scores = cross_val_score(LinearRegression(), X, y, scoring='r2', cv=5)
CV_r2 = np.mean(r2_scores)
CV_r2_std = np.std(r2_scores)

print("===== r^2 =====\n")
print("Avg r^2 score over 5 folds:", CV_r2)
print("Stddev r^2 score over 5 folds:", CV_r2_std)
print("Variance over 5 folds:", np.std(r2_scores)*2)

===== r^2 =====

Avg r^2 score over 5 folds: 0.494461774843054
Stddev r^2 score over 5 folds: 0.043079901382247154
Variance over 5 folds: 0.08615980276449431


In [16]:
results = {
    "": ["MAE", "RMSE", "r^2"],
    "Train/test split": [str(round(t_mae,2)), str(round(t_rmse,2)), str(round(t_r2,4))],
    "5-fold CV (std dev.)": [f"{str(round(CV_mae,2))} ({str(round(CV_mae_std,2))})", f"{str(round(CV_rmse,2))} ({str(round(CV_rmse_std,2))})",f"{str(round(CV_r2,4))} ({str(round(CV_r2_std,4))})"]
}

resultsdf = pd.DataFrame(results)
resultsdf = resultsdf.set_index("")
print(f"Average total journey time in training set: {y_train['TOTAL_JOURNEY_TIME'].mean()}")
print(f"Average total journey time in full set: {y['TOTAL_JOURNEY_TIME'].mean()}\n\n")
print("Performance of linear regression models:")
resultsdf

Average total journey time in training set: 4045.3296662999633
Average total journey time in full set: 4046.632700205339


Performance of linear regression models:


Unnamed: 0,Train/test split,5-fold CV (std dev.)
,,
MAE,570.27,573.75 (40.46)
RMSE,769.68,795.0 (58.31)
r^2,0.5409,0.4945 (0.0431)


# Exploring continuous feature relationships

When two features (neither the target feature) are highly correlated, it is likely that they both have the same affect on the target feature. This can lead to redundancy if they are both included in the model training, as they are unlikely to add anything original to the model and even insome cases pose a risk of overfitting (Vishal R, 2019).

This section aims to experiment with the removal of redundant features that are highly correlated with others, to assess the gain (or loss) of model performance. 

We have no categorical features to consider so only need to look at correlation matrix. I will compute the correlation of all continuous features in the X training set.

In [17]:
# correlation of all pairs of continuous features
X_train.corr()

Unnamed: 0,PLANNEDTIME_DEP,rain,temp,wet_bulb_temp(C),dew_pt_temp(C),vapour_pressure(hPa),humidity(%),sea_lvl_pressure(hPa),sin_hour_of_day,cos_hour_of_day,friday,monday,saturday,sunday,thursday,tuesday,wednesday
PLANNEDTIME_DEP,1.0,-0.028049,-0.122976,-0.095917,-0.041982,-0.037172,0.148759,0.025747,-0.754081,0.850167,0.001753,0.02297,-0.021711,0.199148,-0.068859,-0.08025,-0.1028
rain,-0.028049,1.0,-0.073543,-0.01835,0.061857,0.055279,0.226241,-0.208744,0.00607,-0.029895,0.011785,-0.043627,-0.013655,0.007265,-0.009474,0.056943,-0.007305
temp,-0.122976,-0.073543,1.0,0.965682,0.794249,0.791829,-0.565025,0.295007,-0.01317,-0.204491,0.002504,-0.059909,0.060768,0.033507,-0.019856,-0.029476,-0.040615
wet_bulb_temp(C),-0.095917,-0.01835,0.965682,1.0,0.922567,0.916797,-0.340976,0.227628,0.004262,-0.144117,-0.016185,-0.065568,0.066964,0.04558,-0.016644,-0.034533,-0.038615
dew_pt_temp(C),-0.041982,0.061857,0.794249,0.922567,1.0,0.985465,0.04289,0.092373,0.029096,-0.033217,-0.04075,-0.070783,0.068651,0.054367,-0.007717,-0.033809,-0.030406
vapour_pressure(hPa),-0.037172,0.055279,0.791829,0.916797,0.985465,1.0,0.02238,0.131434,0.022184,-0.033379,-0.048744,-0.057547,0.06912,0.065701,-0.02211,-0.042852,-0.025062
humidity(%),0.148759,0.226241,-0.565025,-0.340976,0.04289,0.02238,1.0,-0.376308,0.063972,0.298535,-0.05659,-0.00354,-0.007466,0.016529,0.024877,0.003812,0.027639
sea_lvl_pressure(hPa),0.025747,-0.208744,0.295007,0.227628,0.092373,0.131434,-0.376308,1.0,-0.009139,0.026966,-0.008305,0.110402,-0.009235,0.040555,-0.010968,-0.088626,-0.031733
sin_hour_of_day,-0.754081,0.00607,-0.01317,0.004262,0.029096,0.022184,0.063972,-0.009139,1.0,-0.41126,-0.058673,0.038504,0.053722,-0.18517,0.039141,0.064489,0.089635
cos_hour_of_day,0.850167,-0.029895,-0.204491,-0.144117,-0.033217,-0.033379,0.298535,0.026966,-0.41126,1.0,-0.047817,0.04396,-0.017913,0.126353,-0.031561,-0.046628,-0.046218


Feature pairs that have high correlation:

- PLANNEDTIME_DEP : sin_hour_of_day..........-0.754087
- PLANNEDTIME_DEP : cos_hour_of_day..........0.850175
- temp : wet_bulb_temp(C).................................0.965682
- temp : vapour_pressure(hPa)............................0.791829
- vapour_pressure(hPa) : wet_bulb_temp(C).......0.916797

For this reason I am going to drop wet_buld_temp(C) and vapour_pressure(hPa).

In [18]:
list(X_train)

['PLANNEDTIME_DEP',
 'rain',
 'temp',
 'wet_bulb_temp(C)',
 'dew_pt_temp(C)',
 'vapour_pressure(hPa)',
 'humidity(%)',
 'sea_lvl_pressure(hPa)',
 'sin_hour_of_day',
 'cos_hour_of_day',
 'friday',
 'monday',
 'saturday',
 'sunday',
 'thursday',
 'tuesday',
 'wednesday']

In [19]:
modX_train = X_train.drop(['wet_bulb_temp(C)', 'vapour_pressure(hPa)'], axis=1)
list(modX_train)

['PLANNEDTIME_DEP',
 'rain',
 'temp',
 'dew_pt_temp(C)',
 'humidity(%)',
 'sea_lvl_pressure(hPa)',
 'sin_hour_of_day',
 'cos_hour_of_day',
 'friday',
 'monday',
 'saturday',
 'sunday',
 'thursday',
 'tuesday',
 'wednesday']

**Now to do the same to the test data**

In [20]:
modX_test = X_test.drop(['wet_bulb_temp(C)', 'vapour_pressure(hPa)'], axis=1)
list(modX_test)

['PLANNEDTIME_DEP',
 'rain',
 'temp',
 'dew_pt_temp(C)',
 'humidity(%)',
 'sea_lvl_pressure(hPa)',
 'sin_hour_of_day',
 'cos_hour_of_day',
 'friday',
 'monday',
 'saturday',
 'sunday',
 'thursday',
 'tuesday',
 'wednesday']

# Model Training

In [21]:
mod_features = list(modX_train)
# create the linear regression model
# only have continuous features so no need to exclude categoricals

modLR = LinearRegression().fit(modX_train, y_train)

print("Features                      Coefficients\n=================================================")
for it, feature in enumerate(mod_features):
    print(f"{feature}               {modLR.coef_[0][it]}")
    
print("\nIntercept: \n", modLR.intercept_)

feature_importance = pd.DataFrame({'feature': mod_features, 'importance':abs(modLR.coef_.reshape(-1))})
feature_importance.sort_values('importance', ascending=False)

Features                      Coefficients
PLANNEDTIME_DEP               -0.06048183690079416
rain               111.58683809003766
temp               13.674697678614056
dew_pt_temp(C)               -14.126484982394839
humidity(%)               7.263259712050896
sea_lvl_pressure(hPa)               1.9187359005808684
sin_hour_of_day               -981.7470814936557
cos_hour_of_day               292.0121326261898
friday               389.45965178726635
monday               -103.94233789914985
saturday               -717.2367135065733
sunday               -879.1288143879622
thursday               369.30326664616473
tuesday               447.1062886995854
wednesday               494.43865866066966

Intercept: 
 [4899.94416306]


Unnamed: 0,feature,importance
6,sin_hour_of_day,981.747081
11,sunday,879.128814
10,saturday,717.236714
14,wednesday,494.438659
13,tuesday,447.106289
8,friday,389.459652
12,thursday,369.303267
7,cos_hour_of_day,292.012133
1,rain,111.586838
9,monday,103.942338


 ## Evaluate on training set

In [22]:
# make prediction and display results
modLRprediction = modLR.predict(modX_train)
print("First 10 linear regression predictions:\n")
actualVpredicted = pd.concat([y_train, pd.DataFrame(modLRprediction, columns=["Predicted"], index=y_train.index)], axis=1)
actualVpredicted.head(10)

First 10 linear regression predictions:



Unnamed: 0,TOTAL_JOURNEY_TIME,Predicted
2289,4610,4886.020235
2344,3283,3740.903641
3114,5747,4770.402362
3305,2386,2468.686295
2347,3007,3310.504109
1988,3463,3504.74086
681,4091,4327.448116
2418,5500,5041.245516
250,3841,4730.752302
3570,3531,2977.400875


In [23]:
# compute mse, rmse and mae
mod_mse = mean_squared_error(y_train, modLRprediction)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_train, modLRprediction)
r2 = r2_score(y_train, modLRprediction)

# show results
print("Evaluation metrics on training data:\n")
print(f"Root mean squared error: {rmse}")
print(f"Mean absolute error: {mae}")
print(f"R sqared: {r2}")

Evaluation metrics on training data:

Root mean squared error: 768.0736455482638
Mean absolute error: 547.1802180523825
R sqared: 0.5330605914735647


## Evaluate on hold-out set

In [24]:
# Predicted price on test set
modLRtest_prediction = modLR.predict(modX_test)
print("Linear regression prediction on test data:")
mod_test_actualVpredicted = pd.concat([y_test, pd.DataFrame(modLRtest_prediction, columns=["Predicted"], index=y_test.index)], axis=1)
mod_test_actualVpredicted.head(10)

Linear regression prediction on test data:


Unnamed: 0,TOTAL_JOURNEY_TIME,Predicted
1413,3650,3456.098574
3362,3966,3827.579849
1787,4049,3537.780396
1396,4025,3781.404308
724,2348,3639.95228
1410,2728,2496.733076
710,6436,5154.140093
2011,4211,4815.359988
2805,2526,3435.095938
453,2619,3990.664686


In [25]:
# compute mse, rmse and mae
t_mse = mean_squared_error(y_test, modLRtest_prediction)
t_rmse = np.sqrt(t_mse)
t_mae = mean_absolute_error(y_test, modLRtest_prediction)
t_r2 = r2_score(y_test, modLRtest_prediction)

# show results
print("Evaluation metrics on training data:\n")
print(f"Root mean squared error: {t_rmse}")
print(f"Mean absolute error: {t_mae}")
print(f"r^2: {t_r2}")

Evaluation metrics on training data:

Root mean squared error: 770.3728060750603
Mean absolute error: 570.9270380478232
r^2: 0.5400315937728708


## 5-fold cross-validation

In [26]:
# perform same processing that was done on train-test splits on original set
modX = X.drop(['wet_bulb_temp(C)', 'vapour_pressure(hPa)'], axis=1)
list(modX)

['PLANNEDTIME_DEP',
 'rain',
 'temp',
 'dew_pt_temp(C)',
 'humidity(%)',
 'sea_lvl_pressure(hPa)',
 'sin_hour_of_day',
 'cos_hour_of_day',
 'friday',
 'monday',
 'saturday',
 'sunday',
 'thursday',
 'tuesday',
 'wednesday']

### MAE

In [27]:
mae_scores = -cross_val_score(LinearRegression(), modX, y, scoring='neg_mean_absolute_error', cv=5)
CV_mae = np.mean(mae_scores)
CV_mae_std = np.std(mae_scores)

print("===== MAE =====\n")
print("Avg MAE score over 5 folds:", CV_mae)
print("Stddev MAE score over 5 folds:", CV_mae_std)
print("Variance over 5 folds:", np.std(mae_scores)*2)

===== MAE =====

Avg MAE score over 5 folds: 573.3031976819332
Stddev MAE score over 5 folds: 39.804960495844256
Variance over 5 folds: 79.60992099168851


### RMSE

In [28]:
rmse_scores = -cross_val_score(LinearRegression(), modX, y, scoring='neg_root_mean_squared_error', cv=5)
CV_rmse = np.mean(rmse_scores)
CV_rmse_std = np.std(rmse_scores)

print("===== RMSE =====\n")
print("Avg RMSE score over 5 folds:", CV_rmse)
print("Stddev RMSE score over 5 folds:", CV_rmse_std)
print("Variance over 5 folds:", np.std(rmse_scores)*2)

===== RMSE =====

Avg RMSE score over 5 folds: 794.619251409979
Stddev RMSE score over 5 folds: 60.64667800048247
Variance over 5 folds: 121.29335600096493


### r2

In [29]:
r2_scores = cross_val_score(LinearRegression(), modX, y, scoring='r2', cv=5)
CV_r2 = np.mean(r2_scores)
CV_r2_std = np.std(r2_scores)

print("===== r^2 =====\n")
print("Avg r^2 score over 5 folds:", CV_r2)
print("Stddev r^2 score over 5 folds:", CV_r2_std)
print("Variance over 5 folds:", np.std(r2_scores)*2)

===== r^2 =====

Avg r^2 score over 5 folds: 0.494904460796106
Stddev r^2 score over 5 folds: 0.04590689936469265
Variance over 5 folds: 0.0918137987293853


In [30]:
results = {
    "": ["MAE", "RMSE", "r^2"],
    "Train/test split": [str(round(t_mae,2)), str(round(t_rmse,2)), str(round(t_r2,4))],
    "5-fold CV (std dev.)": [f"{str(round(CV_mae,2))} ({str(round(CV_mae_std,2))})", f"{str(round(CV_rmse,2))} ({str(round(CV_rmse_std,2))})",f"{str(round(CV_r2,4))} ({str(round(CV_r2_std,4))})"]
}

resultsdf = pd.DataFrame(results)
resultsdf = resultsdf.set_index("")
print(f"Average total journey time in training set: {y_train['TOTAL_JOURNEY_TIME'].mean()}")
print(f"Average total journey time in full set: {y['TOTAL_JOURNEY_TIME'].mean()}\n\n")
print("Performance of linear regression models AFTER removing redundant features:")
resultsdf

Average total journey time in training set: 4045.3296662999633
Average total journey time in full set: 4046.632700205339


Performance of linear regression models AFTER removing redundant features:


Unnamed: 0,Train/test split,5-fold CV (std dev.)
,,
MAE,570.93,573.3 (39.8)
RMSE,770.37,794.62 (60.65)
r^2,0.54,0.4949 (0.0459)


# Findings

Main take away is that the train/test split performs better than the 5-fold CV on all metrics.
For ease of interpretation, converted to minutes we get roughly:<br>
571 = **9m 30s**<br>
770 = **12m 45s**<br>
793 = **13m 11s**<br>

Taking away the highly correlated continuous features had no effect on model performance for this route.

# Automating feature selection: filter method

When running multiple datasets through the same algorithm to produce a model for each, it is desirable to ensure that each model is as specific to that dataset as possible, and minimise generalisation across the group of datasets. For that reason, we are imlementing automated feature selection using the filter method. A threshold correlation is set, and only features above that correlation are selected. I am setting the minimum correlation threshold to 0.1.

Due to sin and cos hour of day being linked with one another, I am tuning the algorithm so that these two features are always selected.

In [35]:
# create an empty dataframe to build selected features
col_names = list(chunks[0].keys())
print(col_names)

df_dict = {}

for col in col_names:
    df_dict[f'{col}'] = []
    
#route_features_df = pd.DataFrame(df_dict)
#route_features_df

['PLANNEDTIME_DEP', 'ROUTEID', 'TOTAL_JOURNEY_TIME', 'rain', 'temp', 'wet_bulb_temp(C)', 'dew_pt_temp(C)', 'vapour_pressure(hPa)', 'humidity(%)', 'sea_lvl_pressure(hPa)', 'sin_hour_of_day', 'cos_hour_of_day', 'friday', 'monday', 'saturday', 'sunday', 'thursday', 'tuesday', 'wednesday']


In [36]:
df_dict

{'PLANNEDTIME_DEP': [],
 'ROUTEID': [],
 'TOTAL_JOURNEY_TIME': [],
 'rain': [],
 'temp': [],
 'wet_bulb_temp(C)': [],
 'dew_pt_temp(C)': [],
 'vapour_pressure(hPa)': [],
 'humidity(%)': [],
 'sea_lvl_pressure(hPa)': [],
 'sin_hour_of_day': [],
 'cos_hour_of_day': [],
 'friday': [],
 'monday': [],
 'saturday': [],
 'sunday': [],
 'thursday': [],
 'tuesday': [],
 'wednesday': []}

In [59]:
def selectFeaturesToCSV(dfs):
    """Adds the selected features into the dictionary to be used to create DataFrame for all routes."""


    def selectFeatures(df):
        """Uses Pearson correlation to select features with at least 0.05 score.
        
        Selected features returned as a list."""
        cor = df.corr()


        def correctSinCos(features_series):
            '''Checks presence of sin/cos pair in relevant features.

            Adds other if only one is found. Adds both if none are found.'''

            features = list(features_series.keys())

            cos = 'cos_hour_of_day'
            sin = 'sin_hour_of_day'

            if (sin in features) & (cos in features):
                pass

            elif (sin not in features) & (cos not in features):
                features.append(sin)
                features.append(cos)

            elif (sin in features) & (cos not in features):
                features.append(cos)

            else:
                features.append(sin)

            return features

        #Correlation with output variable
        cor_target = abs(cor["TOTAL_JOURNEY_TIME"])
        #Selecting highly correlated features
        relevant_features = cor_target[cor_target>0.2]


        selected_features = correctSinCos(relevant_features)
        return selected_features

    # create an empty dataframe to build selected features
    col_names = list(dfs[0].keys())

    df_dict = {}

    for col in col_names:
        df_dict[f'{col}'] = []
    
    for df in dfs:
        selected = selectFeatures(df)

        routeid = df['ROUTEID'].unique()[0]

        for col in col_names:
            if col == 'ROUTEID':
                df_dict['ROUTEID'].append(routeid)
            elif col in selected and col != 'ROUTEID':
                df_dict[f'{col}'].append(1)
            elif col not in selected and col != 'ROUTEID':
                df_dict[f'{col}'].append(0)
                
    features_df = pd.DataFrame(df_dict)
    features_df.to_csv('../SupportVectorReg/Feature_selection_V4_02.csv', index=False)
        

In [60]:
selectFeaturesToCSV(chunks)

In [61]:
len(chunks)

305

# 3 Examine results across all routes

In [5]:
# read chunks in again as they may have been altered in previous cells
chunks2 = get_chunks(data_list)

Index at chunk_list 0  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/18_3.csv :  (3896, 19)
Index at chunk_list 1  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/67_6.csv :  (4001, 19)
Index at chunk_list 2  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/142_9.csv :  (252, 17)
Index at chunk_list 3  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/17_16.csv :  (307, 16)
Index at chunk_list 4  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/46A_74.csv :  (11186, 19)
Index at chunk_list 5  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/122_14.csv :  (4164, 19)
Index at chunk_list 6  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/150_8.csv :  (4907, 19)
Index at chunk_list 7  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/33X_49.csv :  (181, 17)
Index at chunk_list 8  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/83_19.csv :  (1051, 16)
Index at chunk_list 9  is  ../../Pelin/Chunks/D

Index at chunk_list 99  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/102_9.csv :  (2800, 19)
Index at chunk_list 100  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/38B_40.csv :  (211, 17)
Index at chunk_list 101  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/26_29.csv :  (1232, 19)
Index at chunk_list 102  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/114_6.csv :  (1387, 18)
Index at chunk_list 103  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/84X_62.csv :  (204, 17)
Index at chunk_list 104  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/7A_86.csv :  (1023, 19)
Index at chunk_list 105  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/270_42.csv :  (1875, 19)
Index at chunk_list 106  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/7A_89.csv :  (238, 17)
Index at chunk_list 107  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/111_7.csv :  (1134, 19)
Index at chunk_list 108  is  

Index at chunk_list 178  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/7A_87.csv :  (2178, 19)
Index at chunk_list 179  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/13_70.csv :  (252, 17)
Index at chunk_list 180  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/39_23.csv :  (204, 19)
Index at chunk_list 181  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/161_50.csv :  (166, 17)
Index at chunk_list 182  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/65B_65.csv :  (2157, 19)
Index at chunk_list 183  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/40B_63.csv :  (483, 19)
Index at chunk_list 184  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/68_84.csv :  (242, 18)
Index at chunk_list 185  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/7_72.csv :  (2421, 19)
Index at chunk_list 186  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/43_84.csv :  (2253, 19)
Index at chunk_list 187  is  ..

Index at chunk_list 267  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/41C_79.csv :  (3961, 19)
Index at chunk_list 268  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/83A_20.csv :  (538, 16)
Index at chunk_list 269  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/39_20.csv :  (3967, 19)
Index at chunk_list 270  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/63_26.csv :  (1453, 19)
Index at chunk_list 271  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/66B_58.csv :  (859, 19)
Index at chunk_list 272  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/17_10.csv :  (2066, 18)
Index at chunk_list 273  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/17A_14.csv :  (422, 16)
Index at chunk_list 274  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/66A_38.csv :  (947, 19)
Index at chunk_list 275  is  ../../Pelin/Chunks/DBUS/Predictive_Model_Chunks_backup/15A_85.csv :  (1058, 19)
Index at chunk_list 276  i

In [6]:
results_df = pd.DataFrame()

In [7]:
mae_all = []
mse_all = []
rmse_all = []
r2_all = []
CV_mae_all = []
CV_rmse_all = []
CV_r2_all = []

In [8]:
def getPickle(model, modelname, df, features_file):
    """File to generate complete pickel file for a dataframe.
    
    Selects features based on feature selection recommendation in table"""
    
    def get_sourcefeatures(file, ROUTEID):
        df = pd.read_csv(file, low_memory=False)
        targetrow_df= df[df.ROUTEID == ROUTEID ]
        features_list=[]
        for column in targetrow_df.columns.values:
            if targetrow_df.iloc[0][column]==1:
                features_list.append(column)
        features_list.remove('TOTAL_JOURNEY_TIME')
        return features_list
    
    global mae_all
    global mse_all
    global rmse_all
    global r2_all
    global CV_mae_all
    global CV_rmse_all
    global CV_r2_all
    global results_df
    
    # shuffle
    df = df.sample(frac=1)
    source_features = get_sourcefeatures(f'{features_file}', df['ROUTEID'].unique()[0])
    y=df['TOTAL_JOURNEY_TIME']
    X=df[source_features]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

    model = model.fit(X_train, y_train)
    prediction = model.predict(X_test)
    
    mae = metrics.mean_absolute_error(y_test, prediction)
    mse = metrics.mean_squared_error(y_test, prediction)
    rmse = metrics.mean_squared_error(y_test, prediction)**0.5
    r2 = metrics.r2_score(y_test, prediction)
    
    mae_all.append(mae)
    mse_all.append(mse)
    rmse_all.append(rmse)
    r2_all.append(r2)
    
    # 3-fold CV
    
    accuracyR2 = cross_val_score(model, X, y, scoring='r2', cv = 3)
    CV_r2_all.append(accuracyR2.mean())
    
    accuracyMAE = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv = 3)
    CV_mae_all.append(accuracyMAE.mean())
    
    accuracyRMSE = cross_val_score(model, X, y, scoring='neg_root_mean_squared_error', cv = 3)
    CV_rmse_all.append(accuracyRMSE.mean())
    
    # generate pickle file
    routeid = df['ROUTEID'].unique()[0]
    filename = routeid+'.pkl'
    with open(f'pickle_files/{filename}', 'wb') as handle:
        pickle.dump(model, handle, pickle.HIGHEST_PROTOCOL)
        
    # create a dataframe of the results
    data = [[modelname, routeid, accuracyMAE.mean(), accuracyRMSE.mean(), accuracyR2.mean()]]
    row_df = pd.DataFrame(data, columns=['MODEL','ROUTEID','MAE', 'RMSE','R2'])
    results_df = pd.concat([results_df,row_df],axis=0,ignore_index=True)
    
    return results_df
    

In [9]:
LR = LinearRegression()

In [16]:

count = 0
for chunk in chunks2:
    print(f'generating pickel for chunk index {count}')
    getPickle(LR, 'LR', chunk, '../SupportVectorReg/Feature_selection_V1.csv')
    count +=1

generating pickel for chunk index 0
generating pickel for chunk index 1
generating pickel for chunk index 2
generating pickel for chunk index 3
generating pickel for chunk index 4
generating pickel for chunk index 5
generating pickel for chunk index 6
generating pickel for chunk index 7
generating pickel for chunk index 8
generating pickel for chunk index 9
generating pickel for chunk index 10
generating pickel for chunk index 11
generating pickel for chunk index 12
generating pickel for chunk index 13
generating pickel for chunk index 14
generating pickel for chunk index 15
generating pickel for chunk index 16
generating pickel for chunk index 17
generating pickel for chunk index 18
generating pickel for chunk index 19
generating pickel for chunk index 20
generating pickel for chunk index 21
generating pickel for chunk index 22
generating pickel for chunk index 23
generating pickel for chunk index 24
generating pickel for chunk index 25
generating pickel for chunk index 26
generating 

generating pickel for chunk index 221
generating pickel for chunk index 222
generating pickel for chunk index 223
generating pickel for chunk index 224
generating pickel for chunk index 225
generating pickel for chunk index 226
generating pickel for chunk index 227
generating pickel for chunk index 228
generating pickel for chunk index 229
generating pickel for chunk index 230
generating pickel for chunk index 231
generating pickel for chunk index 232
generating pickel for chunk index 233
generating pickel for chunk index 234
generating pickel for chunk index 235
generating pickel for chunk index 236
generating pickel for chunk index 237
generating pickel for chunk index 238
generating pickel for chunk index 239
generating pickel for chunk index 240
generating pickel for chunk index 241
generating pickel for chunk index 242
generating pickel for chunk index 243
generating pickel for chunk index 244
generating pickel for chunk index 245
generating pickel for chunk index 246
generating p

### CV results

In [17]:
results_df

Unnamed: 0,MODEL,ROUTEID,MAE,RMSE,R2
0,LR,18_3,-556.101450,-774.658138,0.529055
1,LR,67_6,-481.571359,-658.738232,0.500589
2,LR,142_9,-223.974564,-392.735041,-0.009175
3,LR,17_16,-252.486950,-375.551883,0.502187
4,LR,46A_74,-517.367954,-734.705490,0.476095
...,...,...,...,...,...
551,LR,17A_14,-472.036796,-609.662974,0.475238
552,LR,66A_38,-486.785276,-757.756204,0.347463
553,LR,15A_85,-486.455525,-687.672768,0.419779
554,LR,47_136,-560.496138,-769.804813,0.470143


In [18]:
print("max r2 = ", results_df['R2'].max())
print("mean r2 = ", results_df['R2'].mean())


max r2 =  0.8663119032602657
mean r2 =  0.4295114081036137


## Trying with stricter feature selection (corr=0.1)

In [10]:
results_df = pd.DataFrame()

count = 0
for chunk in chunks2:
    print(f'generating pickel for chunk index {count}')
    getPickle(LR, 'LR', chunk, '../SupportVectorReg/Feature_selection_V2.csv')
    count +=1

generating pickel for chunk index 0
generating pickel for chunk index 1
generating pickel for chunk index 2
generating pickel for chunk index 3
generating pickel for chunk index 4
generating pickel for chunk index 5
generating pickel for chunk index 6
generating pickel for chunk index 7
generating pickel for chunk index 8
generating pickel for chunk index 9
generating pickel for chunk index 10
generating pickel for chunk index 11
generating pickel for chunk index 12
generating pickel for chunk index 13
generating pickel for chunk index 14
generating pickel for chunk index 15
generating pickel for chunk index 16
generating pickel for chunk index 17
generating pickel for chunk index 18
generating pickel for chunk index 19
generating pickel for chunk index 20
generating pickel for chunk index 21
generating pickel for chunk index 22
generating pickel for chunk index 23
generating pickel for chunk index 24
generating pickel for chunk index 25
generating pickel for chunk index 26
generating 

generating pickel for chunk index 220
generating pickel for chunk index 221
generating pickel for chunk index 222
generating pickel for chunk index 223
generating pickel for chunk index 224
generating pickel for chunk index 225
generating pickel for chunk index 226
generating pickel for chunk index 227
generating pickel for chunk index 228
generating pickel for chunk index 229
generating pickel for chunk index 230
generating pickel for chunk index 231
generating pickel for chunk index 232
generating pickel for chunk index 233
generating pickel for chunk index 234
generating pickel for chunk index 235
generating pickel for chunk index 236
generating pickel for chunk index 237
generating pickel for chunk index 238
generating pickel for chunk index 239
generating pickel for chunk index 240
generating pickel for chunk index 241
generating pickel for chunk index 242
generating pickel for chunk index 243
generating pickel for chunk index 244
generating pickel for chunk index 245
generating p

### CV results


In [11]:
results_df

Unnamed: 0,MODEL,ROUTEID,MAE,RMSE,R2
0,LR,18_3,-556.101450,-774.658138,0.529055
1,LR,67_6,-481.571359,-658.738232,0.500589
2,LR,142_9,-223.974564,-392.735041,-0.009175
3,LR,17_16,-252.486950,-375.551883,0.502187
4,LR,46A_74,-517.367954,-734.705490,0.476095
...,...,...,...,...,...
273,LR,17A_14,-476.842893,-616.143448,0.451350
274,LR,66A_38,-480.101636,-748.471538,0.361873
275,LR,15A_85,-473.698238,-674.336475,0.443314
276,LR,47_136,-560.233839,-769.429324,0.469558


In [12]:
print("r2 max for cor=0.1 : ", results_df['R2'].max())
print("r2 mean for cor=0.1 : ", results_df['R2'].mean())

r2 max for cor=0.1 :  0.8663119032602657
r2 mean for cor=0.1 :  0.42668859128866404


In [13]:
results_df.to_csv('LR_results.csv')

## Corr=0.2

In [62]:
results_df = pd.DataFrame()

count = 0
for chunk in chunks2:
    print(f'generating pickel for chunk index {count}')
    getPickle(LR, 'LR', chunk, '../SupportVectorReg/Feature_selection_V4_02.csv')
    count +=1

generating pickel for chunk index 0
generating pickel for chunk index 1
generating pickel for chunk index 2
generating pickel for chunk index 3
generating pickel for chunk index 4
generating pickel for chunk index 5
generating pickel for chunk index 6
generating pickel for chunk index 7
generating pickel for chunk index 8
generating pickel for chunk index 9
generating pickel for chunk index 10
generating pickel for chunk index 11
generating pickel for chunk index 12
generating pickel for chunk index 13
generating pickel for chunk index 14
generating pickel for chunk index 15
generating pickel for chunk index 16
generating pickel for chunk index 17
generating pickel for chunk index 18
generating pickel for chunk index 19
generating pickel for chunk index 20
generating pickel for chunk index 21
generating pickel for chunk index 22
generating pickel for chunk index 23
generating pickel for chunk index 24
generating pickel for chunk index 25
generating pickel for chunk index 26
generating 

generating pickel for chunk index 222
generating pickel for chunk index 223
generating pickel for chunk index 224
generating pickel for chunk index 225
generating pickel for chunk index 226
generating pickel for chunk index 227
generating pickel for chunk index 228
generating pickel for chunk index 229
generating pickel for chunk index 230
generating pickel for chunk index 231
generating pickel for chunk index 232
generating pickel for chunk index 233
generating pickel for chunk index 234
generating pickel for chunk index 235
generating pickel for chunk index 236
generating pickel for chunk index 237
generating pickel for chunk index 238
generating pickel for chunk index 239
generating pickel for chunk index 240
generating pickel for chunk index 241
generating pickel for chunk index 242
generating pickel for chunk index 243
generating pickel for chunk index 244
generating pickel for chunk index 245
generating pickel for chunk index 246
generating pickel for chunk index 247
generating p

### CV results

In [63]:
results_df

Unnamed: 0,MODEL,ROUTEID,MAE,RMSE,R2
0,LR,18_3,-555.953295,-773.983641,0.529712
1,LR,67_6,-529.427508,-725.830106,0.394165
2,LR,142_9,-220.218638,-397.350993,-0.016993
3,LR,17_16,-251.480548,-372.175896,0.512921
4,LR,46A_74,-541.178263,-760.369466,0.43898
5,LR,122_14,-443.560659,-626.86731,0.399484
6,LR,150_8,-333.585978,-496.575389,0.436364
7,LR,33X_49,-456.744431,-620.666654,0.463274
8,LR,83_19,-394.838816,-551.388265,0.242851
9,LR,270_44,-254.447942,-354.130197,0.277775


In [64]:
print("r2 max for cor=0.3 : ", results_df['R2'].max())
print("r2 mean for cor=0.3 : ", results_df['R2'].mean())

r2 max for cor=0.3 :  0.8407221810966807
r2 mean for cor=0.3 :  0.3682364257614371


In [65]:
results_df.to_csv('LR_results.csv')

## Corr=0.3

In [51]:
results_df = pd.DataFrame()

count = 0
for chunk in chunks2:
    print(f'generating pickel for chunk index {count}')
    getPickle(LR, 'LR', chunk, '../SupportVectorReg/Feature_selection_V3.csv')
    count +=1

generating pickel for chunk index 0
generating pickel for chunk index 1
generating pickel for chunk index 2
generating pickel for chunk index 3
generating pickel for chunk index 4
generating pickel for chunk index 5
generating pickel for chunk index 6
generating pickel for chunk index 7
generating pickel for chunk index 8
generating pickel for chunk index 9
generating pickel for chunk index 10
generating pickel for chunk index 11
generating pickel for chunk index 12
generating pickel for chunk index 13
generating pickel for chunk index 14
generating pickel for chunk index 15
generating pickel for chunk index 16
generating pickel for chunk index 17
generating pickel for chunk index 18
generating pickel for chunk index 19
generating pickel for chunk index 20
generating pickel for chunk index 21
generating pickel for chunk index 22
generating pickel for chunk index 23
generating pickel for chunk index 24
generating pickel for chunk index 25
generating pickel for chunk index 26
generating 

generating pickel for chunk index 220
generating pickel for chunk index 221
generating pickel for chunk index 222
generating pickel for chunk index 223
generating pickel for chunk index 224
generating pickel for chunk index 225
generating pickel for chunk index 226
generating pickel for chunk index 227
generating pickel for chunk index 228
generating pickel for chunk index 229
generating pickel for chunk index 230
generating pickel for chunk index 231
generating pickel for chunk index 232
generating pickel for chunk index 233
generating pickel for chunk index 234
generating pickel for chunk index 235
generating pickel for chunk index 236
generating pickel for chunk index 237
generating pickel for chunk index 238
generating pickel for chunk index 239
generating pickel for chunk index 240
generating pickel for chunk index 241
generating pickel for chunk index 242
generating pickel for chunk index 243
generating pickel for chunk index 244
generating pickel for chunk index 245
generating p

### CV results


In [52]:
results_df

Unnamed: 0,MODEL,ROUTEID,MAE,RMSE,R2
0,LR,18_3,-566.163374,-789.587982,0.510709
1,LR,67_6,-584.042632,-780.250732,0.300222
2,LR,142_9,-222.262812,-404.617411,-0.07111
3,LR,17_16,-248.773786,-370.991059,0.504084
4,LR,46A_74,-649.209208,-892.463806,0.227204
5,LR,122_14,-555.809896,-765.637614,0.104238
6,LR,150_8,-416.874705,-583.862732,0.22085
7,LR,33X_49,-465.691631,-649.627875,0.338359
8,LR,83_19,-396.190582,-552.802368,0.238026
9,LR,270_44,-297.075623,-408.772451,0.029363


In [53]:
print("r2 max for cor=0.3 : ", results_df['R2'].max())
print("r2 mean for cor=0.3 : ", results_df['R2'].mean())

r2 max for cor=0.3 :  0.8591471705362181
r2 mean for cor=0.3 :  -52.93752761859946


In [54]:

pd.set_option('display.max_rows', 500)
results_df.head(305)

Unnamed: 0,MODEL,ROUTEID,MAE,RMSE,R2
0,LR,18_3,-566.163374,-789.587982,0.510709
1,LR,67_6,-584.042632,-780.250732,0.300222
2,LR,142_9,-222.262812,-404.617411,-0.07111
3,LR,17_16,-248.773786,-370.991059,0.504084
4,LR,46A_74,-649.209208,-892.463806,0.227204
5,LR,122_14,-555.809896,-765.637614,0.104238
6,LR,150_8,-416.874705,-583.862732,0.22085
7,LR,33X_49,-465.691631,-649.627875,0.338359
8,LR,83_19,-396.190582,-552.802368,0.238026
9,LR,270_44,-297.075623,-408.772451,0.029363


In [55]:
chunks[230].shape

(197, 19)

# Reduced weather feature set & corr=0.1

In [14]:
results_df = pd.DataFrame()

count = 0
for chunk in chunks2:
    print(f'generating pickel for chunk index {count}')
    getPickle(LR, 'LR', chunk, '../FeatureSelection/feature_selection_cor01.csv')
    count +=1

generating pickel for chunk index 0
generating pickel for chunk index 1
generating pickel for chunk index 2
generating pickel for chunk index 3
generating pickel for chunk index 4
generating pickel for chunk index 5
generating pickel for chunk index 6
generating pickel for chunk index 7
generating pickel for chunk index 8
generating pickel for chunk index 9
generating pickel for chunk index 10
generating pickel for chunk index 11
generating pickel for chunk index 12
generating pickel for chunk index 13
generating pickel for chunk index 14
generating pickel for chunk index 15
generating pickel for chunk index 16
generating pickel for chunk index 17
generating pickel for chunk index 18
generating pickel for chunk index 19
generating pickel for chunk index 20
generating pickel for chunk index 21
generating pickel for chunk index 22
generating pickel for chunk index 23
generating pickel for chunk index 24
generating pickel for chunk index 25
generating pickel for chunk index 26
generating 

generating pickel for chunk index 219
generating pickel for chunk index 220
generating pickel for chunk index 221
generating pickel for chunk index 222
generating pickel for chunk index 223
generating pickel for chunk index 224
generating pickel for chunk index 225
generating pickel for chunk index 226
generating pickel for chunk index 227
generating pickel for chunk index 228
generating pickel for chunk index 229
generating pickel for chunk index 230
generating pickel for chunk index 231
generating pickel for chunk index 232
generating pickel for chunk index 233
generating pickel for chunk index 234
generating pickel for chunk index 235
generating pickel for chunk index 236
generating pickel for chunk index 237
generating pickel for chunk index 238
generating pickel for chunk index 239
generating pickel for chunk index 240
generating pickel for chunk index 241
generating pickel for chunk index 242
generating pickel for chunk index 243
generating pickel for chunk index 244
generating p

## CV results

In [15]:
results_df

Unnamed: 0,MODEL,ROUTEID,MAE,RMSE,R2
0,LR,18_3,-556.151174,-773.821755,0.529649
1,LR,67_6,-480.848176,-658.613099,0.500896
2,LR,142_9,-236.468155,-404.033600,-0.166172
3,LR,17_16,-253.771280,-377.193196,0.500692
4,LR,46A_74,-517.283256,-734.520239,0.476553
...,...,...,...,...,...
273,LR,17A_14,-474.065631,-611.080600,0.458037
274,LR,66A_38,-484.666139,-756.154886,0.346680
275,LR,15A_85,-481.244114,-681.329796,0.429975
276,LR,47_136,-558.548441,-768.422805,0.471865


In [16]:
print("r2 max for cor=0.3 : ", results_df['R2'].max())
print("r2 mean for cor=0.3 : ", results_df['R2'].mean())

r2 max for cor=0.3 :  0.8508709294665243
r2 mean for cor=0.3 :  0.42833564348685393


In [17]:
results_df.to_csv('LR_results.csv')