# Table of Contents
[1. Model Preparation](#1.-Model-Preperation)
<br>
* [1.1 Reviewing, Splitting data set](#1.1-Reviewing,-splitting-dataset-into-7:3-for-training-and-testing.)
* [1.2 Plotting features against target feature](#1.2-Plot-to-compare-all-features-to-target-feature-to-help-make-decisions-to-keep-for-the-models.)
    * [1.2.1 Plotting datetime feature against target feature](#Plotting-datetime-feature-against-target-feature)
    * [1.2.2 Plotting numerical features against target feature](#Plotting-numerical-features-against-target-feature)
    * [1.2.3 Plotting categorical features against target feature](#Plotting-categorical-features-against-target-feature)
* [1.3. Summary of all features](#1.3.-Summary-of-all-features)
    * [1.3.1 Numerical Features](#Numerical-Features)
    * [1.3.1 Cateogrical Features](#Categorical-Features)
*[2. Linear Regression & Random Forest & Decision Trees & K-Nearest-Neighbour](#2.-Linear-Regression-&-Random-Forest-&-Decision-Trees-&-K-Nearest-Neighbour)
* [3. Route model and taking the proportion of the prediction to calculate a journey time for the user](#3.-Route-model-and-taking-the-proportion-of-the-prediction-to-calculate-a-journey-time-for-the-user.)
    * [3.1 Calculating the proportion of each stop from the overall trip](#3.1-Calculating-the-proportion-of-each-stop-from-the-overall-trip.)
* [4. Random Forest & Decision Trees](#4.-Random-Forest-&-Decision-Trees)
* [5. Stop pair model](#5.-Stop-pair-model)
    * [5.1 First version of paired stop approach](#5.1-First-version-of-paired-stop-approach)
    * [5.2.1 Setting up for 46a stop pair models using first approach](#5.2.1-Setting-up-for-46a-stop-pair-models-using-first-approach)
    * [5.3 Stop pair based on entire leavetimes](#5.3-Stop-pair-based-on-entire-leavetimes)

Establishing a connection with sqlite database

In [1]:
# import boto3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import sqlite3
import pickle
import time

# from sagemaker import et_execution_role
from patsy import dmatrices
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn import metrics
from math import log
from statistics import stdev
from statistics import mode


# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Connecting to s3
# role = get_execution_role()
# bucket='sagemaker-studio-520298385440-7in8n1t299'
# data_key = 'route_46a.feather'
# data_location = 's3://{}/{}'.format(bucket, data_key)

In [2]:
# def function to create connection to db
def create_connection(db_file):
    """
    create a database connection to the SQLite database specified by db_file
    :param df_file: database file
    :return: Connection object or None
    """
    conn = None
    try: 
        conn = sqlite3.connect(db_file)
        return conn
    except 'Error' as e:
        print(e)
        
    return conn

In [3]:
# create connection to db
db_file = "C:/Users/fayea/UCD/ResearchPracticum/Data-Analytics-CityRoute/dublinbus.db"
conn = create_connection(db_file)

In [4]:
# initialise query
query = """
SELECT leavetimes.*, weather.*
FROM leavetimes, weather
WHERE TRIPID in  
    (SELECT TRIPID
    FROM trips
    WHERE LINEID = '46A' AND DIRECTION = '1')
AND leavetimes.DAYOFSERVICE = weather.dt;
"""

In [5]:
# execute query and read into dataframe
query_df = pd.read_sql(query, conn)

# 1. Model Preperation

In [6]:
# Loading file
df = query_df

In [None]:
df = pd.read_feather('route46a.feather')

## 1.1 Reviewing, splitting dataset into 7:3 for training and testing.

In [None]:
df.head(5)

In [None]:
df.tail(5)

In [None]:
# Missing values
df.isnull().sum()

In [None]:
# Unique types for each feature
df.nunique()

In [None]:
# Datatypes and convert
df.dtypes

In [None]:
# Rows and columns
df.shape

In [None]:
df.describe().T

**Review so far:**
<br>
There are no more missing values and the constant columns have been removed.
* Remove index, index, dt.
* Investigate level_0.
* Convert the following to categorical: DAYOFWEEK, MONTHOFSERVICE, PROGRNUMBER, STOPPOINTID, VEHICLEID, IS_HOLIDAY, IS_WEEKDAY, TRIPID, weather_id, weather_main, weather_description
* We have data for most of the days of the year and for each month.


In [7]:
df = df.drop(['level_0', 'dt','index'], axis=1)

In [8]:
# Sorting by trip then dayofservice
df['PROGRNUMBER'] = df['PROGRNUMBER'].astype('int64')
df = df.sort_values(by=['TRIPID', 'DAYOFSERVICE', 'PROGRNUMBER'])

In [9]:
# Creating features
categorical_features = ['DAYOFWEEK', 'MONTHOFSERVICE', 'PROGRNUMBER', 'STOPPOINTID', 'PREVIOUS_STOPPOINTID',
                       'IS_HOLIDAY', 'IS_WEEKDAY', 'TRIPID', 'VEHICLEID', 'weather_id', 'weather_main', 'weather_description']

datetime_features = ['DAYOFSERVICE']

numerical_features = ['PLANNEDTIME_ARR', 'ACTUALTIME_ARR', 'PLANNEDTIME_DEP', 'ACTUALTIME_DEP',
                     'DWELLTIME', 'PLANNEDTIME_TRAVEL', 'temp', 'pressure', 'humidity', 'wind_speed', 'wind_deg', 'rain_1h', 'clouds_all']

target_feat = 'ACTUALTIME_TRAVEL'

In [10]:
# Converting object to categorical
for column in categorical_features:
    df[column] = df[column].astype('category')
    
# Converting dayofservice to datetime
df['DAYOFSERVICE'] = pd.to_datetime(df['DAYOFSERVICE'])

In [11]:
# Replacing PROGRNUMBER equal to 1 of ACTUALTIME_TRAVEL with 0
df.loc[df['PROGRNUMBER'] == '1', 'ACTUALTIME_TRAVEL'] = 0
df.loc[df['PROGRNUMBER'] == '1', 'PLANNEDTIME_TRAVEL'] = 0

In [12]:
df.loc[df['PLANNEDTIME_TRAVEL'] < 0, 'PLANNEDTIME_TRAVEL'] = 0
df.loc[df['ACTUALTIME_TRAVEL'] < 0, 'ACTUALTIME_TRAVEL'] = 0

In [None]:
df['HOUROFSERVICE'] = [int(time.strftime("%H",time.gmtime(hour))) for hour in df['ACTUALTIME_DEP']]
df['eve_rushour'] = [1 if int(time.strftime("%H",time.gmtime(hour))) >= 16 and int(time.strftime("%H",time.gmtime(hour))) <= 19 else 0 for hour in df['ACTUALTIME_DEP']]
df['morn_rushour'] = [1 if int(time.strftime("%H",time.gmtime(hour))) >= 7 and int(time.strftime("%H",time.gmtime(hour))) <= 9 else 0 for hour in df['ACTUALTIME_DEP']]

In [None]:
df = df.reset_index()
df.to_feather('route46a.feather')

In [None]:
# Making new feature for previous stoppointid and let those with PROGRNUMBER = 1 to 0
# df['PREVIOUS_STOPPOINTID'] = df['STOPPOINTID'].shift()
# first_stop = {'0':'0'}
# df['PREVIOUS_STOPPOINTID'] = df['PREVIOUS_STOPPOINTID'].cat.add_categories(first_stop)
# df.loc[df['PROGRNUMBER'] == '1', 'PREVIOUS_STOPPOINTID'] = '0'

<br><br>
Setting the target feature as _y and x_ as the remaining features in the dataframe. 
<br><br>

In [None]:
df.set_index(np.random.permutation(df.index))
# sort the resulting random index
df.sort_index(inplace=True)

In [None]:
# Creating y and x axis
target_feature = df['ACTUALTIME_TRAVEL']
y = pd.DataFrame(target_feature)
X = df.drop(['ACTUALTIME_TRAVEL'], axis=1)

# Splitting dataset for train and testing data by 70/30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Printing shape of the new split data
print("The original range is: ",df.shape[0])
print("The training range (70%):\t rows 0 to", round(X_train.shape[0]))
print("The test range (30%): \t rows", round(X_train.shape[0]), "to", round(X_train.shape[0]) + X_test.shape[0])

## 1.2 Plot to compare all features to target feature to help make decisions to keep for the models.

#### Plotting datetime feature against target feature

In [None]:
# Plot datetime feature against target feature
X_train.DAYOFSERVICE = pd.to_numeric(X_train.DAYOFSERVICE)
df_temp = pd.concat([X_train['DAYOFSERVICE'], y_train], axis=1)
correlation_dt = df_temp[['DAYOFSERVICE', 'ACTUALTIME_TRAVEL']].corr(method='pearson')
correlation_dt

In [None]:
print('PLOT: DAYOFSERVICE')
fig = plt.figure()
ax = fig.add_subplot
df_temp.plot(kind='scatter', x='DAYOFSERVICE', y='ACTUALTIME_TRAVEL', label = "%.3f" % df_temp[['ACTUALTIME_TRAVEL', 'DAYOFSERVICE']].corr().to_numpy()[0,1], figsize=(15, 8))
plt.show()

#### Plotting numerical features against target feature

In [None]:
for column in numerical_features:
    df_temp = pd.concat([X_train[column], y_train], axis=1)
    correlation_dt = df_temp[[column, 'ACTUALTIME_TRAVEL']].corr(method='pearson')
    print('\n',correlation_dt)

In [None]:
for column in numerical_features:
    df_temp = pd.concat([X_train[column], y_train], axis=1)
    correlation_dt = df_temp[[column, 'ACTUALTIME_TRAVEL']].corr(method='spearman')
    print('\n',correlation_dt)

#### Pearson correlation method

In [None]:
print('NUMERICAL FEATURES: PEARSON')
for column in numerical_features:
    df_temp = pd.concat([X_train[column], y_train], axis=1)
    fig = plt.figure()
    ax = fig.add_subplot
    df_temp.plot(kind='scatter', x=column, y='ACTUALTIME_TRAVEL', label = "%.3f" % df_temp[['ACTUALTIME_TRAVEL', column]].corr(method='pearson').to_numpy()[0,1], figsize=(12, 8))
    plt.show()

#### Spearman correlation method

In [None]:
print('NUMERICAL FEATURES: SPEARSMAN')
for column in numerical_features:
    df_temp = pd.concat([X_train[column], y_train], axis=1)
    fig = plt.figure()
    ax = fig.add_subplot
    df_temp.plot(kind='scatter', x=column, y='ACTUALTIME_TRAVEL', label = "%.3f" % df_temp[['ACTUALTIME_TRAVEL', column]].corr(method='spearman').to_numpy()[0,1], figsize=(12, 8))
    plt.show()

In [None]:
print('NUMERICAL FEATURES: USING CORR()')
df.corr()['ACTUALTIME_TRAVEL'][:]

In [None]:
df_numeric = df[numerical_features]
for feature in df_numeric:
    df_numeric[feature] = np.log(df_numeric[feature])
df_numeric['ACTUALTIME_TRAVEL'] = np.log(df['ACTUALTIME_TRAVEL'])

In [None]:
print('NUMERICAL FEATURES USING LOG DATA')
# Creating y and x axis
target_feature_numeric = df_numeric['ACTUALTIME_TRAVEL']
y_numeric = pd.DataFrame(target_feature_numeric)
X_numeric = df_numeric.drop(['ACTUALTIME_TRAVEL'], axis=1)

# Splitting dataset for train and testing data by 70/30
X_train_numeric, X_test_numeric, y_train_numeric, y_test_numeric = train_test_split(X_numeric, y_numeric, test_size=0.3, random_state=1)

# Printing shape of the new split data
print("The original range is: ",df.shape[0])
print("The training range (70%):\t rows 0 to", round(X_train_numeric.shape[0]))
print("The test range (30%): \t rows", round(X_train_numeric.shape[0]), "to", round(X_train_numeric.shape[0]) + X_test_numeric.shape[0])

for column in numerical_features:
    df_temp = pd.concat([X_train_numeric[column], y_train_numeric], axis=1)
    fig = plt.figure()
    ax = fig.add_subplot
    df_temp.plot(kind='scatter', x=column, y='ACTUALTIME_TRAVEL', label = "%.3f" % df_temp[['ACTUALTIME_TRAVEL', column]].corr(method='spearman').to_numpy()[0,1], figsize=(12, 8))
    plt.show()

#### Plotting categorical features against target feature

In [None]:
year_features = ['eve_rushour', 'morn_rushour','DAYOFWEEK', 'IS_HOLIDAY', 'IS_WEEKDAY', 'MONTHOFSERVICE', 'weather_id', 'weather_main', 'weather_description']

for feature in year_features:
    print(feature)
    df_temp = pd.concat([X_train, y_train], axis=1)
    unique = df_temp[feature].unique()
    list_average = []
    
    for value in unique:
        list_values = df_temp[df_temp[feature]== value]['ACTUALTIME_TRAVEL'].tolist()
        length_list = len(list_values)
        average =  sum(list_values)/length_list
        list_average += [average]
#         print(f'Sum of values / list of values: \n {sum(list_values)} / {length_list}')
#         print(f'Average ACTUALTIME_TRAVEL: {average}, \n')
        
    # taken from https://pythonspot.com/matplotlib-bar-chart/
    y_pos = np.arange(len(unique))
    plt.bar(y_pos, list_average, align='center')
    plt.xticks(y_pos, unique)
    plt.ylabel('Usage')
    plt.title(feature)
    plt.xticks(rotation=90)

    plt.show()

In [None]:
# Average time for each vehicle id
df_temp = pd.concat([X_train, y_train], axis=1)
vehicleid = df_temp['VEHICLEID'].unique().tolist()
for id_ in vehicleid:
    print(f'VEHICLEID: {id_}')
    list_values = df_temp[df_temp['VEHICLEID']== id_]['ACTUALTIME_TRAVEL'].tolist()
    length_list = len(list_values)
    average =  sum(list_values)/length_list
    print(f'Average ACTUALTIME_TRAVEL: {average} \n')

In [None]:
# Making dummy variables for categorical 
cat = ['DAYOFWEEK', 'MONTHOFSERVICE', 'PROGRNUMBER', 'STOPPOINTID', 'IS_HOLIDAY', 'IS_WEEKDAY', 'weather_id', 'weather_main', 'weather_description']
df_temp = pd.concat([X_train, y_train], axis=1)
df_copy = df_temp.copy()
df_copy = df_copy[cat]
df_copy = pd.get_dummies(df_copy)
df_copy = pd.concat([df_copy, y_train], axis=1)

categorical_corr = df_copy.corr()['ACTUALTIME_TRAVEL'][:]

In [None]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(categorical_corr)

In [None]:
categorical_list = categorical_corr[categorical_corr > 0.04].index.tolist()
categorical_list.remove('ACTUALTIME_TRAVEL')

In [None]:
categorical_list

## 1.3. Summary of all features
<br><br>
#### Numerical Features
<br><br>

**DayOfService:**
* The correlation to the target feature is very low of 0.03806.
* Don't see it being a useful feature for the target feature. 
* Plot represents a straight line, which suggests little to no correlation.
* Conclusion: dropped because of the low correlation score. 

**PlannedTime_Arr:**
* There is very low correlation against the target feature though it gets better using spearman correlation.
* After logging the data, the correlation plot did not make a huge difference when using the spearman method to plot it for the second time. 
* Pearson and spearman plot pre log suggests little correlation as it is a continuous straight line. However, this shouldn't mean it should be dropped.
* When most values in the target feature fell less than 10, we see that the plannedtime arrival values increasing, it didn't change much. This would be due to the fact that the target feature is the difference between times so it would make sense that the relationship is poor.
* After logging the data, the plot is more spread out instead of a straight line, but the correlation score still shows a similar low score with a .02 difference using the spearman method. 
* Conclusion: However, this will be dropped.

**ActualTime_Arr:**
* Compared to Planned time arrival feature, the pearson correlation score is poorer but the spearman scores are more similar pre log. 
* It is similar to planned time arrival in that the plot represents a straight line, that suggests a poor relationship with the target feature. 
* After logging the data, it is found that the plot is more spread out. The score using spearman is not much different pre logging the data. 
* However, it would be unwise to drop this feature as it I feel it would serve good purpose for the target feature for predicting the prediction time for the next stop. 
* Conclusion: this will be dropped.

**PlannedTime_Dep:**
* Planned time departure has little correlation with the target feature after looking at spearman and pearsons. 
* It doesn't have a linear relationship and the straight line on the plot of both methods proves this.
* However, when plotted using the logged values we see that the correlation score hasn't changed but the data is more spread out. 
* This doesn't change the relationship much, however. 
* Even so, this will be kept as I feel it would help the predictions. Having the planned time departures would help skew a better result because it would relatively be close to the actual time departure even though it is just an estimate.
* Conclusion: this will be dropped 

**ActualTime_Dep:**
* Actual time departure is again, more or less the same. It represents the departure for these times at a particular stop to go to the next stop. It is strange that the correlation is so low even after logging the data but it would make sense as you wouldn't expect there to be a linear relationship.
* The plot is similar to the rest of the previous features mentioned so far. 
* However, it will still be kept because I feel it would still be a useful feature for predicting a time in seconds. 
* By taking the actual time departure for a particular stop it may help.
* Conclusion: this will be dropped.

**Dwell Time:**
* Dwell time has a 0.03 coorelation score with the target feature. It suggests on the graph that the time for dwell time equal to 0 then the more the target feature time increases. It might suggest traffic times where if a bus is full then it might be due to rush hour? busy hours?
* Plotting against the target feature after logging the data gives similar scores using the spearman correlation method. However we see the graph differing from pre log plot. It is more grouped up together compared to the previous graph plot.
* Because the score is more fairer compared to the previous, it will be useful to keep it for the modelling.
* Conclusion: dropped.

**PlannedTime_Travel:**
* When plotting using the pearse correlation method, it gave a correlation of 0.2. This time it is the highest correlation and we see a small linear relationship.
* The time for planned time travel, as it increases, so does the target feature. It gives us an indication of that slight linear relationship.
* Using spearmans to graph the correlation gave us a 0.7 score which is a good indication that the two features has a linear relationship.
* Because of this, this feature will be dropped.

**Temp:**
* Temp  has a negative 0.009 correlation with the target feature and an even poorer linear relationship at -.002.
* This indicates a poor linear/monotonic relationship and it will not serve useful for the model.
* The graph plots does not give anymore useful information that would give further evidence that it should be kept.
* Conclusion: drop.

**Pressure:**
* It also has a negative linear relationship with the target feature.
* When looking at the graph plots for both spearman and pearsons, it does not give any further insights.
* For this reason, this feature will be dropped.

**Humidity:**
* Humidity does not have a strong relationship with the target feature, be it linear or monotonic.
* The reason being the correlation using both methods fell < 0.00. 
* Unfortunately, the graph does not represent anything useful either.
* When looking at the logged data plots however, there is a slight difference however it is not signficant enough that this feature should still be kept as there is no distinct relationship that we can see.
* Conclusion: drop.

**Windspeed:**
* No linear relationship.
* Indicates a small monotonic relationship.
* This means that as the windspeed value increases, the value of the target feature tends to be higher as well.
* But a spearman correlation of 0.01 is not strong enough of a feature to keep.
* Conclusion: drop

**Wind_Deg:**
* This feature will be dropped immediately as the correalations are both <0.000.

**Rain_1H:**
* It doesn't have a strong linear relationship but it shows spearmans correlation some promising results when the data has been logged.


<br><br>
#### Categorical Features
<br><br>
**DayOfWeek:**
* In the graph we see the actual time travel increasing during weekdays and slowly the travel time is less during weekends. 
* This suggests a relationship between the days of the week and the target feature in which weekdays have a higher tendency for the actualtime travel feature to be higher.
* Conclusion: this will be kept.

**MonthofService:**
* In the graph, we don't really see a connection between each month against the target feature even if it is in order. 
* The overall actual travel time is higher in february before it dips, then rising during winter season.
* The correlation score seems to be poor also for each month. 
* This feature will still be kept. 

**Progrnumber:**
* Most progrnumbers will be dropped as a lot of the correlations are <0.00.
* For this reason, this feature will be dropped.
    
**StoppointID:**
* Similarly to progrnumbers, there are a lot of low correlations falling <0.00.
* Most stoppoint numbers are <0.00 correlation.
* This indicates a very low relationship with the target feature. 
* For this reason, this feature will be dropped, except for those with a correlation > 0.04
    
**Is_Holiday:**
* After analyzing the graph, we see a relationship between the target feature and whether or not the time falls under a holiday date (non-school holiday).
* If it a non holiday, the actual time travel increases. 
* If it is a holiday, the actual time travel decreases. 
* This means that less people are using public transport if it is a holiday date.
* For this reason, this feature will be kept.

**Is_Weekday:**
* Like Is_Holiday, we see a relationship between the target feature and whether or not the time is during a weekday or not. 
* We see a contrast between the two values in which 1, being a weekday, has a higher actual time travel, vice versa.
* For this reason, it is a good indication of a relationship to the target feature.
* Therefore, this feature will be kept. 

**VehicleID:**
* When looking at the different averages, we see that the average differences are not big.
* For this reason, it may be best to drop this feature because it doesn't give any indication it would be a useful feature to help the prediction models.


## 1.4 Cleaning up features

### Setting low correlation features - keep

In [13]:
# Categorical features
low_corr_categorical = ['DAYOFWEEK', 'MONTHOFSERVICE', 'IS_HOLIDAY', 'IS_WEEKDAY'] 

### Setting low correlation features - drop

In [14]:
# Numerical features
low_corr_numerical = ['PLANNEDTIME_ARR', 'PLANNEDTIME_DEP', 'ACTUALTIME_ARR', 'ACTUALTIME_DEP','PLANNEDTIME_TRAVEL']

low_corr = ['DAYOFSERVICE', 'VEHICLEID', 'TRIPID', 'STOPPOINTID', 'PREVIOUS_STOPPOINTID', 'PROGRNUMBER', 'temp', 'pressure', 'humidity', 
            'wind_deg', 'weather_id', 'weather_description', 'clouds_all', 'wind_speed', 'PREVIOUS_STOPPOINTID', 'PLANNEDTIME_ARR', 'PLANNEDTIME_DEP', 'ACTUALTIME_ARR', 'ACTUALTIME_DEP',
           'PLANNEDTIME_TRAVEL', 'DWELLTIME']

### Setting high correlation  features

In [15]:
# Numerical features 
high_corr_numerical = ['DWELLTIME', 'PLANNEDTIME_TRAVEL']

### Dropping features & setting dummy features

In [None]:
df_copy = df.copy()
df_copy = df_copy.drop(low_corr, 1)

In [None]:
df_copy = pd.get_dummies(df_copy)

### Training & Testing data

In [None]:
# All features
features = df_copy.columns.tolist()
features

In [None]:
datas = {'ACTUALTIME_TRAVEL': df_copy['ACTUALTIME_TRAVEL']}
y = pd.DataFrame(data=datas)
X = df_copy.drop(['ACTUALTIME_TRAVEL'],1)

In [None]:
# Splitting the dataset into 2 datasets: 
# Split the dataset into two datasets: 70% training and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=1)

print("The Original range of the dataset: ",df.shape[0])
print("The Training range taken from dataset: (70%): rows 0 to", round(X_train.shape[0]))
print("The Test range taken from dataset: (30%): rows", round(X_train.shape[0]), "to", round(X_train.shape[0]) + X_test.shape[0])

In [None]:
print("\nDescriptive features in X:\n", X_train.head(5))
print("\nTarget feature in y:\n", y_train.head(5))

In [None]:
# I will reset the indexes of the training and test splits so we can see the X_train printout
# We will see that they are no longer in order and the next markdown cell I will reset the indexes.
X_train.head(5)

In [None]:
# Using .reset_index 
# We see that they are in order again. 
X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)
X_train.head(10)

***

<br><br>
# 2. Linear Regression & Random Forest & Decision Trees & K Nearest Neighbour

In [16]:
# Setting up route samples
routesample_1 = df[(df['TRIPID'] == '8591174') & (df['DAYOFSERVICE']=='2018-12-23')]
routesample_2 = df[(df['TRIPID'] == '6106738') & (df['DAYOFSERVICE']==' 2018-01-19')]

# List of stops for route 46a
stops_46a = routesample_1.STOPPOINTID.tolist()

# Setting up dummy features ]
routesample_1 = routesample_1.drop(low_corr, 1)
routesample_1 = pd.get_dummies(routesample_1)
actual_routesample_1 = pd.DataFrame(routesample_1['ACTUALTIME_TRAVEL'])
routesample_1 = routesample_1.drop('ACTUALTIME_TRAVEL', 1)

routesample_2 = routesample_2.drop(low_corr, 1)
routesample_2 = pd.get_dummies(routesample_2)
actual_routesample_2 = pd.DataFrame(routesample_2['ACTUALTIME_TRAVEL'])
routesample_2 = routesample_2.drop('ACTUALTIME_TRAVEL', 1)

# Setting up models for each model - two versions of training models
linear_model_v1 = {}
rf_model_v1 = {}
dt_model_v1 = {}
knn_model_v1 = {}

# Setting up list for predictions
linear_v1_pred = np.zeros(shape=(59,1))
linear_v2_pred = np.zeros(shape=(59,1))
rf_model_v1_pred = np.zeros(shape=(59,1))
rf_model_v2_pred = np.zeros(shape=(59,1))
dt_model_v1_pred = np.zeros(shape=(59,1))
dt_model_v2_pred = np.zeros(shape=(59,1))
knn_model_v1_pred = np.zeros(shape=(59,1))
knn_model_v2_pred = np.zeros(shape=(59,1))

<br><br>
## 2.1 Training without additional features - current stopid and previous stopid

In [17]:
for previous, current in zip(stops_46a, stops_46a[1:]):
    
    df_stopid = df[(df['STOPPOINTID']==current) & (df['PREVIOUS_STOPPOINTID']==previous)]
    df_stopid = df_stopid.drop(low_corr, 1)
    df_stopid = pd.get_dummies(df_stopid)
    y = pd.DataFrame(df_stopid['ACTUALTIME_TRAVEL'])
    df_stopid = df_stopid.drop('ACTUALTIME_TRAVEL', 1)
        
    rfm = RandomForestRegressor(n_estimators=40, oob_score=True, random_state=1)
    dtc_4 = DecisionTreeRegressor(max_depth=4, random_state=1)
    knn = KNeighborsRegressor()
    
    # Training models
    linear_model = LinearRegression().fit(df_stopid, y)
    rf_model = rfm.fit(df_stopid, y)
    dt_model = dtc_4.fit(df_stopid, y)
    knn_model = knn.fit(df_stopid, y)
    
    # Storing models in dictionary
    linear_model_v1[current + '_' + previous] = linear_model
    rf_model_v1[current + '_' + previous] = rf_model
    dt_model_v1[current + '_' + previous] = dt_model
    knn_model_v1[current + '_' + previous] = knn_model

### 2.1.1 Obtaining predictions - route sample 1

In [19]:
index = 0

predictions_1 = []
predictions_2 = []
predictions_3 = []
predictions_4 = []

for previous, current in zip(stops_46a, stops_46a[1:]):
    if previous == '807' and current == '817':
        continue
    
    predictions_1 += [linear_model_v1[current + '_' + previous].predict(routesample_1.iloc[[index]])]
    predictions_2 += [rf_model_v1[current + '_' + previous].predict(routesample_1.iloc[[index]])]
    predictions_3 += [dt_model_v1[current + '_' + previous].predict(routesample_1.iloc[[index]])]
    predictions_4 += [knn_model_v1[current + '_' + previous].predict(routesample_1.iloc[[index]])]
    index += 1

predictions_2[0][0]
for pred in range(len(predictions_1)):
    linear_v1_pred[pred] = predictions_1[pred][0][0]
    rf_model_v1_pred[pred] = predictions_2[pred][0]
    dt_model_v1_pred[pred] = predictions_3[pred][0]
    knn_model_v1_pred[pred] = predictions_4[pred][0]

### 2.1.2 Obtaining predictions - route sample 2

In [20]:
index = 0

predictions_1 = []
predictions_2 = []
predictions_3 = []
predictions_4 = []

for previous, current in zip(stops_46a, stops_46a[1:]):
    if previous == '807' and current == '817':
        continue
    
    predictions_1 += [linear_model_v1[current + '_' + previous].predict(routesample_2.iloc[[index]])]
    predictions_2 += [rf_model_v1[current + '_' + previous].predict(routesample_2.iloc[[index]])]
    predictions_3 += [dt_model_v1[current + '_' + previous].predict(routesample_2.iloc[[index]])]
    predictions_4 += [knn_model_v1[current + '_' + previous].predict(routesample_2.iloc[[index]])]
    index += 1
    

for pred in range(len(predictions_1)):
    linear_v2_pred[pred] = predictions_1[pred][0][0]
    rf_model_v2_pred[pred] = predictions_2[pred][0]
    dt_model_v2_pred[pred] = predictions_3[pred][0]
    knn_model_v2_pred[pred] = predictions_4[pred][0]

<br><br>
Printing evaluation metrics for route sample 1

In [21]:
# Printing evaluation metrics
print('Linear Model Evaluation Metrics: \n')
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actual_routesample_1, linear_v1_pred)))
print("MSE Score: ", metrics.mean_squared_error(actual_routesample_1, linear_v1_pred))
print("MAE Score: ", metrics.mean_absolute_error(actual_routesample_1, linear_v1_pred))
print("R2 Score: ", metrics.r2_score(actual_routesample_1, linear_v1_pred))

actual_total_linear = sum(actual_routesample_1.ACTUALTIME_TRAVEL)
predicted_total_linear = sum(linear_v1_pred)
print(f'\nActual total journney time: {actual_total_linear} seconds.')
print(f'Predicted total journey time: {predicted_total_linear[0]} seconds')

Linear Model Evaluation Metrics: 

RMSE Score:  83.68133293045231
MSE Score:  7002.565481017202
MAE Score:  34.11403022378178
R2 Score:  0.034754464791250994

Actual total journney time: 3088 seconds.
Predicted total journey time: 3075.960205078125 seconds


In [22]:
# Printing evaluation metrics
print('Random Forest Evaluation Metrics: \n')
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actual_routesample_1, rf_model_v1_pred)))
print("MSE Score: ", metrics.mean_squared_error(actual_routesample_1, rf_model_v1_pred))
print("MAE Score: ", metrics.mean_absolute_error(actual_routesample_1, rf_model_v1_pred))
print("R2 Score: ", metrics.r2_score(actual_routesample_1, rf_model_v1_pred))

actual_total_linear = sum(actual_routesample_1.ACTUALTIME_TRAVEL)
predicted_total_linear = sum(rf_model_v1_pred)
print(f'\nActual total journney time: {actual_total_linear} seconds.')
print(f'Predicted total journey time: {predicted_total_linear[0]} seconds')

Random Forest Evaluation Metrics: 

RMSE Score:  83.12660265954864
MSE Score:  6910.032069718478
MAE Score:  33.645659336232484
R2 Score:  0.04750942757679766

Actual total journney time: 3088 seconds.
Predicted total journey time: 3024.4059229390546 seconds


In [23]:
# Printing evaluation metrics
print('Decision Trees Evaluation Metrics: \n')
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actual_routesample_1, dt_model_v1_pred)))
print("MSE Score: ", metrics.mean_squared_error(actual_routesample_1, dt_model_v1_pred))
print("MAE Score: ", metrics.mean_absolute_error(actual_routesample_1, dt_model_v1_pred))
print("R2 Score: ", metrics.r2_score(actual_routesample_1, dt_model_v1_pred))

actual_total_linear = sum(actual_routesample_1.ACTUALTIME_TRAVEL)
predicted_total_linear = sum(dt_model_v1_pred)
print(f'\nActual total journney time: {actual_total_linear} seconds.')
print(f'Predicted total journey time: {predicted_total_linear[0]} seconds')

Decision Trees Evaluation Metrics: 

RMSE Score:  83.24699486036393
MSE Score:  6930.0621532814575
MAE Score:  33.98629730207484
R2 Score:  0.044748446793192564

Actual total journney time: 3088 seconds.
Predicted total journey time: 3032.4926459055937 seconds


In [24]:
# Printing evaluation metrics
print('KNN Evaluation Metrics: \n')
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actual_routesample_1, knn_model_v1_pred)))
print("MSE Score: ", metrics.mean_squared_error(actual_routesample_1, knn_model_v1_pred))
print("MAE Score: ", metrics.mean_absolute_error(actual_routesample_1, knn_model_v1_pred))
print("R2 Score: ", metrics.r2_score(actual_routesample_1, knn_model_v1_pred))

actual_total_linear = sum(actual_routesample_1.ACTUALTIME_TRAVEL)
predicted_total_linear = sum(knn_model_v1_pred)
print(f'\nActual total journney time: {actual_total_linear} seconds.')
print(f'Predicted total journey time: {predicted_total_linear[0]} seconds')

KNN Evaluation Metrics: 

RMSE Score:  86.89320934634924
MSE Score:  7550.429830508475
MAE Score:  34.32542372881356
R2 Score:  -0.04076408889882899

Actual total journney time: 3088 seconds.
Predicted total journey time: 2920.8 seconds


<br><br>
Printing evaluation metrics for route sample 2

In [25]:
# Printing evaluation metrics
print('Linear Model Evaluation Metrics: \n')
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actual_routesample_2, linear_v2_pred)))
print("MSE Score: ", metrics.mean_squared_error(actual_routesample_2, linear_v2_pred))
print("MAE Score: ", metrics.mean_absolute_error(actual_routesample_2, linear_v2_pred))
print("R2 Score: ", metrics.r2_score(actual_routesample_2, linear_v2_pred))

actual_total_linear = sum(actual_routesample_1.ACTUALTIME_TRAVEL)
predicted_total_linear = sum(linear_v2_pred)
print(f'\nActual total journney time: {actual_total_linear} seconds.')
print(f'Predicted total journey time: {predicted_total_linear[0]} seconds')

Linear Model Evaluation Metrics: 

RMSE Score:  72.98483677190029
MSE Score:  5326.786398620929
MAE Score:  38.11338883739407
R2 Score:  -0.26088046708461676

Actual total journney time: 3088 seconds.
Predicted total journey time: 3472.107421875 seconds


In [26]:
# Printing evaluation metrics
print('Random Forest Model Evaluation Metrics: \n')
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actual_routesample_2, rf_model_v2_pred)))
print("MSE Score: ", metrics.mean_squared_error(actual_routesample_2, rf_model_v2_pred))
print("MAE Score: ", metrics.mean_absolute_error(actual_routesample_2, rf_model_v2_pred))
print("R2 Score: ", metrics.r2_score(actual_routesample_2, rf_model_v2_pred))

actual_total_linear = sum(actual_routesample_1.ACTUALTIME_TRAVEL)
predicted_total_linear = sum(rf_model_v2_pred)
print(f'\nActual total journney time: {actual_total_linear} seconds.')
print(f'Predicted total journey time: {predicted_total_linear[0]} seconds')

Random Forest Model Evaluation Metrics: 

RMSE Score:  71.72452992020408
MSE Score:  5144.40819227425
MAE Score:  37.26930656705153
R2 Score:  -0.21771051417192067

Actual total journney time: 3088 seconds.
Predicted total journey time: 3405.7749909606014 seconds


In [27]:
# Printing evaluation metrics
print('Decition Tree Model Evaluation Metrics: \n')
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actual_routesample_2, dt_model_v2_pred)))
print("MSE Score: ", metrics.mean_squared_error(actual_routesample_2, dt_model_v2_pred))
print("MAE Score: ", metrics.mean_absolute_error(actual_routesample_2, dt_model_v2_pred))
print("R2 Score: ", metrics.r2_score(actual_routesample_2, dt_model_v2_pred))

actual_total_linear = sum(actual_routesample_1.ACTUALTIME_TRAVEL)
predicted_total_linear = sum(dt_model_v2_pred)
print(f'\nActual total journney time: {actual_total_linear} seconds.')
print(f'Predicted total journey time: {predicted_total_linear[0]} seconds')

Decition Tree Model Evaluation Metrics: 

RMSE Score:  73.08694498088572
MSE Score:  5341.701526639015
MAE Score:  38.53056990972517
R2 Score:  -0.26441096224149807

Actual total journney time: 3088 seconds.
Predicted total journey time: 3524.7613502582353 seconds


In [28]:
# Printing evaluation metrics
print('KNN Model Evaluation Metrics: \n')
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actual_routesample_2, knn_model_v2_pred)))
print("MSE Score: ", metrics.mean_squared_error(actual_routesample_2, knn_model_v2_pred))
print("MAE Score: ", metrics.mean_absolute_error(actual_routesample_2, knn_model_v2_pred))
print("R2 Score: ", metrics.r2_score(actual_routesample_2, knn_model_v2_pred))

actual_total_linear = sum(actual_routesample_1.ACTUALTIME_TRAVEL)
predicted_total_linear = sum(knn_model_v2_pred)
print(f'\nActual total journney time: {actual_total_linear} seconds.')
print(f'Predicted total journey time: {predicted_total_linear[0]} seconds')

KNN Model Evaluation Metrics: 

RMSE Score:  71.31724777715749
MSE Score:  5086.149830508474
MAE Score:  37.135593220338976
R2 Score:  -0.20392043045205677

Actual total journney time: 3088 seconds.
Predicted total journey time: 3320.4 seconds


## 2.2 Training with additional features  - current stopid and previous stopid

[Back to top section](#2.1-Training-without-additional-features---current-stopid-and-previous-stopid)

In [55]:
# Making new features
# df['HOUROFSERVICE'] = [int(time.strftime("%H",time.gmtime(hour))) for hour in df['ACTUALTIME_DEP']]
df['eve_rushour'] = [1 if int(time.strftime("%H",time.gmtime(hour))) >= 16 and int(time.strftime("%H",time.gmtime(hour))) <= 19 else 0 for hour in df['ACTUALTIME_DEP']]
df['morn_rushour'] = [1 if int(time.strftime("%H",time.gmtime(hour))) >= 7 and int(time.strftime("%H",time.gmtime(hour))) <= 9 else 0 for hour in df['ACTUALTIME_DEP']]
df['morn_rushour'] = df['morn_rushour'].astype('category')
df['eve_rushour'] = df['eve_rushour'].astype('category')
df = df.drop('HOUROFSERVICE',  1)
# df = df.drop('morn_rushour',  1)


# Setting up route samples
routesample_1 = df[(df['TRIPID'] == '8591174') & (df['DAYOFSERVICE']=='2018-12-23')]
routesample_2 = df[(df['TRIPID'] == '6106738') & (df['DAYOFSERVICE']==' 2018-01-19')]

# List of stops for route 46a
stops_46a = routesample_1.STOPPOINTID.tolist()

# Setting up dummy features ]
routesample_1 = routesample_1.drop(low_corr, 1)
routesample_1 = pd.get_dummies(routesample_1)
actual_routesample_1 = pd.DataFrame(routesample_1['ACTUALTIME_TRAVEL'])
routesample_1 = routesample_1.drop('ACTUALTIME_TRAVEL', 1)

routesample_2 = routesample_2.drop(low_corr, 1)
routesample_2 = pd.get_dummies(routesample_2)
actual_routesample_2 = pd.DataFrame(routesample_2['ACTUALTIME_TRAVEL'])
routesample_2 = routesample_2.drop('ACTUALTIME_TRAVEL', 1)

# Setting up dictionary to store trained models
linear_model_v2 = {}
dt_model_v2 = {}
rf_model_v2 = {}
knn_model_v2 = {}

# Setting up empty arrays to feed predictions into it
linear_v1_pred = np.zeros(shape=(59,1))
linear_v2_pred = np.zeros(shape=(59,1))
rf_model_v1_pred = np.zeros(shape=(59,1))
rf_model_v2_pred = np.zeros(shape=(59,1))
dt_model_v1_pred = np.zeros(shape=(59,1))
dt_model_v2_pred = np.zeros(shape=(59,1))
knn_model_v1_pred = np.zeros(shape=(59,1))
knn_model_v2_pred = np.zeros(shape=(59,1))

In [56]:
for previous, current in zip(stops_46a, stops_46a[1:]):
    
    df_stopid = df[(df['STOPPOINTID']==current) & (df['PREVIOUS_STOPPOINTID']==previous)]
    df_stopid = df_stopid.drop(low_corr, 1)
    df_stopid = pd.get_dummies(df_stopid)
    y = pd.DataFrame(df_stopid['ACTUALTIME_TRAVEL'])
    df_stopid = df_stopid.drop('ACTUALTIME_TRAVEL', 1)
        
    rfm = RandomForestRegressor(n_estimators=40, oob_score=True, random_state=1)
    dtc_4 = DecisionTreeRegressor(max_depth=4, random_state=1)
    knn = KNeighborsRegressor()
    
    # Training models
    linear_model = LinearRegression().fit(df_stopid, y)
    rf_model = rfm.fit(df_stopid, y)
    dt_model = dtc_4.fit(df_stopid, y)
    knn_model = knn.fit(df_stopid, y)
    
    # Storing models in dictionary
    linear_model_v2[current + '_' + previous] = linear_model
    rf_model_v2[current + '_' + previous] = rf_model
    dt_model_v2[current + '_' + previous] = dt_model
    knn_model_v2[current + '_' + previous] = knn_model

### 2.2.1 Obtaining predictions - route sample 1

In [57]:
index = 0

predictions_1 = []
predictions_2 = []
predictions_3 = []
predictions_4 = []

for previous, current in zip(stops_46a, stops_46a[1:]):
    if previous == '807' and current == '817':
        continue
    
    predictions_1 += [linear_model_v2[current + '_' + previous].predict(routesample_1.iloc[[index]])]
    predictions_2 += [rf_model_v2[current + '_' + previous].predict(routesample_1.iloc[[index]])]
    predictions_3 += [dt_model_v2[current + '_' + previous].predict(routesample_1.iloc[[index]])]
    predictions_4 += [knn_model_v2[current + '_' + previous].predict(routesample_1.iloc[[index]])]
    index += 1

for pred in range(len(predictions_1)):
    linear_v1_pred[pred] = predictions_1[pred][0][0]
    rf_model_v1_pred[pred] = predictions_2[pred][0]
    dt_model_v1_pred[pred] = predictions_3[pred][0]
    knn_model_v1_pred[pred] = predictions_4[pred][0]

### 2.2.2 Obtaining predictions - route sample 2

In [58]:
index = 0

predictions_1 = []
predictions_2 = []
predictions_3 = []
predictions_4 = []

for previous, current in zip(stops_46a, stops_46a[1:]):
    if previous == '807' and current == '817':
        continue
    
    predictions_1 += [linear_model_v2[current + '_' + previous].predict(routesample_2.iloc[[index]])]
    predictions_2 += [rf_model_v2[current + '_' + previous].predict(routesample_2.iloc[[index]])]
    predictions_3 += [dt_model_v2[current + '_' + previous].predict(routesample_2.iloc[[index]])]
    predictions_4 += [knn_model_v2[current + '_' + previous].predict(routesample_2.iloc[[index]])]
    index += 1
    

for pred in range(len(predictions_1)):
    linear_v2_pred[pred] = predictions_1[pred][0][0]
    rf_model_v2_pred[pred] = predictions_2[pred][0]
    dt_model_v2_pred[pred] = predictions_3[pred][0]
    knn_model_v2_pred[pred] = predictions_4[pred][0]

<br><br>
Printing evaluation metrics for route sample 1

In [59]:
# Printing evaluation metrics
print('Linear Model Evaluation Metrics: \n')
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actual_routesample_1, linear_v1_pred)))
print("MSE Score: ", metrics.mean_squared_error(actual_routesample_1, linear_v1_pred))
print("MAE Score: ", metrics.mean_absolute_error(actual_routesample_1, linear_v1_pred))
print("R2 Score: ", metrics.r2_score(actual_routesample_1, linear_v1_pred))

actual_total_linear = sum(actual_routesample_1.ACTUALTIME_TRAVEL)
predicted_total_linear = sum(linear_v1_pred)
print(f'\nActual total journney time: {actual_total_linear} seconds.')
print(f'Predicted total journey time: {predicted_total_linear[0]} seconds')

Linear Model Evaluation Metrics: 

RMSE Score:  83.61311147640558
MSE Score:  6991.152410765826
MAE Score:  32.91250662076271
R2 Score:  0.03632766180497804

Actual total journney time: 3088 seconds.
Predicted total journey time: 2936.134765625 seconds


In [60]:
# Printing evaluation metrics
print('Random Forest Evaluation Metrics: \n')
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actual_routesample_1, rf_model_v1_pred)))
print("MSE Score: ", metrics.mean_squared_error(actual_routesample_1, rf_model_v1_pred))
print("MAE Score: ", metrics.mean_absolute_error(actual_routesample_1, rf_model_v1_pred))
print("R2 Score: ", metrics.r2_score(actual_routesample_1, rf_model_v1_pred))

actual_total_linear = sum(actual_routesample_1.ACTUALTIME_TRAVEL)
predicted_total_linear = sum(rf_model_v1_pred)
print(f'\nActual total journney time: {actual_total_linear} seconds.')
print(f'Predicted total journey time: {predicted_total_linear[0]} seconds')

Random Forest Evaluation Metrics: 

RMSE Score:  85.55130874123965
MSE Score:  7319.026427338908
MAE Score:  33.956089352886394
R2 Score:  -0.008867050256772702

Actual total journney time: 3088 seconds.
Predicted total journey time: 2987.1297782103416 seconds


In [61]:
# Printing evaluation metrics
print('Decision Trees Evaluation Metrics: \n')
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actual_routesample_1, dt_model_v1_pred)))
print("MSE Score: ", metrics.mean_squared_error(actual_routesample_1, dt_model_v1_pred))
print("MAE Score: ", metrics.mean_absolute_error(actual_routesample_1, dt_model_v1_pred))
print("R2 Score: ", metrics.r2_score(actual_routesample_1, dt_model_v1_pred))

actual_total_linear = sum(actual_routesample_1.ACTUALTIME_TRAVEL)
predicted_total_linear = sum(dt_model_v1_pred)
print(f'\nActual total journney time: {actual_total_linear} seconds.')
print(f'Predicted total journey time: {predicted_total_linear[0]} seconds')

Decision Trees Evaluation Metrics: 

RMSE Score:  85.41215638525424
MSE Score:  7295.236458379125
MAE Score:  33.875282408339054
R2 Score:  -0.00558780047561247

Actual total journney time: 3088 seconds.
Predicted total journey time: 2969.4982309295583 seconds


In [62]:
# Printing evaluation metrics
print('KNN Evaluation Metrics: \n')
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actual_routesample_1, knn_model_v1_pred)))
print("MSE Score: ", metrics.mean_squared_error(actual_routesample_1, knn_model_v1_pred))
print("MAE Score: ", metrics.mean_absolute_error(actual_routesample_1, knn_model_v1_pred))
print("R2 Score: ", metrics.r2_score(actual_routesample_1, knn_model_v1_pred))

actual_total_linear = sum(actual_routesample_1.ACTUALTIME_TRAVEL)
predicted_total_linear = sum(knn_model_v1_pred)
print(f'\nActual total journney time: {actual_total_linear} seconds.')
print(f'Predicted total journey time: {predicted_total_linear[0]} seconds')

KNN Evaluation Metrics: 

RMSE Score:  86.31116745310912
MSE Score:  7449.617627118644
MAE Score:  35.28813559322034
R2 Score:  -0.026867963331652422

Actual total journney time: 3088 seconds.
Predicted total journey time: 3012.400000000001 seconds


<br><br>
Printing evaluation metrics for route sample 2

In [63]:
# Printing evaluation metrics
print('Linear Model Evaluation Metrics: \n')
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actual_routesample_2, linear_v2_pred)))
print("MSE Score: ", metrics.mean_squared_error(actual_routesample_2, linear_v2_pred))
print("MAE Score: ", metrics.mean_absolute_error(actual_routesample_2, linear_v2_pred))
print("R2 Score: ", metrics.r2_score(actual_routesample_2, linear_v2_pred))

actual_total_linear = sum(actual_routesample_2.ACTUALTIME_TRAVEL)
predicted_total_linear = sum(linear_v2_pred)
print(f'\nActual total journney time: {actual_total_linear} seconds.')
print(f'Predicted total journey time: {predicted_total_linear[0]} seconds')

Linear Model Evaluation Metrics: 

RMSE Score:  71.86010137476937
MSE Score:  5163.87416959213
MAE Score:  36.78793697033898
R2 Score:  -0.2223182210961523

Actual total journney time: 3571 seconds.
Predicted total journey time: 3313.75390625 seconds


In [64]:
# Printing evaluation metrics
print('Random Forest Model Evaluation Metrics: \n')
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actual_routesample_2, rf_model_v2_pred)))
print("MSE Score: ", metrics.mean_squared_error(actual_routesample_2, rf_model_v2_pred))
print("MAE Score: ", metrics.mean_absolute_error(actual_routesample_2, rf_model_v2_pred))
print("R2 Score: ", metrics.r2_score(actual_routesample_2, rf_model_v2_pred))

actual_total_linear = sum(actual_routesample_1.ACTUALTIME_TRAVEL)
predicted_total_linear = sum(rf_model_v2_pred)
print(f'\nActual total journney time: {actual_total_linear} seconds.')
print(f'Predicted total journey time: {predicted_total_linear[0]} seconds')

Random Forest Model Evaluation Metrics: 

RMSE Score:  69.80301899895186
MSE Score:  4872.461461368033
MAE Score:  35.24740333678491
R2 Score:  -0.15333918492621712

Actual total journney time: 3088 seconds.
Predicted total journey time: 3174.8641115646724 seconds


In [65]:
# Printing evaluation metrics
print('Decition Tree Model Evaluation Metrics: \n')
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actual_routesample_2, dt_model_v2_pred)))
print("MSE Score: ", metrics.mean_squared_error(actual_routesample_2, dt_model_v2_pred))
print("MAE Score: ", metrics.mean_absolute_error(actual_routesample_2, dt_model_v2_pred))
print("R2 Score: ", metrics.r2_score(actual_routesample_2, dt_model_v2_pred))

actual_total_linear = sum(actual_routesample_1.ACTUALTIME_TRAVEL)
predicted_total_linear = sum(dt_model_v2_pred)
print(f'\nActual total journney time: {actual_total_linear} seconds.')
print(f'Predicted total journey time: {predicted_total_linear[0]} seconds')

Decition Tree Model Evaluation Metrics: 

RMSE Score:  70.96467090024154
MSE Score:  5035.984515979589
MAE Score:  36.11643799801047
R2 Score:  -0.19204601678474642

Actual total journney time: 3088 seconds.
Predicted total journey time: 3303.1281121651737 seconds


In [66]:
# Printing evaluation metrics
print('KNN Model Evaluation Metrics: \n')
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actual_routesample_2, knn_model_v2_pred)))
print("MSE Score: ", metrics.mean_squared_error(actual_routesample_2, knn_model_v2_pred))
print("MAE Score: ", metrics.mean_absolute_error(actual_routesample_2, knn_model_v2_pred))
print("R2 Score: ", metrics.r2_score(actual_routesample_2, knn_model_v2_pred))

actual_total_linear = sum(actual_routesample_1.ACTUALTIME_TRAVEL)
predicted_total_linear = sum(knn_model_v2_pred)
print(f'\nActual total journney time: {actual_total_linear} seconds.')
print(f'Predicted total journey time: {predicted_total_linear[0]} seconds')

KNN Model Evaluation Metrics: 

RMSE Score:  70.13334998173849
MSE Score:  4918.686779661018
MAE Score:  36.36949152542373
R2 Score:  -0.16428097920118212

Actual total journney time: 3088 seconds.
Predicted total journey time: 2948.4 seconds


<br><br>

# 3. Route model and taking the proportion of the prediction to calculate a journey time for the user.

## 3.1 Calculating the proportion of each stop from the overall trip.

In [None]:
def proportion_stops(predictions):
    # Sum from the first stop until each stop
    sum_each_stop = np.zeros(predictions.shape[0], dtype=float)
    proportion_each_stop = np.zeros(predictions.shape[0], dtype=float)
    overall_prediction = np.sum(predictions)
    
    # Adding sum up until current stop and dividing by overall prediction to get proportion of the trip
    for length in range(predictions.shape[0]):
        sum_each_stop = np.append(sum_each_stop, [predictions[length]])
        sum_overall = np.sum(sum_each_stop) / overall_prediction*100
        proportion_each_stop[length] = sum_overall
        
    return proportion_each_stop

## 3.2 Return the progrnumber based off the stoppointid in a route

Finding the most common progrnumber based off the stoppointid. The reason for using to find the most common progrnumber is because it assumes that most route_id for each line would be always complete with the exception of a few trips in which they take a different route and skips some stops as a result.

In [None]:
# Code taken from https://www.geeksforgeeks.org/python-find-most-frequent-element-in-a-list/

# array only accepts a panda Series or numpy array
def most_common(array):
    List = array.tolist()
    mode_list = mode(List)
    if mode_list == '1':
        return 0
    
    else:
        return(mode(List))

## 3.3 Calculating the journey time from a start to end destination based on user input

Finding the travel time duration based on a stoppointid then getting the progrnumber

In [None]:
def journey_time(start,end, prediction):
    # Converting into int because the function returns a string
    start_progrnum = int(most_common(df['PROGRNUMBER'][df['STOPPOINTID']==start]))
    end_progrnum = int(most_common(df['PROGRNUMBER'][df['STOPPOINTID']==end]))
    
#     print(start_progrnum)
#     print(end_progrnum)

    proportion_array = proportion_stops(prediction)
    overall_prediction = np.sum(prediction)
    
    # calculating the time difference from start to end destination 
    start_prediction = (proportion_array[start_progrnum]/100) * overall_prediction
    end_prediction = (proportion_array[end_progrnum]/100) * overall_prediction
    
    journeytime = end_prediction - start_prediction
    
    # print(journeytime)
    
    return journeytime

In [None]:
user_start = '807'
user_end = '812'

journey_time(user_start, user_end, prediction_46a)

***

<br><br>
# 5. Stop pair model

## 5.1 First version of paired stop approach
<br><br>
This approach makes a model based on the stopid and its previous stopids

In [None]:
# Returns a paired list of stops
def paired_stops(df):
    stopid = df['STOPPOINTID'].unique().tolist()
    previous_stopid = []
    for i in stopid:
        prev = df['PREVIOUS_STOPPOINTID'][df['STOPPOINTID']==i]
        # Adds most frequent previous stopid to list
        previous_stopid += [prev.value_counts().idxmax()]
    
    return [stopid, previous_stopid]

In [None]:
for ids in range(len(paired_stops[0])):
    
    # Making new dataframe
    to_add = df[df['STOPPOINTID']==paired_stops[0][ids]]
    to_add = to_add.append(df[df['PREVIOUS_STOPPOINTID']==paired_stops[1][ids]])
    stops_df = pd.DataFrame(data=to_add)
    
    # Setting target feature
    y = stops_df['ACTUALTIME_TRAVEL']
    
    # Dropping target feature and low corr features
    stops_df = stops_df.drop(low_corr,1)
    stops_df = stops_df.drop('ACTUALTIME_TRAVEL',1)
    stops_df = pd.get_dummies(stops_df)
    
    # Fitting model based on stops
    linear_reg = LinearRegression().fit(stops_df, y)
    
    # Save to pickle file 

In [None]:
pair_stops = paired_stops(df)
to_add = df[df['STOPPOINTID']==pair_stops[0][5]]
to_add = to_add.append(df[df['PREVIOUS_STOPPOINTID']==pair_stops[1][5]])
stops_df = pd.DataFrame(to_add)

 # Setting target feature
y = stops_df['ACTUALTIME_TRAVEL']
    
# Dropping target feature and low corr features
stops_df = stops_df.drop(low_corr,1)
stops_df = stops_df.drop('ACTUALTIME_TRAVEL',1)
stops_df = pd.get_dummies(stops_df)

# Fitting/Training model based on stops
linear_reg_model_ = LinearRegression().fit(stops_df, y)

# Saving to pickle File
with open('model_'+pair_stops[0][5]+'.pkl', 'wb') as handle:
    pickle.dump(linear_reg_model_, handle)


In [None]:
sampledf = stops_df.iloc[[0]]
sample_prediction = linear_reg_sample.predict(sampledf)

In [None]:
sample_prediction

In [None]:
with open('model_'+pair_stops[0][5]+'.pkl', 'rb') as handle:
    model = pickle.load(handle)

In [None]:
model.predict(sampledf)

## 5.2.1 Setting up for 46a stop pair models using first approach

In [None]:
# Function to get previous stopid and return a paired list
def pair_stopids(current_stopids):
    previous_stopid = []
    for i in current_stopids:
        prev = df['PREVIOUS_STOPPOINTID'][df['STOPPOINTID']==i]
        # Adds most frequent previous stopid to list
        previous_stopid += [prev.value_counts().idxmax()]
    
    return [current_stopids, previous_stopid]

In [None]:
# Loading the json file
import json
file = open('routes_and_stops.json',)
routes_stops = json.load(file)

In [None]:
# Get all stops for 46a going outbound ('1')
list_46a_stops = routes_stops['46A']['outbound']

# Pairing stopids and prev stopids from 46a route
pairing_46a_stopids = pair_stopids(list_46a_stops)
predictions = []

In [None]:
for ids in range(len(pairing_46a_stopids[0])):
    # Making new dataframe
    to_add = df[df['STOPPOINTID']==pairing_46a_stopids[0][ids]]
    to_add = to_add.append(df[df['PREVIOUS_STOPPOINTID']==pairing_46a_stopids[1][ids]])
    stops_df = pd.DataFrame(data=to_add)
    
    # Setting target feature
    y = stops_df['ACTUALTIME_TRAVEL']
    
    # Dropping target feature and low corr features
    stops_df = stops_df.drop(low_corr,1)
    stops_df = stops_df.drop('ACTUALTIME_TRAVEL',1)
    stops_df = pd.get_dummies(stops_df)
    
    # Fitting model based on stops
    linear_reg_model = LinearRegression().fit(stops_df, y)
    
      # Save to pickle file
#     with open('model_'+pairing_46a_stopids[0][ids]+'.pkl', 'wb') as handle:
#         pickle.dump(linear_reg_model, handle)

     # Predicting data
    with open('stop_'+pair_stops[0][ids]+'.pkl', 'rb') as handle:
        model = pickle.load(handle)
    
    k = model.predict(route_46a.iloc[[index]])
    predictions += [k]

In [None]:
# Printing evaluation metrics
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actualtimes_46a, predictions)))
print("MSE Score: ", metrics.mean_squared_error(actualtimes_46a, predictions))
print("MAE Score: ", metrics.mean_absolute_error(actualtimes_46a, predictions))
print("R2 Score: ", metrics.r2_score(actualtimes_46a, predictions))

<br><br>
##### Conclusion:
Linear regression model is not very good. MSE score is off by more than 1000 seconds. And the R2 score is at a negative value. This means the parameters need to be tuned. Keeping dwelltime might be good.

## 5.3 Stop pair based on entire leavetimes

<br><br>
1) Make a rough query that selects rows that contain a certain stopid and its previous stopid based on the direction.

In [None]:
# initialise query - for OUTBOUND (WHERE DIRECTION == '1')
# query_stopid = "SELECT leavetimes.*, weather.* FROM leavetimes, weather WHERE leavetimes.STOPPOINTID = " + current_stopid + " AND leavetimes.DAYOFSERVICE = weather.dt"
# query_stopid_df = pd.read_sql(query_previoustop, conn)

<br><br>
2) Make a function that will combine lists in a list together as one list

In [None]:
def combine_listsoflist(to_combine):
    combined = []
    for each_list in to_combine:
        combined += each_list
    return combined

<br><br>
4) Make a function that will get rid of the duplicates in the list

In [None]:
def get_unique(stopids_list):
    return list(set(stopids_list))

<br><br>
5) Make a list to store all stopids for DIRECTION == outbound/1.

In [None]:
# Loading the json file
import json
file = open('routes_and_stops.json',)
routes_stops = json.load(file)

In [None]:
# Looping through every lineid, outbound 
stopids_outbound = []
for i,j in routes_stops.items():
    try:
#         print(i, '\n', routes_stops[i]['outbound'], '\n')
        stopids_outbound += [routes_stops[i]['outbound']]
    except KeyError:
        continue
        
# Calling function to get combined list
combined_stopids_outbound = combine_listsoflist(stopids_outbound)

# Calling function to get unique stopids from combined list
unique_stopids_outbound = get_unique(combined_stopids_outbound)

<br><br>
6) Make a list to store all stopids for DIRECTION ==inbound/2.

In [None]:
# Looping through every lineid, inbound
stopids_inbound = []
for i,j in routes_stops.items():
    try:
#         print(i, '\n', routes_stops[i]['inbound'], '\n')
        stopids_inbound += [routes_stops[i]['inbound']]
    except KeyError:
        continue
        
# Calling function to get combined list
combined_stopids_inbound = combine_listsoflist(stopids_inbound)

# Calling function to get unique stopids from combined list - using set() to get rid off existing stops from outbound stops
unique_stopids_inbound = list(set(combined_stopids_inbound) - set(combined_stopids_outbound))

<br><br>
7) Query to select all of the previous_stopids based on the current stopid and put it to a list

In [None]:
# query_previoustop = "SELECT leavetimes.PREVIOUS_STOPPOINTID FROM leavetimes WHERE leavetimes.STOPPOINTID = " + current_stopid 
# query_prevstop_df = pd.read_sql(query_previoustop, conn)

# # Converting into a pandas series then to list
# query_prevstop_series = query_prevstop_df.iloc[0]
# query_prevstop_list = query_prevstop_series.tolist()

<br><br>
8) Query to select the rows based on the previous stopids and append them to the current dataframe of the current stopid


In [None]:
def df_prev_stops(query_prevstop_list):
    query_prevstop_rows = "SELECT leavetimes.* FROM leavetimes WHERE leavetimes.PREVIOUS_STOPPOINTID IN " + str(query_prevstop_list)
    print(query_prevstop_rows)
    df_prevstop = pd.read_sql(query_prevstop_rows, conn)
    return df_prevstop

In [None]:
def df_prev_stops_one_element(query_prevstop_list):
    query_prevstop_rows = "SELECT leavetimes.* FROM leavetimes WHERE leavetimes.PREVIOUS_STOPPOINTID = " + str(query_prevstop_list)
    print(query_prevstop_rows)
    df_prevstop = pd.read_sql(query_prevstop_rows, conn)
    return df_prevstop

<br><br>
9) Adding index on STOPPOINTID and PREVIOUS_STOPPOINTID

In [None]:
# Adding indexes
# add_index1 = """CREATE INDEX stopid ON leavetimes(STOPPOINTID);"""
# add_index2 = """CREATE INDEX previous_stopid ON leavetimes(PREVIOUS_STOPPOINTID);"""
# conn.execute(add_index1)
# conn.execute(add_index2)

<br><br>
10) Piecing every step together

In [None]:
# Lists all stops done so far. This is for when laptop needs to rest
import os 
arr = os.listdir('C:/Users/fayea/UCD/ResearchPracticum/Data-Analytics-CityRoute/stop_models')
j = []
for i in arr:
    j += i.split('_')

h = []
for i in j:
    h += i.split('.')

g = [str(i) for i in h if i.isdigit()]
unique_stopids_outbound = [x for x in unique_stopids_outbound if x not in g]
len(unique_stopids_outbound)

In [None]:
previous_stops = {}
for stopid in unique_stopids_inbound:
    # Get all previous stopids in list
    query_previoustop = "SELECT DISTINCT leavetimes.PREVIOUS_STOPPOINTID FROM leavetimes WHERE leavetimes.STOPPOINTID = " + stopid
    query_prevstop_df = pd.read_sql(query_previoustop, conn)

    # Converting into a pandas series then to list
    query_prevstop_series = query_prevstop_df['PREVIOUS_STOPPOINTID'].tolist()
    query_prevstop_list = [stopid for stopid in query_prevstop_series if stopid != '0']
    previous_stops[stopid] = query_prevstop_list
    print('finished')
    
with open('previous_stops_inbound.json', 'w+') as fp:
    json.dump(previous_stops, fp)

In [None]:
# import boto3
import pandas as pd
import numpy as np
import sqlite3
import pickle

# from sagemaker import get_execution_role
from sklearn.linear_model import LinearRegression
from math import log
from multiprocessing import Pool


# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Connecting to s3
# role = get_execution_role()
# bucket='sagemaker-studio-520298385440-7in8n1t299'
# data_key = 'route_46a.feather'
# data_location = 's3://{}/{}'.format(bucket, data_key)

In [None]:
low_corr = ['DAYOFSERVICE', 'VEHICLEID', 'TRIPID', 'STOPPOINTID', 'PREVIOUS_STOPPOINTID', 'PROGRNUMBER', 'temp', 'pressure', 'humidity', 
            'wind_speed', 'wind_deg', 'weather_id', 'weather_description', 'clouds_all', 'PREVIOUS_STOPPOINTID', 'PLANNEDTIME_ARR', 'PLANNEDTIME_DEP', 'ACTUALTIME_ARR', 'ACTUALTIME_DEP',
           'PLANNEDTIME_TRAVEL', 'DWELLTIME']

In [None]:
# def function to create connection to db
def create_connection(db_file):
    """
    create a database connection to the SQLite database specified by db_file
    :param df_file: database file
    :return: Connection object or None
    """
    conn = None
    try: 
        conn = sqlite3.connect(db_file)
        return conn
    except 'Error' as e:
        print(e)
        
    return conn

In [None]:
# create connection to db
db_file = "C:/Users/fayea/UCD/ResearchPracticum/Data-Analytics-CityRoute/dublinbus.db"
conn = create_connection(db_file)

In [None]:
# Outbound
file = open('previous_stops.json',)
previous_stops = json.load(file)

# Query to get all of weather
weather_query = "SELECT weather.* from weather"
weather_df = pd.read_sql(weather_query, conn)
weather_df = weather_df.rename(columns={"dt": "DAYOFSERVICE"})

for current_stopid in unique_stopids_outbound:
    
    query_prevstop_series = previous_stops[current_stopid]
    query_prevstop_list = tuple(query_prevstop_series)
    if len(query_prevstop_list) == 1:
        # Making query to db and make df
        query_stopid = "SELECT leavetimes.* FROM leavetimes WHERE leavetimes.STOPPOINTID = " + current_stopid + " OR leavetimes.PREVIOUS_STOPPOINTID = " + str(query_prevstop_list)
        df = pd.read_sql(query_stopid, conn)
        
        # Append previous stops rows to main df
#         to_add = df_prev_stops_one_element(query_prevstop_series[0])
#         df = pd.concat([df,to_add])
        df = df.merge(weather_df, on='DAYOFSERVICE', how='left')
        
    elif len(query_prevstop_list) == 0:
        continue
        
    else:
        # Making query to db and make df
        query_stopid = "SELECT leavetimes.* FROM leavetimes WHERE leavetimes.STOPPOINTID = " + current_stopid + " OR leavetimes.PREVIOUS_STOPPOINTID IN " + str(query_prevstop_list)
        df = pd.read_sql(query_stopid, conn)
        
        # Append previous stops rows to main df
#         to_add = df_prev_stops(query_prevstop_list)
#         df = pd.concat([df,to_add]) 
        df = df.merge(weather_df, on='DAYOFSERVICE', how='left')

    # Drop low correlated features and setting target feature
    df = df.drop(low_corr, 1)
    tf = df['ACTUALTIME_TRAVEL']
    df = df.drop('ACTUALTIME_TRAVEL', 1)
    df = pd.get_dummies(df)
    print(df.shape[1])
#     # Fitting model based on stops
#     linear_reg_model = LinearRegression().fit(df, tf)
    
#     # Save to pickle file
#     with open('C:/Users/fayea/UCD/ResearchPracticum/Data-Analytics-CityRoute/stop_models/stop_'+ current_stopid +'.pkl', 'wb') as handle:
#         pickle.dump(linear_reg_model, handle)
    
    print('Finished.')

In [None]:
# Inbound 
for current_stopid in range(len(stopids_inbound)):
    # Making query to df and make df
    query_stopid = "SELECT leavetimes.*, weather.* FROM leavetimes, weather WHERE leavetimes.STOPPOINTID = " + stopids_inbound[current_stopid] + " AND leavetimes.DAYOFSERVICE = weather.dt"
    df = pd.read_sql(query_previoustop, conn)
    
    # Get all previous stopids in list
    query_previoustop = "SELECT leavetimes.PREVIOUS_STOPPOINTID FROM leavetimes WHERE leavetimes.STOPPOINTID = " + stopids_inbound[current_stopid] 
    query_prevstop_df = pd.read_sql(query_previoustop, conn)

    # Converting into a pandas series then to list
    query_prevstop_series = query_prevstop_df.iloc[0]
    query_prevstop_list = query_prevstop_series.tolist()
    
    # Append previous stops rows to main df
    to_add = df_prev_stops(query_prevstop_list)
    df = df.append(to_add)
    
    # Drop low correlated features and setting target feature
    df = df.drop(low_corr, 1)
    tf = df['ACTUALTIME_TRAVEL']
    df = df.drop('ACTUALTIME_TRAVEL', 1)
    df = pd.get_dummies(df)
    
    # Fitting model based on stops
    linear_reg_model = LinearRegression().fit(df, tf)
    
    # Save to pickle file
    with open('/UCD/ResearchPracticum/Data-Analytics-CityRoute/stop_models/stop_'+ stopids_outbound[current_stopid] +'.pkl', 'wb') as handle:
        pickle.dump(linear_reg_model, handle)

***