# Table of Contents
[1. Model Preparation](#1.-Model-Preperation)
<br>
* [1.1 Reviewing, Splitting data set](#1.1-Reviewing,-splitting-dataset-into-7:3-for-training-and-testing.)
* [1.2 Plotting features against target feature](#1.2-Plot-to-compare-all-features-to-target-feature-to-help-make-decisions-to-keep-for-the-models.)
    * [1.2.1 Plotting datetime feature against target feature](#Plotting-datetime-feature-against-target-feature)
    * [1.2.2 Plotting numerical features against target feature](#Plotting-numerical-features-against-target-feature)
    * [1.2.3 Plotting categorical features against target feature](#Plotting-categorical-features-against-target-feature)
* [1.3. Summary of all features](#1.3.-Summary-of-all-features)
    * [1.3.1 Numerical Features](#Numerical-Features)
    * [1.3.1 Cateogrical Features](#Categorical-Features)
* [2. Linear Regression](#2.-Linear-Regression)
* [3. Route model and taking the proportion of the prediction to calculate a journey time for the user](#3.-Route-model-and-taking-the-proportion-of-the-prediction-to-calculate-a-journey-time-for-the-user.)
    * [3.1 Calculating the proportion of each stop from the overall trip](#3.1-Calculating-the-proportion-of-each-stop-from-the-overall-trip.)
* [4. Random Forest & Decision Trees](#4.-Random-Forest-&-Decision-Trees)
* [5. Stop pair model](#5.-Stop-pair-model)
    * [5.1 First version of paired stop approach](#5.1-First-version-of-paired-stop-approach)
    * [5.2.1 Setting up for 46a stop pair models using first approach](#5.2.1-Setting-up-for-46a-stop-pair-models-using-first-approach)
    * [5.3 Stop pair based on entire leavetimes](#5.3-Stop-pair-based-on-entire-leavetimes)

Establishing a connection with sqlite database

In [1]:
# import boto3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import sqlite3
import pickle
import time

# from sagemaker import et_execution_role
from patsy import dmatrices
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from math import log
from statistics import stdev
from statistics import mode


# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Connecting to s3
# role = get_execution_role()
# bucket='sagemaker-studio-520298385440-7in8n1t299'
# data_key = 'route_46a.feather'
# data_location = 's3://{}/{}'.format(bucket, data_key)

In [2]:
# def function to create connection to db
def create_connection(db_file):
    """
    create a database connection to the SQLite database specified by db_file
    :param df_file: database file
    :return: Connection object or None
    """
    conn = None
    try: 
        conn = sqlite3.connect(db_file)
        return conn
    except 'Error' as e:
        print(e)
        
    return conn

In [3]:
# create connection to db
db_file = "C:/Users/fayea/UCD/ResearchPracticum/Data-Analytics-CityRoute/dublinbus.db"
conn = create_connection(db_file)

In [4]:
# initialise query
query = """
SELECT leavetimes.*, weather.*
FROM leavetimes, weather
WHERE TRIPID in  
    (SELECT TRIPID
    FROM trips
    WHERE LINEID = '46A' AND DIRECTION = '1')
AND leavetimes.DAYOFSERVICE = weather.dt;
"""

In [5]:
# execute query and read into dataframe
query_df = pd.read_sql(query, conn)

# 1. Model Preperation

In [108]:
# Loading file
df = query_df

In [None]:
df = pd.read_feather('route46a.feather')

## 1.1 Reviewing, splitting dataset into 7:3 for training and testing.

In [None]:
df.head(5)

In [None]:
df.tail(5)

In [None]:
# Missing values
df.isnull().sum()

In [None]:
# Unique types for each feature
df.nunique()

In [None]:
# Datatypes and convert
df.dtypes

In [None]:
# Rows and columns
df.shape

In [None]:
df.describe().T

**Review so far:**
<br>
There are no more missing values and the constant columns have been removed.
* Remove index, index, dt.
* Investigate level_0.
* Convert the following to categorical: DAYOFWEEK, MONTHOFSERVICE, PROGRNUMBER, STOPPOINTID, VEHICLEID, IS_HOLIDAY, IS_WEEKDAY, TRIPID, weather_id, weather_main, weather_description
* We have data for most of the days of the year and for each month.


In [109]:
df = df.drop(['level_0', 'dt','index'], axis=1)

In [110]:
# Sorting by trip then dayofservice
df = df.sort_values(by=['TRIPID', 'DAYOFSERVICE', 'PROGRNUMBER'])

In [111]:
# Creating features
categorical_features = ['DAYOFWEEK', 'MONTHOFSERVICE', 'PROGRNUMBER', 'STOPPOINTID', 'PREVIOUS_STOPPOINTID',
                       'IS_HOLIDAY', 'IS_WEEKDAY', 'TRIPID', 'VEHICLEID', 'weather_id', 'weather_main', 'weather_description']

datetime_features = ['DAYOFSERVICE']

numerical_features = ['PLANNEDTIME_ARR', 'ACTUALTIME_ARR', 'PLANNEDTIME_DEP', 'ACTUALTIME_DEP',
                     'DWELLTIME', 'PLANNEDTIME_TRAVEL', 'temp', 'pressure', 'humidity', 'wind_speed', 'wind_deg', 'rain_1h', 'clouds_all']

target_feat = 'ACTUALTIME_TRAVEL'

In [112]:
# Converting object to categorical
for column in categorical_features:
    df[column] = df[column].astype('category')
    
# Converting dayofservice to datetime
df['DAYOFSERVICE'] = pd.to_datetime(df['DAYOFSERVICE'])

In [113]:
# Replacing PROGRNUMBER equal to 1 of ACTUALTIME_TRAVEL with 0
df.loc[df['PROGRNUMBER'] == '1', 'ACTUALTIME_TRAVEL'] = 0
df.loc[df['PROGRNUMBER'] == '1', 'PLANNEDTIME_TRAVEL'] = 0

In [114]:
df.loc[df['PLANNEDTIME_TRAVEL'] < 0, 'PLANNEDTIME_TRAVEL'] = 0
df.loc[df['ACTUALTIME_TRAVEL'] < 0, 'ACTUALTIME_TRAVEL'] = 0

In [None]:
# add feature
df['morn_rushour'] = [1 if int(time.strftime("%H",time.gmtime(hour))) >= 7 and int(time.strftime("%H",time.gmtime(hour))) <= 9 else 0 for hour in df['ACTUALTIME_DEP']]

In [None]:
# add feature
df['eve_rushour'] = [1 if int(time.strftime("%H",time.gmtime(hour))) >= 16 and int(time.strftime("%H",time.gmtime(hour))) <= 19 else 0 for hour in df['ACTUALTIME_DEP']]

In [None]:
# hour of service
df['HOUROFSERVICE'] = [int(time.strftime("%H",time.gmtime(hour))) for hour in df['ACTUALTIME_DEP']]

In [None]:
df = df.reset_index()
df.to_feather('route46a.feather')

In [None]:
# Making new feature for previous stoppointid and let those with PROGRNUMBER = 1 to 0
# df['PREVIOUS_STOPPOINTID'] = df['STOPPOINTID'].shift()
# first_stop = {'0':'0'}
# df['PREVIOUS_STOPPOINTID'] = df['PREVIOUS_STOPPOINTID'].cat.add_categories(first_stop)
# df.loc[df['PROGRNUMBER'] == '1', 'PREVIOUS_STOPPOINTID'] = '0'

<br><br>
Setting the target feature as _y and x_ as the remaining features in the dataframe. 
<br><br>

In [None]:
df.set_index(np.random.permutation(df.index))
# sort the resulting random index
df.sort_index(inplace=True)

In [None]:
# Creating y and x axis
target_feature = df['ACTUALTIME_TRAVEL']
y = pd.DataFrame(target_feature)
X = df.drop(['ACTUALTIME_TRAVEL'], axis=1)

# Splitting dataset for train and testing data by 70/30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Printing shape of the new split data
print("The original range is: ",df.shape[0])
print("The training range (70%):\t rows 0 to", round(X_train.shape[0]))
print("The test range (30%): \t rows", round(X_train.shape[0]), "to", round(X_train.shape[0]) + X_test.shape[0])

## 1.2 Plot to compare all features to target feature to help make decisions to keep for the models.

#### Plotting datetime feature against target feature

In [None]:
# Plot datetime feature against target feature
X_train.DAYOFSERVICE = pd.to_numeric(X_train.DAYOFSERVICE)
df_temp = pd.concat([X_train['DAYOFSERVICE'], y_train], axis=1)
correlation_dt = df_temp[['DAYOFSERVICE', 'ACTUALTIME_TRAVEL']].corr(method='pearson')
correlation_dt

In [None]:
print('PLOT: DAYOFSERVICE')
fig = plt.figure()
ax = fig.add_subplot
df_temp.plot(kind='scatter', x='DAYOFSERVICE', y='ACTUALTIME_TRAVEL', label = "%.3f" % df_temp[['ACTUALTIME_TRAVEL', 'DAYOFSERVICE']].corr().to_numpy()[0,1], figsize=(15, 8))
plt.show()

#### Plotting numerical features against target feature

In [None]:
for column in numerical_features:
    df_temp = pd.concat([X_train[column], y_train], axis=1)
    correlation_dt = df_temp[[column, 'ACTUALTIME_TRAVEL']].corr(method='pearson')
    print('\n',correlation_dt)

In [None]:
for column in numerical_features:
    df_temp = pd.concat([X_train[column], y_train], axis=1)
    correlation_dt = df_temp[[column, 'ACTUALTIME_TRAVEL']].corr(method='spearman')
    print('\n',correlation_dt)

#### Pearson correlation method

In [None]:
print('NUMERICAL FEATURES: PEARSON')
for column in numerical_features:
    df_temp = pd.concat([X_train[column], y_train], axis=1)
    fig = plt.figure()
    ax = fig.add_subplot
    df_temp.plot(kind='scatter', x=column, y='ACTUALTIME_TRAVEL', label = "%.3f" % df_temp[['ACTUALTIME_TRAVEL', column]].corr(method='pearson').to_numpy()[0,1], figsize=(12, 8))
    plt.show()

#### Spearman correlation method

In [None]:
print('NUMERICAL FEATURES: SPEARSMAN')
for column in numerical_features:
    df_temp = pd.concat([X_train[column], y_train], axis=1)
    fig = plt.figure()
    ax = fig.add_subplot
    df_temp.plot(kind='scatter', x=column, y='ACTUALTIME_TRAVEL', label = "%.3f" % df_temp[['ACTUALTIME_TRAVEL', column]].corr(method='spearman').to_numpy()[0,1], figsize=(12, 8))
    plt.show()

In [None]:
print('NUMERICAL FEATURES: USING CORR()')
df.corr()['ACTUALTIME_TRAVEL'][:]

In [None]:
df_numeric = df[numerical_features]
for feature in df_numeric:
    df_numeric[feature] = np.log(df_numeric[feature])
df_numeric['ACTUALTIME_TRAVEL'] = np.log(df['ACTUALTIME_TRAVEL'])

In [None]:
print('NUMERICAL FEATURES USING LOG DATA')
# Creating y and x axis
target_feature_numeric = df_numeric['ACTUALTIME_TRAVEL']
y_numeric = pd.DataFrame(target_feature_numeric)
X_numeric = df_numeric.drop(['ACTUALTIME_TRAVEL'], axis=1)

# Splitting dataset for train and testing data by 70/30
X_train_numeric, X_test_numeric, y_train_numeric, y_test_numeric = train_test_split(X_numeric, y_numeric, test_size=0.3, random_state=1)

# Printing shape of the new split data
print("The original range is: ",df.shape[0])
print("The training range (70%):\t rows 0 to", round(X_train_numeric.shape[0]))
print("The test range (30%): \t rows", round(X_train_numeric.shape[0]), "to", round(X_train_numeric.shape[0]) + X_test_numeric.shape[0])

for column in numerical_features:
    df_temp = pd.concat([X_train_numeric[column], y_train_numeric], axis=1)
    fig = plt.figure()
    ax = fig.add_subplot
    df_temp.plot(kind='scatter', x=column, y='ACTUALTIME_TRAVEL', label = "%.3f" % df_temp[['ACTUALTIME_TRAVEL', column]].corr(method='spearman').to_numpy()[0,1], figsize=(12, 8))
    plt.show()

#### Plotting categorical features against target feature

In [None]:
year_features = ['eve_rushour', 'morn_rushour','DAYOFWEEK', 'IS_HOLIDAY', 'IS_WEEKDAY', 'MONTHOFSERVICE', 'weather_id', 'weather_main', 'weather_description']

for feature in year_features:
    print(feature)
    df_temp = pd.concat([X_train, y_train], axis=1)
    unique = df_temp[feature].unique()
    list_average = []
    
    for value in unique:
        list_values = df_temp[df_temp[feature]== value]['ACTUALTIME_TRAVEL'].tolist()
        length_list = len(list_values)
        average =  sum(list_values)/length_list
        list_average += [average]
#         print(f'Sum of values / list of values: \n {sum(list_values)} / {length_list}')
#         print(f'Average ACTUALTIME_TRAVEL: {average}, \n')
        
    # taken from https://pythonspot.com/matplotlib-bar-chart/
    y_pos = np.arange(len(unique))
    plt.bar(y_pos, list_average, align='center')
    plt.xticks(y_pos, unique)
    plt.ylabel('Usage')
    plt.title(feature)
    plt.xticks(rotation=90)

    plt.show()

In [None]:
# Average time for each vehicle id
df_temp = pd.concat([X_train, y_train], axis=1)
vehicleid = df_temp['VEHICLEID'].unique().tolist()
for id_ in vehicleid:
    print(f'VEHICLEID: {id_}')
    list_values = df_temp[df_temp['VEHICLEID']== id_]['ACTUALTIME_TRAVEL'].tolist()
    length_list = len(list_values)
    average =  sum(list_values)/length_list
    print(f'Average ACTUALTIME_TRAVEL: {average} \n')

In [None]:
# Making dummy variables for categorical 
cat = ['DAYOFWEEK', 'MONTHOFSERVICE', 'PROGRNUMBER', 'STOPPOINTID', 'IS_HOLIDAY', 'IS_WEEKDAY', 'weather_id', 'weather_main', 'weather_description']
df_temp = pd.concat([X_train, y_train], axis=1)
df_copy = df_temp.copy()
df_copy = df_copy[cat]
df_copy = pd.get_dummies(df_copy)
df_copy = pd.concat([df_copy, y_train], axis=1)

categorical_corr = df_copy.corr()['ACTUALTIME_TRAVEL'][:]

In [None]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(categorical_corr)

In [None]:
categorical_list = categorical_corr[categorical_corr > 0.04].index.tolist()
categorical_list.remove('ACTUALTIME_TRAVEL')

In [None]:
categorical_list

## 1.3. Summary of all features
<br><br>
#### Numerical Features
<br><br>

**DayOfService:**
* The correlation to the target feature is very low of 0.03806.
* Don't see it being a useful feature for the target feature. 
* Plot represents a straight line, which suggests little to no correlation.
* Conclusion: dropped because of the low correlation score. 

**PlannedTime_Arr:**
* There is very low correlation against the target feature though it gets better using spearman correlation.
* After logging the data, the correlation plot did not make a huge difference when using the spearman method to plot it for the second time. 
* Pearson and spearman plot pre log suggests little correlation as it is a continuous straight line. However, this shouldn't mean it should be dropped.
* When most values in the target feature fell less than 10, we see that the plannedtime arrival values increasing, it didn't change much. This would be due to the fact that the target feature is the difference between times so it would make sense that the relationship is poor.
* After logging the data, the plot is more spread out instead of a straight line, but the correlation score still shows a similar low score with a .02 difference using the spearman method. 
* Conclusion: However, this will be dropped.

**ActualTime_Arr:**
* Compared to Planned time arrival feature, the pearson correlation score is poorer but the spearman scores are more similar pre log. 
* It is similar to planned time arrival in that the plot represents a straight line, that suggests a poor relationship with the target feature. 
* After logging the data, it is found that the plot is more spread out. The score using spearman is not much different pre logging the data. 
* However, it would be unwise to drop this feature as it I feel it would serve good purpose for the target feature for predicting the prediction time for the next stop. 
* Conclusion: this will be dropped.

**PlannedTime_Dep:**
* Planned time departure has little correlation with the target feature after looking at spearman and pearsons. 
* It doesn't have a linear relationship and the straight line on the plot of both methods proves this.
* However, when plotted using the logged values we see that the correlation score hasn't changed but the data is more spread out. 
* This doesn't change the relationship much, however. 
* Even so, this will be kept as I feel it would help the predictions. Having the planned time departures would help skew a better result because it would relatively be close to the actual time departure even though it is just an estimate.
* Conclusion: this will be dropped 

**ActualTime_Dep:**
* Actual time departure is again, more or less the same. It represents the departure for these times at a particular stop to go to the next stop. It is strange that the correlation is so low even after logging the data but it would make sense as you wouldn't expect there to be a linear relationship.
* The plot is similar to the rest of the previous features mentioned so far. 
* However, it will still be kept because I feel it would still be a useful feature for predicting a time in seconds. 
* By taking the actual time departure for a particular stop it may help.
* Conclusion: this will be dropped.

**Dwell Time:**
* Dwell time has a 0.03 coorelation score with the target feature. It suggests on the graph that the time for dwell time equal to 0 then the more the target feature time increases. It might suggest traffic times where if a bus is full then it might be due to rush hour? busy hours?
* Plotting against the target feature after logging the data gives similar scores using the spearman correlation method. However we see the graph differing from pre log plot. It is more grouped up together compared to the previous graph plot.
* Because the score is more fairer compared to the previous, it will be useful to keep it for the modelling.
* Conclusion: dropped.

**PlannedTime_Travel:**
* When plotting using the pearse correlation method, it gave a correlation of 0.2. This time it is the highest correlation and we see a small linear relationship.
* The time for planned time travel, as it increases, so does the target feature. It gives us an indication of that slight linear relationship.
* Using spearmans to graph the correlation gave us a 0.7 score which is a good indication that the two features has a linear relationship.
* Because of this, this feature will be dropped.

**Temp:**
* Temp  has a negative 0.009 correlation with the target feature and an even poorer linear relationship at -.002.
* This indicates a poor linear/monotonic relationship and it will not serve useful for the model.
* The graph plots does not give anymore useful information that would give further evidence that it should be kept.
* Conclusion: drop.

**Pressure:**
* It also has a negative linear relationship with the target feature.
* When looking at the graph plots for both spearman and pearsons, it does not give any further insights.
* For this reason, this feature will be dropped.

**Humidity:**
* Humidity does not have a strong relationship with the target feature, be it linear or monotonic.
* The reason being the correlation using both methods fell < 0.00. 
* Unfortunately, the graph does not represent anything useful either.
* When looking at the logged data plots however, there is a slight difference however it is not signficant enough that this feature should still be kept as there is no distinct relationship that we can see.
* Conclusion: drop.

**Windspeed:**
* No linear relationship.
* Indicates a small monotonic relationship.
* This means that as the windspeed value increases, the value of the target feature tends to be higher as well.
* But a spearman correlation of 0.01 is not strong enough of a feature to keep.
* Conclusion: drop

**Wind_Deg:**
* This feature will be dropped immediately as the correalations are both <0.000.

**Rain_1H:**
* It doesn't have a strong linear relationship but it shows spearmans correlation some promising results when the data has been logged.


<br><br>
#### Categorical Features
<br><br>
**DayOfWeek:**
* In the graph we see the actual time travel increasing during weekdays and slowly the travel time is less during weekends. 
* This suggests a relationship between the days of the week and the target feature in which weekdays have a higher tendency for the actualtime travel feature to be higher.
* Conclusion: this will be kept.

**MonthofService:**
* In the graph, we don't really see a connection between each month against the target feature even if it is in order. 
* The overall actual travel time is higher in february before it dips, then rising during winter season.
* The correlation score seems to be poor also for each month. 
* This feature will still be kept. 

**Progrnumber:**
* Most progrnumbers will be dropped as a lot of the correlations are <0.00.
* For this reason, this feature will be dropped.
    
**StoppointID:**
* Similarly to progrnumbers, there are a lot of low correlations falling <0.00.
* Most stoppoint numbers are <0.00 correlation.
* This indicates a very low relationship with the target feature. 
* For this reason, this feature will be dropped, except for those with a correlation > 0.04
    
**Is_Holiday:**
* After analyzing the graph, we see a relationship between the target feature and whether or not the time falls under a holiday date (non-school holiday).
* If it a non holiday, the actual time travel increases. 
* If it is a holiday, the actual time travel decreases. 
* This means that less people are using public transport if it is a holiday date.
* For this reason, this feature will be kept.

**Is_Weekday:**
* Like Is_Holiday, we see a relationship between the target feature and whether or not the time is during a weekday or not. 
* We see a contrast between the two values in which 1, being a weekday, has a higher actual time travel, vice versa.
* For this reason, it is a good indication of a relationship to the target feature.
* Therefore, this feature will be kept. 

**VehicleID:**
* When looking at the different averages, we see that the average differences are not big.
* For this reason, it may be best to drop this feature because it doesn't give any indication it would be a useful feature to help the prediction models.


## 1.4 Cleaning up features

### Setting low correlation features - keep

In [13]:
# Categorical features
low_corr_categorical = ['DAYOFWEEK', 'MONTHOFSERVICE', 'IS_HOLIDAY', 'IS_WEEKDAY'] 

### Setting low correlation features - drop

In [14]:
# Numerical features
low_corr_numerical = ['PLANNEDTIME_ARR', 'PLANNEDTIME_DEP', 'ACTUALTIME_ARR', 'ACTUALTIME_DEP','PLANNEDTIME_TRAVEL']

low_corr = ['DAYOFSERVICE', 'VEHICLEID', 'TRIPID', 'STOPPOINTID', 'PREVIOUS_STOPPOINTID', 'PROGRNUMBER', 'temp', 'pressure', 'humidity', 
            'wind_deg', 'weather_id', 'weather_description', 'clouds_all', 'wind_speed', 'PREVIOUS_STOPPOINTID', 'PLANNEDTIME_ARR', 'PLANNEDTIME_DEP', 'ACTUALTIME_ARR', 'ACTUALTIME_DEP',
           'PLANNEDTIME_TRAVEL', 'DWELLTIME']

### Setting high correlation  features

In [15]:
# Numerical features 
high_corr_numerical = ['DWELLTIME', 'PLANNEDTIME_TRAVEL']

### Dropping features & setting dummy features

In [91]:
df_copy = df.copy()
df_copy = df_copy.drop(low_corr, 1)
df_copy

Unnamed: 0,MONTHOFSERVICE,DAYOFWEEK,ACTUALTIME_TRAVEL,IS_HOLIDAY,IS_WEEKDAY,rain_1h,weather_main
0,January,Saturday,0,0,0,0.0,Clouds
9,January,Saturday,40,0,0,0.0,Clouds
10,January,Saturday,15,0,0,0.0,Clouds
11,January,Saturday,33,0,0,0.0,Clouds
12,January,Saturday,129,0,0,0.0,Clouds
...,...,...,...,...,...,...,...
2164842,December,Sunday,21,0,0,0.0,Clouds
2164789,December,Sunday,16,0,0,0.0,Clouds
2164790,December,Sunday,17,0,0,0.0,Clouds
2164791,December,Sunday,62,0,0,0.0,Clouds


In [92]:
df_copy = pd.get_dummies(df_copy)
df_copy

Unnamed: 0,ACTUALTIME_TRAVEL,rain_1h,MONTHOFSERVICE_April,MONTHOFSERVICE_August,MONTHOFSERVICE_December,MONTHOFSERVICE_February,MONTHOFSERVICE_January,MONTHOFSERVICE_July,MONTHOFSERVICE_June,MONTHOFSERVICE_March,...,IS_HOLIDAY_1,IS_WEEKDAY_0,IS_WEEKDAY_1,weather_main_Clear,weather_main_Clouds,weather_main_Drizzle,weather_main_Fog,weather_main_Mist,weather_main_Rain,weather_main_Snow
0,0,0.0,0,0,0,0,1,0,0,0,...,0,1,0,0,1,0,0,0,0,0
9,40,0.0,0,0,0,0,1,0,0,0,...,0,1,0,0,1,0,0,0,0,0
10,15,0.0,0,0,0,0,1,0,0,0,...,0,1,0,0,1,0,0,0,0,0
11,33,0.0,0,0,0,0,1,0,0,0,...,0,1,0,0,1,0,0,0,0,0
12,129,0.0,0,0,0,0,1,0,0,0,...,0,1,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2164842,21,0.0,0,0,1,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
2164789,16,0.0,0,0,1,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
2164790,17,0.0,0,0,1,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
2164791,62,0.0,0,0,1,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0


### Training & Testing data

In [18]:
# All features
features = df_copy.columns.tolist()
features

['ACTUALTIME_TRAVEL',
 'rain_1h',
 'MONTHOFSERVICE_April',
 'MONTHOFSERVICE_August',
 'MONTHOFSERVICE_December',
 'MONTHOFSERVICE_February',
 'MONTHOFSERVICE_January',
 'MONTHOFSERVICE_July',
 'MONTHOFSERVICE_June',
 'MONTHOFSERVICE_March',
 'MONTHOFSERVICE_May',
 'MONTHOFSERVICE_November',
 'MONTHOFSERVICE_October',
 'MONTHOFSERVICE_September',
 'DAYOFWEEK_Friday',
 'DAYOFWEEK_Monday',
 'DAYOFWEEK_Saturday',
 'DAYOFWEEK_Sunday',
 'DAYOFWEEK_Thursday',
 'DAYOFWEEK_Tuesday',
 'DAYOFWEEK_Wednesday',
 'IS_HOLIDAY_0',
 'IS_HOLIDAY_1',
 'IS_WEEKDAY_0',
 'IS_WEEKDAY_1',
 'weather_main_Clear',
 'weather_main_Clouds',
 'weather_main_Drizzle',
 'weather_main_Fog',
 'weather_main_Mist',
 'weather_main_Rain',
 'weather_main_Snow']

In [19]:
datas = {'ACTUALTIME_TRAVEL': df_copy['ACTUALTIME_TRAVEL']}
y = pd.DataFrame(data=datas)
X = df_copy.drop(['ACTUALTIME_TRAVEL'],1)

In [20]:
# Splitting the dataset into 2 datasets: 
# Split the dataset into two datasets: 70% training and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=1)

print("The Original range of the dataset: ",df.shape[0])
print("The Training range taken from dataset: (70%): rows 0 to", round(X_train.shape[0]))
print("The Test range taken from dataset: (30%): rows", round(X_train.shape[0]), "to", round(X_train.shape[0]) + X_test.shape[0])

The Original range of the dataset:  2164843
The Training range taken from dataset: (70%): rows 0 to 1515390
The Test range taken from dataset: (30%): rows 1515390 to 2164843


In [21]:
print("\nDescriptive features in X:\n", X_train.head(5))
print("\nTarget feature in y:\n", y_train.head(5))


Descriptive features in X:
          rain_1h  MONTHOFSERVICE_April  MONTHOFSERVICE_August  \
1890847      0.0                     0                      0   
440272       0.0                     0                      0   
1381125      0.0                     0                      0   
75530        0.0                     0                      0   
1111339      0.0                     0                      0   

         MONTHOFSERVICE_December  MONTHOFSERVICE_February  \
1890847                        0                        0   
440272                         0                        0   
1381125                        0                        0   
75530                          0                        0   
1111339                        0                        0   

         MONTHOFSERVICE_January  MONTHOFSERVICE_July  MONTHOFSERVICE_June  \
1890847                       0                    0                    0   
440272                        0                    0       

In [22]:
# I will reset the indexes of the training and test splits so we can see the X_train printout
# We will see that they are no longer in order and the next markdown cell I will reset the indexes.
X_train.head(5)

Unnamed: 0,rain_1h,MONTHOFSERVICE_April,MONTHOFSERVICE_August,MONTHOFSERVICE_December,MONTHOFSERVICE_February,MONTHOFSERVICE_January,MONTHOFSERVICE_July,MONTHOFSERVICE_June,MONTHOFSERVICE_March,MONTHOFSERVICE_May,...,IS_HOLIDAY_1,IS_WEEKDAY_0,IS_WEEKDAY_1,weather_main_Clear,weather_main_Clouds,weather_main_Drizzle,weather_main_Fog,weather_main_Mist,weather_main_Rain,weather_main_Snow
1890847,0.0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
440272,0.0,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0
1381125,0.0,0,0,0,0,0,1,0,0,0,...,0,0,1,0,1,0,0,0,0,0
75530,0.0,0,0,0,0,1,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
1111339,0.0,0,0,0,0,0,1,0,0,0,...,0,0,1,1,0,0,0,0,0,0


In [23]:
# Using .reset_index 
# We see that they are in order again. 
X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)
X_train.head(10)

Unnamed: 0,rain_1h,MONTHOFSERVICE_April,MONTHOFSERVICE_August,MONTHOFSERVICE_December,MONTHOFSERVICE_February,MONTHOFSERVICE_January,MONTHOFSERVICE_July,MONTHOFSERVICE_June,MONTHOFSERVICE_March,MONTHOFSERVICE_May,...,IS_HOLIDAY_1,IS_WEEKDAY_0,IS_WEEKDAY_1,weather_main_Clear,weather_main_Clouds,weather_main_Drizzle,weather_main_Fog,weather_main_Mist,weather_main_Rain,weather_main_Snow
0,0.0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
1,0.0,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0
2,0.0,0,0,0,0,0,1,0,0,0,...,0,0,1,0,1,0,0,0,0,0
3,0.0,0,0,0,0,1,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
4,0.0,0,0,0,0,0,1,0,0,0,...,0,0,1,1,0,0,0,0,0,0
5,0.0,0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
6,0.0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
7,0.0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
8,0.0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
9,0.0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0


***

<br><br>
# 2. Linear Regression

In this section, I will be preparating a linear regression model. I will attempt to see the prediction overall.

## 2.1 Training a linear regression model to predict the target feature.

In [None]:
linear_reg = LinearRegression().fit(X_train, y_train)

## 2.2 Printing out the coefficients learned by the model and discussing the role in the model.

In [None]:
print("\nThe features are: \n", X_train.columns)
print("\nThe coefficients are: \n", linear_reg.coef_)
print("\n The intercept is: \n", linear_reg.intercept_)
print("\nFeatures and coefficients: \n", list(zip(X_train.columns, linear_reg.coef_[0])))

## 2.3 Printing the predicted target feature. Printing the predicted class for a few examples. Printing classification evaluation measures computed on the full training set. 

In [None]:
# Calculating the prediction and threshold value. 
linear_predictions_train_data = (linear_reg.predict(X_train))

In [None]:
print("\nPredictions with multiple linear regression: \n")
actual_vs_predicted_multiplelinreg = pd.concat([y_train, pd.DataFrame(linear_predictions_train_data, columns=['Predicted'])], axis=1)
print(actual_vs_predicted_multiplelinreg.head(50))

In [None]:
# Printing the a few classification evaluation measures computed on the full training set.
# The following will be printed: accuracy, confusion matrix, precision, recall, f1).
# Some more evaluation metrics.
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(y_train, linear_predictions_train_data)))
print("MAE Score: ", metrics.mean_absolute_error(y_train, linear_predictions_train_data))
print("R2 Score: ", metrics.r2_score(y_train, linear_predictions_train_data))

## 2. 4 Evaluating the model using classification evaluation measures on the hold-out (30% examples) test set.

In [None]:
linear_predictions_testing = (linear_reg.predict(X_test))

print("\nPredictions that has multiple linear regression: \n")
actual_vs_predicted_multiple_linear_reg = pd.concat([y_test, pd.DataFrame(linear_predictions_testing, columns=['Predicted'])], axis=1)
print(actual_vs_predicted_multiple_linear_reg)

In [None]:
# Printing the a few classification evaluation measures computed on the full training set.
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(y_test, linear_predictions_testing)))
print("MAE Score: ", metrics.mean_absolute_error(y_test, linear_predictions_testing))
print("R2 Score: ", metrics.r2_score(y_test, linear_predictions_testing))

<br><br>
route46a

In [None]:
# Taking a tripid from 46a and applying a prediction using Linear Regression model.
route_46a = df[(df['TRIPID'] == '8591174') & (df['DAYOFSERVICE']=='2018-12-23')]
route_46a = route_46a.drop(low_corr, 1)
route_46a = route_46a.drop(['level_0', 'dt'],1)
route_46a = pd.get_dummies(route_46a)
actualtimes_46a = pd.DataFrame(route_46a['ACTUALTIME_TRAVEL'])
actualtimes_46a.reset_index(drop=True, inplace=True)
route_46a = route_46a.drop('ACTUALTIME_TRAVEL', 1)

# # Prediction
# prediction_46a = linear_reg.predict(route_46a)
# prediction_46a[0] = 0.0

In [None]:
print("\nPredictions with multiple linear regression: \n")
actual_vs_predicted = pd.concat([actualtimes_46a, pd.DataFrame(prediction_46a, columns=['Prediction'])], axis=1, join='outer')
actual_vs_predicted.head(50)

In [None]:
# Printing evaluation metrics
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actualtimes_46a, prediction_46a)))
print("MSE Score: ", metrics.mean_squared_error(actualtimes_46a, prediction_46a))
print("MAE Score: ", metrics.mean_absolute_error(actualtimes_46a, prediction_46a))
print("R2 Score: ", metrics.r2_score(actualtimes_46a, prediction_46a))

<br><br>
Stop model - Training a model based off the current stop and all of the previous stops of that current stop.

In [76]:
import json
file = open('previous_stops.json',)
previous_stops = json.load(file)

In [83]:
route_46a = df[(df['TRIPID'] == '8591174') & (df['DAYOFSERVICE']=='2018-12-23')]

In [84]:
list_stopids_onetrip = route_46a.STOPPOINTID.tolist()
list_fittedmodels = []
index = 0

route_46a = route_46a.drop(low_corr, 1)
route_46a = pd.get_dummies(route_46a)
actualtimes_46a = pd.DataFrame(route_46a['ACTUALTIME_TRAVEL'])
actualtimes_46a.reset_index(drop=True, inplace=True)
route_46a = route_46a.drop('ACTUALTIME_TRAVEL', 1)

for stopid in list_stopids_onetrip:
    df_stopid = df[df['STOPPOINTID']==stopid]
    list_prevstops = previous_stops[stopid]
    prev_stops = pd.DataFrame()
    
    for prevstop in list_prevstops:
        prev_stops = prev_stops.append(df[df['PREVIOUS_STOPPOINTID']==prevstop])
    
    df_stopid = df_stopid.append(prev_stops)
    df_copy = df_stopid.copy()
    df_copy = df_copy.drop(low_corr, 1)
    df_copy = pd.get_dummies(df_copy)
    datas = {'ACTUALTIME_TRAVEL': df_copy['ACTUALTIME_TRAVEL']}
    y = pd.DataFrame(data=datas)
    df_copy = df_copy.drop('ACTUALTIME_TRAVEL', 1)
    stop_model = LinearRegression().fit(df_copy, y)
    
    if index <= 58:
        prediction = stop_model.predict(route_46a.iloc[[index]])
        list_fittedmodels += [prediction]
        index += 1

1
Finished
2
Finished
3
Finished
4
Finished
5
Finished
6
Finished
7
Finished
8
Finished
9
Finished
10
Finished
11
Finished
12
Finished
13
Finished
14
Finished
15
Finished
16
Finished
17
Finished
18
Finished
19
Finished
20
Finished
21
Finished
22
Finished
23
Finished
24
Finished
25
Finished
26
Finished
27
Finished
28
Finished
29
Finished
30
Finished
31
Finished
32
Finished
33
Finished
34
Finished
35
Finished
36
Finished
37
Finished
38
Finished
39
Finished
40
Finished
41
Finished
42
Finished
43
Finished
44
Finished
45
Finished
46
Finished
47
Finished
48
Finished
49
Finished
50
Finished
51
Finished
52
Finished
53
Finished
54
Finished
55
Finished
56
Finished
57
Finished
58
Finished
59
Finished


In [85]:
numpy_predictions = list_fittedmodels
a = np.zeros(shape=(59,1))
for pred in range(len(numpy_predictions)):
    a[pred] = numpy_predictions[pred][0][0]

In [86]:
# Printing evaluation metrics
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actualtimes_46a, a)))
print("MSE Score: ", metrics.mean_squared_error(actualtimes_46a, a))
print("MAE Score: ", metrics.mean_absolute_error(actualtimes_46a, a))
print("R2 Score: ", metrics.r2_score(actualtimes_46a, a))

RMSE Score:  41.91257080185228
MSE Score:  1756.6635912202803
MAE Score:  36.2312208272643
R2 Score:  -1.4882528042583463


In [87]:
print("\nPredictions with multiple linear regression: \n")
actual_vs_predicted = pd.concat([actualtimes_46a, pd.DataFrame(a, columns=['Prediction'])], axis=1, join='outer')
actual_vs_predicted.head(50)


Predictions with multiple linear regression: 



Unnamed: 0,ACTUALTIME_TRAVEL,Prediction
0,0,80.699219
1,34,88.25
2,13,67.96875
3,70,94.412109
4,42,93.506226
5,69,93.821289
6,128,67.313293
7,127,89.939453
8,67,105.970703
9,72,89.929688


In [88]:
actual_total_journey = sum(actualtimes_46a['ACTUALTIME_TRAVEL'])
predicted_total_journey = sum(a)[0]
print(f"Actual Total Journey time: {actual_total_journey} seconds. \nPredicted Total Journey Time: {predicted_total_journey} seconds.")

Actual Total Journey time: 2418 seconds. 
Predicted Total Journey Time: 4199.918518066406 seconds.


Off by 30 minutes total.

<br><br>
Stop pair model version 1 - taking all rows for current stopid in stoppointid and all rows of previous_stopid entered in stoppointid

In [71]:
# Get a list of all stop ids
route_46a = df[(df['TRIPID'] == '8591174') & (df['DAYOFSERVICE']=='2018-12-23')]
list_stops_trip = route_46a.STOPPOINTID.tolist()

index = 0
predictions_v1 = {}

route_46a = route_46a.drop(low_corr, 1)
route_46a = pd.get_dummies(route_46a)
actualtimes_46a = pd.DataFrame(route_46a['ACTUALTIME_TRAVEL'])
actualtimes_46a.reset_index(drop=True, inplace=True)
route_46a = route_46a.drop('ACTUALTIME_TRAVEL', 1)

for previous, current in zip(list_stops_trip, list_stops_trip[1:]):
    
    df_stopid = df[df['STOPPOINTID']==current]
    prev_stop_df = df[df['STOPPOINTID']==previous]
    df_stopid = df_stopid.append(prev_stop_df)
        
    df_stopid = df_stopid.drop(low_corr, 1)
    df_stopid = pd.get_dummies(df_stopid)
    y = pd.DataFrame(df_stopid['ACTUALTIME_TRAVEL'])
    df_stopid = df_stopid.drop('ACTUALTIME_TRAVEL', 1)
        
        # Error checking
#         f = f.append(df_stopid)
#         g = g.append(y)
        
    stop_model = LinearRegression().fit(df_stopid, y)
    predictions_v1[current + '_' + previous] = stop_model
#         if index <= len(list_stops_trip):
#             prediction = stop_model.predict(df_copy.iloc[[index]])
#             predictions_v1[stopid + '_' + prevstop] = prediction
#             index += 1

**Prediction 1 - Evaluation**

In [72]:
# List to store predictions
predictions = []
index = 0

for previous, current in zip(list_stops_trip, list_stops_trip[1:]):
    if previous == '807' and current == '817':
        continue
        
    predictions += [predictions_v1[current + '_' + previous].predict(route_46a.iloc[[index]])]
    index += 1

In [73]:
numpy_predictions = np.zeros(shape=(59,1))
for pred in range(len(predictions)):
    numpy_predictions[pred] = predictions[pred][0][0]

In [74]:
# Printing evaluation metrics
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actualtimes_46a, numpy_predictions)))
print("MSE Score: ", metrics.mean_squared_error(actualtimes_46a, numpy_predictions))
print("MAE Score: ", metrics.mean_absolute_error(actualtimes_46a, numpy_predictions))
print("R2 Score: ", metrics.r2_score(actualtimes_46a, numpy_predictions))

RMSE Score:  34.87305145354171
MSE Score:  1216.1297176813675
MAE Score:  26.004270391949152
R2 Score:  -0.7226053955273883


In [59]:
print("\nPredictions with multiple linear regression: \n")
actual_vs_predicted = pd.concat([actualtimes_46a, pd.DataFrame(numpy_predictions, columns=['Prediction'])], axis=1, join='outer')
actual_vs_predicted.head(50)


Predictions with multiple linear regression: 



Unnamed: 0,ACTUALTIME_TRAVEL,Prediction
0,0,32.910156
1,34,36.804688
2,13,74.875
3,70,84.210938
4,42,124.265625
5,69,180.224609
6,128,137.222656
7,127,88.566406
8,67,67.403809
9,72,56.441406


In [75]:
actual_total_journey = sum(actualtimes_46a['ACTUALTIME_TRAVEL'])
predicted_total_journey = sum(numpy_predictions)[0]
print(f"Actual Total Journey time: {actual_total_journey} seconds. \nPredicted Total Journey Time: {predicted_total_journey} seconds.")

Actual Total Journey time: 2418 seconds. 
Predicted Total Journey Time: 3111.666015625 seconds.


Predicted time is over 11.5 minutes total

**Prediction 2 - Evaluation**

In [133]:
prediction_2 = df[(df['TRIPID'] == '6106738') & (df['DAYOFSERVICE']==' 	2018-01-19')]
prediction_2 = prediction_2.drop(low_corr, 1)
prediction_2 = pd.get_dummies(prediction_2)
actualtimes_predicion2 = pd.DataFrame(prediction_2['ACTUALTIME_TRAVEL'])
actualtimes_predicion2.reset_index(drop=True, inplace=True)
prediction_2 = prediction_2.drop('ACTUALTIME_TRAVEL', 1)

In [134]:
# List to store predictions
predictions_2 = []
index = 0

for previous, current in zip(list_stops_trip, list_stops_trip[1:]):
    if previous == '807' and current == '817':
        continue
        
    predictions_2 += [predictions_v1[current + '_' + previous].predict(prediction_2.iloc[[index]])]
    index += 1
    
    
numpy_predictions_2 = np.zeros(shape=(59,1))
for pred in range(len(predictions)):
    numpy_predictions_2[pred] = predictions_2[pred][0][0]

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 31 is different from 36)

In [126]:
# Printing evaluation metrics
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actualtimes_predicion2, numpy_predictions_2)))
print("MSE Score: ", metrics.mean_squared_error(actualtimes_predicion2, numpy_predictions_2))
print("MAE Score: ", metrics.mean_absolute_error(actualtimes_predicion2, numpy_predictions_2))
print("R2 Score: ", metrics.r2_score(actualtimes_predicion2, numpy_predictions_2))

RMSE Score:  69.38611233144296
MSE Score:  4814.432584471621
MAE Score:  36.29726396980932
R2 Score:  -0.13960342157282146


In [127]:
print("\nPredictions with multiple linear regression: \n")
actual_vs_predicted = pd.concat([actualtimes_predicion2, pd.DataFrame(numpy_predictions_2, columns=['Prediction'])], axis=1, join='outer')
actual_vs_predicted.head(50)


Predictions with multiple linear regression: 



Unnamed: 0,ACTUALTIME_TRAVEL,Prediction
0,0,32.910156
1,36,36.804688
2,19,74.875
3,91,84.210938
4,72,124.265625
5,76,180.224609
6,95,137.222656
7,156,88.566406
8,94,67.403809
9,120,56.441406


In [130]:
actual_total_journey = sum(actualtimes_predicion2['ACTUALTIME_TRAVEL'])
predicted_total_journey = sum(numpy_predictions_2)[0]
print(f"Actual Total Journey time: {actual_total_journey} seconds. \nPredicted Total Journey Time: {predicted_total_journey} seconds.")

Actual Total Journey time: 3571 seconds. 
Predicted Total Journey Time: 3111.666015625 seconds.


Prediction is under by 15 minutes

<br><br>
Stop pair model version 2 - adding extra features then training them based on the second approach of the stop pair model

**Prediction 1 - Evaluation**

In [132]:
# Adding morn_rushour, eve_rushour, HOUROFSERVICE
df['HOUROFSERVICE'] = [int(time.strftime("%H",time.gmtime(hour))) for hour in df['ACTUALTIME_DEP']]
df['morn_rushour'] = [1 if int(time.strftime("%H",time.gmtime(hour))) >= 7 and int(time.strftime("%H",time.gmtime(hour))) <= 9 else 0 for hour in df['ACTUALTIME_DEP']]
df['eve_rushour'] = [1 if int(time.strftime("%H",time.gmtime(hour))) >= 16 and int(time.strftime("%H",time.gmtime(hour))) <= 19 else 0 for hour in df['ACTUALTIME_DEP']]

df['morn_rushour'] = df['morn_rushour'].astype('category')
df['eve_rushour'] = df['eve_rushour'].astype('category')

In [136]:
# Get a list of all stop ids
route_46a = df[(df['TRIPID'] == '8591174') & (df['DAYOFSERVICE']=='2018-12-23')]
list_stops_trip = route_46a.STOPPOINTID.tolist()

index = 0
predictions_v2 = {}

route_46a = route_46a.drop(low_corr, 1)
route_46a = pd.get_dummies(route_46a)
actualtimes_46a = pd.DataFrame(route_46a['ACTUALTIME_TRAVEL'])
actualtimes_46a.reset_index(drop=True, inplace=True)
route_46a = route_46a.drop('ACTUALTIME_TRAVEL', 1)

for previous, current in zip(list_stops_trip, list_stops_trip[1:]):
    
    df_stopid = df[df['STOPPOINTID']==current]
    prev_stop_df = df[df['STOPPOINTID']==previous]
    df_stopid = df_stopid.append(prev_stop_df)
        
    df_stopid = df_stopid.drop(low_corr, 1)
    df_stopid = pd.get_dummies(df_stopid)
    y = pd.DataFrame(df_stopid['ACTUALTIME_TRAVEL'])
    df_stopid = df_stopid.drop('ACTUALTIME_TRAVEL', 1)
        
        # Error checking
#         f = f.append(df_stopid)
#         g = g.append(y)
        
    stop_model = LinearRegression().fit(df_stopid, y)
    predictions_v2[current + '_' + previous] = stop_model

In [137]:
# List to store predictions
predictions_3 = []
index = 0

for previous, current in zip(list_stops_trip, list_stops_trip[1:]):
    if previous == '807' and current == '817':
        continue
        
    predictions_3 += [predictions_v2[current + '_' + previous].predict(route_46a.iloc[[index]])]
    index += 1
    
numpy_predictions_3 = np.zeros(shape=(59,1))
for pred in range(len(predictions_3)):
    numpy_predictions_3[pred] = predictions_3[pred][0][0]

In [141]:
# Printing evaluation metrics
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actualtimes_46a, numpy_predictions_3)))
print("MSE Score: ", metrics.mean_squared_error(actualtimes_46a, numpy_predictions_3))
print("MAE Score: ", metrics.mean_absolute_error(actualtimes_46a, numpy_predictions_3))
print("R2 Score: ", metrics.r2_score(actualtimes_46a, numpy_predictions_3))

RMSE Score:  31.286367543847607
MSE Score:  978.8367940887209
MAE Score:  23.37968470686573
R2 Score:  -0.38648823256512355


In [142]:
print("\nPredictions with multiple linear regression: \n")
actual_vs_predicted = pd.concat([actualtimes_46a, pd.DataFrame(numpy_predictions_3, columns=['Prediction'])], axis=1, join='outer')
actual_vs_predicted.head(50)


Predictions with multiple linear regression: 



Unnamed: 0,ACTUALTIME_TRAVEL,Prediction
0,0,27.9375
1,34,36.953125
2,13,71.035156
3,70,75.111328
4,42,111.59375
5,69,162.65625
6,128,121.5625
7,127,76.109375
8,67,60.984375
9,72,51.188477


In [143]:
actual_total_journey = sum(actualtimes_46a['ACTUALTIME_TRAVEL'])
predicted_total_journey = sum(numpy_predictions_3)[0]
print(f"Actual Total Journey time: {actual_total_journey} seconds. \nPredicted Total Journey Time: {predicted_total_journey} seconds.")

Actual Total Journey time: 2418 seconds. 
Predicted Total Journey Time: 2747.295867919922 seconds.


Prediction is off by being over 10.9 minutes

**Prediction 2 - Evaluation**

In [144]:
prediction_2 = df[(df['TRIPID'] == '6106738') & (df['DAYOFSERVICE']==' 	2018-01-19')]
prediction_2 = prediction_2.drop(low_corr, 1)
prediction_2 = pd.get_dummies(prediction_2)
actualtimes_predicion2 = pd.DataFrame(prediction_2['ACTUALTIME_TRAVEL'])
actualtimes_predicion2.reset_index(drop=True, inplace=True)
prediction_2 = prediction_2.drop('ACTUALTIME_TRAVEL', 1)

In [145]:
# List to store predictions
predictions_4 = []
index = 0

for previous, current in zip(list_stops_trip, list_stops_trip[1:]):
    if previous == '807' and current == '817':
        continue
        
    predictions_4 += [predictions_v2[current + '_' + previous].predict(prediction_2.iloc[[index]])]
    index += 1
    
    
numpy_predictions_4 = np.zeros(shape=(59,1))
for pred in range(len(predictions)):
    numpy_predictions_4[pred] = predictions_4[pred][0][0]

In [146]:
# Printing evaluation metrics
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actualtimes_predicion2, numpy_predictions_4)))
print("MSE Score: ", metrics.mean_squared_error(actualtimes_predicion2, numpy_predictions_4))
print("MAE Score: ", metrics.mean_absolute_error(actualtimes_predicion2, numpy_predictions_4))
print("R2 Score: ", metrics.r2_score(actualtimes_predicion2, numpy_predictions_4))

RMSE Score:  72.82228258810056
MSE Score:  5303.084841341175
MAE Score:  38.50819991806806
R2 Score:  -0.2552701744283794


In [147]:
print("\nPredictions with multiple linear regression: \n")
actual_vs_predicted = pd.concat([actualtimes_predicion2, pd.DataFrame(numpy_predictions_4, columns=['Prediction'])], axis=1, join='outer')
actual_vs_predicted.head(50)


Predictions with multiple linear regression: 



Unnamed: 0,ACTUALTIME_TRAVEL,Prediction
0,0,38.765625
1,36,39.019531
2,19,78.441406
3,91,94.751953
4,72,131.5625
5,76,204.921875
6,95,168.1875
7,156,102.34375
8,94,73.820312
9,120,40.964844


In [148]:
actual_total_journey = sum(actualtimes_predicion2['ACTUALTIME_TRAVEL'])
predicted_total_journey = sum(numpy_predictions_4)[0]
print(f"Actual Total Journey time: {actual_total_journey} seconds. \nPredicted Total Journey Time: {predicted_total_journey} seconds.")

Actual Total Journey time: 3571 seconds. 
Predicted Total Journey Time: 3411.2642517089844 seconds.


***

<br><br>

# 3. Route model and taking the proportion of the prediction to calculate a journey time for the user.

## 3.1 Calculating the proportion of each stop from the overall trip.

In [None]:
def proportion_stops(predictions):
    # Sum from the first stop until each stop
    sum_each_stop = np.zeros(predictions.shape[0], dtype=float)
    proportion_each_stop = np.zeros(predictions.shape[0], dtype=float)
    overall_prediction = np.sum(predictions)
    
    # Adding sum up until current stop and dividing by overall prediction to get proportion of the trip
    for length in range(predictions.shape[0]):
        sum_each_stop = np.append(sum_each_stop, [predictions[length]])
        sum_overall = np.sum(sum_each_stop) / overall_prediction*100
        proportion_each_stop[length] = sum_overall
        
    return proportion_each_stop

## 3.2 Return the progrnumber based off the stoppointid in a route

Finding the most common progrnumber based off the stoppointid. The reason for using to find the most common progrnumber is because it assumes that most route_id for each line would be always complete with the exception of a few trips in which they take a different route and skips some stops as a result.

In [None]:
# Code taken from https://www.geeksforgeeks.org/python-find-most-frequent-element-in-a-list/

# array only accepts a panda Series or numpy array
def most_common(array):
    List = array.tolist()
    mode_list = mode(List)
    if mode_list == '1':
        return 0
    
    else:
        return(mode(List))

## 3.3 Calculating the journey time from a start to end destination based on user input

Finding the travel time duration based on a stoppointid then getting the progrnumber

In [None]:
def journey_time(start,end, prediction):
    # Converting into int because the function returns a string
    start_progrnum = int(most_common(df['PROGRNUMBER'][df['STOPPOINTID']==start]))
    end_progrnum = int(most_common(df['PROGRNUMBER'][df['STOPPOINTID']==end]))
    
#     print(start_progrnum)
#     print(end_progrnum)

    proportion_array = proportion_stops(prediction)
    overall_prediction = np.sum(prediction)
    
    # calculating the time difference from start to end destination 
    start_prediction = (proportion_array[start_progrnum]/100) * overall_prediction
    end_prediction = (proportion_array[end_progrnum]/100) * overall_prediction
    
    journeytime = end_prediction - start_prediction
    
    # print(journeytime)
    
    return journeytime

In [None]:
user_start = '807'
user_end = '812'

journey_time(user_start, user_end, prediction_46a)

***

<br><br>
# 4.  Random Forest & Decision Trees

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfm = RandomForestRegressor(n_estimators=40, oob_score=True, random_state=1)
rfm_fit = rfm.fit(X_train, y_train)

In [None]:
from sklearn.tree import DecisionTreeRegressor
dtc_4 = DecisionTreeRegressor(max_depth=4, random_state=1)
dtc4_fit = dtc_4.fit(X_train, y_train)

In [None]:
importance = pd.DataFrame({'feature': X_train.columns, 'importance': rfm.feature_importances_})
importance.sort_values('importance', ascending=False)

In [None]:
print('Random Forest - Train Data \n')
rfm_predictions_train = rfm_fit.predict(X_train)
rdm_train_df = pd.DataFrame(rfm_predictions_train, columns=['Predicted'])
actual_vs_predicted_rfm_train = pd.concat([y_train, rdm_train_df], axis=1)
actual_vs_predicted_rfm_train.head(20)

In [None]:
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(y_train, rfm_predictions_train)))
print("MSE Score: ", metrics.mean_squared_error(y_train, rfm_predictions_train))
print("MAE Score: ", metrics.mean_absolute_error(y_train, rfm_predictions_train))
print("R2 Score: ", metrics.r2_score(y_train, rfm_predictions_train))

In [None]:
print('Random Forest - Test Data \n')
rfm_predictions_test = rfm_fit.predict(X_test)
rfm_test_df = pd.DataFrame(rfm_predictions_test, columns=['Predicted'])
actual_vs_predicted_rfm_test = pd.concat([y_test, rfm_test_df], axis=1)
actual_vs_predicted_rfm_test.head(20)

In [None]:
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(y_test, rfm_predictions_test)))
print("MSE Score: ", metrics.mean_squared_error(y_test, rfm_predictions_test))
print("MAE Score: ", metrics.mean_absolute_error(y_test, rfm_predictions_test))
print("R2 Score: ", metrics.r2_score(y_test, rfm_predictions_test))

In [None]:
print('Decision Tree - Train Data \n')
dtc4_fit_predictions_train = dtc4_fit.predict(X_train)
dtc4_train_df = pd.DataFrame(dtc4_fit_predictions_train, columns=['Predicted'])
actual_vs_predicted_dtc4_train = pd.concat([y_train, dtc4_train_df], axis=1)
actual_vs_predicted_dtc4_train.head(20)

In [None]:
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(y_train, dtc4_fit_predictions_train)))
print("MSE Score: ", metrics.mean_squared_error(y_train, dtc4_fit_predictions_train))
print("MAE Score: ", metrics.mean_absolute_error(y_train, dtc4_fit_predictions_train))
print("R2 Score: ", metrics.r2_score(y_train, dtc4_fit_predictions_train))

In [None]:
print('Decision Tree - Test Data \n')
dtc4_fit_predictions_test = dtc4_fit.predict(X_train)
dtc4_test_df = pd.DataFrame(dtc4_fit_predictions_test, columns=['Predicted'])
actual_vs_predicted_dtc4_test = pd.concat([y_train, dtc4_test_df], axis=1)
actual_vs_predicted_dtc4_test.head(20)

In [None]:
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(y_train, dtc4_fit_predictions_test)))
print("MSE Score: ", metrics.mean_squared_error(y_train, dtc4_fit_predictions_test))
print("MAE Score: ", metrics.mean_absolute_error(y_train, dtc4_fit_predictions_test))
print("R2 Score: ", metrics.r2_score(y_train, dtc4_fit_predictions_test))

***

<br><br>
# 5. Stop pair model

## 5.1 First version of paired stop approach
<br><br>
This approach makes a model based on the stopid and its previous stopids

In [None]:
# Returns a paired list of stops
def paired_stops(df):
    stopid = df['STOPPOINTID'].unique().tolist()
    previous_stopid = []
    for i in stopid:
        prev = df['PREVIOUS_STOPPOINTID'][df['STOPPOINTID']==i]
        # Adds most frequent previous stopid to list
        previous_stopid += [prev.value_counts().idxmax()]
    
    return [stopid, previous_stopid]

In [None]:
for ids in range(len(paired_stops[0])):
    
    # Making new dataframe
    to_add = df[df['STOPPOINTID']==paired_stops[0][ids]]
    to_add = to_add.append(df[df['PREVIOUS_STOPPOINTID']==paired_stops[1][ids]])
    stops_df = pd.DataFrame(data=to_add)
    
    # Setting target feature
    y = stops_df['ACTUALTIME_TRAVEL']
    
    # Dropping target feature and low corr features
    stops_df = stops_df.drop(low_corr,1)
    stops_df = stops_df.drop('ACTUALTIME_TRAVEL',1)
    stops_df = pd.get_dummies(stops_df)
    
    # Fitting model based on stops
    linear_reg = LinearRegression().fit(stops_df, y)
    
    # Save to pickle file 

In [None]:
pair_stops = paired_stops(df)
to_add = df[df['STOPPOINTID']==pair_stops[0][5]]
to_add = to_add.append(df[df['PREVIOUS_STOPPOINTID']==pair_stops[1][5]])
stops_df = pd.DataFrame(to_add)

 # Setting target feature
y = stops_df['ACTUALTIME_TRAVEL']
    
# Dropping target feature and low corr features
stops_df = stops_df.drop(low_corr,1)
stops_df = stops_df.drop('ACTUALTIME_TRAVEL',1)
stops_df = pd.get_dummies(stops_df)

# Fitting/Training model based on stops
linear_reg_model_ = LinearRegression().fit(stops_df, y)

# Saving to pickle File
with open('model_'+pair_stops[0][5]+'.pkl', 'wb') as handle:
    pickle.dump(linear_reg_model_, handle)


In [None]:
sampledf = stops_df.iloc[[0]]
sample_prediction = linear_reg_sample.predict(sampledf)

In [None]:
sample_prediction

In [None]:
with open('model_'+pair_stops[0][5]+'.pkl', 'rb') as handle:
    model = pickle.load(handle)

In [None]:
model.predict(sampledf)

## 5.2.1 Setting up for 46a stop pair models using first approach

In [None]:
# Function to get previous stopid and return a paired list
def pair_stopids(current_stopids):
    previous_stopid = []
    for i in current_stopids:
        prev = df['PREVIOUS_STOPPOINTID'][df['STOPPOINTID']==i]
        # Adds most frequent previous stopid to list
        previous_stopid += [prev.value_counts().idxmax()]
    
    return [current_stopids, previous_stopid]

In [None]:
# Loading the json file
import json
file = open('routes_and_stops.json',)
routes_stops = json.load(file)

In [None]:
# Get all stops for 46a going outbound ('1')
list_46a_stops = routes_stops['46A']['outbound']

# Pairing stopids and prev stopids from 46a route
pairing_46a_stopids = pair_stopids(list_46a_stops)
predictions = []

In [None]:
for ids in range(len(pairing_46a_stopids[0])):
    # Making new dataframe
    to_add = df[df['STOPPOINTID']==pairing_46a_stopids[0][ids]]
    to_add = to_add.append(df[df['PREVIOUS_STOPPOINTID']==pairing_46a_stopids[1][ids]])
    stops_df = pd.DataFrame(data=to_add)
    
    # Setting target feature
    y = stops_df['ACTUALTIME_TRAVEL']
    
    # Dropping target feature and low corr features
    stops_df = stops_df.drop(low_corr,1)
    stops_df = stops_df.drop('ACTUALTIME_TRAVEL',1)
    stops_df = pd.get_dummies(stops_df)
    
    # Fitting model based on stops
    linear_reg_model = LinearRegression().fit(stops_df, y)
    
      # Save to pickle file
#     with open('model_'+pairing_46a_stopids[0][ids]+'.pkl', 'wb') as handle:
#         pickle.dump(linear_reg_model, handle)

     # Predicting data
    with open('stop_'+pair_stops[0][ids]+'.pkl', 'rb') as handle:
        model = pickle.load(handle)
    
    k = model.predict(route_46a.iloc[[index]])
    predictions += [k]

In [None]:
# Printing evaluation metrics
print("RMSE Score: ", np.sqrt(metrics.mean_squared_error(actualtimes_46a, predictions)))
print("MSE Score: ", metrics.mean_squared_error(actualtimes_46a, predictions))
print("MAE Score: ", metrics.mean_absolute_error(actualtimes_46a, predictions))
print("R2 Score: ", metrics.r2_score(actualtimes_46a, predictions))

<br><br>
##### Conclusion:
Linear regression model is not very good. MSE score is off by more than 1000 seconds. And the R2 score is at a negative value. This means the parameters need to be tuned. Keeping dwelltime might be good.

## 5.3 Stop pair based on entire leavetimes

<br><br>
1) Make a rough query that selects rows that contain a certain stopid and its previous stopid based on the direction.

In [None]:
# initialise query - for OUTBOUND (WHERE DIRECTION == '1')
# query_stopid = "SELECT leavetimes.*, weather.* FROM leavetimes, weather WHERE leavetimes.STOPPOINTID = " + current_stopid + " AND leavetimes.DAYOFSERVICE = weather.dt"
# query_stopid_df = pd.read_sql(query_previoustop, conn)

<br><br>
2) Make a function that will combine lists in a list together as one list

In [95]:
def combine_listsoflist(to_combine):
    combined = []
    for each_list in to_combine:
        combined += each_list
    return combined

<br><br>
4) Make a function that will get rid of the duplicates in the list

In [96]:
def get_unique(stopids_list):
    return list(set(stopids_list))

<br><br>
5) Make a list to store all stopids for DIRECTION == outbound/1.

In [93]:
# Loading the json file
import json
file = open('routes_and_stops.json',)
routes_stops = json.load(file)

In [97]:
# Looping through every lineid, outbound 
stopids_outbound = []
for i,j in routes_stops.items():
    try:
#         print(i, '\n', routes_stops[i]['outbound'], '\n')
        stopids_outbound += [routes_stops[i]['outbound']]
    except KeyError:
        continue
        
# Calling function to get combined list
combined_stopids_outbound = combine_listsoflist(stopids_outbound)

# Calling function to get unique stopids from combined list
unique_stopids_outbound = get_unique(combined_stopids_outbound)

<br><br>
6) Make a list to store all stopids for DIRECTION ==inbound/2.

In [None]:
# Looping through every lineid, inbound
stopids_inbound = []
for i,j in routes_stops.items():
    try:
#         print(i, '\n', routes_stops[i]['inbound'], '\n')
        stopids_inbound += [routes_stops[i]['inbound']]
    except KeyError:
        continue
        
# Calling function to get combined list
combined_stopids_inbound = combine_listsoflist(stopids_inbound)

# Calling function to get unique stopids from combined list - using set() to get rid off existing stops from outbound stops
unique_stopids_inbound = list(set(combined_stopids_inbound) - set(combined_stopids_outbound))

<br><br>
7) Query to select all of the previous_stopids based on the current stopid and put it to a list

In [None]:
# query_previoustop = "SELECT leavetimes.PREVIOUS_STOPPOINTID FROM leavetimes WHERE leavetimes.STOPPOINTID = " + current_stopid 
# query_prevstop_df = pd.read_sql(query_previoustop, conn)

# # Converting into a pandas series then to list
# query_prevstop_series = query_prevstop_df.iloc[0]
# query_prevstop_list = query_prevstop_series.tolist()

<br><br>
8) Query to select the rows based on the previous stopids and append them to the current dataframe of the current stopid


In [98]:
def df_prev_stops(query_prevstop_list):
    query_prevstop_rows = "SELECT leavetimes.* FROM leavetimes WHERE leavetimes.PREVIOUS_STOPPOINTID IN " + str(query_prevstop_list)
    print(query_prevstop_rows)
    df_prevstop = pd.read_sql(query_prevstop_rows, conn)
    return df_prevstop

In [99]:
def df_prev_stops_one_element(query_prevstop_list):
    query_prevstop_rows = "SELECT leavetimes.* FROM leavetimes WHERE leavetimes.PREVIOUS_STOPPOINTID = " + str(query_prevstop_list)
    print(query_prevstop_rows)
    df_prevstop = pd.read_sql(query_prevstop_rows, conn)
    return df_prevstop

<br><br>
9) Adding index on STOPPOINTID and PREVIOUS_STOPPOINTID

In [None]:
# Adding indexes
# add_index1 = """CREATE INDEX stopid ON leavetimes(STOPPOINTID);"""
# add_index2 = """CREATE INDEX previous_stopid ON leavetimes(PREVIOUS_STOPPOINTID);"""
# conn.execute(add_index1)
# conn.execute(add_index2)

<br><br>
10) Piecing every step together

In [None]:
# Lists all stops done so far. This is for when laptop needs to rest
import os 
arr = os.listdir('C:/Users/fayea/UCD/ResearchPracticum/Data-Analytics-CityRoute/stop_models')
j = []
for i in arr:
    j += i.split('_')

h = []
for i in j:
    h += i.split('.')

g = [str(i) for i in h if i.isdigit()]
unique_stopids_outbound = [x for x in unique_stopids_outbound if x not in g]
len(unique_stopids_outbound)

In [None]:
previous_stops = {}
for stopid in unique_stopids_inbound:
    # Get all previous stopids in list
    query_previoustop = "SELECT DISTINCT leavetimes.PREVIOUS_STOPPOINTID FROM leavetimes WHERE leavetimes.STOPPOINTID = " + stopid
    query_prevstop_df = pd.read_sql(query_previoustop, conn)

    # Converting into a pandas series then to list
    query_prevstop_series = query_prevstop_df['PREVIOUS_STOPPOINTID'].tolist()
    query_prevstop_list = [stopid for stopid in query_prevstop_series if stopid != '0']
    previous_stops[stopid] = query_prevstop_list
    print('finished')
    
with open('previous_stops_inbound.json', 'w+') as fp:
    json.dump(previous_stops, fp)

In [None]:
# import boto3
import pandas as pd
import numpy as np
import sqlite3
import pickle

# from sagemaker import get_execution_role
from sklearn.linear_model import LinearRegression
from math import log
from multiprocessing import Pool


# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Connecting to s3
# role = get_execution_role()
# bucket='sagemaker-studio-520298385440-7in8n1t299'
# data_key = 'route_46a.feather'
# data_location = 's3://{}/{}'.format(bucket, data_key)

In [None]:
low_corr = ['DAYOFSERVICE', 'VEHICLEID', 'TRIPID', 'STOPPOINTID', 'PREVIOUS_STOPPOINTID', 'PROGRNUMBER', 'temp', 'pressure', 'humidity', 
            'wind_speed', 'wind_deg', 'weather_id', 'weather_description', 'clouds_all', 'PREVIOUS_STOPPOINTID', 'PLANNEDTIME_ARR', 'PLANNEDTIME_DEP', 'ACTUALTIME_ARR', 'ACTUALTIME_DEP',
           'PLANNEDTIME_TRAVEL', 'DWELLTIME']

In [100]:
# def function to create connection to db
def create_connection(db_file):
    """
    create a database connection to the SQLite database specified by db_file
    :param df_file: database file
    :return: Connection object or None
    """
    conn = None
    try: 
        conn = sqlite3.connect(db_file)
        return conn
    except 'Error' as e:
        print(e)
        
    return conn

In [101]:
# create connection to db
db_file = "C:/Users/fayea/UCD/ResearchPracticum/Data-Analytics-CityRoute/dublinbus.db"
conn = create_connection(db_file)

In [105]:
# Outbound
file = open('previous_stops.json',)
previous_stops = json.load(file)

# Query to get all of weather
weather_query = "SELECT weather.* from weather"
weather_df = pd.read_sql(weather_query, conn)
weather_df = weather_df.rename(columns={"dt": "DAYOFSERVICE"})

for current_stopid in unique_stopids_outbound:
    
    query_prevstop_series = previous_stops[current_stopid]
    query_prevstop_list = tuple(query_prevstop_series)
    if len(query_prevstop_list) == 1:
        # Making query to db and make df
        query_stopid = "SELECT leavetimes.* FROM leavetimes WHERE leavetimes.STOPPOINTID = " + current_stopid + " OR leavetimes.PREVIOUS_STOPPOINTID = " + str(query_prevstop_list)
        df = pd.read_sql(query_stopid, conn)
        
        # Append previous stops rows to main df
#         to_add = df_prev_stops_one_element(query_prevstop_series[0])
#         df = pd.concat([df,to_add])
        df = df.merge(weather_df, on='DAYOFSERVICE', how='left')
        
    elif len(query_prevstop_list) == 0:
        continue
        
    else:
        # Making query to db and make df
        query_stopid = "SELECT leavetimes.* FROM leavetimes WHERE leavetimes.STOPPOINTID = " + current_stopid + " OR leavetimes.PREVIOUS_STOPPOINTID IN " + str(query_prevstop_list)
        df = pd.read_sql(query_stopid, conn)
        
        # Append previous stops rows to main df
#         to_add = df_prev_stops(query_prevstop_list)
#         df = pd.concat([df,to_add]) 
        df = df.merge(weather_df, on='DAYOFSERVICE', how='left')

    # Drop low correlated features and setting target feature
    df = df.drop(low_corr, 1)
    tf = df['ACTUALTIME_TRAVEL']
    df = df.drop('ACTUALTIME_TRAVEL', 1)
    df = pd.get_dummies(df)
    print(df.shape[1])
#     # Fitting model based on stops
#     linear_reg_model = LinearRegression().fit(df, tf)
    
#     # Save to pickle file
#     with open('C:/Users/fayea/UCD/ResearchPracticum/Data-Analytics-CityRoute/stop_models/stop_'+ current_stopid +'.pkl', 'wb') as handle:
#         pickle.dump(linear_reg_model, handle)
    
    print('Finished.')

32
Finished.
32
Finished.
32
Finished.
32
Finished.
32
Finished.
32
Finished.
32
Finished.
32
Finished.
32
Finished.
32
Finished.
32
Finished.
32
Finished.
32
Finished.
32
Finished.
32
Finished.
32
Finished.
32
Finished.
32
Finished.


DatabaseError: Execution failed on sql 'SELECT leavetimes.* FROM leavetimes WHERE leavetimes.STOPPOINTID = 7441OR leavetimes.PREVIOUS_STOPPOINTID = ('2930',)': unrecognized token: "7441OR"

In [None]:
# Inbound 
for current_stopid in range(len(stopids_inbound)):
    # Making query to df and make df
    query_stopid = "SELECT leavetimes.*, weather.* FROM leavetimes, weather WHERE leavetimes.STOPPOINTID = " + stopids_inbound[current_stopid] + " AND leavetimes.DAYOFSERVICE = weather.dt"
    df = pd.read_sql(query_previoustop, conn)
    
    # Get all previous stopids in list
    query_previoustop = "SELECT leavetimes.PREVIOUS_STOPPOINTID FROM leavetimes WHERE leavetimes.STOPPOINTID = " + stopids_inbound[current_stopid] 
    query_prevstop_df = pd.read_sql(query_previoustop, conn)

    # Converting into a pandas series then to list
    query_prevstop_series = query_prevstop_df.iloc[0]
    query_prevstop_list = query_prevstop_series.tolist()
    
    # Append previous stops rows to main df
    to_add = df_prev_stops(query_prevstop_list)
    df = df.append(to_add)
    
    # Drop low correlated features and setting target feature
    df = df.drop(low_corr, 1)
    tf = df['ACTUALTIME_TRAVEL']
    df = df.drop('ACTUALTIME_TRAVEL', 1)
    df = pd.get_dummies(df)
    
    # Fitting model based on stops
    linear_reg_model = LinearRegression().fit(df, tf)
    
    # Save to pickle file
    with open('/UCD/ResearchPracticum/Data-Analytics-CityRoute/stop_models/stop_'+ stopids_outbound[current_stopid] +'.pkl', 'wb') as handle:
        pickle.dump(linear_reg_model, handle)

***