<a id="top"></a>

The purpose of this notebook is to train and save a predictive model for each Bus route.

***

# Import Packages

In [114]:
import sqlite3
import json
import pandas as pd
import pickle

from sklearn.linear_model import LinearRegression

***

<a id="contents"></a>
# Contents

- [1. Create Connection to Database](#create_conn)
- [2. Load Line Routes Dictionary](#load_line_routes)
- [3. Train Models - Linear Regression](#linear_reg)

***

<a id="create_conn"></a>
# 1. Create Connection to Database
[Back to contents](#contents)

In [1]:
# def function to create connection to db
def create_connection(db_file):
    """
    create a database connection to the SQLite database specified by db_file
    :param df_file: database file
    :return: Connection object or None
    """
    conn = None
    try: 
        conn = sqlite3.connect(db_file)
        return conn
    except 'Error' as e:
        print(e)
        
    return conn

In [4]:
# create connection to db
db_file = '/home/faye/Data-Analytics-CityRoute/dublinbus.db'
conn = create_connection(db_file)

<a id="load_line_routes"></a>
# 2. Load Line Routes Dictionary
[Back to contents](#contents)

In [10]:
# load in line_routes json
with open('/home/faye/data/line_routes.json') as json_file:
    line_routes = json.load(json_file)
    
    print("Type:", type(data))

Type: <class 'dict'>


<a id="linear_reg"></a>
# 3. Train Models - Linear Regression
[Back to contents](#contents)

In [105]:
# set trip features
trip_features = """
    T.ACTUALTIME_TRAVEL, T.MONTHOFSERVICE, T.DAYOFWEEK, T.HOUR, T.IS_HOLIDAY"""

In [106]:
# set weather features
weather_features = """
    W.temp, W.humidity, W.wind_speed, W.rain_1h, W.weather_main
"""

In [107]:
# set ordered feature list
ordered_features = [
    'HOUR',
    
    'DAYOFWEEK_Monday',
    'DAYOFWEEK_Tuesday',
    'DAYOFWEEK_Wednesday',
    'DAYOFWEEK_Thursday',
    'DAYOFWEEK_Friday',
    'DAYOFWEEK_Saturday',
    'DAYOFWEEK_Sunday',
       
    'MONTHOFSERVICE_January',
    'MONTHOFSERVICE_February',
    'MONTHOFSERVICE_March',
    'MONTHOFSERVICE_April',
    'MONTHOFSERVICE_May',
    'MONTHOFSERVICE_June',
    'MONTHOFSERVICE_July',
    'MONTHOFSERVICE_August',
    'MONTHOFSERVICE_September',
    'MONTHOFSERVICE_October',
    'MONTHOFSERVICE_November',
    'MONTHOFSERVICE_December',
    
    'IS_HOLIDAY_0',
    'IS_HOLIDAY_1',
    
    'humidity',
    'rain_1h',
    'temp',
    'wind_speed',
    
    'weather_main_Clear',
    'weather_main_Clouds',
    'weather_main_Drizzle',
    'weather_main_Fog',
    'weather_main_Mist',
    'weather_main_Rain',
    'weather_main_Smoke',
    'weather_main_Snow',
    
]

In [108]:
# set extra dummy features to drop
features_to_drop = [
    'DAYOFWEEK_Monday',
    'MONTHOFSERVICE_January', 
    'IS_HOLIDAY_0', 
    'weather_main_Clear', 
]

In [140]:
# train models for each line
total_line = len(line_routes)
line_count = 0
for line in line_routes:
    
    line_count += 1
    print(f"Line {line} -- {line_count}/{total_line}")
    
    
    
    for direction in line_routes[line]:
        
        routeID = line_routes[line][direction]
        
        # initialise query
        query = f"""
            SELECT {trip_features}, {weather_features}
            FROM trips2 T, weather W
            WHERE ROUTEID = '{routeID}' and T.dt = W.dt
        """
        
        # read in query to dataframe
        df = pd.read_sql(query, conn)
        
        # change hour to numerical
        df['HOUR'] = df['HOUR'].astype('int64')
        
        # get dummy variables
        df = pd.get_dummies(df)
        
        # find differences between df and master features
        diff_cols = list(set(ordered_features) - set(df.columns))
        
        # add missing features to df
        for c in diff_cols:
            df[c] = 0
            
        # separate target feature
        tf = df['ACTUALTIME_TRAVEL']
            
        # reorder features, dropping target feature
        df = df[ordered_features]
        
        # drop extra dummy variables
        df = df.drop(columns=features_to_drop)
        
        # train linear regression
        linear_reg = LinearRegression().fit(df, tf)
        
        # save model as pickle
        file_name = f"route_{line}_{direction}.pkl" 
        file_path = f"/home/faye/Data-Analytics-CityRoute/route_models/{file_name}"
        with open(file_path, 'wb') as handle:
            pickle.dump(linear_reg, handle)


Line 1 -- 1/130
Line 102 -- 2/130
Line 104 -- 3/130
Line 11 -- 4/130
Line 111 -- 5/130
Line 114 -- 6/130
Line 116 -- 7/130
Line 118 -- 8/130
Line 120 -- 9/130
Line 122 -- 10/130
Line 123 -- 11/130
Line 13 -- 12/130
Line 130 -- 13/130
Line 14 -- 14/130
Line 140 -- 15/130
Line 142 -- 16/130
Line 145 -- 17/130
Line 14C -- 18/130
Line 15 -- 19/130
Line 150 -- 20/130
Line 151 -- 21/130
Line 15A -- 22/130
Line 15B -- 23/130
Line 15D -- 24/130
Line 16 -- 25/130
Line 161 -- 26/130
Line 16C -- 27/130
Line 16D -- 28/130
Line 17 -- 29/130
Line 17A -- 30/130
Line 18 -- 31/130
Line 184 -- 32/130
Line 185 -- 33/130
Line 220 -- 34/130
Line 236 -- 35/130
Line 238 -- 36/130
Line 239 -- 37/130
Line 25 -- 38/130
Line 25A -- 39/130
Line 25B -- 40/130
Line 25D -- 41/130
Line 25X -- 42/130
Line 26 -- 43/130
Line 27 -- 44/130
Line 270 -- 45/130
Line 27A -- 46/130
Line 27B -- 47/130
Line 27X -- 48/130
Line 29A -- 49/130
Line 31 -- 50/130
Line 31A -- 51/130
Line 31B -- 52/130
Line 31D -- 53/130
Line 32 -- 54/1

The above produced 252 models for 130 lines. This must mean that 8 lines operate in only one direction.

In [151]:
# find lines with only 1 direction
one_direction_lines = []
for line in line_routes:    
    if len(line_routes[line].keys()) != 2:
        one_direction_lines.append(line)

print(one_direction_lines)

['118', '16D', '33E', '41A', '46E', '51X', '68X', '77X']


***

[Back to top](#top)