In [1]:
import pandas as pd
import sklearn
from sklearn.impute import KNNImputer
import datetime as dt
import sqlite3

In [None]:
df = pd.read_csv("rt_trips_DB_2018.txt", sep=";")

<h3>Initial Exploration</h3>

In [4]:
# Display first 5 rows of dataset
df.head()

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,LINEID,ROUTEID,DIRECTION,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,BASIN,TENDERLOT,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
0,DB,07-FEB-18 00:00:00,6253783,68,68_80,1,87245,84600,87524.0,84600.0,BasDef,,,,28-FEB-18 12:05:11,",2967409,"
1,DB,07-FEB-18 00:00:00,6262138,25B,25B_271,2,30517,26460,32752.0,,BasDef,,,,28-FEB-18 12:05:11,",2580260,"
2,DB,07-FEB-18 00:00:00,6254942,45A,45A_70,2,35512,32100,36329.0,32082.0,BasDef,,,,28-FEB-18 12:05:11,",2448968,"
3,DB,07-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58463.0,54443.0,BasDef,,,,28-FEB-18 12:05:11,",3094242,"
4,DB,07-FEB-18 00:00:00,6253175,14,14_15,1,85383,81600,84682.0,81608.0,BasDef,,,,28-FEB-18 12:05:11,",2526331,"


In [5]:
# Display number of rows and columns
df.shape

(2182637, 16)

<p>The dataset has 2,182,637 entries with 24 features per entry.</p>
<p>The features as defined in the Concept Design and any additional observations are outlined below:
    <ul>
        <li>DATASOURCE: unique bus operator code - assumed DB stands for Dublin Bus.</li>
        <li>DAYOFSERVICE: day of service - one day of service could last more than 24 hours e.g. a trip that begins before midnight and runs into the following day.</li>
        <li>TRIPID: unique trip code, identifying a single end-to-end trip along a route - trip codes are repeated but are unique per day of service i.e. the combination of date and trip code could be used as a unique indentifier.</li>
 

In [None]:
df.duplicated(subset=["TRIPID"]).sum()

In [None]:
df.duplicated(subset=["DAYOFSERVICE", "TRIPID"]).sum()

   <ul>
       <li>LINEEID: unique line code, identifying a Dublin Bus line and including all stops along that line for both inbound and outbound journeys.</li>
        <li>ROUTEID: unique route code, identifying a particular route in one direction along a Dublin Bus line, i.e. a route is an operational subsection of the overall line, and including all stops along that route.</li>
        <li>DIRECTION: route direction - it is observed that 0 represents IB i.e. inbound/going/northbound/eastbound, and 1 represents OB i.e. outbound/back/southbound/westbound.</li>
        <li>PLANNEDTIME_DEP: planned departure time of the trip, in seconds past the beginning of the operation day, to allow representation of trips that arrive after midnight.</li>
        <li>PLANNEDTIME_ARR: planned arrival time of the trip, in seconds past the beginning of the operation day - this format is used to allow representation of trips that arrive after midnight e.g. a trip that departs at 23:30 on 1st April and arrives at its end destination at 00:30 on 2nd April will have its arrival time recorded as 88200 (when there are 86400 seconds in a single day).</li>
        <li>BASIN: basin code - key is not provided/could not be located.</li>
        <li>TENDERLOT: tender lot - no further details provided.</li>
        <li>ACTUALTIME_DEP: actual departure time of trip, in seconds past the beginning of the operation day.</li>
        <li>ACTUALTIME_ARR: actual arrival time of trip, in seconds past the beginning of the operation day.</li>
        <li>SUPPRESSED: flag, with 0 representing that the planned trip was achieved and 1 representing that the planned trip was suppressed i.e. the trip was not completed - the flag will only be checked if the trip was fully suppressed and did not begin, and will not be checked if the trip was partially suppressed i.e. the entire trip was not completed.</li>
        <li>JUSTIFICATIONID: fault code - key is not provided/could not be located.</li>
        <li>LASTUPDATE: date/time of last record update.</li>
        <li>NOTE: free note - key is not provided/could not be located.</li>

<h6>Examine feature datatypes</h6>

In [11]:
# Display datatype for each feature
df.dtypes

DATASOURCE          object
DAYOFSERVICE        object
TRIPID               int64
LINEID              object
ROUTEID             object
DIRECTION            int64
PLANNEDTIME_ARR      int64
PLANNEDTIME_DEP      int64
ACTUALTIME_ARR     float64
ACTUALTIME_DEP     float64
BASIN               object
TENDERLOT          float64
SUPPRESSED         float64
JUSTIFICATIONID    float64
LASTUPDATE          object
NOTE                object
dtype: object

<h6>Check for null values</h6>

In [6]:
# Display percentage null values per column
df.isna().sum() * 100 / len(df)

DATASOURCE           0.000000
DAYOFSERVICE         0.000000
TRIPID               0.000000
LINEID               0.000000
ROUTEID              0.000000
DIRECTION            0.000000
PLANNEDTIME_ARR      0.000000
PLANNEDTIME_DEP      0.000000
ACTUALTIME_ARR       6.286295
ACTUALTIME_DEP       7.539091
BASIN                0.000000
TENDERLOT          100.000000
SUPPRESSED          99.801479
JUSTIFICATIONID     99.801616
LASTUPDATE           0.000000
NOTE                 0.000000
dtype: float64

<p>
    <ul>
        <li>The feature TENDERLOT contains 100% null values.</li> 
        <li>Both SUPPRESSED and JUSTIFICATIONID contain a high proportion of null values. Following consultation of the Concept Design, it is believed that these features will not add any value to our prediction models.</li> 
        <li>ACTUALTIME_ARR and ACTUALTIME_DEP both contain a small proportion of null values, approximately 6% and 8% respectively. The data contained within these features can be used to calculate actual journey time, which is crucial to our predictive model. The missing values for these features will need to be imputed or the rows containing null values dropped.</li>
        </ul></p>

<h6>Check for constant columns</h6>

In [7]:
# Display number of unique values per feature
df.nunique(axis=0)

DATASOURCE              1
DAYOFSERVICE          360
TRIPID             658964
LINEID                130
ROUTEID               588
DIRECTION               2
PLANNEDTIME_ARR     64461
PLANNEDTIME_DEP       791
ACTUALTIME_ARR      68122
ACTUALTIME_DEP      66771
BASIN                   1
TENDERLOT               0
SUPPRESSED              1
JUSTIFICATIONID      3526
LASTUPDATE            360
NOTE                46690
dtype: int64

In [10]:
# Print unique values per column 
print(df.DATASOURCE.unique())
print(df.BASIN.unique())
print(df.TENDERLOT.unique())
print(df.SUPPRESSED.unique())

['DB']
['BasDef']
[nan]
[nan  0.]


<p>DATASOURCE, BASIN and TENDERLOT are constant columns (in the case of TENDERLOT, the column contains only null values). SUPPRESSED contains mostly null values and is otherwise constant, therefore adding little value.</p>

<h6>Check for duplicate rows</h6>

In [12]:
print("Number of duplicate rows (excluding the first occurrence) is:", df.duplicated().sum())

Number of duplicate rows (excluding the first occurrence) is: 0


<p>There are no duplicate rows in the dataset.</p>

In [8]:
df.duplicated(subset=["TRIPID"]).sum()

1523673

In [9]:
df.duplicated(subset=["DAYOFSERVICE", "TRIPID"]).sum()

0