In this notebook, I will try get an understanding of the given dataset.


In [1]:
# Importing the modules that might be used
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import dask.dataframe as dd
import sqlite3
from sqlalchemy import create_engine
from pprint import pprint

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

## Exploring the dataset
We are given three datasets:
* Leave times
* Trips 
* Vehicles

In this notebook:
* Understand each dataset
* Look for inconsistencies
* Start looking at trips and get a better understanding how the datasets work together

In [2]:
# Using error_bad_lines so that lines with too many commas will be returned in the dataframe

trips = pd.read_csv('/home/faye/data/rt_trips_DB_2018.txt', sep=';',error_bad_lines=False)
vehicles = pd.read_csv('/home/faye/data/rt_vehicles_DB_2018.txt', sep=';',error_bad_lines=False)
# leave_times_db = pd.read_csv('/home/faye/data/rt_leavetimes_DB_2018.txt', sep=';',error_bad_lines=False)

# refer to this link for more info https://docs.sqlalchemy.org/en/14/core/engines.html
leave_times_db = create_engine('sqlite:///rt_leavetimes_DB_2018.db')
con = sqlite3.connect('/home/faye/notebooks/rt_leavetimes_DB_2018.db')

In [3]:
leave_times_db

Engine(sqlite:///rt_leavetimes_DB_2018.db)

In [4]:
con

<sqlite3.Connection at 0x7f0aa8bd6e30>

## Leave Times Data

#### Understanding the features:
* DATASOURCE: Unique Bus Operator Code
* DAYOFSERVICE: day of attendance. One day of service could last more than 24 hours.
* TRIPID: Refers to a unique Trip. Will be investigated further.
* PROGNUMBER:  Sequential position of the stop point of the trip.
* STOPPOINTID: Unique stop point code
* PLANNEDTIME_ARR: Planned arrival time at the stop point, in seconds
* PLANNEDTIME_DEP: Planned departure time from the stop point, in seconds
* ACTUALTIME_ARR: Actual arrival time at the stop point, in seconds
* ACTUALTIME_DEP: Actual departure time from the stop point, in seconds
* VEHICLEID: Unique vehicle code arriving at this stop point
* PASSENGERS: Number of passengers on board (previous link)
* PASSENGERSIN: Number of boarded passengers
* PASSENGERSOUT: Number of descended passengers
* DISTANCE: Distance measured from the beginning of the trip
* SUPPRESSED: When the trip is partially suppressed it says that the previous link is suppressed (0 =achieved, 1 = suppressed)
* JUSTIFICATIONID: Fault code
* LASTUPDATE: Time of the last record update
* NOTE: Free note -> document states that the data type is string type but here states that it's float64.

#### Exploring Leave Times Data

* check the first 5 rows
* last 5 rows
* check rows and columns
* check data types
* check missing data

In [5]:
def sql_fetch(con, query):
    cursorObj = con.cursor()
    cursorObj.execute(query)
    pprint(cursorObj.fetchall())
    return cursorObj.fetchall()

In [6]:
# list info about the table in the database
sql_fetch(con, 'select * from sqlite_master;')

[('table',
  'chunk_sql',
  'chunk_sql',
  2,
  'CREATE TABLE chunk_sql (\n'
  '\t"index" BIGINT, \n'
  '\t"DATASOURCE" TEXT, \n'
  '\t"DAYOFSERVICE" TEXT, \n'
  '\t"TRIPID" BIGINT, \n'
  '\t"PROGRNUMBER" BIGINT, \n'
  '\t"STOPPOINTID" BIGINT, \n'
  '\t"PLANNEDTIME_ARR" BIGINT, \n'
  '\t"PLANNEDTIME_DEP" BIGINT, \n'
  '\t"ACTUALTIME_ARR" BIGINT, \n'
  '\t"ACTUALTIME_DEP" BIGINT, \n'
  '\t"VEHICLEID" BIGINT, \n'
  '\t"PASSENGERS" FLOAT, \n'
  '\t"PASSENGERSIN" FLOAT, \n'
  '\t"PASSENGERSOUT" FLOAT, \n'
  '\t"DISTANCE" FLOAT, \n'
  '\t"SUPPRESSED" FLOAT, \n'
  '\t"JUSTIFICATIONID" FLOAT, \n'
  '\t"LASTUPDATE" TEXT, \n'
  '\t"NOTE" FLOAT\n'
  ')'),
 ('index',
  'ix_chunk_sql_index',
  'chunk_sql',
  3,
  'CREATE INDEX ix_chunk_sql_index ON chunk_sql ("index")')]


[]

In [7]:
# Read into a csv file?
leave_times = pd.read_sql('select * from chunk_sql where TRIPID = 6259460;', con)

In [8]:
leave_times.head(5)

Unnamed: 0,index,DATASOURCE,DAYOFSERVICE,TRIPID,PROGRNUMBER,STOPPOINTID,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,VEHICLEID,PASSENGERS,PASSENGERSIN,PASSENGERSOUT,DISTANCE,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
0,11919360,DB,16-FEB-18 00:00:00,6259460,2,493,54531,54531,54597,54616,1000499,,,,,,,26-FEB-18 12:19:30,
1,11919400,DB,16-FEB-18 00:00:00,6259460,3,494,54617,54617,54700,54727,1000499,,,,,,,26-FEB-18 12:19:30,
2,11919463,DB,16-FEB-18 00:00:00,6259460,36,4611,56912,56912,57972,57972,1000499,,,,,,,26-FEB-18 12:19:30,
3,11919499,DB,16-FEB-18 00:00:00,6259460,37,4612,56945,56945,58002,58015,1000499,,,,,,,26-FEB-18 12:19:30,
4,11920568,DB,16-FEB-18 00:00:00,6259460,4,495,54695,54695,54839,54872,1000499,,,,,,,26-FEB-18 12:19:30,


In [9]:
leave_times.tail(5)

Unnamed: 0,index,DATASOURCE,DAYOFSERVICE,TRIPID,PROGRNUMBER,STOPPOINTID,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,VEHICLEID,PASSENGERS,PASSENGERSIN,PASSENGERSOUT,DISTANCE,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
395,19547832,DB,07-FEB-18 00:00:00,6259460,21,2215,56059,56059,56975,56986,1932345,,,,,,,28-FEB-18 12:05:11,
396,19547833,DB,07-FEB-18 00:00:00,6259460,33,4608,56753,56753,57846,57863,1932345,,,,,,,28-FEB-18 12:05:11,
397,19574031,DB,09-FEB-18 00:00:00,6259460,6,7588,54772,54772,54815,54815,1000935,,,,,,,28-FEB-18 12:32:51,
398,19574032,DB,09-FEB-18 00:00:00,6259460,18,4401,55823,55823,56532,56549,1000935,,,,,,,28-FEB-18 12:32:51,
399,19574033,DB,09-FEB-18 00:00:00,6259460,29,4605,56464,56464,57568,57586,1000935,,,,,,,28-FEB-18 12:32:51,


In [10]:
leave_times.dtypes

index               int64
DATASOURCE         object
DAYOFSERVICE       object
TRIPID              int64
PROGRNUMBER         int64
STOPPOINTID         int64
PLANNEDTIME_ARR     int64
PLANNEDTIME_DEP     int64
ACTUALTIME_ARR      int64
ACTUALTIME_DEP      int64
VEHICLEID           int64
PASSENGERS         object
PASSENGERSIN       object
PASSENGERSOUT      object
DISTANCE           object
SUPPRESSED         object
JUSTIFICATIONID    object
LASTUPDATE         object
NOTE               object
dtype: object

## Trips Data

Each row represents one trip(route)
#### Understanding the features:
* DATASOURCE: Unique Bus Operator Code
* DAYOFSERVICE: Day of service. One day of service could last more than 24 hours
* TRIPID: Unique Trip code
* LINEID: Unique Line code
* ROUTEID: Unique route code
* DIRECTION: Route direction: IB = inbound / going / northbound / eastbound, OB = outbound / back / southbound / westbound
* PLANNEDTIME_ARR: Planned arrival time of the trip, in seconds
* PLANNEDTIME_DEP: Planned departure time of the trip, in seconds
* ACTUALTIME_ARR: Actual arrival time of the trip, in seconds
* ACTUALTIME_DEP: Actual departure time of the trip, in seconds
* BASIN: basin code
* TENDERLOT: tender lot
* SUPPRESSED: The whole trip has been supressed (0 = achieved, 1 = suppressed)
* JUSTIFICATIONID: Fault code
* LASTUPDATE: Time of the last record update 
* NOTE: Free note

In [11]:
trips.shape

(2182637, 16)

In [12]:
trips.head(20)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,LINEID,ROUTEID,DIRECTION,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,BASIN,TENDERLOT,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
0,DB,07-FEB-18 00:00:00,6253783,68,68_80,1,87245,84600,87524.0,84600.0,BasDef,,,,28-FEB-18 12:05:11,",2967409,"
1,DB,07-FEB-18 00:00:00,6262138,25B,25B_271,2,30517,26460,32752.0,,BasDef,,,,28-FEB-18 12:05:11,",2580260,"
2,DB,07-FEB-18 00:00:00,6254942,45A,45A_70,2,35512,32100,36329.0,32082.0,BasDef,,,,28-FEB-18 12:05:11,",2448968,"
3,DB,07-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58463.0,54443.0,BasDef,,,,28-FEB-18 12:05:11,",3094242,"
4,DB,07-FEB-18 00:00:00,6253175,14,14_15,1,85383,81600,84682.0,81608.0,BasDef,,,,28-FEB-18 12:05:11,",2526331,"
5,DB,07-FEB-18 00:00:00,6248240,77A,77A_28,2,41648,37200,42019.0,37538.0,BasDef,,,,28-FEB-18 12:05:11,",2966500,"
6,DB,07-FEB-18 00:00:00,6251760,39,39_21,2,34768,28920,35709.0,28929.0,BasDef,,,,28-FEB-18 12:05:11,",2422850,"
7,DB,07-FEB-18 00:00:00,6262909,16,16_20,1,43936,38880,43721.0,38955.0,BasDef,,,,28-FEB-18 12:05:11,",3045261,"
8,DB,07-FEB-18 00:00:00,6251147,40D,40D_102,1,33558,30600,34540.0,31185.0,BasDef,,,,28-FEB-18 12:05:11,",2837123,"
9,DB,07-FEB-18 00:00:00,6249435,27B,27B_34,2,52682,49800,53281.0,49974.0,BasDef,,,,28-FEB-18 12:05:11,",2329338,"


In [13]:
trips.tail(5)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,LINEID,ROUTEID,DIRECTION,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,BASIN,TENDERLOT,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
2182632,DB,14-MAY-18 00:00:00,6765849,123,123_36,2,61560,57840,61365.0,57859.0,BasDef,,,,26-JUN-18 09:13:13,",3216350,"
2182633,DB,14-MAY-18 00:00:00,6765469,75,75_17,1,53416,48600,,48823.0,BasDef,,,,26-JUN-18 09:13:13,",2865284,"
2182634,DB,14-MAY-18 00:00:00,6765486,33D,33D_62,2,29460,26400,29904.0,,BasDef,,,,26-JUN-18 09:13:13,",3077688,"
2182635,DB,14-MAY-18 00:00:00,6764987,70,70_60,1,65277,60600,66341.0,,BasDef,,,,26-JUN-18 09:13:13,",3208841,"
2182636,DB,14-MAY-18 00:00:00,6765012,27,27_19,1,47722,41700,47508.0,41642.0,BasDef,,,,26-JUN-18 09:13:13,",2960092,"


In [14]:
trips.dtypes

DATASOURCE          object
DAYOFSERVICE        object
TRIPID               int64
LINEID              object
ROUTEID             object
DIRECTION            int64
PLANNEDTIME_ARR      int64
PLANNEDTIME_DEP      int64
ACTUALTIME_ARR     float64
ACTUALTIME_DEP     float64
BASIN               object
TENDERLOT          float64
SUPPRESSED         float64
JUSTIFICATIONID    float64
LASTUPDATE          object
NOTE                object
dtype: object

In [15]:
trips.dtypes

DATASOURCE          object
DAYOFSERVICE        object
TRIPID               int64
LINEID              object
ROUTEID             object
DIRECTION            int64
PLANNEDTIME_ARR      int64
PLANNEDTIME_DEP      int64
ACTUALTIME_ARR     float64
ACTUALTIME_DEP     float64
BASIN               object
TENDERLOT          float64
SUPPRESSED         float64
JUSTIFICATIONID    float64
LASTUPDATE          object
NOTE                object
dtype: object

In [16]:
# Unique values for each feature
trips.nunique()

DATASOURCE              1
DAYOFSERVICE          360
TRIPID             658964
LINEID                130
ROUTEID               588
DIRECTION               2
PLANNEDTIME_ARR     64461
PLANNEDTIME_DEP       791
ACTUALTIME_ARR      68122
ACTUALTIME_DEP      66771
BASIN                   1
TENDERLOT               0
SUPPRESSED              1
JUSTIFICATIONID      3526
LASTUPDATE            360
NOTE                46690
dtype: int64

In [17]:
# Missing values for each feature
trips.isnull().sum()

DATASOURCE               0
DAYOFSERVICE             0
TRIPID                   0
LINEID                   0
ROUTEID                  0
DIRECTION                0
PLANNEDTIME_ARR          0
PLANNEDTIME_DEP          0
ACTUALTIME_ARR      137207
ACTUALTIME_DEP      164551
BASIN                    0
TENDERLOT          2182637
SUPPRESSED         2178304
JUSTIFICATIONID    2178307
LASTUPDATE               0
NOTE                     0
dtype: int64

<br><br>
* There are three constant features that will be dropped: BASIN, SUPPRESSED and DATASOURCE.
* TENDERLOT's rows are 100% missing so this feature will be dropped. Trying to relate tender lot to business terms. This may refer to whether or not the trip is paid for a service. Makes sense that it's hidden for privacy if this is the case. It doesn't relate much to the project. 
* ACTUALTIME_ARR has over 100k rows missing but only accounts for 6% of the data. 
* ACTUALTIME_DEP - the same observation. 
* LASTUPDATE - in the documentation it states that it detects new data availability. 
* SUPPRESSED doesn't look like it gives any meaningful data. Not sure why it has to be suppressed. There are no 1.0 values at all, which indicates that there are no trips/rows within the trips dataset where a the information is suppressed/hidden. Demonstrated above, we see the LINEID, TRIPID, etc don't have any null values. From the documentation, it looks like the NaN values just means it's partially suppressed which may explain TENDERLOT feature being 100% missing.
<br><br>

In [18]:
# Seeing what the non missing rows looked like. 0 represents that it 
non_missing_trips_suppressed = trips['SUPPRESSED'].notna()
for row in range(len(non_missing_trips_suppressed)):
    if non_missing_trips_suppressed[row] == True:
        if trips['SUPPRESSED'][row] == 1.0:
            print(trips['SUPPRESSED'][row])
        else:
            print("No 1.0")

No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0

No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0

No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0

No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0

In [19]:
trips.loc[1783]

DATASOURCE                         DB
DAYOFSERVICE       18-FEB-18 00:00:00
TRIPID                        6269995
LINEID                            45A
ROUTEID                        45A_60
DIRECTION                           1
PLANNEDTIME_ARR                 68767
PLANNEDTIME_DEP                 65700
ACTUALTIME_ARR                    NaN
ACTUALTIME_DEP                  65746
BASIN                          BasDef
TENDERLOT                         NaN
SUPPRESSED                          0
JUSTIFICATIONID                194642
LASTUPDATE         26-FEB-18 11:09:33
NOTE                        ,2428302,
Name: 1783, dtype: object

## Vehicles Data

Each record of this table represents the service of one vehicle for one day of service and tells about
the overall distance and time worked by the associated vehicle in that specific day.

#### Understanding the features:
* DATASOURCE: Unique Bus Operator Code
* DAYOFSERVICE: Day of service
* VEHICLEID: Unique vehicle code arriving at this stop point
* DISTANCE: Distance travelled by the vehicle in the corresponding day
* MINUTES: Time worked by the veihcle in the corresponding day
* LASTUPDATE: Time of the last record update
* NOTE: Free note

In [20]:
vehicles.shape

(272622, 7)

In [21]:
vehicles.head(5)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,VEHICLEID,DISTANCE,MINUTES,LASTUPDATE,NOTE
0,DB,23-NOV-18 00:00:00,3303848,286166,58849,04-DEC-18 08:03:09,
1,DB,23-NOV-18 00:00:00,3303847,259545,56828,04-DEC-18 08:03:09,
2,DB,28-FEB-18 00:00:00,2868329,103096,40967,08-MAR-18 10:35:59,
3,DB,28-FEB-18 00:00:00,2868330,147277,43599,08-MAR-18 10:35:59,
4,DB,28-FEB-18 00:00:00,2868331,224682,40447,08-MAR-18 10:35:59,


In [22]:
vehicles.tail(5)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,VEHICLEID,DISTANCE,MINUTES,LASTUPDATE,NOTE
272617,DB,29-DEC-18 00:00:00,3393878,264237,62320,16-JAN-19 18:00:42,
272618,DB,29-DEC-18 00:00:00,3394105,250335,52254,16-JAN-19 18:00:42,
272619,DB,29-DEC-18 00:00:00,3394109,172539,44349,16-JAN-19 18:00:42,
272620,DB,29-DEC-18 00:00:00,3394130,188057,38948,16-JAN-19 18:00:42,
272621,DB,29-DEC-18 00:00:00,3394131,291697,63677,16-JAN-19 18:00:42,


In [23]:
vehicles.dtypes

DATASOURCE       object
DAYOFSERVICE     object
VEHICLEID         int64
DISTANCE          int64
MINUTES           int64
LASTUPDATE       object
NOTE            float64
dtype: object

In [24]:
#  Unique values for each feature
vehicles.nunique()

DATASOURCE           1
DAYOFSERVICE       360
VEHICLEID         1152
DISTANCE        170498
MINUTES          57523
LASTUPDATE         360
NOTE                 0
dtype: int64

In [25]:
# Missing values for each feature
vehicles.isnull().sum()

DATASOURCE           0
DAYOFSERVICE         0
VEHICLEID            0
DISTANCE             0
MINUTES              0
LASTUPDATE           0
NOTE            272622
dtype: int64

* DATASOURCE only has 1 unique value so this will probably be dropped.
* NOTE has no values so this will be dropped.
* All rows of NOTE is missing. Will be dropped.
* Other than that, other features have no missing values.
* MINUTES - Think this was meant to refer to the meaning of a period of time. Still should have been named something else. Represents in SECONDS.
* DISTANCE refers to either km or miles. What unit of measurement will be investigated below when I start investigating individual trips. But probably km. 


In this section:
* Investigate one trip:
    * See if DISTANCE is measured in km or miles.
    * See if I can figure out if the trip is inbound or outbound.
    * Get a better understanding of the STOPPOINTID feature. If it really relates to the stop no.

In [26]:
trip25 = trips.loc[trips['TRIPID'] == 6259460]
trip25

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,LINEID,ROUTEID,DIRECTION,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,BASIN,TENDERLOT,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
3,DB,07-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58463.0,54443.0,BasDef,,,,28-FEB-18 12:05:11,",3094242,"
206056,DB,12-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58082.0,54430.0,BasDef,,,,28-FEB-18 13:18:29,",3094242,"
215820,DB,16-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58268.0,54432.0,BasDef,,,,26-FEB-18 12:19:30,",3094242,"
306497,DB,05-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,57593.0,54430.0,BasDef,,,,28-FEB-18 11:29:31,",3094242,"
554013,DB,14-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,57527.0,54418.0,BasDef,,,,15-MAR-18 12:47:59,",3094242,"
734151,DB,15-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58796.0,54430.0,BasDef,,,,28-FEB-18 13:51:50,",3094242,"
1061171,DB,08-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,57662.0,54436.0,BasDef,,,,28-FEB-18 12:20:01,",3094242,"
1310703,DB,09-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58376.0,54459.0,BasDef,,,,28-FEB-18 12:32:51,",3094242,"
1578994,DB,13-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,57955.0,54497.0,BasDef,,,,28-FEB-18 13:29:11,",3094242,"
1618386,DB,06-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,57912.0,54462.0,BasDef,,,,28-FEB-18 11:48:21,",3094242,"


In [27]:
trip25.shape

(10, 16)

In [28]:
import datetime
str(datetime.timedelta(seconds=30517))

'8:28:37'

In [29]:
# Going to attempt to convert seconds into times
time_columns = ['PLANNEDTIME_ARR', 'PLANNEDTIME_DEP', 'ACTUALTIME_ARR', 'ACTUALTIME_DEP']

planned_actual_tb = trip25[time_columns]

for row in trip25['PLANNEDTIME_ARR']:
    print(row)

57261
57261
57261
57261
57261
57261
57261
57261
57261
57261


In [30]:
trip25

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,LINEID,ROUTEID,DIRECTION,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,BASIN,TENDERLOT,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
3,DB,07-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58463.0,54443.0,BasDef,,,,28-FEB-18 12:05:11,",3094242,"
206056,DB,12-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58082.0,54430.0,BasDef,,,,28-FEB-18 13:18:29,",3094242,"
215820,DB,16-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58268.0,54432.0,BasDef,,,,26-FEB-18 12:19:30,",3094242,"
306497,DB,05-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,57593.0,54430.0,BasDef,,,,28-FEB-18 11:29:31,",3094242,"
554013,DB,14-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,57527.0,54418.0,BasDef,,,,15-MAR-18 12:47:59,",3094242,"
734151,DB,15-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58796.0,54430.0,BasDef,,,,28-FEB-18 13:51:50,",3094242,"
1061171,DB,08-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,57662.0,54436.0,BasDef,,,,28-FEB-18 12:20:01,",3094242,"
1310703,DB,09-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58376.0,54459.0,BasDef,,,,28-FEB-18 12:32:51,",3094242,"
1578994,DB,13-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,57955.0,54497.0,BasDef,,,,28-FEB-18 13:29:11,",3094242,"
1618386,DB,06-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,57912.0,54462.0,BasDef,,,,28-FEB-18 11:48:21,",3094242,"


In [31]:
# Let's look at a specific date
trip25.loc[trip25['DAYOFSERVICE']=='16-FEB-18 00:00:00']

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,LINEID,ROUTEID,DIRECTION,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,BASIN,TENDERLOT,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
215820,DB,16-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58268.0,54432.0,BasDef,,,,26-FEB-18 12:19:30,",3094242,"


<br><br>
For this TRIPID all of the departure left around 3pm and arrived at 4pm to the last stop. Google map directions states it takes around 50 minutes. So this is about right. This probably represents the number of buses that ran 25A in different days. They all just go in one direction though. Going to investigate whether 1 represents inbound or outbound. 

This was a Friday, based on the departure the timetable matches so the route of this bus(es) are coming from Merrion Sq in town. Not sure if it matters, just assuming the buses are coming from town?? idk.
<br><br>

In [32]:
leave_times 

Unnamed: 0,index,DATASOURCE,DAYOFSERVICE,TRIPID,PROGRNUMBER,STOPPOINTID,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,VEHICLEID,PASSENGERS,PASSENGERSIN,PASSENGERSOUT,DISTANCE,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
0,11919360,DB,16-FEB-18 00:00:00,6259460,2,493,54531,54531,54597,54616,1000499,,,,,,,26-FEB-18 12:19:30,
1,11919400,DB,16-FEB-18 00:00:00,6259460,3,494,54617,54617,54700,54727,1000499,,,,,,,26-FEB-18 12:19:30,
2,11919463,DB,16-FEB-18 00:00:00,6259460,36,4611,56912,56912,57972,57972,1000499,,,,,,,26-FEB-18 12:19:30,
3,11919499,DB,16-FEB-18 00:00:00,6259460,37,4612,56945,56945,58002,58015,1000499,,,,,,,26-FEB-18 12:19:30,
4,11920568,DB,16-FEB-18 00:00:00,6259460,4,495,54695,54695,54839,54872,1000499,,,,,,,26-FEB-18 12:19:30,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,19547832,DB,07-FEB-18 00:00:00,6259460,21,2215,56059,56059,56975,56986,1932345,,,,,,,28-FEB-18 12:05:11,
396,19547833,DB,07-FEB-18 00:00:00,6259460,33,4608,56753,56753,57846,57863,1932345,,,,,,,28-FEB-18 12:05:11,
397,19574031,DB,09-FEB-18 00:00:00,6259460,6,7588,54772,54772,54815,54815,1000935,,,,,,,28-FEB-18 12:32:51,
398,19574032,DB,09-FEB-18 00:00:00,6259460,18,4401,55823,55823,56532,56549,1000935,,,,,,,28-FEB-18 12:32:51,


In [33]:
# Going to attempt to convert seconds into times for leave_times

In [50]:
# Making new df in order of PROGRNUMBER
leavetimes16th = leave_times.loc[leave_times["DAYOFSERVICE"]=="16-FEB-18 00:00:00"]
leavetimes16th = leavetimes16th.sort_values(by='PROGRNUMBER')

In [51]:
for row in leavetimes16th['PLANNEDTIME_ARR']:
    leavetimes16th['PLANNEDTIME_ARR'] = leavetimes16th['PLANNEDTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leavetimes16th['PLANNEDTIME_DEP']:
    leavetimes16th['PLANNEDTIME_DEP'] = leavetimes16th['PLANNEDTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leavetimes16th['ACTUALTIME_ARR']:
    leavetimes16th['ACTUALTIME_ARR'] = leavetimes16th['ACTUALTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leavetimes16th['ACTUALTIME_DEP']:
    leavetimes16th['ACTUALTIME_DEP'] = leavetimes16th['ACTUALTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

In [52]:
leavetimes16th

Unnamed: 0,index,DATASOURCE,DAYOFSERVICE,TRIPID,PROGRNUMBER,STOPPOINTID,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,VEHICLEID,PASSENGERS,PASSENGERSIN,PASSENGERSOUT,DISTANCE,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
33,12058819,DB,16-FEB-18 00:00:00,6259460,1,7391,15:07:00,15:07:00,15:07:12,15:07:12,1000499,,,,,,,26-FEB-18 12:19:30,
0,11919360,DB,16-FEB-18 00:00:00,6259460,2,493,15:08:51,15:08:51,15:09:57,15:10:16,1000499,,,,,,,26-FEB-18 12:19:30,
1,11919400,DB,16-FEB-18 00:00:00,6259460,3,494,15:10:17,15:10:17,15:11:40,15:12:07,1000499,,,,,,,26-FEB-18 12:19:30,
4,11920568,DB,16-FEB-18 00:00:00,6259460,4,495,15:11:35,15:11:35,15:13:59,15:14:32,1000499,,,,,,,26-FEB-18 12:19:30,
5,11920609,DB,16-FEB-18 00:00:00,6259460,5,400,15:12:38,15:12:38,15:15:13,15:15:32,1000499,,,,,,,26-FEB-18 12:19:30,
31,12058700,DB,16-FEB-18 00:00:00,6259460,6,7588,15:12:52,15:12:52,15:15:43,15:16:13,1000499,,,,,,,26-FEB-18 12:19:30,
7,11921844,DB,16-FEB-18 00:00:00,6259460,7,7392,15:15:11,15:15:11,15:22:03,15:23:43,1000499,,,,,,,26-FEB-18 12:19:30,
34,12060650,DB,16-FEB-18 00:00:00,6259460,8,312,15:16:01,15:16:01,15:24:39,15:25:08,1000499,,,,,,,26-FEB-18 12:19:30,
9,11923080,DB,16-FEB-18 00:00:00,6259460,9,1444,15:17:10,15:17:10,15:26:36,15:26:36,1000499,,,,,,,26-FEB-18 12:19:30,
10,11923115,DB,16-FEB-18 00:00:00,6259460,10,1445,15:19:01,15:19:01,15:28:26,15:28:38,1000499,,,,,,,26-FEB-18 12:19:30,


In [36]:
vehicles.loc[vehicles['DAYOFSERVICE']=='16-FEB-18 00:00:00']

Unnamed: 0,DATASOURCE,DAYOFSERVICE,VEHICLEID,DISTANCE,MINUTES,LASTUPDATE,NOTE
33221,DB,16-FEB-18 00:00:00,1000110,228874,55641,26-FEB-18 12:19:30,
33222,DB,16-FEB-18 00:00:00,1000113,128763,35281,26-FEB-18 12:19:30,
33223,DB,16-FEB-18 00:00:00,1000115,236146,60897,26-FEB-18 12:19:30,
33224,DB,16-FEB-18 00:00:00,1000116,152375,41586,26-FEB-18 12:19:30,
33225,DB,16-FEB-18 00:00:00,1000117,177197,46542,26-FEB-18 12:19:30,
...,...,...,...,...,...,...,...
38052,DB,16-FEB-18 00:00:00,3088368,264588,55859,26-FEB-18 12:19:30,
38053,DB,16-FEB-18 00:00:00,3088372,187291,46889,26-FEB-18 12:19:30,
38054,DB,16-FEB-18 00:00:00,3088373,94749,21493,26-FEB-18 12:19:30,
38055,DB,16-FEB-18 00:00:00,3088374,266384,51523,26-FEB-18 12:19:30,
