In this notebook, I will try get an understanding of the given dataset.


In [1]:
# Importing the modules that might be used
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import dask.dataframe as dd
import sqlite3
from sqlalchemy import create_engine
from pprint import pprint

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

## Exploring the dataset
We are given three datasets:
* Leave times
* Trips 
* Vehicles

In this notebook:
* Understand each dataset
* Look for inconsistencies
* Start looking at trips and get a better understanding how the datasets work together

In [2]:
# Using error_bad_lines so that lines with too many commas will be returned in the dataframe

trips = pd.read_csv('/home/faye/data/rt_trips_DB_2018.txt', sep=';',error_bad_lines=False)
vehicles = pd.read_csv('/home/faye/data/rt_vehicles_DB_2018.txt', sep=';',error_bad_lines=False)
# leave_times_db = pd.read_csv('/home/faye/data/rt_leavetimes_DB_2018.txt', sep=';',error_bad_lines=False)
leave_times = pd.read_feather('/home/faye/data/leavetimes_cleaned_1.feather')
# refer to this link for more info https://docs.sqlalchemy.org/en/14/core/engines.html
leave_times_db = create_engine('sqlite:///rt_leavetimes_DB_2018.db')
con = sqlite3.connect('/home/faye/notebooks/rt_leavetimes_DB_2018.db')

In [3]:
leave_times_db

Engine(sqlite:///rt_leavetimes_DB_2018.db)

In [4]:
con

<sqlite3.Connection at 0x7f842ac1ae30>

In [6]:
vehicles.head(1)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,VEHICLEID,DISTANCE,MINUTES,LASTUPDATE,NOTE
0,DB,23-NOV-18 00:00:00,3303848,286166,58849,04-DEC-18 08:03:09,


In [7]:
leave_times.loc[leave_times['VEHICLEID']==3303848]

Unnamed: 0,DAYOFSERVICE,LASTUPDATE,TRIPID,PROGRNUMBER,STOPPOINTID,VEHICLEID,JUSTIFICATIONID,SUPPRESSED,PLANNEDTIME_ARR,ACTUALTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_DEP
93656062,2018-10-19,2018-10-30 17:32:16,8024553,23,4601,3303848,,,67908,68076,67908,68087
93656093,2018-10-19,2018-10-30 17:32:16,8024553,24,5058,3303848,,,68009,68136,68009,68136
93657247,2018-10-19,2018-10-30 17:32:16,8024553,26,4603,3303848,,,68153,68262,68153,68276
93658385,2018-10-19,2018-10-30 17:32:16,8024553,27,4604,3303848,,,68196,68306,68196,68319
93658417,2018-10-19,2018-10-30 17:32:16,8024553,28,4605,3303848,,,68263,68364,68263,68387
...,...,...,...,...,...,...,...,...,...,...,...,...
116947780,2018-12-31,2019-01-16 18:27:21,8589894,26,4401,3303848,,,69172,69577,69172,69628
116947826,2018-12-31,2019-01-16 18:27:21,8589891,27,2238,3303848,,,56556,56477,56556,56477
116947955,2018-12-31,2019-01-16 18:27:21,8589890,50,3961,3303848,,,52352,53025,52352,53041
116948463,2018-12-31,2019-01-16 18:27:21,8589894,43,3954,3303848,,,69956,70588,69956,70588


## Leave Times Data

#### Understanding the features:
* DATASOURCE: Unique Bus Operator Code
* DAYOFSERVICE: day of attendance. One day of service could last more than 24 hours.
* TRIPID: Refers to a unique Trip. Will be investigated further.
* PROGNUMBER:  Sequential position of the stop point of the trip.
* STOPPOINTID: Unique stop point code
* PLANNEDTIME_ARR: Planned arrival time at the stop point, in seconds
* PLANNEDTIME_DEP: Planned departure time from the stop point, in seconds
* ACTUALTIME_ARR: Actual arrival time at the stop point, in seconds
* ACTUALTIME_DEP: Actual departure time from the stop point, in seconds
* VEHICLEID: Unique vehicle code arriving at this stop point
* PASSENGERS: Number of passengers on board (previous link)
* PASSENGERSIN: Number of boarded passengers
* PASSENGERSOUT: Number of descended passengers
* DISTANCE: Distance measured from the beginning of the trip
* SUPPRESSED: When the trip is partially suppressed it says that the previous link is suppressed (0 =achieved, 1 = suppressed)
* JUSTIFICATIONID: Fault code
* LASTUPDATE: Time of the last record update
* NOTE: Free note -> document states that the data type is string type but here states that it's float64.

#### Exploring Leave Times Data

* check the first 5 rows
* last 5 rows
* check rows and columns
* check data types
* check missing data

In [5]:
def sql_fetch(con, query):
    cursorObj = con.cursor()
    cursorObj.execute(query)
    pprint(cursorObj.fetchall())
    return cursorObj.fetchall()

In [6]:
# list info about the table in the database
sql_fetch(con, 'select * from sqlite_master;')

[('table',
  'chunk_sql',
  'chunk_sql',
  2,
  'CREATE TABLE chunk_sql (\n'
  '\t"index" BIGINT, \n'
  '\t"DATASOURCE" TEXT, \n'
  '\t"DAYOFSERVICE" TEXT, \n'
  '\t"TRIPID" BIGINT, \n'
  '\t"PROGRNUMBER" BIGINT, \n'
  '\t"STOPPOINTID" BIGINT, \n'
  '\t"PLANNEDTIME_ARR" BIGINT, \n'
  '\t"PLANNEDTIME_DEP" BIGINT, \n'
  '\t"ACTUALTIME_ARR" BIGINT, \n'
  '\t"ACTUALTIME_DEP" BIGINT, \n'
  '\t"VEHICLEID" BIGINT, \n'
  '\t"PASSENGERS" FLOAT, \n'
  '\t"PASSENGERSIN" FLOAT, \n'
  '\t"PASSENGERSOUT" FLOAT, \n'
  '\t"DISTANCE" FLOAT, \n'
  '\t"SUPPRESSED" FLOAT, \n'
  '\t"JUSTIFICATIONID" FLOAT, \n'
  '\t"LASTUPDATE" TEXT, \n'
  '\t"NOTE" FLOAT\n'
  ')'),
 ('index',
  'ix_chunk_sql_index',
  'chunk_sql',
  3,
  'CREATE INDEX ix_chunk_sql_index ON chunk_sql ("index")')]


[]

In [7]:
# Read into a csv file?


In [None]:
leave_times.head(5)

In [None]:
leave_times.tail(5)

In [None]:
leave_times.dtypes

## Trips Data

Each row represents one trip(route)
#### Understanding the features:
* DATASOURCE: Unique Bus Operator Code
* DAYOFSERVICE: Day of service. One day of service could last more than 24 hours
* TRIPID: Unique Trip code
* LINEID: Unique Line code
* ROUTEID: Unique route code
* DIRECTION: Route direction: IB = inbound / going / northbound / eastbound, OB = outbound / back / southbound / westbound
* PLANNEDTIME_ARR: Planned arrival time of the trip, in seconds
* PLANNEDTIME_DEP: Planned departure time of the trip, in seconds
* ACTUALTIME_ARR: Actual arrival time of the trip, in seconds
* ACTUALTIME_DEP: Actual departure time of the trip, in seconds
* BASIN: basin code
* TENDERLOT: tender lot
* SUPPRESSED: The whole trip has been supressed (0 = achieved, 1 = suppressed)
* JUSTIFICATIONID: Fault code
* LASTUPDATE: Time of the last record update 
* NOTE: Free note

In [None]:
trips.shape

In [None]:
trips.head(20)

In [None]:
trips.tail(5)

In [None]:
trips_sortby_trip = trips.sort_values(by='TRIPID')
trips_sortby_trip.head(40)

<br><br>
So if the TRIPID is sorted in order, look at 77A. The dates aren't in order. But I counted 6 buses for one day. 3 buses inbound and 3 buses outbound. 

Idea for filtering:
* Filter by Dates > we get all of the buses than ran for that day > Filter by specific route (ROUTEID) > we get a list of all of the TRIPID.
* To link to leave_times - passing on the TRIPIDs.
* So different TRIPIDs represent different trips in the same day at different times. 

<br><br>

In [None]:
trips.dtypes

In [None]:
trips.dtypes

In [None]:
# Unique values for each feature
trips.nunique()

In [None]:
# Missing values for each feature
trips.isnull().sum()

<br><br>
* There are three constant features that will be dropped: BASIN, SUPPRESSED and DATASOURCE.
* TENDERLOT's rows are 100% missing so this feature will be dropped. Trying to relate tender lot to business terms. This may refer to whether or not the trip is paid for a service. Makes sense that it's hidden for privacy if this is the case. It doesn't relate much to the project. 
* ACTUALTIME_ARR has over 100k rows missing but only accounts for 6% of the data. 
* ACTUALTIME_DEP - the same observation. 
* LASTUPDATE - in the documentation it states that it detects new data availability. 
* SUPPRESSED doesn't look like it gives any meaningful data. Not sure why it has to be suppressed. There are no 1.0 values at all, which indicates that there are no trips/rows within the trips dataset where a the information is suppressed/hidden. Demonstrated above, we see the LINEID, TRIPID, etc don't have any null values. From the documentation, it looks like the NaN values just means it's partially suppressed which may explain TENDERLOT feature being 100% missing.
<br><br>

In [None]:
# Seeing what the non missing rows looked like. 0 represents that it 
non_missing_trips_suppressed = trips['SUPPRESSED'].notna()
for row in range(len(non_missing_trips_suppressed)):
    if non_missing_trips_suppressed[row] == True:
        if trips['SUPPRESSED'][row] == 1.0:
            print(trips['SUPPRESSED'][row])
        else:
            print("No 1.0")

In [None]:
trips.loc[1783]

## Vehicles Data

Each record of this table represents the service of one vehicle for one day of service and tells about
the overall distance and time worked by the associated vehicle in that specific day.

#### Understanding the features:
* DATASOURCE: Unique Bus Operator Code
* DAYOFSERVICE: Day of service
* VEHICLEID: Unique vehicle code arriving at this stop point
* DISTANCE: Distance travelled by the vehicle in the corresponding day
* MINUTES: Time worked by the veihcle in the corresponding day
* LASTUPDATE: Time of the last record update
* NOTE: Free note

In [None]:
vehicles.shape

In [None]:
vehicles.head(5)

In [None]:
vehicles.tail(5)

In [None]:
vehicles.dtypes

In [None]:
#  Unique values for each feature
vehicles.nunique()

In [None]:
# Missing values for each feature
vehicles.isnull().sum()

* DATASOURCE only has 1 unique value so this will probably be dropped.
* NOTE has no values so this will be dropped.
* All rows of NOTE is missing. Will be dropped.
* Other than that, other features have no missing values.
* MINUTES - Think this was meant to refer to the meaning of a period of time. Still should have been named something else. Represents in SECONDS.
* DISTANCE refers to either km or miles. What unit of measurement will be investigated below when I start investigating individual trips. But probably km. 


## Investigating datasets together
* See if DISTANCE is measured in km or miles.
* See if I can figure out if the trip is inbound or outbound.
* Get a better understanding of the STOPPOINTID feature. If it really relates to the stop no.

In [3]:
trip25 = trips.loc[trips['TRIPID'] == 6259460]
trip25

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,LINEID,ROUTEID,DIRECTION,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,BASIN,TENDERLOT,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
3,DB,07-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58463.0,54443.0,BasDef,,,,28-FEB-18 12:05:11,",3094242,"
206056,DB,12-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58082.0,54430.0,BasDef,,,,28-FEB-18 13:18:29,",3094242,"
215820,DB,16-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58268.0,54432.0,BasDef,,,,26-FEB-18 12:19:30,",3094242,"
306497,DB,05-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,57593.0,54430.0,BasDef,,,,28-FEB-18 11:29:31,",3094242,"
554013,DB,14-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,57527.0,54418.0,BasDef,,,,15-MAR-18 12:47:59,",3094242,"
734151,DB,15-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58796.0,54430.0,BasDef,,,,28-FEB-18 13:51:50,",3094242,"
1061171,DB,08-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,57662.0,54436.0,BasDef,,,,28-FEB-18 12:20:01,",3094242,"
1310703,DB,09-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58376.0,54459.0,BasDef,,,,28-FEB-18 12:32:51,",3094242,"
1578994,DB,13-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,57955.0,54497.0,BasDef,,,,28-FEB-18 13:29:11,",3094242,"
1618386,DB,06-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,57912.0,54462.0,BasDef,,,,28-FEB-18 11:48:21,",3094242,"


In [4]:
trip25.shape

(10, 16)

In [5]:
import datetime
str(datetime.timedelta(seconds=30517))

'8:28:37'

In [None]:
# Going to attempt to convert seconds into times
time_columns = ['PLANNEDTIME_ARR', 'PLANNEDTIME_DEP', 'ACTUALTIME_ARR', 'ACTUALTIME_DEP']

for row in trip25['PLANNEDTIME_ARR']:
    trip25['PLANNEDTIME_ARR'] = trip25['PLANNEDTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trip25['PLANNEDTIME_DEP']:
    trip25['PLANNEDTIME_DEP'] = trip25['PLANNEDTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trip25['ACTUALTIME_ARR']:
    trip25['ACTUALTIME_ARR'] = trip25['ACTUALTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trip25['ACTUALTIME_DEP']:
    trip25['ACTUALTIME_DEP'] = trip25['ACTUALTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

In [None]:
trip25

In [None]:
# Let's look at a specific date
trip25.loc[trip25['DAYOFSERVICE']=='16-FEB-18 00:00:00']

<br><br>
For this TRIPID all of the departure left around 3pm and arrived at 4pm to the last stop. Google map directions states it takes around 50 minutes. So this is about right. This probably represents the number of buses that ran 25A in different days. They all just go in one direction though. Going to investigate whether 1 represents inbound or outbound. 

This was a Friday, based on the departure the timetable matches so the route of this bus(es) are coming from Merrion Sq in town. Not sure if it matters, just assuming the buses are coming from town?? idk.

1 represents going westbound which is the way to Lucan from town so feature DIRECTION: 1 = outbound & 2 = inbound.
<br><br>

In [None]:
leave_times 

### Going to attempt to convert seconds into times for leave_times

In [None]:
# Making new df in order of PROGRNUMBER
leavetimes16th = leave_times.loc[leave_times["DAYOFSERVICE"]=="16-FEB-18 00:00:00"]
leavetimes16th = leavetimes16th.sort_values(by='PROGRNUMBER')

In [None]:
for row in leavetimes16th['PLANNEDTIME_ARR']:
    leavetimes16th['PLANNEDTIME_ARR'] = leavetimes16th['PLANNEDTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leavetimes16th['PLANNEDTIME_DEP']:
    leavetimes16th['PLANNEDTIME_DEP'] = leavetimes16th['PLANNEDTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leavetimes16th['ACTUALTIME_ARR']:
    leavetimes16th['ACTUALTIME_ARR'] = leavetimes16th['ACTUALTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leavetimes16th['ACTUALTIME_DEP']:
    leavetimes16th['ACTUALTIME_DEP'] = leavetimes16th['ACTUALTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

In [None]:
leavetimes16th

<br><br>
Already knew this but wanted to see for myself. For one day it only shows a specific route. Maybe there are other TRIPIDs that would illustrate earlier times that route 25A. Might investigate that.
<br><br>

In [None]:
vehicles16th = vehicles.loc[vehicles['DAYOFSERVICE']=='16-FEB-18 00:00:00']
vehicles16th.loc[vehicles16th['VEHICLEID']==1000499]

So, the DISTANCE is not suppressed here compared to the leavetimes table.

In [None]:
for row in vehicles16th['MINUTES']:
    vehicles16th['MINUTES'] = vehicles16th['MINUTES'].replace(row, str(datetime.timedelta(seconds=row)))

In [None]:
vehicles16th.loc[vehicles16th['VEHICLEID']==1000499]

<br><br>
So far, we know that some different TRIPIDs represent the same route with a lot of them being sequential. For instance we see above that route 77A has different TRIPIDs so I want to investigate the difference in leavetimes and trips dataset using the first instance of the 77A and the last. I want to see leave_times will provide different times.

We see the number of hours the vehicle ran for. But depending on what time the bus began running it's going to be difficult to see what time the service ended at.
<br><br>

In [None]:
# Selecting routes for 77A first and last for leave times and trips
leave_times_first = pd.read_sql('select * from chunk_sql where TRIPID = 5955225;', con)
leave_times_last = pd.read_sql('select * from chunk_sql where TRIPID = 5955232;', con)

In [None]:
trips_first = trips.loc[trips['TRIPID']==5955225]
trips_last = trips.loc[trips['TRIPID']==5955232]

In [None]:
leave_times_first = leave_times_first.sort_values(by='PROGRNUMBER')
leave_times_last = leave_times_last.sort_values(by='PROGRNUMBER')

In [None]:
# Converting minutes to datetime for leavetimes first, last and trips first, last
for row in leave_times_first['PLANNEDTIME_ARR']:
    leave_times_first['PLANNEDTIME_ARR'] = leave_times_first['PLANNEDTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leave_times_first['PLANNEDTIME_DEP']:
    leave_times_first['PLANNEDTIME_DEP'] = leave_times_first['PLANNEDTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leave_times_first['ACTUALTIME_ARR']:
    leave_times_first['ACTUALTIME_ARR'] = leave_times_first['ACTUALTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leave_times_first['ACTUALTIME_DEP']:
    leave_times_first['ACTUALTIME_DEP'] = leave_times_first['ACTUALTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

In [None]:
for row in leave_times_last['PLANNEDTIME_ARR']:
    leave_times_last['PLANNEDTIME_ARR'] = leave_times_last['PLANNEDTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leave_times_last['PLANNEDTIME_DEP']:
    leave_times_last['PLANNEDTIME_DEP'] = leave_times_last['PLANNEDTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leave_times_last['ACTUALTIME_ARR']:
    leave_times_last['ACTUALTIME_ARR'] = leave_times_last['ACTUALTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leave_times_last['ACTUALTIME_DEP']:
    leave_times_last['ACTUALTIME_DEP'] = leave_times_last['ACTUALTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

In [None]:
for row in trips_first['PLANNEDTIME_ARR']:
    trips_first['PLANNEDTIME_ARR'] = trips_first['PLANNEDTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trips_first['PLANNEDTIME_DEP']:
    trips_first['PLANNEDTIME_DEP'] = trips_first['PLANNEDTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trips_first['ACTUALTIME_ARR']:
    trips_first['ACTUALTIME_ARR'] = trips_first['ACTUALTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trips_first['ACTUALTIME_DEP']:
    trips_first['ACTUALTIME_DEP'] = trips_first['ACTUALTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

In [None]:
for row in trips_last['PLANNEDTIME_ARR']:
    trips_last['PLANNEDTIME_ARR'] = trips_last['PLANNEDTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trips_last['PLANNEDTIME_DEP']:
    trips_last['PLANNEDTIME_DEP'] = trips_last['PLANNEDTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trips_last['ACTUALTIME_ARR']:
    trips_last['ACTUALTIME_ARR'] = trips_last['ACTUALTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trips_last['ACTUALTIME_DEP']:
    trips_last['ACTUALTIME_DEP'] = trips_last['ACTUALTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

In [None]:
leave_times_first.head(40)

In [None]:
leave_times_last.head(40)

<br><br>
Some patterns I see here so far is that it is only showing times during rush hour (5pm). So the theory that the first instance 77A appeared would present a time for an early trip is not going to be shown here. As we can see, the times for both first and last instance of 77A is around the same time. It seems they want to capture only the rush hour times. 

Also noticed that the plannedtime_arr and plannedtime_dep are the same. So the buses plan to never stop at all? Might be worth discussing about.

To conclude, they only capture rush hour times which is ideal because then we can probably work with this data. Now, for data cleaning and preparation, it might be worth looking at the times during a holiday or something.
<br><br>

In [None]:
vehicles.loc[vehicles['VEHICLEID']==1001060	]

<br><br>
Going to see the rest of 77A routes times for leavetimes table.
<br><br>

In [None]:
# Making new df in order of PROGRNUMBER
leave_times_77a1 = pd.read_sql('select * from chunk_sql where TRIPID = 5955226;', con)
leave_times_77a1 = leave_times_77a1.loc[leave_times["DAYOFSERVICE"]=="03-JAN-18 00:00:00"]
leave_times_77a1 = leave_times_77a1.sort_values(by='PROGRNUMBER')

# Converting minutes to datetime for leavetimes first, last and trips first, last
for row in leave_times_77a1['PLANNEDTIME_ARR']:
    leave_times_77a1['PLANNEDTIME_ARR'] = leave_times_77a1['PLANNEDTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leave_times_77a1['PLANNEDTIME_DEP']:
    leave_times_77a1['PLANNEDTIME_DEP'] = leave_times_77a1['PLANNEDTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leave_times_77a1['ACTUALTIME_ARR']:
    leave_times_77a1['ACTUALTIME_ARR'] = leave_times_77a1['ACTUALTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leave_times_77a1['ACTUALTIME_DEP']:
    leave_times_77a1['ACTUALTIME_DEP'] = leave_times_77a1['ACTUALTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

In [None]:
# Making new df in order of PROGRNUMBER
leave_times_77a2 = pd.read_sql('select * from chunk_sql where TRIPID = 5955227;', con)
leave_times_77a2  = leave_times_77a2.loc[leave_times_77a2["DAYOFSERVICE"]=="03-JAN-18 00:00:00"]
leave_times_77a2  = leave_times_77a2.sort_values(by='PROGRNUMBER')

# Converting minutes to datetime for leavetimes first, last and trips first, last
for row in leave_times_77a2['PLANNEDTIME_ARR']:
    leave_times_77a2['PLANNEDTIME_ARR'] = leave_times_77a2['PLANNEDTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leave_times_77a2['PLANNEDTIME_DEP']:
    leave_times_77a2['PLANNEDTIME_DEP'] = leave_times_77a2['PLANNEDTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leave_times_77a2['ACTUALTIME_ARR']:
    leave_times_77a2['ACTUALTIME_ARR'] = leave_times_77a2['ACTUALTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leave_times_77a2['ACTUALTIME_DEP']:
    leave_times_77a2['ACTUALTIME_DEP'] = leave_times_77a2['ACTUALTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

In [None]:
# Making new df in order of PROGRNUMBER
leave_times_77a3 = pd.read_sql('select * from chunk_sql where TRIPID = 5955228;', con)
leave_times_77a3 = leave_times_77a3.loc[leave_times_77a3["DAYOFSERVICE"]=="03-JAN-18 00:00:00"]
leave_times_77a3 = leave_times_77a3.sort_values(by='PROGRNUMBER')

# Converting minutes to datetime for leavetimes first, last and trips first, last
for row in leave_times_77a3['PLANNEDTIME_ARR']:
    leave_times_77a3['PLANNEDTIME_ARR'] = leave_times_77a3['PLANNEDTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leave_times_77a3['PLANNEDTIME_DEP']:
    leave_times_77a3['PLANNEDTIME_DEP'] = leave_times_77a3['PLANNEDTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leave_times_77a3['ACTUALTIME_ARR']:
    leave_times_77a3['ACTUALTIME_ARR'] = leave_times_77a3['ACTUALTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in leave_times_77a3['ACTUALTIME_DEP']:
    leave_times_77a3['ACTUALTIME_DEP'] = leave_times_77a3['ACTUALTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

In [None]:
leave_times_77a1

In [None]:
leave_times_77a2

In [None]:
leave_times_77a3

<br><br>
Next question: **how do we find the trip in leavetimes table for every hour for the same route??**

Get rows for all of leavetimes trips according to a specific date (Jan 3, 2018).
<br><br>

In [None]:
leave_times_77 = pd.read_sql('select * from chunk_sql where DAYOFSERVICE = "03-JAN-18 00:00:00";', con)

In [None]:
leave_times_77.shape

In [None]:
stopid395 = leave_times_77.loc[leave_times_77['STOPPOINTID']==395]
stopid395 = stopid395.loc[stopid395['PROGRNUMBER']==1]

In [None]:
for row in stopid395['PLANNEDTIME_ARR']:
    stopid395['PLANNEDTIME_ARR'] = stopid395['PLANNEDTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in stopid395['PLANNEDTIME_DEP']:
    stopid395['PLANNEDTIME_DEP'] = stopid395['PLANNEDTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

for row in stopid395['ACTUALTIME_ARR']:
    stopid395['ACTUALTIME_ARR'] = stopid395['ACTUALTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in stopid395['ACTUALTIME_DEP']:
    stopid395['ACTUALTIME_DEP'] = stopid395['ACTUALTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

In [None]:
stopid395 = stopid395.sort_values(by='PLANNEDTIME_ARR')

In [None]:
stopid395.tail(50)

<br><br>
Success!
<br><br>