### Goals of this project:

How much impact does being late or too spaced out at the first stop have downstream?

What is the impact of the layover at the start of the trip (the difference between the first top arrival and departure time)?

Does more layover lead to more stable headways (lower values for % headway deviation)?

How closely does lateness (ADHERENCE) correlate to headway?

What is the relationship between distance or time travelled since the start of a given trip and the headway deviation? Does headway become less stable the further along the route the bus has travelled?

How much of a factor does the driver have on headway and on-time performance? The driver is indicated by the OPERATOR variable.
How does direction of travel, route, or location affect the headway and on-time performance?

How does time of day or day of week affect headway and on-time performance? Can you detect an impact of school schedule on headway deviation (for certain routes and at certain times of day)?


Does weather have any effect on headway or on-time performance? To help answer this question, the file bna_2022.csv contains historical weather data recorded at Nashville International Airport.

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [8]:
%matplotlib inline

In [13]:
headway = pd.read_csv('../data/Headway Data.csv')

In [15]:
headway.head()        

Unnamed: 0,CALENDAR_ID,SERVICE_ABBR,ADHERENCE_ID,DATE,ROUTE_ABBR,BLOCK_ABBR,OPERATOR,TRIP_ID,OVERLOAD_ID,ROUTE_DIRECTION_NAME,...,ACTUAL_HDWY,HDWY_DEV,ADJUSTED_EARLY_COUNT,ADJUSTED_LATE_COUNT,ADJUSTED_ONTIME_COUNT,STOP_CANCELLED,PREV_SCHED_STOP_CANCELLED,IS_RELIEF,BLOCK_STOP_ORDER,DWELL_IN_MINS
0,120211101,1,76447164,2021-11-01,7,1704,2088,297750,0,TO DOWNTOWN,...,15.983333,1.983333,0,0,1,0,0.0,0,2,12.65
1,120211101,1,76447165,2021-11-01,7,1704,2088,297750,0,TO DOWNTOWN,...,17.333333,3.333333,0,0,1,0,0.0,0,11,0.0
2,120211101,1,76447166,2021-11-01,7,1704,2088,297750,0,TO DOWNTOWN,...,,,0,1,0,0,,0,23,22.416666
3,120211101,1,76447167,2021-11-01,50,1704,2088,297749,0,TO DOWNTOWN,...,,,0,0,1,0,0.0,0,24,5.766666
4,120211101,1,76447168,2021-11-01,50,1704,2088,297749,0,TO DOWNTOWN,...,,,0,0,1,0,,0,25,0.0


In [18]:
headway.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1981715 entries, 0 to 1981714
Data columns (total 30 columns):
 #   Column                     Dtype  
---  ------                     -----  
 0   CALENDAR_ID                int64  
 1   SERVICE_ABBR               int64  
 2   ADHERENCE_ID               int64  
 3   DATE                       object 
 4   ROUTE_ABBR                 int64  
 5   BLOCK_ABBR                 int64  
 6   OPERATOR                   int64  
 7   TRIP_ID                    int64  
 8   OVERLOAD_ID                int64  
 9   ROUTE_DIRECTION_NAME       object 
 10  TIME_POINT_ABBR            object 
 11  ROUTE_STOP_SEQUENCE        float64
 12  TRIP_EDGE                  int64  
 13  LATITUDE                   float64
 14  LONGITUDE                  float64
 15  SCHEDULED_TIME             object 
 16  ACTUAL_ARRIVAL_TIME        object 
 17  ACTUAL_DEPARTURE_TIME      object 
 18  ADHERENCE                  float64
 19  SCHEDULED_HDWY             float64
 20  AC

In [23]:
headway.isnull().sum()

CALENDAR_ID                       0
SERVICE_ABBR                      0
ADHERENCE_ID                      0
DATE                              0
ROUTE_ABBR                        0
BLOCK_ABBR                        0
OPERATOR                          0
TRIP_ID                           0
OVERLOAD_ID                       0
ROUTE_DIRECTION_NAME              0
TIME_POINT_ABBR                   0
ROUTE_STOP_SEQUENCE              41
TRIP_EDGE                         0
LATITUDE                          0
LONGITUDE                         0
SCHEDULED_TIME                    0
ACTUAL_ARRIVAL_TIME           54383
ACTUAL_DEPARTURE_TIME         54430
ADHERENCE                     54430
SCHEDULED_HDWY               432294
ACTUAL_HDWY                  473824
HDWY_DEV                     474368
ADJUSTED_EARLY_COUNT              0
ADJUSTED_LATE_COUNT               0
ADJUSTED_ONTIME_COUNT             0
STOP_CANCELLED                    0
PREV_SCHED_STOP_CANCELLED    402647
IS_RELIEF                   

In [16]:
bna_weather = pd.read_csv('../data/bna_weather.csv')

In [None]:
bna_weather.head()

In [34]:
#calculating headway deviation percentage - which is HDWY_DEV/SCHEDULED_HDWY
headway['Deviation_Percentage'] =headway['HDWY_DEV']/headway['SCHEDULED_HDWY']*100
headway

Unnamed: 0,CALENDAR_ID,SERVICE_ABBR,ADHERENCE_ID,DATE,ROUTE_ABBR,BLOCK_ABBR,OPERATOR,TRIP_ID,OVERLOAD_ID,ROUTE_DIRECTION_NAME,...,HDWY_DEV,ADJUSTED_EARLY_COUNT,ADJUSTED_LATE_COUNT,ADJUSTED_ONTIME_COUNT,STOP_CANCELLED,PREV_SCHED_STOP_CANCELLED,IS_RELIEF,BLOCK_STOP_ORDER,DWELL_IN_MINS,Deviation_Percentage
0,120211101,1,76447164,2021-11-01,7,1704,2088,297750,0,TO DOWNTOWN,...,1.983333,0,0,1,0,0.0,0,2,12.650000,14.166664
1,120211101,1,76447165,2021-11-01,7,1704,2088,297750,0,TO DOWNTOWN,...,3.333333,0,0,1,0,0.0,0,11,0.000000,23.809521
2,120211101,1,76447166,2021-11-01,7,1704,2088,297750,0,TO DOWNTOWN,...,,0,1,0,0,,0,23,22.416666,
3,120211101,1,76447167,2021-11-01,50,1704,2088,297749,0,TO DOWNTOWN,...,,0,0,1,0,0.0,0,24,5.766666,
4,120211101,1,76447168,2021-11-01,50,1704,2088,297749,0,TO DOWNTOWN,...,,0,0,1,0,,0,25,0.000000,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1981710,120221101,1,91057724,2022-11-01,50,9302,2355,329980,0,TO DOWNTOWN,...,,0,0,1,0,,0,3,1.016666,
1981711,120221101,1,91057851,2022-11-01,50,9950,1880,330003,0,TO DOWNTOWN,...,-2.366667,1,0,0,0,0.0,0,2,9.900000,-23.666670
1981712,120221101,1,91057852,2022-11-01,50,9950,1880,330003,0,TO DOWNTOWN,...,,1,0,0,0,,0,3,0.000000,
1981713,120221101,1,91057881,2022-11-01,56,9975,1922,330011,0,TO DOWNTOWN,...,,0,0,1,0,0.0,0,2,19.366666,


In [41]:
#making a dataframe with only + headway deviations (HDWY_DEV)
headway1 = headway[['DATE', 'ROUTE_ABBR', 'BLOCK_ABBR', 'OPERATOR', 'TRIP_ID', 'ROUTE_DIRECTION_NAME', 'TRIP_EDGE', 'HDWY_DEV']]
headway1 = headway1.loc[(headway1['HDWY_DEV']>=0)]
headway1

Unnamed: 0,DATE,ROUTE_ABBR,BLOCK_ABBR,OPERATOR,TRIP_ID,ROUTE_DIRECTION_NAME,TRIP_EDGE,HDWY_DEV
0,2021-11-01,7,1704,2088,297750,TO DOWNTOWN,1,1.983333
1,2021-11-01,7,1704,2088,297750,TO DOWNTOWN,0,3.333333
13,2021-11-01,22,2200,1352,298128,TO DOWNTOWN,0,0.866666
16,2021-11-01,22,2200,1352,298151,FROM DOWNTOWN,1,1.116666
17,2021-11-01,22,2200,1352,298151,FROM DOWNTOWN,0,0.400000
...,...,...,...,...,...,...,...,...
1981686,2022-11-01,7,705,1066,329447,FROM DOWNTOWN,1,2.000000
1981687,2022-11-01,7,705,1066,329447,FROM DOWNTOWN,0,2.233333
1981689,2022-11-01,7,706,2597,329460,TO DOWNTOWN,1,1.916666
1981690,2022-11-01,7,706,2597,329460,TO DOWNTOWN,0,1.650000


In [43]:
#making a dataframe with only - headway deviations (HDWY_DEV)
headway2 = headway[['DATE', 'ROUTE_ABBR', 'BLOCK_ABBR', 'OPERATOR', 'TRIP_ID', 'ROUTE_DIRECTION_NAME', 'TRIP_EDGE', 'HDWY_DEV']]
headway2 = headway2.loc[(headway2['HDWY_DEV']<0)]
headway2

Unnamed: 0,DATE,ROUTE_ABBR,BLOCK_ABBR,OPERATOR,TRIP_ID,ROUTE_DIRECTION_NAME,TRIP_EDGE,HDWY_DEV
12,2021-11-01,22,2200,1352,298128,TO DOWNTOWN,1,-0.366667
14,2021-11-01,22,2200,1352,298128,TO DOWNTOWN,0,-1.016667
24,2021-11-01,22,2200,1352,298152,FROM DOWNTOWN,0,-3.666667
27,2021-11-01,22,2200,1352,298130,TO DOWNTOWN,1,-1.466667
34,2021-11-01,22,2200,1352,298131,TO DOWNTOWN,0,-0.083334
...,...,...,...,...,...,...,...,...
1981695,2022-11-01,7,706,375,329461,TO DOWNTOWN,1,-1.150000
1981696,2022-11-01,7,706,375,329461,TO DOWNTOWN,0,-2.500000
1981698,2022-11-01,7,706,375,329459,FROM DOWNTOWN,1,-1.883334
1981699,2022-11-01,7,706,375,329459,FROM DOWNTOWN,0,-6.516667


### Question 1: 
How much impact does being late or too spaced out at the first stop have downstream?