# WeGo Public Transit
[WeGo Public Transit](https://www.wegotransit.com/) is a public transit system serving the Greater Nashville and Davidson County area. WeGo provides local and regional bus routes, the WeGo Star train service connecting Lebanon to downtown Nashville, along with several other transit services.

In this project, you'll be analyzing the bus spacing to look for patterns and try to identify correlations to controllable or external factors. Specifically, you'll be using a dataset containing information on the headway, or amount of time between vehicle arrivals at a stop. This dataset contains a column HDWY_DEV, which shows the headway deviation. This variable will be negative when bunching has occurred (shorter headway than scheduled) and will be positive for gapping (longer headway than scheduled). Note that you can calculate headway deviation percentage as HDWY_DEV/SCHEDULED_HDWY.

Goals of this project:
1. How much impact does being late or too spaced out at the first stop have downstream?

2. What is the impact of the layover at the start of the trip (the difference between the first top arrival and departure time)? Does more layover lead to more stable headways (lower values for % headway deviation)?

3. How closely does lateness (ADHERENCE) correlate to headway?

4. What is the relationship between distance or time travelled since the start of a given trip and the headway deviation? Does headway become less statble the further along the route the bus has travelled?

5. How much of a factor does the driver have on headway and on-time performance? The driver is indicated by the OPERATOR variable.

6. How does direction of travel, route, or location affect the headway and on-time performance?

7. How does time of day or day of week affect headway and on-time performance? Can you detect an impact of school schedule on headway deviation (for certain routes and at certain times of day)?

8. Does weather have any effect on headway or on-time performance? To help answer this question, the file bna_2022.csv contains historical weather data recorded at Nashville International Airport. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
wego = pd.read_csv('../data/Headway_Data.csv')
wego.head()

Unnamed: 0,CALENDAR_ID,SERVICE_ABBR,ADHERENCE_ID,DATE,ROUTE_ABBR,BLOCK_ABBR,OPERATOR,TRIP_ID,OVERLOAD_ID,ROUTE_DIRECTION_NAME,...,ACTUAL_HDWY,HDWY_DEV,ADJUSTED_EARLY_COUNT,ADJUSTED_LATE_COUNT,ADJUSTED_ONTIME_COUNT,STOP_CANCELLED,PREV_SCHED_STOP_CANCELLED,IS_RELIEF,BLOCK_STOP_ORDER,DWELL_IN_MINS
0,120211101,1,76447164,2021-11-01,7,1704,2088,297750,0,TO DOWNTOWN,...,15.983333,1.983333,0,0,1,0,0.0,0,2,12.65
1,120211101,1,76447165,2021-11-01,7,1704,2088,297750,0,TO DOWNTOWN,...,17.333333,3.333333,0,0,1,0,0.0,0,11,0.0
2,120211101,1,76447166,2021-11-01,7,1704,2088,297750,0,TO DOWNTOWN,...,,,0,1,0,0,,0,23,22.416666
3,120211101,1,76447167,2021-11-01,50,1704,2088,297749,0,TO DOWNTOWN,...,,,0,0,1,0,0.0,0,24,5.766666
4,120211101,1,76447168,2021-11-01,50,1704,2088,297749,0,TO DOWNTOWN,...,,,0,0,1,0,,0,25,0.0


In [None]:
wego_weather = pd.read_csv('../data/bna_weather.csv')
wego_weather.head()

In [3]:
wego['DEV_PERC'] = wego['HDWY_DEV'] / wego['SCHEDULED_HDWY']
wego.head()

Unnamed: 0,CALENDAR_ID,SERVICE_ABBR,ADHERENCE_ID,DATE,ROUTE_ABBR,BLOCK_ABBR,OPERATOR,TRIP_ID,OVERLOAD_ID,ROUTE_DIRECTION_NAME,...,HDWY_DEV,ADJUSTED_EARLY_COUNT,ADJUSTED_LATE_COUNT,ADJUSTED_ONTIME_COUNT,STOP_CANCELLED,PREV_SCHED_STOP_CANCELLED,IS_RELIEF,BLOCK_STOP_ORDER,DWELL_IN_MINS,DEV_PERC
0,120211101,1,76447164,2021-11-01,7,1704,2088,297750,0,TO DOWNTOWN,...,1.983333,0,0,1,0,0.0,0,2,12.65,0.141667
1,120211101,1,76447165,2021-11-01,7,1704,2088,297750,0,TO DOWNTOWN,...,3.333333,0,0,1,0,0.0,0,11,0.0,0.238095
2,120211101,1,76447166,2021-11-01,7,1704,2088,297750,0,TO DOWNTOWN,...,,0,1,0,0,,0,23,22.416666,
3,120211101,1,76447167,2021-11-01,50,1704,2088,297749,0,TO DOWNTOWN,...,,0,0,1,0,0.0,0,24,5.766666,
4,120211101,1,76447168,2021-11-01,50,1704,2088,297749,0,TO DOWNTOWN,...,,0,0,1,0,,0,25,0.0,


In [4]:
wego.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1981715 entries, 0 to 1981714
Data columns (total 31 columns):
 #   Column                     Dtype  
---  ------                     -----  
 0   CALENDAR_ID                int64  
 1   SERVICE_ABBR               int64  
 2   ADHERENCE_ID               int64  
 3   DATE                       object 
 4   ROUTE_ABBR                 int64  
 5   BLOCK_ABBR                 int64  
 6   OPERATOR                   int64  
 7   TRIP_ID                    int64  
 8   OVERLOAD_ID                int64  
 9   ROUTE_DIRECTION_NAME       object 
 10  TIME_POINT_ABBR            object 
 11  ROUTE_STOP_SEQUENCE        float64
 12  TRIP_EDGE                  int64  
 13  LATITUDE                   float64
 14  LONGITUDE                  float64
 15  SCHEDULED_TIME             object 
 16  ACTUAL_ARRIVAL_TIME        object 
 17  ACTUAL_DEPARTURE_TIME      object 
 18  ADHERENCE                  float64
 19  SCHEDULED_HDWY             float64
 20  AC

In [5]:
wego.isna().sum()

CALENDAR_ID                       0
SERVICE_ABBR                      0
ADHERENCE_ID                      0
DATE                              0
ROUTE_ABBR                        0
BLOCK_ABBR                        0
OPERATOR                          0
TRIP_ID                           0
OVERLOAD_ID                       0
ROUTE_DIRECTION_NAME              0
TIME_POINT_ABBR                   0
ROUTE_STOP_SEQUENCE              41
TRIP_EDGE                         0
LATITUDE                          0
LONGITUDE                         0
SCHEDULED_TIME                    0
ACTUAL_ARRIVAL_TIME           54383
ACTUAL_DEPARTURE_TIME         54430
ADHERENCE                     54430
SCHEDULED_HDWY               432294
ACTUAL_HDWY                  473824
HDWY_DEV                     474368
ADJUSTED_EARLY_COUNT              0
ADJUSTED_LATE_COUNT               0
ADJUSTED_ONTIME_COUNT             0
STOP_CANCELLED                    0
PREV_SCHED_STOP_CANCELLED    402647
IS_RELIEF                   

In [7]:
wego['ROUTE_ABBR'].value_counts()

55    362259
56    355098
52    291150
23    250787
50    249880
3     216446
22    153074
7     103021
Name: ROUTE_ABBR, dtype: int64

In [8]:
wego['BLOCK_ABBR'].value_counts()

2312    50892
2310    49518
300     45464
5004    45164
2311    44900
        ...  
5090       14
5695        8
5694        8
5691        2
5692        2
Name: BLOCK_ABBR, Length: 113, dtype: int64

In [9]:
wego['TRIP_ID'].value_counts()

302998    715
302923    712
305058    704
303006    679
303075    678
         ... 
298941      5
298340      3
298341      3
314754      2
314755      2
Name: TRIP_ID, Length: 18495, dtype: int64

In [10]:
wego['TIME_POINT_ABBR'].value_counts()

GXRVRGAT    88506
DWMRT       83826
WHBG        67522
MXTHOMP     60462
MXWHARF     60447
            ...  
CRES          170
HLHS          137
MCC4_13       131
MCC5_3         85
MCC4_21         5
Name: TIME_POINT_ABBR, Length: 65, dtype: int64

In [23]:
wego_tidy = wego.iloc[:, np.r_[0:22, 30]]
wego_tidy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1981715 entries, 0 to 1981714
Data columns (total 23 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   CALENDAR_ID            int64  
 1   SERVICE_ABBR           int64  
 2   ADHERENCE_ID           int64  
 3   DATE                   object 
 4   ROUTE_ABBR             int64  
 5   BLOCK_ABBR             int64  
 6   OPERATOR               int64  
 7   TRIP_ID                int64  
 8   OVERLOAD_ID            int64  
 9   ROUTE_DIRECTION_NAME   object 
 10  TIME_POINT_ABBR        object 
 11  ROUTE_STOP_SEQUENCE    float64
 12  TRIP_EDGE              int64  
 13  LATITUDE               float64
 14  LONGITUDE              float64
 15  SCHEDULED_TIME         object 
 16  ACTUAL_ARRIVAL_TIME    object 
 17  ACTUAL_DEPARTURE_TIME  object 
 18  ADHERENCE              float64
 19  SCHEDULED_HDWY         float64
 20  ACTUAL_HDWY            float64
 21  HDWY_DEV               float64
 22  DEV_PERC          

In [24]:
wego_tidy = wego_tidy.rename(columns = {'DEV_PERC':'HDWY_DEV_PERC'})
wego_tidy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1981715 entries, 0 to 1981714
Data columns (total 23 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   CALENDAR_ID            int64  
 1   SERVICE_ABBR           int64  
 2   ADHERENCE_ID           int64  
 3   DATE                   object 
 4   ROUTE_ABBR             int64  
 5   BLOCK_ABBR             int64  
 6   OPERATOR               int64  
 7   TRIP_ID                int64  
 8   OVERLOAD_ID            int64  
 9   ROUTE_DIRECTION_NAME   object 
 10  TIME_POINT_ABBR        object 
 11  ROUTE_STOP_SEQUENCE    float64
 12  TRIP_EDGE              int64  
 13  LATITUDE               float64
 14  LONGITUDE              float64
 15  SCHEDULED_TIME         object 
 16  ACTUAL_ARRIVAL_TIME    object 
 17  ACTUAL_DEPARTURE_TIME  object 
 18  ADHERENCE              float64
 19  SCHEDULED_HDWY         float64
 20  ACTUAL_HDWY            float64
 21  HDWY_DEV               float64
 22  HDWY_DEV_PERC     

In [25]:
wego_tidy.head()

Unnamed: 0,CALENDAR_ID,SERVICE_ABBR,ADHERENCE_ID,DATE,ROUTE_ABBR,BLOCK_ABBR,OPERATOR,TRIP_ID,OVERLOAD_ID,ROUTE_DIRECTION_NAME,...,LATITUDE,LONGITUDE,SCHEDULED_TIME,ACTUAL_ARRIVAL_TIME,ACTUAL_DEPARTURE_TIME,ADHERENCE,SCHEDULED_HDWY,ACTUAL_HDWY,HDWY_DEV,HDWY_DEV_PERC
0,120211101,1,76447164,2021-11-01,7,1704,2088,297750,0,TO DOWNTOWN,...,36.107575,-86.812719,14:10:00,13:59:21,14:12:00,-2.0,14.0,15.983333,1.983333,0.141667
1,120211101,1,76447165,2021-11-01,7,1704,2088,297750,0,TO DOWNTOWN,...,36.138372,-86.800622,14:20:00,14:23:21,14:23:21,-3.35,14.0,17.333333,3.333333,0.238095
2,120211101,1,76447166,2021-11-01,7,1704,2088,297750,0,TO DOWNTOWN,...,36.167091,-86.781923,14:39:00,14:36:46,14:59:11,-20.183333,,,,
3,120211101,1,76447167,2021-11-01,50,1704,2088,297749,0,TO DOWNTOWN,...,36.161008,-86.800851,15:10:00,15:04:31,15:10:17,-0.283333,,,,
4,120211101,1,76447168,2021-11-01,50,1704,2088,297749,0,TO DOWNTOWN,...,36.167091,-86.781923,15:27:00,15:16:59,15:16:59,10.016666,,,,


In [26]:
wego['ROUTE_STOP_SEQUENCE'].value_counts()

5.0     278419
4.0     214527
8.0     181918
3.0     177798
2.0     157335
10.0    154766
6.0     153988
9.0     141764
7.0     102499
11.0     99828
12.0     73821
13.0     63406
14.0     52909
15.0     46339
1.0      32748
17.0     24805
16.0     24804
Name: ROUTE_STOP_SEQUENCE, dtype: int64

In [27]:
wego_tidy.groupby('ROUTE_STOP_SEQUENCE', as_index=False)['ROUTE_ABBR'].value_counts()

Unnamed: 0,ROUTE_STOP_SEQUENCE,ROUTE_ABBR,count
0,1.0,22,29045
1,1.0,23,2445
2,1.0,55,245
3,1.0,50,217
4,1.0,52,213
...,...,...,...
84,14.0,22,11592
85,15.0,50,39395
86,15.0,22,6944
87,16.0,50,24804


In [28]:
wego_tidy.groupby('ROUTE_ABBR', as_index=False)['HDWY_DEV_PERC'].mean()

Unnamed: 0,ROUTE_ABBR,HDWY_DEV_PERC
0,3,0.029075
1,7,0.022715
2,22,0.030718
3,23,0.023105
4,50,
5,52,0.022999
6,55,
7,56,


In [29]:
wego_tidy.groupby('ROUTE_STOP_SEQUENCE', as_index=False)['HDWY_DEV_PERC'].mean()

Unnamed: 0,ROUTE_STOP_SEQUENCE,HDWY_DEV_PERC
0,1.0,0.013355
1,2.0,0.015778
2,3.0,0.027638
3,4.0,0.021843
4,5.0,0.020038
5,6.0,
6,7.0,
7,8.0,0.014616
8,9.0,0.025739
9,10.0,


In [30]:
wego_tidy.isna().sum()

CALENDAR_ID                   0
SERVICE_ABBR                  0
ADHERENCE_ID                  0
DATE                          0
ROUTE_ABBR                    0
BLOCK_ABBR                    0
OPERATOR                      0
TRIP_ID                       0
OVERLOAD_ID                   0
ROUTE_DIRECTION_NAME          0
TIME_POINT_ABBR               0
ROUTE_STOP_SEQUENCE          41
TRIP_EDGE                     0
LATITUDE                      0
LONGITUDE                     0
SCHEDULED_TIME                0
ACTUAL_ARRIVAL_TIME       54383
ACTUAL_DEPARTURE_TIME     54430
ADHERENCE                 54430
SCHEDULED_HDWY           432294
ACTUAL_HDWY              473824
HDWY_DEV                 474368
HDWY_DEV_PERC            474368
dtype: int64

In [32]:
wego_tidy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1981715 entries, 0 to 1981714
Data columns (total 23 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   CALENDAR_ID            int64  
 1   SERVICE_ABBR           int64  
 2   ADHERENCE_ID           int64  
 3   DATE                   object 
 4   ROUTE_ABBR             int64  
 5   BLOCK_ABBR             int64  
 6   OPERATOR               int64  
 7   TRIP_ID                int64  
 8   OVERLOAD_ID            int64  
 9   ROUTE_DIRECTION_NAME   object 
 10  TIME_POINT_ABBR        object 
 11  ROUTE_STOP_SEQUENCE    float64
 12  TRIP_EDGE              int64  
 13  LATITUDE               float64
 14  LONGITUDE              float64
 15  SCHEDULED_TIME         object 
 16  ACTUAL_ARRIVAL_TIME    object 
 17  ACTUAL_DEPARTURE_TIME  object 
 18  ADHERENCE              float64
 19  SCHEDULED_HDWY         float64
 20  ACTUAL_HDWY            float64
 21  HDWY_DEV               float64
 22  HDWY_DEV_PERC     

In [39]:
wego_tidy.groupby(['ROUTE_ABBR', 'TRIP_EDGE']).agg({'HDWY_DEV':['mean']})

Unnamed: 0_level_0,Unnamed: 1_level_0,HDWY_DEV
Unnamed: 0_level_1,Unnamed: 1_level_1,mean
ROUTE_ABBR,TRIP_EDGE,Unnamed: 2_level_2
3,0,0.401955
3,1,1.783789
3,2,
7,0,0.423648
7,1,0.458352
7,2,
22,0,0.835121
22,1,0.686182
22,2,
23,0,0.320139


In [42]:
Q1 = wego_tidy.groupby('TRIP_EDGE', as_index=False)['HDWY_DEV', 'ADHERENCE'].mean()

  Q1 = wego_tidy.groupby('TRIP_EDGE', as_index=False)['HDWY_DEV', 'ADHERENCE'].mean()


In [43]:
Q1.corr()

Unnamed: 0,TRIP_EDGE,HDWY_DEV,ADHERENCE
TRIP_EDGE,1.0,1.0,0.99152
HDWY_DEV,1.0,1.0,1.0
ADHERENCE,0.99152,1.0,1.0
