WeGo Public Transit is a public transit system serving the Greater Nashville and Davidson County area. WeGo provides local and regional bus routes, the WeGo Star train service connecting Lebanon to downtown Nashville, along with several other transit services.

In this project, you'll be analyzing the bus spacing to look for patterns and try to identify correlations to controllable or external factors. Specifically, you'll be using a dataset containing information on the headway, or amount of time between vehicle arrivals at a stop. This dataset contains a column HDWY_DEV, which shows the headway deviation. This variable will be negative when bunching has occurred (shorter headway than scheduled) and will be positive for gapping (longer headway than scheduled). Note that you can calculate headway deviation percentage as HDWY_DEV/SCHEDULED_HDWY.

 1) How much impact does being late or too spaced out at the first stop have downstream?
 
 4) What is the relationship between distance or time travelled since the start of a given trip and the headway deviation? Does headway become less statble the further along the route the bus has travelled?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
#reading in the 3 csv's
bna_2022 = pd.read_csv("../data/bna_2022.csv")
bna_2022.head()

Unnamed: 0,key,class,expire_time_gmt,obs_id,obs_name,valid_time_gmt,day_ind,temp,wx_icon,icon_extd,...,clds,water_temp,primary_wave_period,primary_wave_height,primary_swell_period,primary_swell_height,primary_swell_direction,secondary_swell_period,secondary_swell_height,secondary_swell_direction
0,KBNA,observation,1641027180,KBNA,Nashville,1641019980,N,71,26,2600,...,OVC,,,,,,,,,
1,KBNA,observation,1641030780,KBNA,Nashville,1641023580,N,72,26,2600,...,OVC,,,,,,,,,
2,KBNA,observation,1641034380,KBNA,Nashville,1641027180,N,73,26,2600,...,OVC,,,,,,,,,
3,KBNA,observation,1641037980,KBNA,Nashville,1641030780,N,73,26,2600,...,OVC,,,,,,,,,
4,KBNA,observation,1641041580,KBNA,Nashville,1641034380,N,73,26,2600,...,OVC,,,,,,,,,


In [3]:
headway_data = pd.read_csv("../data/Headway_Data.csv.txt")
headway_data.head()

Unnamed: 0,CALENDAR_ID,SERVICE_ABBR,ADHERENCE_ID,DATE,ROUTE_ABBR,BLOCK_ABBR,OPERATOR,TRIP_ID,OVERLOAD_ID,ROUTE_DIRECTION_NAME,...,ACTUAL_HDWY,HDWY_DEV,ADJUSTED_EARLY_COUNT,ADJUSTED_LATE_COUNT,ADJUSTED_ONTIME_COUNT,STOP_CANCELLED,PREV_SCHED_STOP_CANCELLED,IS_RELIEF,BLOCK_STOP_ORDER,DWELL_IN_MINS
0,120211101,1,76447164,2021-11-01,7,1704,2088,297750,0,TO DOWNTOWN,...,15.983333,1.983333,0,0,1,0,0.0,0,2,12.65
1,120211101,1,76447165,2021-11-01,7,1704,2088,297750,0,TO DOWNTOWN,...,17.333333,3.333333,0,0,1,0,0.0,0,11,0.0
2,120211101,1,76447166,2021-11-01,7,1704,2088,297750,0,TO DOWNTOWN,...,,,0,1,0,0,,0,23,22.416666
3,120211101,1,76447167,2021-11-01,50,1704,2088,297749,0,TO DOWNTOWN,...,,,0,0,1,0,0.0,0,24,5.766666
4,120211101,1,76447168,2021-11-01,50,1704,2088,297749,0,TO DOWNTOWN,...,,,0,0,1,0,,0,25,0.0


In [4]:
bna_weather = pd.read_csv("../data/bna_weather.csv")
bna_weather.head()

Unnamed: 0,Date,key,class,expire_time_gmt,obs_id,obs_name,valid_time_gmt,day_ind,temp,wx_icon,...,clds,water_temp,primary_wave_period,primary_wave_height,primary_swell_period,primary_swell_height,primary_swell_direction,secondary_swell_period,secondary_swell_height,secondary_swell_direction
0,2021-11-01 00:53:00,KBNA,observation,1635753180,KBNA,Nashville,1635745980,N,51,26,...,OVC,,,,,,,,,
1,2021-11-01 01:30:00,KBNA,observation,1635755400,KBNA,Nashville,1635748200,N,50,29,...,SCT,,,,,,,,,
2,2021-11-01 01:53:00,KBNA,observation,1635756780,KBNA,Nashville,1635749580,N,49,33,...,CLR,,,,,,,,,
3,2021-11-01 02:53:00,KBNA,observation,1635760380,KBNA,Nashville,1635753180,N,48,33,...,CLR,,,,,,,,,
4,2021-11-01 03:53:00,KBNA,observation,1635763980,KBNA,Nashville,1635756780,N,47,33,...,CLR,,,,,,,,,


In [5]:
#kept columns that we wanted
headway_df = headway_data[['DATE', 'ROUTE_ABBR', 'TRIP_ID', 'ROUTE_DIRECTION_NAME','SCHEDULED_TIME', 'ACTUAL_ARRIVAL_TIME', 'ACTUAL_DEPARTURE_TIME', 'ADHERENCE', 'SCHEDULED_HDWY', 'ACTUAL_HDWY', 'HDWY_DEV', 'DWELL_IN_MINS']]

In [6]:
headway_df

Unnamed: 0,DATE,ROUTE_ABBR,TRIP_ID,ROUTE_DIRECTION_NAME,SCHEDULED_TIME,ACTUAL_ARRIVAL_TIME,ACTUAL_DEPARTURE_TIME,ADHERENCE,SCHEDULED_HDWY,ACTUAL_HDWY,HDWY_DEV,DWELL_IN_MINS
0,2021-11-01,7,297750,TO DOWNTOWN,14:10:00,13:59:21,14:12:00,-2.000000,14.0,15.983333,1.983333,12.650000
1,2021-11-01,7,297750,TO DOWNTOWN,14:20:00,14:23:21,14:23:21,-3.350000,14.0,17.333333,3.333333,0.000000
2,2021-11-01,7,297750,TO DOWNTOWN,14:39:00,14:36:46,14:59:11,-20.183333,,,,22.416666
3,2021-11-01,50,297749,TO DOWNTOWN,15:10:00,15:04:31,15:10:17,-0.283333,,,,5.766666
4,2021-11-01,50,297749,TO DOWNTOWN,15:27:00,15:16:59,15:16:59,10.016666,,,,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...
1981710,2022-11-01,50,329980,TO DOWNTOWN,15:18:00,15:19:43,15:20:44,-2.733333,,,,1.016666
1981711,2022-11-01,50,330003,TO DOWNTOWN,15:20:00,14:57:12,15:07:06,12.900000,10.0,7.633333,-2.366667,9.900000
1981712,2022-11-01,50,330003,TO DOWNTOWN,15:37:00,15:10:16,15:10:16,26.733333,,,,0.000000
1981713,2022-11-01,56,330011,TO DOWNTOWN,15:20:00,14:59:40,15:19:02,0.966666,0.0,,,19.366666


In [7]:
#changing the column names
headway_df.columns = ['date', 'route', 'trip_id', 'rte_dir_name', 'schd_time', 'act_arrvl_time', 'act_depart', 'adh', 'schd_hdwy', 'act_hdwy', 'hdwy_dev', 'dwell_mins']

In [8]:
headway_df

Unnamed: 0,date,route,trip_id,rte_dir_name,schd_time,act_arrvl_time,act_depart,adh,schd_hdwy,act_hdwy,hdwy_dev,dwell_mins
0,2021-11-01,7,297750,TO DOWNTOWN,14:10:00,13:59:21,14:12:00,-2.000000,14.0,15.983333,1.983333,12.650000
1,2021-11-01,7,297750,TO DOWNTOWN,14:20:00,14:23:21,14:23:21,-3.350000,14.0,17.333333,3.333333,0.000000
2,2021-11-01,7,297750,TO DOWNTOWN,14:39:00,14:36:46,14:59:11,-20.183333,,,,22.416666
3,2021-11-01,50,297749,TO DOWNTOWN,15:10:00,15:04:31,15:10:17,-0.283333,,,,5.766666
4,2021-11-01,50,297749,TO DOWNTOWN,15:27:00,15:16:59,15:16:59,10.016666,,,,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...
1981710,2022-11-01,50,329980,TO DOWNTOWN,15:18:00,15:19:43,15:20:44,-2.733333,,,,1.016666
1981711,2022-11-01,50,330003,TO DOWNTOWN,15:20:00,14:57:12,15:07:06,12.900000,10.0,7.633333,-2.366667,9.900000
1981712,2022-11-01,50,330003,TO DOWNTOWN,15:37:00,15:10:16,15:10:16,26.733333,,,,0.000000
1981713,2022-11-01,56,330011,TO DOWNTOWN,15:20:00,14:59:40,15:19:02,0.966666,0.0,,,19.366666


In [9]:
#adding new column to calculate the headway deviation percentage
headway_df["hdwy_dev_%"] = ((headway_df["hdwy_dev"] / headway_df["schd_hdwy"])*100)
headway_df


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  headway_df["hdwy_dev_%"] = ((headway_df["hdwy_dev"] / headway_df["schd_hdwy"])*100)


Unnamed: 0,date,route,trip_id,rte_dir_name,schd_time,act_arrvl_time,act_depart,adh,schd_hdwy,act_hdwy,hdwy_dev,dwell_mins,hdwy_dev_%
0,2021-11-01,7,297750,TO DOWNTOWN,14:10:00,13:59:21,14:12:00,-2.000000,14.0,15.983333,1.983333,12.650000,14.166664
1,2021-11-01,7,297750,TO DOWNTOWN,14:20:00,14:23:21,14:23:21,-3.350000,14.0,17.333333,3.333333,0.000000,23.809521
2,2021-11-01,7,297750,TO DOWNTOWN,14:39:00,14:36:46,14:59:11,-20.183333,,,,22.416666,
3,2021-11-01,50,297749,TO DOWNTOWN,15:10:00,15:04:31,15:10:17,-0.283333,,,,5.766666,
4,2021-11-01,50,297749,TO DOWNTOWN,15:27:00,15:16:59,15:16:59,10.016666,,,,0.000000,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1981710,2022-11-01,50,329980,TO DOWNTOWN,15:18:00,15:19:43,15:20:44,-2.733333,,,,1.016666,
1981711,2022-11-01,50,330003,TO DOWNTOWN,15:20:00,14:57:12,15:07:06,12.900000,10.0,7.633333,-2.366667,9.900000,-23.666670
1981712,2022-11-01,50,330003,TO DOWNTOWN,15:37:00,15:10:16,15:10:16,26.733333,,,,0.000000,
1981713,2022-11-01,56,330011,TO DOWNTOWN,15:20:00,14:59:40,15:19:02,0.966666,0.0,,,19.366666,


In [12]:
#subsetting to get the avg. deviation
avg_dev_trip=headway_df.groupby(['route','date'])['hdwy_dev'].mean().reset_index()
avg_dev_trip

Unnamed: 0,route,date,hdwy_dev
0,3,2021-11-01,0.420389
1,3,2021-11-02,2.086237
2,3,2021-11-03,1.273639
3,3,2021-11-04,0.594816
4,3,2021-11-05,0.492666
...,...,...,...
2923,56,2022-10-28,0.466173
2924,56,2022-10-29,-0.002517
2925,56,2022-10-30,0.043032
2926,56,2022-10-31,0.095075


In [13]:
avg_dev_trip = avg_dev_trip.rename(columns={'hdwy_dev': 'avg_hdwy_dev'})

In [None]:
#avg_dev_trip['date']= pd.to_datetime(avg_dev_trip['date'])

In [None]:
#avg_dev_trip['month'] = avg_dev_trip['date'].dt.month
#avg_dev_trip

In [14]:
#subsetting to get the avg. time per trip.
avg_time_trip = headway_df.groupby(['route','date'])['dwell_mins'].mean().reset_index()
avg_time_trip

Unnamed: 0,route,date,dwell_mins
0,3,2021-11-01,2.052504
1,3,2021-11-02,2.097922
2,3,2021-11-03,1.853772
3,3,2021-11-04,1.839616
4,3,2021-11-05,1.905715
...,...,...,...
2923,56,2022-10-28,2.681436
2924,56,2022-10-29,3.224599
2925,56,2022-10-30,3.045491
2926,56,2022-10-31,2.878848


In [15]:
avg_time_trip = avg_time_trip.rename(columns={'dwell_mins': 'avg_dwell_mins'})

In [None]:
v

In [None]:
#avg_time_trip['date']= pd.to_datetime(avg_time_trip['date'])

In [None]:
#avg_time_trip.info()

In [None]:
#avg_time_trip['month'] = avg_time_trip['date'].dt.month
#avg_time_trip

In [None]:
avg_time_trip.nunique()

In [18]:
avg_trip_dev=avg_dev_trip.merge(avg_time_trip, how='outer', on=['route', 'date'])

In [19]:
avg_trip_dev

Unnamed: 0,route,date,avg_hdwy_dev,avg_dwell_mins
0,3,2021-11-01,0.420389,2.052504
1,3,2021-11-02,2.086237,2.097922
2,3,2021-11-03,1.273639,1.853772
3,3,2021-11-04,0.594816,1.839616
4,3,2021-11-05,0.492666,1.905715
...,...,...,...,...
2923,56,2022-10-28,0.466173,2.681436
2924,56,2022-10-29,-0.002517,3.224599
2925,56,2022-10-30,0.043032,3.045491
2926,56,2022-10-31,0.095075,2.878848


In [20]:
trip_df=headway_df[['date', 'route', 'act_arrvl_time', 'hdwy_dev']]
trip_df

Unnamed: 0,date,route,act_arrvl_time,hdwy_dev
0,2021-11-01,7,13:59:21,1.983333
1,2021-11-01,7,14:23:21,3.333333
2,2021-11-01,7,14:36:46,
3,2021-11-01,50,15:04:31,
4,2021-11-01,50,15:16:59,
...,...,...,...,...
1981710,2022-11-01,50,15:19:43,
1981711,2022-11-01,50,14:57:12,-2.366667
1981712,2022-11-01,50,15:10:16,
1981713,2022-11-01,56,14:59:40,


In [21]:
trip_df['date']= pd.to_datetime(trip_df['date'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  trip_df['date']= pd.to_datetime(trip_df['date'])


### Attempting visualizations

In [None]:
sns.scatterplot(data=avg_dev_trip, x="trip_id", y="avg_hdwy_dev")

In [None]:
sns.scatterplot(data=avg_time_trip, x="trip_id", y="avg_dwell_mins")