# Predicting NYC School Bus Delays

## Background
Parents and NYC school officials have raised the alarm that the public-school transportation system has been operating with increasing delays and bus breakdowns. Effected students lose precious class time and face potentially serious safety concerns while waiting for assistance on congested city streets. This capstone aims to alleviate some of the strain on the NYC school bus system by predicting the occurrence and duration of delays. With a deeper understanding of the pain points that lead to delays I will build a recommendation system that will distribute resources to minimize the likelihood of delays.

## Solution
The primary solution will be in two parts - classification model to predict delays and a linear regression to predict the length of a delay in the event that one occurs. I will use scikit-learn’s GridSearchCV and pipeline functionality to test a variety of model and parameter configurations. I anticipate that random forest will provide the strongest guidance on feature selection but I will likely opt for a more interpretable linear model to predict the length of delay. 
* **Secondary Solution** – Once I have reliably predicted delays and delay length, I would like to use these models to optimize resources for one specific, poorly performing route. 
* **Tertiary Solution** – A recommender that identifies which routes will benefit most given x additional buses are added to the system


## Data Sources
The City of New York maintains a platform of [public datasets](www1.nyc.gov) generated by city agencies and organizations. The Department of Education contributes a number of resources relevant to the school bus transportation problem.
1. [Bus Breakdown and Delays]( https://data.cityofnewyork.us/Transportation/Bus-Breakdown-and-Delays/ez4e-fazm): record of breakdowns and delays reported by the bus drivers who experienced the incident. This dataset is updated daily. I accessed a csv version through [Kaggle](https://www.kaggle.com/new-york-city/ny-bus-breakdown-and-delays/home). 
2. Bus Route & Vendor Metadata: Multiple csv files updated monthly covering  [drivers](https://data.cityofnewyork.us/Transportation/Drivers-and-Attendants/4tqt-y424), [bus stops](https://data.cityofnewyork.us/Transportation/Transportation-Sites/hg3c-2jsy), [vehicles](https://data.cityofnewyork.us/Transportation/Vehicles/28rh-vpvr) , and [routes](https://data.cityofnewyork.us/Transportation/Routes/8yac-vygm)

## Anticipated Challenges
**The primary dataset is limited to breakdowns and delays.**  
In order to model disruptions to the system, I will need to manufacture my best guess at the typical bus schedule through the metadata on NYC’s data platform. In the event that I cannot map out a clear picture of typical routes, I will need to build simulations using [novelty detection]( https://scikit-learn.org/stable/modules/outlier_detection.html) or [survival analysis]( https://github.com/sebp/scikit-survival)

**The event I am trying to predict – breakdown & delay – by nature are likely to be rare occurrences creating an imbalanced dataset.**  
An imbalanced dataset may lead my model to favor the majority class and unable to identify potential delays. In the event that the target class is significantly undersized I will have to use a combination of over-sampling the minority class and under-sampling the majority class to compensate.

**Exogenous factors like weather and traffic are likely to be high contributors to delay.**  
It is very likely that weather, traffic and road construction are the strongest indicators of delay. If it is not possible to reliably predict delay given the current route and bus based information, I will seek out weather and traffic pattern data.  