# HamletMTA Final Report

- [David Hadaller](#David-Hadaller)
- [Angelika Shastapalava](#Angelika-Shastapalava)
- [Sam Mundle](#Sam-Mundle)
- [Excel Espina](#Excel-Espina)

## Project Aims

We will use MTA bus data, Monte Carlo passenger simulation and weather data to create a model that will determine how long a passenger should wait for the next bus, before giving up and chosing an alternative mode of transportation. Our focus thus far has been on one particular bus stop (the southbound M100 on 135th street) in August of 2017. However, this model could be iterated system-wide for every bus stop location. 

The diagram below serves as a reference for our analysis and is explained in more detail in the passages that follow.

![diagram](analysis_diagram.png)

### Monte Carlo Simulation

In our Monte Carlo simulation, we will first assume that there is a uniform probability that a passenger will approach a stop at any given time of day. The reason for this choice is for it's realism (it is a common assumtion among staff analysts at the MTA's operations and planning department) and for it's ease of computation. Later on, we can experiment different ridership behaviors such as a bimodal probability distribution (peak in the early morning and evening for commuters). 

#### Passenger Wait Times ([Sam](#Sam-Mundle))

We view each day's bus arrivals, at a given bus stop, as points on a timeline. Similarly, each  passenger arival at the bus stop falls also on the day's timeline. Calculating wait times then amounts to finding the difference between the passenger's arrival and the arrival of the very next bus along this timeline. Our goal here is to create a dataset with the time (as in time of day) of a passenger's approach to the bus station as the independent variable and the wait time as the dependent variable. This data will serve as the validation set for our model.

#### Bus Time Deltas/headways ([David](#David-Hadaller))

For the simulated passengers, each approaching the bus stop at some random time, we would like to find how crowded the bus they board will be. To find this, we will simulate many passengers "boarding" busses by assigning a population of $n$ daily passengers a uniform random timestamp between the last and first busses of the day. Buses which feature an abnormally long timedelta (long time interval between current and last bus arrival) will generally be more crowded, since more passengers accumulate at the bus stop as time goes on. Again, the time of day is our independent variable and the number of people on the bus (crowding) is the dependent variable.  This data will serve as the validation set for our model.

### Weather Data ([Angelika](#Angelika-Shastapalava))

Weather data will include columns for precipitation, wind speed, and visibility for the month of August 2017 in NYC. These will serve as our features for predicting crowding and wait time.


### Machine Learning Model ([Excel](#Excel-Espina))

The feature variables will be include weather data columns and time of day, while the target variables will be the crowding and wait times as experienced by the passengers (to predict what we simulated in the Monte Carlo method step.) Our goal is to produce a general weather-conscious model that predicts passenger experience (wait time and crowdedness)

Considerations for this step include, but are not limited to:

- model evaluation (accuracrecall,prediction, lift and all that)
- model type (linear, non linear/ regression/decision tree)


## Data Sources

[MTA Schedules](http://web.mta.info/nyct/service/bus/bklnsch.htm#top) (need to figure out best way to scrape or source better structured data)

[MTA Bus Statistics](https://www.kaggle.com/stoney71/new-york-city-transport-statistics)

[Weather.Gov](https://www.weather.gov/okx/CentralParkHistorical) Data from a weather monitor in central park; Each day, a 1:30 am report holds 24 hours of weather data starting at 12:00 am EST the previous day and the reports look like [this](https://forecast.weather.gov/product.php?site=NWS&issuedby=NYC&product=CLI&format=CI&version=1&glossary=1&highlight=off).

# David Hadaller

In [None]:
import pandas as pd
from pandas import Timestamp

import matplotlib.dates
import matplotlib.pyplot as plt

import numpy as np
import datetime
import math
import sys

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
#this is a dataset for the timestamps of busses at the Amsterdam & 125th st stop on M100 line going uptown 
busArrivals = pd.read_csv("../data/Arrivals_M100.csv",index_col=0)

# loopTime is the minimum amount of time, in minutes, that it takes a bus to complete the bus route and 
    # arrive at this stop to complete the circuit once again
loopTime = datetime.timedelta(minutes=105)

# Ensure ordering by VehicleRef (a vehicle identifier for busses) and RecordedAtTime (timestamps)
busArrivals = busArrivals.sort_values(['VehicleRef','RecordedAtTime'])

# Resetting Index and deleting resulting index column after ordering for shift later on.
busArrivals = busArrivals.reset_index()
busArrivals.drop(columns=['index',],inplace=True)

# Ensure that RecordedAtTime is of correct data type to find timedelta
busArrivals['RecordedAtTime'] = pd.to_datetime(busArrivals['RecordedAtTime'])

# find difference between CURRENT timestamp and PREVIOUS for each gps-timestamp
    #busArrivals['timeDelta'] = busArrivals_Grouped['RecordedAtTime'].diff()
busArrivals['timeDelta'] = busArrivals['RecordedAtTime'].diff()

# we want to find all the timestamps where busses pull away from this one stop. 
    # the departure time is when we consider that a passenger is no longer waiting for their journey to start. 
    # hence, we count bus idleing as part of the passengers experienced wait time

# wherever the difference between two consecutive timestamps is greater than the loopTime, 
    # the bus has finished it's route and come back to the same stop it started at.
    # The bus is not idleing.  
busArrivals['hasLooped'] = busArrivals['timeDelta'] > loopTime

#fixing some edge cases e.g. the first datapoint has no timedeta because no other time precedes it
busArrivals.loc[0,'timeDelta'] = 0

# # where the timedeta is NaT, set to haslooped=True. We do this so that the first Entry for a given Vehicleref won't 
#     # count as a departure time, but the last entry from the previous VehicleRef entry will.
# busArrivals.loc[busArrivals['timeDelta'].isnull(),'hasLooped'] = True

# wherever the next arrival is a Looparound, the current timestamp is considered a departure from the stop
busArrivals['isDeparting'] = busArrivals['hasLooped'].shift(-1)

#the last entry in the entire dataframe must be included as a departure
busArrivals.loc[busArrivals.index[-1], 'isDeparting']= True

# If the next bus is not the same as the current bus, then this entry must be considered a departure
busArrivals['NextVehicleRef'] = busArrivals['VehicleRef'].shift(-1).fillna("") #create next bus column by shifting current bus up by 1 relative to index
mask = busArrivals['VehicleRef'] != busArrivals['NextVehicleRef']
busArrivals.loc[mask, 'isDeparting'] = True

# return all rows where the busses are departing
busArrivals = busArrivals[busArrivals['isDeparting'] != False]

In [None]:
#M100_NICK = pd.read_csv('../data/M100_month_W125_st.csv')
M100_NICK = busArrivals
M100_NICK.columns
M100_NICK.shape
M100_NICK.to_csv("../data/testing.csv")

Before we begin the simulation, we need to establish what will become the arguments to `numpy.random.uniform(low=0.0, high=1.0, size=None)`. Below, we find the `low` and `high` parameters. That is, we find the first and last bus arrival times for each day.

In [None]:
# need to change the datetime to string to apply string split methods
busArrivals['RecordedAtTime'] = busArrivals['RecordedAtTime'].dt.strftime('%Y-%m-%d %H:%M:%S')

DailyBusMinMax= M100_NICK.loc[:,['RecordedAtTime']]
splitCol = DailyBusMinMax['RecordedAtTime'].str.split(' ', 1, expand=True).rename(columns={0:'Date', 1:'Time'}) 
DailyBusMinMax['Date']= splitCol['Date'] 

DailyBusMinMax = DailyBusMinMax.drop_duplicates()

DailyBusMax = DailyBusMinMax.groupby('Date').max()
DailyBusMin = DailyBusMinMax.groupby('Date').min()

DailyBusMinMax = pd.merge(left=DailyBusMin, right=DailyBusMax, how='inner',on='Date', suffixes=('Min', 'Max'))
DailyBusMinMax = DailyBusMinMax.rename(columns={'RecordedAtTimeMin':'EarliestBusArrival', 'RecordedAtTimeMax':'LatestBusArrival'})

DailyBusMinMax.reset_index(level=0, inplace=True)
DailyBusMinMax['Date'] = pd.to_datetime(DailyBusMinMax['Date'],format='%Y-%m-%d')
DailyBusMinMax['EarliestBusArrival'] = pd.to_datetime(DailyBusMinMax['EarliestBusArrival'], format='%Y-%m-%d %H:%M:%S')
DailyBusMinMax['LatestBusArrival'] = pd.to_datetime(DailyBusMinMax['LatestBusArrival'], format='%Y-%m-%d %H:%M:%S')

DailyBusMinMax.head()

In the cell below, we take the bus arrival times (when the bus pulls into the stop), the dates associated with each bus arrival time (to help with subsequent merges) and the timedeltas (which we may plot later on) from the original `M100_NICK` data

In [None]:
BusArrivals = M100_NICK.loc[:,['RecordedAtTime','time_delta_mins']]
dates = M100_NICK['RecordedAtTime'].str.split(' ', 1, expand=True).rename(columns={0:'Date'})
BusArrivals.insert(loc=0, column='Date', value=dates['Date'])

BusArrivals = BusArrivals.rename(columns={'RecordedAtTime':'BusArrivalTime'})

# Change Column DataTypes from String (object) to DateTime
BusArrivals['BusArrivalTime'] = pd.to_datetime(BusArrivals['BusArrivalTime'], format='%Y-%m-%d %H:%M:%S')
BusArrivals['Date'] = pd.to_datetime(BusArrivals['Date'], format='%Y-%m-%d %H:%M:%S')
BusArrivals = BusArrivals.drop_duplicates()

BusArrivals.head()

Next, we define a simulation function which gets the `low` and `high` bounds of the uniform distribution from the DailyBusMinMax dataframe and then takes `NumPassengers` for population size of passengers to simulate. 

This simulation function creates a "pivot table" with the Date as pseudo-index and an artificial passengerId for column headers. For each date, the table contains each passenger in NumPassengers simulated bus arrival time (the time at which each passenger approaches the bus stop with the hopes of boarding a bus.)

Of course, the table that results isn't a true pivot table, because the Date column is just another column, rather than a pandas index. Keeping the Date as a column allows us to reference it as a column later on, which will come in handy when we need to return a series of dates.

In [None]:
def passengerSim(DailyBusMinMax, NumPassengers):
    
    #time between the first and last bus arrivals
    dailyDelta = DailyBusMinMax['LatestBusArrival'] - DailyBusMinMax['EarliestBusArrival']
    
    # the first bus arrival
    dailyMin = DailyBusMinMax['EarliestBusArrival']
    
    #number of dates to simulate for
    NumDates = len(DailyBusMinMax.Date)
    
    #this vectorized calculation follows the formula dailyDelta * randomVar + firstBusArrival to choose a random time
        # between the EarliestBusArrival and the LatestBusArrival.
        # this is done for every date and for each passenger in NumPassengers
    pSim = pd.DataFrame(np.random.uniform(0,1,(NumDates,NumPassengers)))
    pSim = pSim.mul(dailyDelta,axis=0)
    pSim = pSim.add(dailyMin,axis=0)
    
    # add a dates column to front of dataframe  
    pSim.insert(loc=0, column='Date', value=DailyBusMinMax['Date'])
    
    return pSim

In [None]:
sim = passengerSim(DailyBusMinMax,500)
sim.head()

Here, we reorganize the results of the passenger simulation to get a table that has one single `passengerId` column, instead of one column for each passenger. This is will allow us to perform a merge in the following step.

In [None]:
sim = sim.melt(id_vars='Date')
sim = sim.rename(columns={'variable':'passengerId','value':'passengerArrivalTime'})
sim.head()

As promised, we now merge the passenger simulation,`sim` with the bus arrival times, `BusArrivals`. The result of the code below is a lookup table where each passenger arrival time is associated with one bus arrival time; that is to say, the passengers are associated with the bus they board (which will always be the next bus that approaches the stop after they arrive.)

In [None]:
# The powerset of all passenger-bus combinations
busBoarding = pd.merge(right=sim, left=BusArrivals, on='Date', how='inner')

# whittle down previous dataframe to those where passengers board busses that arrive at stop after they do
    # (no going back in time)
busBoarding = busBoarding.loc[busBoarding['BusArrivalTime']>=busBoarding['passengerArrivalTime']]

# the passenger is reasonable and will board the first bus that approaches the stop
busBoarding = busBoarding.groupby(['Date','passengerId','passengerArrivalTime']).first()

# we reset the index to group by a different column in the next step
busBoarding = busBoarding.reset_index().sort_values(['Date','BusArrivalTime','passengerId'])
busBoarding.head()

Now, we calcuate the number of people per bus by grouping by `BusArrivalTime` and then counting the number of entries in each group. We then merge this `busCrowding` DataFrame back into our `busBoarding` DataFrame from the previous step to give us a `numPassengersPerBus` column, which tells us exactly how many of our simulated passengers boarded each bus.

In [None]:
busCrowding = busBoarding.groupby(['BusArrivalTime']).count()
busCrowding = pd.DataFrame(busCrowding['passengerId']).rename(columns={'passengerId':'numPassengersPerBus'})
busCrowding.reset_index()

busBoarding = pd.merge(left=busBoarding, right=busCrowding, on='BusArrivalTime', how='inner')
busBoarding.head()
busBoarding.shape

We now take a few columns of the `busBoarding`data that to plot the relationship between `passengerArrivalTime` and `numPassengersPerBus`.

In [None]:
plotData = busBoarding.loc[:,['passengerArrivalTime','numPassengersPerBus']]
plotData['passengerArrivalTime'] = plotData.passengerArrivalTime.dt.time
plotData.head()
plotData.shape

Finally, we have our plot.

In [None]:
plotData = plotData.sort_values('passengerArrivalTime', ascending=True)
_= plt.plot(plotData['passengerArrivalTime'], plotData['numPassengersPerBus'],'.',linestyle='none')
_ = plt.xlabel('Passenger Arrival Time')
_ = plt.ylabel('Number of Passengers on Bus')
_ = plt.margins(0.02) # Keeps data off plot edges
plt.xticks(rotation='vertical')

Notice that the above plot features a series of points a the 1000 people mark. This signifies that there are busses so crowded that they fit 1000 passengers inside. Clearly there is an innacuracy in the data. The source of the problem is that there is only one bus per day on some days, but the daily population of passengers remains 1000 every day. So far as we can see, there can be two causes for this error: 

1. Some of the bus arrival data has been unnecessarily deleted or is missing; there are no "one-bus-only" days at this stop.
2. The number of passengers should be adapted to the number of busses for each transit day. This would mean the error was setting the daily ridership at a constant of `numPassengers=1000`.

However, once find the source of this error, we will have our "base truth" to help train our models on in the next phase of our project. 

# Sam Mundle

In [None]:
# Find the elapsed time between passenger arrival and bus arrival (Wait time)
busBoarding['WaitTime'] = busBoarding['BusArrivalTime'] - busBoarding['passengerArrivalTime']
busBoarding.shape

In [None]:
waitPlotData = busBoarding[['passengerArrivalTime', 'WaitTime']].copy()
waitPlotData['passengerArrivalTime'] = waitPlotData.passengerArrivalTime.dt.time
waitPlotData.dtypes

In [None]:
waitPlotData = waitPlotData.sort_values('passengerArrivalTime', ascending=True)[0:500]
_= plt.plot(waitPlotData['passengerArrivalTime'], waitPlotData['WaitTime'],'.',linestyle='none')
_ = plt.xlabel('Passenger Arrival Time')
_ = plt.ylabel('Wait Time For Bus')
_ = plt.margins(0.02) # Keeps data off plot edges
plt.xticks(rotation='vertical')



First we import relevant libraries and generate a dataframe representing all of the arrivals at a single stop in a single month traveling in one of two directions on the M100 bus route. 

It should be noted that there are relatively few entries for the month (107) due to data loss earlier in the cleaning process. Further work on cleaning is necessary

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

mta_df = pd.read_csv('../data/M100_month_W125_st.csv', error_bad_lines=False)
mta_df.shape

Then the dataframe is sorted based on 'RecordedAtTime' which, since all entries are while the bus has reached the stop, represent the actual arrival times. 

In [None]:
mta_df['RecordedAtTime'] = pd.to_datetime(mta_df['RecordedAtTime'])
mta_df.sort_values("RecordedAtTime", inplace=True)
mta_df.head()

The arrivals dataframe is initialized with a controlled number of passenger arrival time entries and can have the frequency of random times changed in its definition statement

In [None]:
def select_random_dates(frequency, NumDataPoints):
    date_range = pd.date_range(start='2017-08-01', end='2017-08-30', freq=frequency)
    random_dates = pd.to_datetime(
        np.concatenate([
                np.random.choice(date_range[1:-1], size=NumDataPoints, replace=False)
            ])
        )
    return random_dates

arrivals_df = pd.DataFrame()
arrivals_df['PassengerTime'] = select_random_dates('1min', 600)
arrivals_df.head(10)

The next arriving bus is found for each of the random passenger arrival times defined above as well as the time delta between the two, representing wait time.

In [None]:
def findNextBus(arrivals_df, mta_df):
    for arrivalIndex, arrivalRow in arrivals_df.iterrows():
        for mtaIndex, mtaRow in mta_df.iterrows():
            if (mtaRow['RecordedAtTime'] > arrivalRow[0]):
                arrivals_df.loc[arrivalIndex,'NextBus'] = mtaRow['RecordedAtTime']
                break

findNextBus(arrivals_df, mta_df)
arrivals_df['WaitTime'] = arrivals_df['NextBus'] - arrivals_df['PassengerTime']
arrivals_df.head(10)

# Angelika Shastapalava

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime
import numpy as np


reading the file

In [None]:
df = pd.read_csv("../data/1401011_weather_data.csv", delimiter = ',', usecols = (5,24), index_col='DATE')

In [None]:
df.head()

In [None]:
df.dtypes


selecting time period of aug 2017

In [None]:
weather = df.loc['2017-08-01' : '2017-09-01']

In [None]:
weather.info()

our data contains some na values. in addition some precip values contain 
letter 'T' or 's'.
droping na and precip values that contain T or s letter in them

In [None]:
weather = weather.dropna()
weather = weather[~weather.HOURLYPrecip.str.contains("T")]
weather = weather[~weather.HOURLYPrecip.str.contains("s")]

In [None]:
weather.info()

In [None]:
weather.HOURLYPrecip=pd.to_numeric(weather.HOURLYPrecip)
weather.reset_index()

In [None]:
%matplotlib inline
import seaborn; seaborn.set()
weather.plot(y='HOURLYPrecip', use_index=True)
plt.xlabel('August 2017')
plt.ylabel('Hourly Precipitation in Inches')

creating another df for visibility

In [None]:
vis = pd.read_csv("../data/1401011_weather_data.csv", delimiter = ',', usecols = (5,8),index_col='DATE')

In [None]:
vis.head()

In [None]:
vis.dtypes

selecting time period of aug 2017

In [None]:
visib = vis.loc['2017-08-01' : '2017-09-01']

In [None]:
visib.info()

droping na values and data that contain string 'V' in it

In [None]:
visib = visib.dropna()
visib = visib[~visib.HOURLYVISIBILITY.str.contains("V")]

In [None]:
visib.info()

converting visibility column to numeric in order to create a plot

In [None]:
visib.HOURLYVISIBILITY=pd.to_numeric(visib.HOURLYVISIBILITY)

In [None]:
visib.plot(y='HOURLYVISIBILITY', use_index=True)
plt.xlabel('August 2017')
plt.ylabel('Hourly Visibility in miles')

Let's see what dates hourly visibility was below 1 mile

In [None]:
visib.loc[visib.HOURLYVISIBILITY < 1]

Let's see what dates horly precipitation was above 0.5 inches

In [None]:
weather.loc[weather.HOURLYPrecip >= 0.5]

It seems like on august 18th precipitation was relatively high and visibility was low
This is important because it can influence waiting time and crowdiness in busses.
For better understanding lets visualize it with seaborn plots

In [None]:
weather.plot(y='HOURLYPrecip', use_index=True)
plt.xlabel('August 2017')
plt.ylabel('Hourly Precipitation in Inches')

visib.plot(y='HOURLYVISIBILITY', use_index=True)
plt.xlabel('August 2017')
plt.ylabel('Hourly Visibility in miles')

Indeed, we have low hourly visibility and high precipitation 
around the same time (aug 18th).
We can expect this date to have longer waiting time and 
higher crowdiness in busses in our future model

Let's see if there any other dates that can influence our 
dependable variables

In [None]:
print(visib.loc[visib.HOURLYVISIBILITY < 2])
print(weather.loc[weather.HOURLYPrecip >= 0.2])

It seems like on aug 15th and 22nd there is 
a hight chance of finding disruption of bus services as well

Now, let's find out if there were any dates with hight wind speed

In [None]:
w = pd.read_csv("../data/1401011_weather_data.csv", delimiter = ',', usecols = (5,17), index_col='DATE')

In [None]:
w.head()

In [None]:
w.dtypes

In [None]:
print(w.shape)
print(len(w))

Selecting time period of aug 2017

In [None]:
wind = w.loc['2017-08-01' : '2017-09-01']

In [None]:
wind.info()

In [None]:
wind = wind.dropna()

In [None]:
wind.info()

In [None]:
wind.HOURLYWindSpeed.describe().astype(int)

In [None]:
wind.plot(y='HOURLYWindSpeed', use_index=True)
plt.xlabel('August 2017')
plt.ylabel('Hourly wind speed in miles')

From the plot it looks like highest wind speed occured
towards the end of the month

In [None]:
wind.loc[wind.HOURLYWindSpeed > 8]

Indeed our observation was correct.
However, we know that on aug 15th, 18th and 22nd 
our other 2 independant variables were relatively high/low.
Higher than usual wind speed during those days would 
probably influence our dependent varibles even more.
Let's see what wind speed we had on aug 15th, 18th and 22nd 
but before that let's convert index to datetime so we can use it later for 
creating our model

In [None]:
wind = wind.reset_index()

In [None]:
wind.head()

In [None]:
wind.info()

In [None]:
wind.DATE = pd.to_datetime(wind.DATE)

In [None]:
wind.info()

In [None]:
print(wind.loc[(wind.DATE >= '2017-08-15') & (wind.DATE <= '2017-08-16') ].max())
print(wind.loc[(wind.DATE >= '2017-08-18') & (wind.DATE <= '2017-08-19') ].max())
print(wind.loc[(wind.DATE >= '2017-08-22') & (wind.DATE <= '2017-08-23') ].max())

As we can see max wind speed on aug 18th and 22ns was 
8 and 11 mph respectively
this also can influence out predictors.
We need to convert DATE to datetime for precipitation and visibility as well

In [None]:
weather = weather.reset_index()

In [None]:
weather.DATE = pd.to_datetime(weather.DATE)

In [None]:
visib = visib.reset_index()
visib.DATE = pd.to_datetime(visib.DATE)

In [None]:
weather.info()

In [None]:
visib.info()

We are finished with cleaning and analyzing weather data.
Our observations show that there are certain days when we can expect higher crowdiness 
and longer wait time for the busses. 

Now we ready to integrate the data to our model as independant variables and see
if our predictions were correct. 

# Excel Espina

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.model_selection import KFold, train_test_split

warnings.filterwarnings("ignore")
random_state = 20181112
import datetime, math, glob

Adding data from the M100 csv file.

In [None]:
%%capture
df = pd.read_csv('../data/M100_month_W125_st.csv', error_bad_lines=False)

# Choosing the Best Classifier

We want (a) regressor(s) that can predict the **wait time** and **crowding** of a bus at a specific stop with the inputs **hourly weather** and **time of day**. We would most likely have two models that predict each **wait time** and **crowding**.

Here are our top picks for regressors:

1. Gradient Boosting Machines ***(top pick)***:
    - Why: GBMs are typically a composite model that combines the efforts of multiple weak models to create a strong model, and each additional weak model reduces the mean squared error (MSE) of the overall model. Our goal would be to minimize MSE to increase the accuracy of our predictions.

1. Random Forest:
    - Why: does not suffer from the overfitting like with Decision Trees. Instead of randomly choosing to split from just **hourly weather** and **time of day**, we can have two trees that randomly split from each and find the best model. 

1. Decision Trees:  
    - Reduction in Standard Deviation (metric): This is a regression metric that measures how much we’ve reduced our uncertainty by picking a split point. By picking the best split each time the greedy decision tree training algorithm tries to form decisions with as few splits as possible.  
    - Hyperparameters:   
        * Max depth: Limit our tree to a `n` depth to prevent overfitting.
        

Evaluating our model:

Since we're creating regression models, we are interested in the ***mean squared error*** and ***R Squared***. The lower our ***R Squared*** the more accurate our model. We intend to use **K-fold cross validation** as well as a **holdout set** as we improve our model through hyperparameter tuning. 

    * Preventing 

# Data Cleaning

What we need to do:  

1. Clean and break up the time components (Hour, Mins, Secs) of the following:
    * `RecordedAtTime`
    * `ExpectedArrivalTime`
    * `ScheduledArrivalTime`
    
2. Store features of interest:
    * `RecordedAtTime`
    * `VehicleLocation.Longitude`
    * `VehicleLocation.Latitude`
    * `DistanceFromStop`
    * `ExpectedArrivalTime`
   

In [None]:
df['ScheduledArrivalTime'] = pd.to_datetime(df.ScheduledArrivalTime, errors='coerce')
df.dropna()
df['Scheduled_Hour'] = df['ScheduledArrivalTime'].dt.hour
df['Scheduled_Minute'] = df['ScheduledArrivalTime'].dt.minute
df['Scheduled_Seconds'] = df['ScheduledArrivalTime'].dt.second

df['RecordedAtTime'] = pd.to_datetime(df.RecordedAtTime)
df['Recorded_Hour'] = pd.to_datetime(df.RecordedAtTime).dt.hour
df['Recorded_Minute'] = pd.to_datetime(df.RecordedAtTime).dt.minute
df['Recorded_Seconds'] = pd.to_datetime(df.RecordedAtTime).dt.second

df['ExpectedArrivalTime'] = pd.to_datetime(df.ExpectedArrivalTime)
df['Expected_Hour'] = pd.to_datetime(df.ExpectedArrivalTime).dt.hour
df['Expected_Minute'] = pd.to_datetime(df.ExpectedArrivalTime).dt.minute
df['Expected_Seconds'] = pd.to_datetime(df.ExpectedArrivalTime).dt.second

In [None]:
df.dtypes

In [None]:
df.count()

In [None]:
features = (['VehicleLocation.Longitude', 
             'VehicleLocation.Latitude', 
             'OriginLong',
             'OriginLat',
             'DistanceFromStop',
             'Recorded_Hour',
             'Scheduled_Hour',
             'Scheduled_Minute',
             'Scheduled_Seconds',
             'Recorded_Minute',
             'Recorded_Seconds',
             'time_diff_bus_mins'
            ])

model_df = df[(features)].dropna().reset_index()

model_df.count()

# Plotting a Chart for Sanity

We want to have a frequency/histogram for each hour of the day and for each minute of the hour.

Credit: David

In [None]:
def ecdf(inputSeries, label):
    try:
        x = np.sort(inputSeries)
    except:
        print("Warning: Series Unsorted")
        x = inputSeries
    y = np.arange(1, len(x)+1) / len(x)
    _ = plt.plot(x, y, marker='.', linestyle='none')
    _ = plt.xlabel('Time Delta ({})'.format(label))
    _ = plt.ylabel('ECDF')
    plt.margins(0.02) # Keeps data off plot edges
    plt.show()
    
def hist(inputSeries, label):
    plt.hist(inputSeries, bins=25, density=True)
    _ = plt.xlabel('Time Delta ({})'.format(label))
    _ = plt.ylabel('PDF')
    plt.show()

In [None]:
M100_NICK_Avg = df[['Recorded_Hour','time_diff_bus_mins', 'Recorded_Minute']]
M100_Hour = M100_NICK_Avg.groupby('Recorded_Hour').mean().dropna()
M100_Min = M100_NICK_Avg.groupby('Recorded_Minute').mean().dropna()
M100_NICK_Avg.head()

In [None]:
ecdf(M100_Hour['time_diff_bus_mins'], "per Hour")
hist(M100_Hour['time_diff_bus_mins'], "per Hour")

ecdf(M100_Min['time_diff_bus_mins'], "per Minute")
hist(M100_Min['time_diff_bus_mins'], "per Minute")

# Saving our Progress

In [None]:
model_df.to_csv('M100_4_month_W125_st_timesplit.csv', encoding='utf-8', index=False)

Splitting training and testing datasets

In [None]:
train_df, holdout_df, y_train, y_holdout = train_test_split(model_df[features],
                                                    model_df['time_diff_bus_mins'],
                                                    test_size=0.3,
                                                    random_state=42)

train_df['time_diff_bus_mins'] = y_train
holdout_df['time_diff_bus_mins'] = y_holdout

train_df.reset_index(inplace=True)
holdout_df.reset_index(inplace=True)

print(train_df.shape[0], train_df.time_diff_bus_mins.mean())
print(holdout_df.shape[0], holdout_df.time_diff_bus_mins.mean())

# Model Training

Let's take a quick look at all of our classification model options using cross validation. For the tree based models, we'll use the hyperparameter `max_depth=6` as a naive attempt at voiding overfitting before we dig deeper.

Let's fit and score the model, this time using cross validation:

In [None]:
k_fold = KFold(n_splits=5, random_state=random_state)

In [None]:
def get_cv_results(classifier):
    
    results = []
    for train, test in k_fold.split(train_df):
        classifier.fit(train_df.loc[train, features], train_df.loc[train, 'time_diff_bus_mins'])
        y_predicted = classifier.predict(train_df.loc[test, features])
        accuracy = accuracy_score(train_df.loc[test, 'time_diff_bus_mins'], y_predicted)
        results.append(accuracy)
    
    return np.mean(results), np.std(results)


Logistic Regression

In [None]:
logreg = LogisticRegression(
    random_state=random_state, 
    solver='lbfgs'
)

get_cv_results(logreg)

Decision Tree

In [None]:
dtree = DecisionTreeClassifier(
    random_state=random_state, 
    max_depth=6
)

get_cv_results(dtree)

Random Forest

In [None]:
rforest = RandomForestClassifier(
    random_state=random_state, 
    max_depth=6,
    n_estimators=100
)

get_cv_results(rforest)

Gradient Boosting Machines

In [None]:
gbm = GradientBoostingClassifier(
    random_state=random_state, 
    max_depth=6,
    n_estimators=100
)

get_cv_results(gbm)

# Evaluating Model Performance

We're using ROC curves to visually see which model performs the best.

In [None]:
def plot_roc(classifier, label, color):

    classifier.fit(train_df[features], train_df['time_diff_bus_mins'])
    y_prob = classifier.predict_proba(holdout_df[features])
    
    fpr, tpr, thresh = roc_curve(holdout_df['time_diff_bus_mins'], y_prob[:,1])
    plt.plot(fpr, tpr,
             label=label,
             color=color, linewidth=3)

    auc = roc_auc_score(holdout_df['time_diff_bus_mins'], y_prob[:,1])
    
    print('AUC: %0.3f (%s)' % (auc, label))

In [None]:
f1 = plt.figure(figsize=(14,6))

logreg = LogisticRegression(
    random_state=random_state, 
    solver='lbfgs'
)
plot_roc(logreg, 'Logistic Regression', 'green')

dtree = DecisionTreeClassifier(
    random_state=random_state, 
    max_depth=3
)
plot_roc(dtree, 'Decision Tree', 'red')

rforest = RandomForestClassifier(
    random_state=random_state, 
    max_depth=6,
    n_estimators=100
)
plot_roc(rforest, 'Random Forest', 'blue')

gbm = GradientBoostingClassifier(
    random_state=random_state, 
    max_depth=6,
    n_estimators=100
)
plot_roc(gbm, 'GBM', 'lightblue')

plt.legend(loc='lower right')