# Modeling M100 Lateness

Here we're trying to model the M100's lateness and simulated crowdedness in the St. Nicholas stop going to Inwood 220 St Via Amsterdam Via Bway. 

We are applying Datacamp's Decision-Tree for Classification

## Table of Contents:
1. [Choosing the Appropriate Classifier](#choosing-the-appropriate-classifier)
1. [Data Cleaning](#data-cleaning)
1. [Designing the model](#designing-the-model)
1. [Our Model](#our-model)
1. [Loading the Model](#loading-the-model)

\* Not finished yet

Prequisites (if you want to follow along/verify results)

In [1]:
# !pip3 install --user -U scikit-learn==0.18
# !pip3 install --user -U seaborn

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.model_selection import KFold, train_test_split
import datetime as dt

warnings.filterwarnings("ignore")
random_state = 20181112
import datetime, math, glob

Adding data from the M100 csv file.

# Choosing the Appropriate Classifier

We want (a) regressor(s) that can predict the **wait time** and **crowding** of a bus at a specific stop with the inputs **hourly weather** and **time of day**. We would most likely have two models that predict each **wait time** and **crowding**.

Here are our top picks for regressors:

1. Gradient Boosting Machines ***(top pick)***:
    - Why: GBMs are typically a composite model that combines the efforts of multiple weak models to create a strong model, and each additional weak model reduces the mean squared error (MSE) of the overall model. Our goal would be to minimize MSE to increase the accuracy of our predictions.

1. Random Forest:
    - Why: does not suffer from the overfitting like with Decision Trees. Instead of randomly choosing to split from just **hourly weather** and **time of day**, we can have two trees that randomly split from each and find the best model. 

1. Decision Trees:  
    - Reduction in Standard Deviation (metric): This is a regression metric that measures how much we’ve reduced our uncertainty by picking a split point. By picking the best split each time the greedy decision tree training algorithm tries to form decisions with as few splits as possible.  
    - Hyperparameters:   
        * Max depth: Limit our tree to a `n` depth to prevent overfitting.
        

Evaluating our model:

Since we're creating regression models, we are interested in the ***mean squared error*** and ***R Squared***. The lower our ***R Squared*** the more accurate our model. We intend to use **K-fold cross validation** as well as a **holdout set** as we improve our model through hyperparameter tuning. 


# Data Cleaning

> Please checkout [this notebook](../Bus_Timeline/Excel_Bus_Timeline_Draft.ipynb) on how we did the cleaning process

1. Clean and break up the time components (Hour, Mins, Secs) of the following:
    * `RecordedAtTime`
2. Merge and store (we'll merge them based on the hour of the day and the day of the month):
    * Bus
        * `Hour`
        * `Min`
        * `Sec`
        * `Day`
    * Weather
        * `Hour`
        * `HourlyVisibility`
        * `HourlyPrecipitation`
        * `HourlyWindSpeed`
3. Features of interest:
    * `Hour`
    * `Min`
    * `Sec`
    * `HourlyVisibility`
    * `HourlyPrecipitation`
    * `HourlyWindSpeed`
4. Prediction result:
    * `timeTillNext`: estimated minutes remaining until next bus

In [40]:
df = pd.read_csv('../data/Merged_Bus_Weather.csv')

# Model Training I

Adapted from: https://shankarmsy.github.io/stories/gbrt-sklearn.html

In [67]:
%matplotlib inline

import pandas as pd
import numpy as np
import warnings, seaborn
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.model_selection import KFold, train_test_split
from sklearn.model_selection import validation_curve

random_state = 42

np.random.seed(sum(map(ord, "aesthetics"))) 
seaborn.set_context('notebook') 
# pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier 
plt.rcParams['figure.figsize'] = (15, 5) # Set some Pandas options 
pd.set_option('display.notebook_repr_html', False) 
pd.set_option('display.max_columns', 40) 
pd.set_option('display.max_rows', 25) 
pd.options.display.max_colwidth = 50 


## Features, Targets and Splitting

In [42]:
df.drop(columns=['Unnamed: 0'], inplace=True)

In [43]:
df.dtypes

passengerArrivalTime     object
numPassengersPerBus       int64
BusDepartureTime         object
HOURLYVISIBILITY        float64
HOURLYWindSpeed         float64
HOURLYPrecip            float64
ArrivalHour               int64
ArrivalSeconds            int64
ArrivalMinutes            int64
dtype: object

In [47]:
features = (['ArrivalHour', 'ArrivalSeconds', 'ArrivalMinutes', 
             'HOURLYVISIBILITY', 'HOURLYWindSpeed', 'HOURLYPrecip'])

target = 'numPassengersPerBus'

model_df = df[(features + [target])].dropna().reset_index()

train_df, holdout_df, y_train, y_holdout = train_test_split(
    model_df[features], 
    model_df[target], test_size=0.2,
    random_state=random_state)

train_df[target] = y_train
holdout_df[target] = y_holdout

train_df.reset_index(inplace=True)
holdout_df.reset_index(inplace=True)

print(train_df.shape[0], train_df.numPassengersPerBus.mean())
print(holdout_df.shape[0], holdout_df.numPassengersPerBus.mean())

11547 26.82999913397419
2887 27.04052649809491


In [48]:
X_train = train_df
X_test = holdout_df
y_train = y_train
y_test = y_holdout

## The Gradient Boosting Regression Tree

In [49]:
gbrt=GradientBoostingRegressor(n_estimators=100) 
train_df.shape

gbrt.fit(train_df, y_train) 
y_pred=gbrt.predict(holdout_df) 

# Designing the model

### The Metrics

We're measure the following:
* R Squared : explains how well the independent variables explain the variability in the dependent.
* **Mean Squared Error:** a risk metric corresponding to the expected value of the squared (quadratic) error or loss. It is an estimator that measures the average of the squares of the errors. 
    
We're concerned about minimizing MSE as it leads to predictions that are close to the population

In [50]:
k_fold = KFold(n_splits=10, random_state=random_state)

Let's see what our hyperparameter should be here.  
NOTE: DO NOT RUN UNLESS YOU ARE WILLING TO SPEND > 2 HRS ON CALCULATIONS

In [59]:
learning_rates = [0.1, 0.05, 0.01]
min_samples_splits = range(20, 100, 20)
max_depths = [4, 6, 8]
n_estimators = range(200,1000,200)

all_mu = []
all_sigma = []

for depth in max_depths:
    for min_splits in min_samples_splits:
        for rate in learning_rates:
            for est in n_estimators:
                print("Depth:", depth, "Splits:", min_splits, "Rate:", rate, "n_estimators:", est, end=" ")
                gbrm=GradientBoostingRegressor(
                    random_state=random_state, 
                    max_depth=depth,
                    min_samples_split=min_splits,
                    learning_rate=rate,
                    n_estimators=est
                )

                mu, sigma = get_cv_results(gbrm)
                all_mu.append(mu)
                all_sigma.append(sigma)

                print(mu, sigma)

Depth: 4 Splits: 20 Rate: 0.1 n_estimators: 200 199.08979045383472 27.743574326432228
Depth: 4 Splits: 20 Rate: 0.1 n_estimators: 400 195.87222997968576 28.695973599225702
Depth: 4 Splits: 20 Rate: 0.1 n_estimators: 600 195.0206280215395 27.68516460653923
Depth: 4 Splits: 20 Rate: 0.1 n_estimators: 800 196.13702306024717 27.713527852404297
Depth: 4 Splits: 20 Rate: 0.05 n_estimators: 200 205.768138906593 27.68032162934012
Depth: 4 Splits: 20 Rate: 0.05 n_estimators: 400 199.3785383582092 27.327002414283733
Depth: 4 Splits: 20 Rate: 0.05 n_estimators: 600 196.88164015596283 27.726851981196273
Depth: 4 Splits: 20 Rate: 0.05 n_estimators: 800 196.08449372788883 27.791080941501143
Depth: 4 Splits: 20 Rate: 0.01 n_estimators: 200 276.6003638466901 34.509902854386986
Depth: 4 Splits: 20 Rate: 0.01 n_estimators: 400 227.54505950769402 30.23556738561684
Depth: 4 Splits: 20 Rate: 0.01 n_estimators: 600 214.6636742401838 28.13481326286243
Depth: 4 Splits: 20 Rate: 0.01 n_estimators: 800 209.5938

From the above, we've picked:  
```
max_depth=8,  
min_samples_split=40,  
learning_rate=0.01,  
n_estimators=600  
   
```

as our hyperparameters.

In [69]:
def get_cv_results(regressor):
    
    mse = []
    r2_scores = []
    
    for train, test in k_fold.split(train_df):
        regressor.fit(train_df.loc[train, features], train_df.loc[train, target])
        y_predicted = regressor.predict(train_df.loc[test, features])
        
        mean_squared = mean_squared_error(train_df.loc[test, target], y_predicted)
        mse.append(mean_squared)
        
        r2 = r2_score(train_df.loc[test, target], y_predicted)
        r2_scores.append(r2)
    
    return (np.mean(mse), np.std(mse), 
            np.mean(r2_scores), np.std(r2_scores))

In [71]:
gbm = GradientBoostingRegressor(
    random_state=random_state, 
    learning_rate = 0.01,
    min_samples_split=40,
    max_depth=8,
    n_estimators=600
)

results = get_cv_results(gbm)

print("Mean of mean squared error:", results[0])
print("Mean squared error std:", results[1])
print("Mean of R Squared:", results[2])
print("R Squared std:", results[3])

Mean of mean squared error: 175.60076094812706
Mean squared error std: 27.082691298358856
Mean of R Squared: 0.8724802563070385
R Squared std: 0.015549400642375276


## Saving our Model
Pretty good results. Now let's save this model for future use!

In [80]:
Hamlet_Crowding_Model = gbm

for train, test in k_fold.split(train_df):
    Hamlet_Crowding_Model.fit(train_df.loc[train, features], train_df.loc[train, target])



In [77]:
# !pip3 install -U --user joblib

In [81]:
# Saving
from joblib import dump, load
dump(Hamlet_Crowding_Model, '../data/GBRT_Hamlet_Crowding.joblib') 

['../data/GBRT_Hamlet_Crowding.joblib']

# Our Model

In [114]:
'''
    @params: 
        timeNow: The current time of day the person arrived to the bus stop
        currentHourlyWeather: The current hourly weather data. Has to have the following:
            + Hourly Precipitation
            + Hourly Visibility
            + Hourly Windspeeds
'''
def predictCrowding(timeNow, currentHourlyWeather):
    currData = pd.Series([])
    currData["ArrivalHour"] = timeNow.split(":")[0]
    currData["ArrivalMinutes"] = timeNow.split(":")[1]
    currData["ArrivalSeconds"] = timeNow.split(":")[2]
    currData["HOURLYVISIBILITY"] = currentHourlyWeather[0]
    currData["HOURLYWindSpeed"] = currentHourlyWeather[1]
    currData["HOURLYPrecip"] = currentHourlyWeather[2]
    
    return Hamlet_Crowding_Model.predict(currData)
    
    

Some sample data:
    
    19, 31, 39, 4.0, 6.0, 0.03 => 7
    0, 2, 48, 10.0, 6.0, 0.03, => 7  
    17, 14, 13, 10.0, 5.0, 0.00, => 8  

In [115]:
timeNow = "19:31:39"
weatherNow = [4.0, 6.0, 0.03]

predictCrowding(timeNow, weatherNow)[0]



8.107071301826297

In [116]:
timeNow = "00:02:48"
weatherNow = [10.0, 6.0, 0.03]

predictCrowding(timeNow, weatherNow)[0]



9.680940892445431

In [117]:
timeNow = "17:14:13"
weatherNow = [10.0, 5.0, 0.00]

predictCrowding(timeNow, weatherNow)[0]



7.4159927796028535

# Loading the Model


Credit: https://scikit-learn.org/stable/modules/model_persistence.html

In [None]:
# !pip3 install -U --user joblib

In [None]:
# Loading
model = load('../../data/GBRT_Hamlet.joblib') 