# Modeling M100 Lateness

Here we're trying to model the M100's lateness and simulated crowdedness in the St. Nicholas stop going to Inwood 220 St Via Amsterdam Via Bway. 

We are applying Datacamp's Decision-Tree for Classification

## Table of Contents:
1. [Choosing the Appropriate Classifier](#choosing-the-appropriate-classifier)
1. [Plotting a Chart for Sanity](#plotting-a-chart-for-sanity)
1. [Saving our Progress](#saving-our-progress)
1. [Model Training](#model-training)\*
1. [Data Cleaning](#data-cleaning)\*

\* Not finished yet

Prequisites (if you want to follow along/verify results)

In [9]:
# !pip3 install --user -U scikit-learn==0.18
!pip3 install --user -U seaborn

Collecting seaborn
  Downloading https://files.pythonhosted.org/packages/a8/76/220ba4420459d9c4c9c9587c6ce607bf56c25b3d3d2de62056efe482dadc/seaborn-0.9.0-py3-none-any.whl (208kB)
Installing collected packages: seaborn
  Found existing installation: seaborn 0.8
    Uninstalling seaborn-0.8:
      Successfully uninstalled seaborn-0.8
Successfully installed seaborn-0.9.0


In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.model_selection import KFold, train_test_split
import datetime as dt

warnings.filterwarnings("ignore")
random_state = 20181112
import datetime, math, glob

Adding data from the M100 csv file.

# Choosing the Appropriate Classifier

We want (a) regressor(s) that can predict the **wait time** and **crowding** of a bus at a specific stop with the inputs **hourly weather** and **time of day**. We would most likely have two models that predict each **wait time** and **crowding**.

Here are our top picks for regressors:

1. Gradient Boosting Machines ***(top pick)***:
    - Why: GBMs are typically a composite model that combines the efforts of multiple weak models to create a strong model, and each additional weak model reduces the mean squared error (MSE) of the overall model. Our goal would be to minimize MSE to increase the accuracy of our predictions.

1. Random Forest:
    - Why: does not suffer from the overfitting like with Decision Trees. Instead of randomly choosing to split from just **hourly weather** and **time of day**, we can have two trees that randomly split from each and find the best model. 

1. Decision Trees:  
    - Reduction in Standard Deviation (metric): This is a regression metric that measures how much we’ve reduced our uncertainty by picking a split point. By picking the best split each time the greedy decision tree training algorithm tries to form decisions with as few splits as possible.  
    - Hyperparameters:   
        * Max depth: Limit our tree to a `n` depth to prevent overfitting.
        

Evaluating our model:

Since we're creating regression models, we are interested in the ***mean squared error*** and ***R Squared***. The lower our ***R Squared*** the more accurate our model. We intend to use **K-fold cross validation** as well as a **holdout set** as we improve our model through hyperparameter tuning. 


# Data Cleaning

> Please checkout [this notebook](../Bus_Timeline/Excel_Bus_Timeline_Draft.ipynb) on how we did the cleaning process

1. Clean and break up the time components (Hour, Mins, Secs) of the following:
    * `RecordedAtTime`
2. Merge and store (we'll merge them based on the hour of the day and the day of the month):
    * Bus
        * `Hour`
        * `Min`
        * `Sec`
        * `Day`
    * Weather
        * `Hour`
        * `HourlyVisibility`
        * `HourlyPrecipitation`
        * `HourlyWindSpeed`
3. Features of interest:
    * `Hour`
    * `Min`
    * `Sec`
    * `HourlyVisibility`
    * `HourlyPrecipitation`
    * `HourlyWindSpeed`
4. Prediction result:
    * `timeTillNext`: estimated minutes remaining until next bus

### Loading our Merged Tables

In [40]:
df = pd.read_csv('../data/Merged_Bus_Weather.csv')

# Model Training I

Adapted from: https://shankarmsy.github.io/stories/gbrt-sklearn.html

In [41]:
%matplotlib inline

import pandas as pd
import numpy as np
import warnings, seaborn
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.model_selection import KFold, train_test_split
random_state = 42

np.random.seed(sum(map(ord, "aesthetics"))) 
seaborn.set_context('notebook') 
# pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier 
plt.rcParams['figure.figsize'] = (15, 5) # Set some Pandas options 
pd.set_option('display.notebook_repr_html', False) 
pd.set_option('display.max_columns', 40) 
pd.set_option('display.max_rows', 25) 
pd.options.display.max_colwidth = 50 


## Features, Targets and Splitting

In [42]:
df.drop(columns=['Unnamed: 0'], inplace=True)

In [43]:
df.dtypes

passengerArrivalTime     object
numPassengersPerBus       int64
BusDepartureTime         object
HOURLYVISIBILITY        float64
HOURLYWindSpeed         float64
HOURLYPrecip            float64
ArrivalHour               int64
ArrivalSeconds            int64
ArrivalMinutes            int64
dtype: object

In [47]:
features = (['ArrivalHour', 'ArrivalSeconds', 'ArrivalMinutes', 
             'HOURLYVISIBILITY', 'HOURLYWindSpeed', 'HOURLYPrecip'])

target = 'numPassengersPerBus'

model_df = df[(features + [target])].dropna().reset_index()

train_df, holdout_df, y_train, y_holdout = train_test_split(
    model_df[features], 
    model_df[target], test_size=0.2,
    random_state=random_state)

train_df[target] = y_train
holdout_df[target] = y_holdout

train_df.reset_index(inplace=True)
holdout_df.reset_index(inplace=True)

print(train_df.shape[0], train_df.numPassengersPerBus.mean())
print(holdout_df.shape[0], holdout_df.numPassengersPerBus.mean())

11547 26.82999913397419
2887 27.04052649809491


In [48]:
X_train = train_df
X_test = holdout_df
y_train = y_train
y_test = y_holdout

## The Gradient Boosting Regression Tree

In [49]:
gbrt=GradientBoostingRegressor(n_estimators=100) 
train_df.shape

gbrt.fit(train_df, y_train) 
y_pred=gbrt.predict(holdout_df) 

## Designing the model

In [50]:
k_fold = KFold(n_splits=10, random_state=random_state)

Setting up models:

In [51]:
def get_cv_results(regressor):
    
    mse = []
    for train, test in k_fold.split(train_df):
        regressor.fit(train_df.loc[train, features], train_df.loc[train, target])
        y_predicted = regressor.predict(train_df.loc[test, features])
        
        mean_squared = mean_squared_error(train_df.loc[test, target], y_predicted)
        mse.append(mean_squared)
    
    return np.mean(mse), np.std(mse)

In [52]:
gbm = GradientBoostingRegressor(
    random_state=random_state, 
    learning_rate = 0.01,
    min_samples_split=4,
    max_depth=6,
    n_estimators=100
)

results = get_cv_results(gbm)

print("Mean of mean squared error:", results[0])
print("Mean squared error std:", results[1])

Mean of mean squared error: 361.38612937537414
Mean squared error std: 36.11640817377122


How about we change some hyperparameters and see what the outcomes are?

In [54]:
gbm = GradientBoostingRegressor(
    random_state=random_state, 
    learning_rate = 0.05,
    min_samples_split=4,
    max_depth=4,
    n_estimators=600
)

results = get_cv_results(gbm)

print("Mean of mean squared error:", results[0])
print("Mean squared error std:", results[1])

Mean of mean squared error: 197.64028769727676
Mean squared error std: 27.16784083705199


In [55]:
gbm = GradientBoostingRegressor(
    random_state=random_state, 
    learning_rate = 0.01,
    min_samples_split=100,
    max_depth=4,
    n_estimators=800
)

results = get_cv_results(gbm)

print("Mean of mean squared error:", results[0])
print("Mean squared error std:", results[1])

Mean of mean squared error: 211.13588172527088
Mean squared error std: 28.384793002154694


In [None]:
learning_rates = [0.1, 0.05, 0.01]
min_samples_splits = range(20, 100, 20)
max_depths = [4, 6, 8]
n_estimators = range(200,1000,200)

all_mu = []
all_sigma = []

for depth in max_depths:
    for min_splits in min_samples_splits:
        for rate in learning_rates:
            for est in n_estimators:
                print("Depth:", depth, "Splits:", min_splits, "Rate:", rate, "n_estimators:", est, end=" ")
                gbrm=GradientBoostingRegressor(
                    random_state=random_state, 
                    max_depth=depth,
                    min_samples_split=min_splits,
                    learning_rate=rate,
                    n_estimators=est
                )

                mu, sigma = get_cv_results(gbrm)
                all_mu.append(mu)
                all_sigma.append(sigma)

                print(mu, sigma)

In [None]:
plt.figure(figsize=(14, 5))
plt.plot(hp_values, all_mu)
plt.ylabel('Cross Validation Mean MSE')
plt.xlabel('Max Depth')

In [None]:
plt.figure(figsize=(14, 5))
plt.plot(hp_values, all_sigma)
plt.ylabel('Cross Validation Std Dev. of MSE')
plt.xlabel('Max Depth')

In [None]:
def plot_roc(regressor, label, color):

    regressor.fit(train_df[features], train_df[target])
    y_prob = regressor.predict(holdout_df[features])
    
    fpr, tpr, thresh = roc_curve(holdout_df[target], y_prob)
    plt.plot(fpr, tpr,
             label=label,
             color=color, linewidth=3)

    auc = roc_auc_score(holdout_df[target], y_prob[:,1])
    
    print('AUC: %0.3f (%s)' % (auc, label))

In [None]:
f1 = plt.figure(figsize=(14,6))
gbm = GradientBoostingRegressor(
    random_state=random_state, 
    learning_rate = 0.01,
    min_samples_split=4,
    max_depth=6,
    n_estimators=100
)

plot_roc(gbm, 'GBM', 'lightblue')


# Saving/Loading the Model


Credit: https://scikit-learn.org/stable/modules/model_persistence.html

In [None]:
# !pip3 install -U --user joblib

In [None]:
# Saving
from joblib import dump, load
dump(estimator, '../../data/GBRT_Hamlet.joblib') 

In [None]:
# Loading
model = load('../../data/GBRT_Hamlet.joblib') 