# Model Building

This notebook will go over the process of testing different variations of models to find the parameters that best suit the model for this application. 

* Model Type
* Metric Evaluation
* Hyper Parameter Tuning 


In [34]:
from sklearn.ensemble import RandomForestRegressor
import sklearn.metrics as metrics
from scipy.stats.stats import pearsonr
import pandas as pd
import numpy as np
from datetime import datetime
import statistics
import matplotlib.pyplot as plt
import pymysql
import config
import transformations

In [35]:
conn = pymysql.connect(config.host, user=config.username,port=config.port,
                           passwd=config.password)

#gather all historical data to build model
RideWaits = pd.read_sql_query("call DisneyDB.RideWaitQuery", conn)

#transform data for model bulding
RideWaits = transformations.transformData(RideWaits)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)


In [36]:
RideWaits.head()

Unnamed: 0,RideId,Date,Time,Wait,Name,OpeningDate,Tier,Location,IntellectualProp,Status,...,DayOfWeek,Weekend,CharacterExperience,inEMH,validTime,EMHDay,TimeSinceOpen,TimeSinceMidday,MagicHourType,TimeSinceRideOpen
0,0,2018-05-07,12:30:00,35,Astro Orbiter,1995-02-25,minor_attraction,Tomorrowland,,1,...,0,0,0,0,1,1,3,2,Night,8472
1,2,2018-05-07,12:30:00,45,Big Thunder Mountain Railroad,1980-09-23,headliner,Frontierland,,1,...,0,0,0,0,1,1,3,2,Night,13740
2,3,2018-05-07,12:30:00,45,Buzz Lightyears Space Ranger Spin,1998-10-07,minor_attraction,Tomorrowland,Pixar,1,...,0,0,0,0,1,1,3,2,Night,7152
3,6,2018-05-07,12:30:00,40,Dumbo the Flying Elephant,1971-10-01,minor_attraction,Fantasyland,AnimatedClassic,1,...,0,0,0,0,1,1,3,2,Night,17020
4,7,2018-05-07,12:30:00,25,Enchanted Tales with Belle,2012-12-06,minor_attraction,Fantasyland,AnimatedClassic,1,...,0,0,0,0,1,1,3,2,Night,1978


The data frame looks quite different than in the prevsious exploratory analysis frame. Certain columns have been removed in an effort to consolidate and extract the vital information that has been seen to make a difference in Wait times. 

In [37]:
keyFeatures = ["Name","MagicHourType", "Tier", "IntellectualProp", "SimpleStatus", "ParkName", "DayOfWeek", "Weekend", "TimeSinceOpen", "CharacterExperience", "TimeSinceMidday", "inEMH", "EMHDay"]


In [38]:
keyFeatures

['Name',
 'MagicHourType',
 'Tier',
 'IntellectualProp',
 'SimpleStatus',
 'ParkName',
 'DayOfWeek',
 'Weekend',
 'TimeSinceOpen',
 'CharacterExperience',
 'TimeSinceMidday',
 'inEMH',
 'EMHDay']

We have established to this point that this list of features will cause the most impact on wait times and give the most insight into the dataset. 
These can be broken into categories into the information we are trying to gain:
* Ride Characteristics
    * Name
    * Tier
    * IntellectualProp (Intellectual Property)
    * ParkName (Which park is this ride located in)
    * CharacterExperience (Is this a character experience)
* Time of Day Information
    * DayOfWeek
    * Weekend 
    * TimeSinceOpen (How many hours is it since the park opened that day)
    * TimeSinceMidday (How many hours is it in absolute value since 2pm)
    * inEMH (is this wait time in an Extra magic hour window)
    * EMHDay (is this day at the park an Extra Magic Hour Day)
    * MagicHourType (is this a extra magic morning or an extra magic night)
* Weather
    * SimpleStatus
    
As the dataset grows and the more months and weather characteristics pile in, this list of usable features may expand. For example, temperature is not being included in this list as all the data gathered as of today has been from one week, and the temperature pattern adds no value. 

In [39]:
categoryColumns = RideWaits.select_dtypes(include = ['category']).columns
RideWaits["Name"] = pd.Categorical(RideWaits["Name"]).codes
for col in categoryColumns:
    RideWaits[col] = pd.Categorical(RideWaits[col]).codes


In [40]:
RideWaits[keyFeatures].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25439 entries, 0 to 26238
Data columns (total 13 columns):
Name                   25439 non-null int8
MagicHourType          25439 non-null int8
Tier                   25439 non-null int8
IntellectualProp       25439 non-null int8
SimpleStatus           25439 non-null int8
ParkName               25439 non-null int8
DayOfWeek              25439 non-null int64
Weekend                25439 non-null int64
TimeSinceOpen          25439 non-null int64
CharacterExperience    25439 non-null int64
TimeSinceMidday        25439 non-null int64
inEMH                  25439 non-null int64
EMHDay                 25439 non-null int64
dtypes: int64(7), int8(6)
memory usage: 2.9 MB


In [41]:
from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(RideWaits[keyFeatures], RideWaits["Wait"], test_size = .25, random_state = 1)

rf = RandomForestRegressor(random_state = 1)
rf.fit(train_x, train_y)
predictions = rf.predict(test_x)
rmseBase = metrics.mean_squared_error(predictions, test_y)**(1/2)
r2Base = metrics.r2_score(predictions, test_y)
varBase = metrics.explained_variance_score(predictions,test_y)
pearsoncorrBase = pearsonr(predictions, test_y)
perrorBase = abs(predictions - test_y)/test_y
accuracyBase = 1 - statistics.median(perrorBase)
errorBase = abs(predictions - test_y)
merrorBase = errorBase.mean()
medErrorBase = statistics.median(errorBase)

In [42]:
rmseBase

9.028194219303769

In [43]:
r2Base

0.87167634408361439

In [44]:
varBase

0.87167751949410999

In [45]:
pearsoncorrBase

(0.939408125858892, 0.0)

In [46]:
accuracyBase

0.8666666666666667

In [47]:
merrorBase

5.5866567852229121

In [48]:
medErrorBase

3.5

By using the defaults of the Random forest regressor we can get some baseline statistics with this algorithm. We see we have a high correlation meaning that our predictions are following the proper trend of our data. Our accuracy for our base model is 87% and is not a bad start, we also see that both our mean and median wait time error values are under 10 minutes which is also a good place to start. We can now try some other modeling methods as well as tune the hyper parameters for our random forest model. 

We are going to start with attempting to tune the hyper parameters for this model

Our focus will be on the following parameters: 
* n_estimators
* max_features
* max_depth
* min_samples_split
* min_samples_leaf
* bootstrap

In [49]:
from sklearn.model_selection import RandomizedSearchCV

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10,110, num = 11)]
max_depth.append(None)
min_samples_split = [2,5,10]
min_samples_leaf = [1,2,4]

bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [50]:
rfNew = RandomForestRegressor()
rf_randomized = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose = 2, random_state = 1, n_jobs = -1)

In [52]:
rf_randomized.fit(train_x, train_y)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  8.9min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 18.5min finished


RandomizedSearchCV(cv=3, error_score='raise',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=1, verbose=0, warm_start=False),
          fit_params=None, iid=True, n_iter=100, n_jobs=-1,
          param_distributions={'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]},
          pre_dispatch='2*n_jobs', random_state=1, refit=True,
          return_train_score=True, scoring=None, verbose=2)

In [53]:
rf_randomized.best_params_

{'bootstrap': True,
 'max_depth': 70,
 'max_features': 'auto',
 'min_samples_leaf': 1,
 'min_samples_split': 5,
 'n_estimators': 1800}

In [54]:
rf_random = rf_randomized.best_estimator_

In [55]:
predictions = rf_random.predict(test_x)
rmseRandom = metrics.mean_squared_error(predictions, test_y)**(1/2)
r2Random = metrics.r2_score(predictions, test_y)
varRandom = metrics.explained_variance_score(predictions,test_y)
pearsoncorrRandom = pearsonr(predictions, test_y)
perrorRandom = abs(predictions - test_y)/test_y
accuracyRandom = 1 - statistics.median(perrorRandom)
errorRandom = abs(predictions - test_y)
merrorRandom = errorRandom.mean()
medErroRandom = statistics.median(errorRandom)

In [56]:
rmseRandom

8.7205278867899914

In [57]:
varRandom

0.87692568282077343

In [58]:
r2Random

0.8769255514304688

In [59]:
pearsoncorrRandom

(0.9434117447146374, 0.0)

In [61]:
accuracyRandom

0.86243886901513289

In [63]:
merrorRandom

5.5176649027601119

In [64]:
medErroRandom

3.4209986464153177

We see a very slight increase in RMSE, mean error, and median Error, but see no increase in overall accuracy.

In [65]:
gridSearch = {'bootstrap': [True],
              'max_depth' : [50,60,70,80,90],
              'max_features': [6,7,8,9,10],
              'min_samples_leaf': [1,2,3,4],
              'n_estimators' : [200,300,500,1000,1500]}

In [67]:
from sklearn.model_selection import GridSearchCV
rf = RandomForestRegressor()
grid_search_rf = GridSearchCV(estimator = rf, param_grid = gridSearch, cv = 3, n_jobs = -1, verbose = 2)

In [68]:
grid_search_rf.fit(train_x, train_y)

Fitting 3 folds for each of 500 candidates, totalling 1500 fits


[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   59.4s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed: 13.3min
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed: 24.4min
[Parallel(n_jobs=-1)]: Done 997 tasks      | elapsed: 38.5min
[Parallel(n_jobs=-1)]: Done 1442 tasks      | elapsed: 55.7min
[Parallel(n_jobs=-1)]: Done 1500 out of 1500 | elapsed: 58.4min finished


GridSearchCV(cv=3, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'bootstrap': [True], 'max_depth': [50, 60, 70, 80, 90], 'max_features': [6, 7, 8, 9, 10], 'min_samples_leaf': [1, 2, 3, 4], 'n_estimators': [200, 300, 500, 1000, 1500]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=2)

In [69]:
grid_search_rf.best_params_

{'bootstrap': True,
 'max_depth': 50,
 'max_features': 7,
 'min_samples_leaf': 1,
 'n_estimators': 500}

In [71]:
best_grid = grid_search_rf.best_estimator_

In [81]:
predictions = best_grid.predict(test_x)
rmseGrid = metrics.mean_squared_error(predictions, test_y)**(1/2)
r2Grid = metrics.r2_score(predictions, test_y)
varGrid = metrics.explained_variance_score(predictions,test_y)
pearsoncorrGrid = pearsonr(predictions, test_y)
perrorGrid = abs(predictions - test_y)/test_y
accuracyGrid = 1 - statistics.median(perrorGrid)
errorGrid = abs(predictions - test_y)
merrorGrid = errorGrid.mean()
medErrorGrid = statistics.median(errorGrid)

In [73]:
rmseGrid

8.8481542038575434

In [74]:
r2Grid

0.87499851414237972

In [75]:
varGrid

0.87499977527551553

In [76]:
pearsoncorrGrid

(0.94174301332456123, 0.0)

In [77]:
accuracyGrid

0.86451547619047664

In [79]:
merrorGrid

5.5206333879364511

In [82]:
medErrorGrid

3.3572023809523781

A slight increase in accuracy but negligible effects across the board in most other categories