# Model Building

This notebook will go over the process of testing different variations of models to find the parameters that best suit the model for this application. 

* Model Type
* Metric Evaluation
* Hyper Parameter Tuning 


In [1]:
import sys

In [2]:
!{sys.executable} -m pip install pymysql



In [3]:
from sklearn.ensemble import RandomForestRegressor
import sklearn.metrics as metrics
from scipy.stats.stats import pearsonr
import pandas as pd
import numpy as np
from datetime import datetime
import statistics
import matplotlib.pyplot as plt
import pymysql
import config
import transformations

In [4]:
conn = pymysql.connect(config.host, user=config.username,port=config.port,
                           passwd=config.password)

#gather all historical data to build model
RideWaits = pd.read_sql_query("call DisneyDB.RideWaitQuery", conn)

#transform data for model bulding
RideWaits = transformations.transformData(RideWaits)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)


In [5]:
RideWaits.head()

Unnamed: 0,RideId,Date,Time,Wait,Name,OpeningDate,Tier,Location,IntellectualProp,Status,...,DayOfWeek,Weekend,CharacterExperience,inEMH,validTime,EMHDay,TimeSinceOpen,TimeSinceMidday,MagicHourType,TimeSinceRideOpen
0,0,2018-05-07,12:30:00,35,Astro Orbiter,1995-02-25,minor_attraction,Tomorrowland,,1,...,0,0,0,0,1,1,3,2,Night,8472
1,2,2018-05-07,12:30:00,45,Big Thunder Mountain Railroad,1980-09-23,headliner,Frontierland,,1,...,0,0,0,0,1,1,3,2,Night,13740
2,3,2018-05-07,12:30:00,45,Buzz Lightyears Space Ranger Spin,1998-10-07,minor_attraction,Tomorrowland,Pixar,1,...,0,0,0,0,1,1,3,2,Night,7152
3,6,2018-05-07,12:30:00,40,Dumbo the Flying Elephant,1971-10-01,minor_attraction,Fantasyland,AnimatedClassic,1,...,0,0,0,0,1,1,3,2,Night,17020
4,7,2018-05-07,12:30:00,25,Enchanted Tales with Belle,2012-12-06,minor_attraction,Fantasyland,AnimatedClassic,1,...,0,0,0,0,1,1,3,2,Night,1978


The data frame looks quite different than in the prevsious exploratory analysis frame. Certain columns have been removed in an effort to consolidate and extract the vital information that has been seen to make a difference in Wait times. 

In [6]:
keyFeatures = ["Name","MagicHourType", "Tier", "IntellectualProp", "SimpleStatus", "ParkName", "DayOfWeek", "Weekend", "TimeSinceOpen", "CharacterExperience", "TimeSinceMidday", "inEMH", "EMHDay"]


In [7]:
keyFeatures

['Name',
 'MagicHourType',
 'Tier',
 'IntellectualProp',
 'SimpleStatus',
 'ParkName',
 'DayOfWeek',
 'Weekend',
 'TimeSinceOpen',
 'CharacterExperience',
 'TimeSinceMidday',
 'inEMH',
 'EMHDay']

We have established to this point that this list of features will cause the most impact on wait times and give the most insight into the dataset. 
These can be broken into categories into the information we are trying to gain:
* Ride Characteristics
    * Name
    * Tier
    * IntellectualProp (Intellectual Property)
    * ParkName (Which park is this ride located in)
    * CharacterExperience (Is this a character experience)
* Time of Day Information
    * DayOfWeek
    * Weekend 
    * TimeSinceOpen (How many hours is it since the park opened that day)
    * TimeSinceMidday (How many hours is it in absolute value since 2pm)
    * inEMH (is this wait time in an Extra magic hour window)
    * EMHDay (is this day at the park an Extra Magic Hour Day)
    * MagicHourType (is this a extra magic morning or an extra magic night)
* Weather
    * SimpleStatus
    
As the dataset grows and the more months and weather characteristics pile in, this list of usable features may expand. For example, temperature is not being included in this list as all the data gathered as of today has been from one week, and the temperature pattern adds no value. 

In [8]:
categoryColumns = RideWaits.select_dtypes(include = ['category']).columns
RideWaits["Name"] = pd.Categorical(RideWaits["Name"]).codes
for col in categoryColumns:
    RideWaits[col] = pd.Categorical(RideWaits[col]).codes


In [9]:
RideWaits[keyFeatures].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27574 entries, 0 to 28473
Data columns (total 13 columns):
Name                   27574 non-null int8
MagicHourType          27574 non-null int8
Tier                   27574 non-null int8
IntellectualProp       27574 non-null int8
SimpleStatus           27574 non-null int8
ParkName               27574 non-null int8
DayOfWeek              27574 non-null int64
Weekend                27574 non-null int64
TimeSinceOpen          27574 non-null int64
CharacterExperience    27574 non-null int64
TimeSinceMidday        27574 non-null int64
inEMH                  27574 non-null int64
EMHDay                 27574 non-null int64
dtypes: int64(7), int8(6)
memory usage: 3.1 MB


In [10]:
from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(RideWaits[keyFeatures], RideWaits["Wait"], test_size = .25, random_state = 1)

rf = RandomForestRegressor(random_state = 1)
rf.fit(train_x, train_y)
predictions = rf.predict(test_x)
rmseBase = metrics.mean_squared_error(predictions, test_y)**(1/2)
r2Base = metrics.r2_score(predictions, test_y)
varBase = metrics.explained_variance_score(predictions,test_y)
pearsoncorrBase = pearsonr(predictions, test_y)
perrorBase = abs(predictions - test_y)/test_y
accuracyBase = 1 - statistics.median(perrorBase)
errorBase = abs(predictions - test_y)
merrorBase = errorBase.mean()
medErrorBase = statistics.median(errorBase)

In [11]:
print("RMSE: " + str(rmseBase))
print("r2: "+ str(r2Base))
print("var:" + str(varBase))
print("pearsonCorr: "+ str(pearsoncorrBase))
print("Accuracy: " + str(accuracyBase))
print("Mean Error: " + str(merrorBase))
print("Median Error: " + str(medErrorBase))

RMSE: 9.16678440584
r2: 0.868382524014
var:0.868410504503
pearsonCorr: (0.93743680462890122, 0.0)
Accuracy: 0.861547008547
Mean Error: 5.66744115471
Median Error: 3.45833333333


By using the defaults of the Random forest regressor we can get some baseline statistics with this algorithm. We see we have a high correlation meaning that our predictions are following the proper trend of our data. Our accuracy for our base model is 87% and is not a bad start, we also see that both our mean and median wait time error values are under 10 minutes which is also a good place to start. We can now try some other modeling methods as well as tune the hyper parameters for our random forest model. 

We are going to start with attempting to tune the hyper parameters for this model

Our focus will be on the following parameters: 
* n_estimators
* max_features
* max_depth
* min_samples_split
* min_samples_leaf
* bootstrap

In [12]:
from sklearn.model_selection import RandomizedSearchCV

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10,110, num = 11)]
max_depth.append(None)
min_samples_split = [2,5,10]
min_samples_leaf = [1,2,4]

bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [13]:
rfNew = RandomForestRegressor()
rf_randomized = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose = 2, random_state = 1, n_jobs = -1)

In [14]:
rf_randomized.fit(train_x, train_y)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed: 11.3min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 23.2min finished


RandomizedSearchCV(cv=3, error_score='raise',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=1, verbose=0, warm_start=False),
          fit_params=None, iid=True, n_iter=100, n_jobs=-1,
          param_distributions={'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]},
          pre_dispatch='2*n_jobs', random_state=1, refit=True,
          return_train_score=True, scoring=None, verbose=2)

In [15]:
rf_randomized.best_params_

{'bootstrap': True,
 'max_depth': 70,
 'max_features': 'auto',
 'min_samples_leaf': 1,
 'min_samples_split': 5,
 'n_estimators': 1800}

In [16]:
rf_random = rf_randomized.best_estimator_

In [19]:
predictions = rf_random.predict(test_x)
rmseRandom = metrics.mean_squared_error(predictions, test_y)**(1/2)
r2Random = metrics.r2_score(predictions, test_y)
varRandom = metrics.explained_variance_score(predictions,test_y)
pearsoncorrRandom = pearsonr(predictions, test_y)
perrorRandom = abs(predictions - test_y)/test_y
accuracyRandom = 1 - statistics.median(perrorRandom)
errorRandom = abs(predictions - test_y)
merrorRandom = errorRandom.mean()
medErrorRandom = statistics.median(errorRandom)

In [20]:
print("RMSE: " + str(rmseRandom))
print("r2: "+ str(r2Random))
print("var:" + str(varRandom))
print("pearsonCorr: "+ str(pearsoncorrRandom))
print("Accuracy: " + str(accuracyRandom))
print("Mean Error: " + str(merrorRandom))
print("Median Error: " + str(medErrorRandom))

RMSE: 8.86696008798
r2: 0.873741967764
var:0.873770139723
pearsonCorr: (0.94132637573981681, 0.0)
Accuracy: 0.85789432858
Mean Error: 5.61230924331
Median Error: 3.45147737872


We see a very slight increase in RMSE, mean error, and median Error, but see no increase in overall accuracy.

In [21]:
gridSearch = {'bootstrap': [True],
              'max_depth' : [50,60,70,80,90],
              'max_features': [6,7,8,9,10],
              'min_samples_leaf': [1,2,3,4],
              'n_estimators' : [200,300,500,1000,1500]}

In [22]:
from sklearn.model_selection import GridSearchCV
rf = RandomForestRegressor()
grid_search_rf = GridSearchCV(estimator = rf, param_grid = gridSearch, cv = 3, n_jobs = -1, verbose = 2)

In [23]:
grid_search_rf.fit(train_x, train_y)

Fitting 3 folds for each of 500 candidates, totalling 1500 fits


[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  6.5min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed: 16.9min
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed: 31.1min
[Parallel(n_jobs=-1)]: Done 997 tasks      | elapsed: 48.1min
[Parallel(n_jobs=-1)]: Done 1442 tasks      | elapsed: 69.2min
[Parallel(n_jobs=-1)]: Done 1500 out of 1500 | elapsed: 72.6min finished


GridSearchCV(cv=3, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'bootstrap': [True], 'max_depth': [50, 60, 70, 80, 90], 'max_features': [6, 7, 8, 9, 10], 'min_samples_leaf': [1, 2, 3, 4], 'n_estimators': [200, 300, 500, 1000, 1500]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=2)

In [69]:
grid_search_rf.best_params_

{'bootstrap': True,
 'max_depth': 50,
 'max_features': 7,
 'min_samples_leaf': 1,
 'n_estimators': 500}

In [71]:
best_grid = grid_search_rf.best_estimator_

In [81]:
predictions = best_grid.predict(test_x)
rmseGrid = metrics.mean_squared_error(predictions, test_y)**(1/2)
r2Grid = metrics.r2_score(predictions, test_y)
varGrid = metrics.explained_variance_score(predictions,test_y)
pearsoncorrGrid = pearsonr(predictions, test_y)
perrorGrid = abs(predictions - test_y)/test_y
accuracyGrid = 1 - statistics.median(perrorGrid)
errorGrid = abs(predictions - test_y)
merrorGrid = errorGrid.mean()
medErrorGrid = statistics.median(errorGrid)

In [73]:
print("RMSE: " + str(rmseGrid))
print("r2: "+ str(r2Grid))
print("var:" + str(varGrid))
print("pearsonCorr: "+ str(pearsoncorrGrid))
print("Accuracy: " + str(accuracyGrid))
print("Mean Error: " + str(merrorGrid))
print("Median Error: " + str(medErrorGrid))

8.8481542038575434

A slight increase in accuracy but negligible effects across the board in most other categories

## Cross validation
Using the internal cross validation packages we can compute scores without worrying about over fitting or tuning hyper parameters to specific data sets. This can cause a pseudo leakage of our testing data into the training set. We will start the cross validation with the random forest regressor we got from the best estimator from the grid search.

In [19]:
from sklearn.model_selection import cross_val_score

In [20]:
rf = RandomForestRegressor(bootstrap = True, max_depth = 50, max_features = 7, min_samples_leaf = 1, n_estimators = 500)

In [None]:
scores = cross_val_score(rf, RideWaits[keyFeatures], RideWaits["Wait"], cv = 10)

In [22]:
scores

array([ 0.95672522,  0.95998704,  0.89559266,  0.6451986 ,  0.70555515,
        0.73674714,  0.79526532,  0.66179876,  0.67487744,  0.56593622])

This is just a single metric, it is perhaps more useful to obtain a number of metrics for each of our cross validated models.

In [30]:
import sklearn.metrics as metrics
from sklearn.model_selection import KFold
def cross_validation_metrics(df, key_cols, target, folds):
    df = df.dropna(how = 'any')
    X = df[key_cols]
    y = np.array(df[target])
    overall_rmse = []
    overall_accuracy = []
    overall_median_error = []
    overall_mean_error = []
    overall_r2 = []
    corr = []
    kf = KFold(n_splits = folds)
    i = 1
    for train_index, test_index in kf.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y[train_index], y[test_index]
        rf = RandomForestRegressor(n_estimators = 500)
        rf.fit(X_train, y_train)
        predictions = rf.predict(X_test)
        rmse = metrics.mean_squared_error(predictions, y_test)**(1/2)
        var = metrics.explained_variance_score(predictions,y_test)
        pearsoncorr = pearsonr(predictions, np.array(y_test))
        perror = abs(predictions - y_test)/y_test
        mperror = statistics.median(perror)
        error = abs(predictions - y_test)
        merror = error.mean()
        print("Fold " + str(i))
        print("RMSE: "+ str(rmse))
        print("Correlation: "+ str(pearsoncorr))
        print("Accuracy: " + str(1-mperror))
        print("Mean Error: "+ str(merror))
        print("Median Error: "+ str(statistics.median(error)))
        print("-------------------------")
        overall_rmse.append(rmse)
        overall_accuracy.append((1-mperror))
        overall_median_error.append(statistics.median(error))
        overall_mean_error.append(merror)
        #overall_r2.append(r2)
        corr.append(pearsoncorr[0])
        i = i+1
    
    return_dict = {
        'rmse': np.mean(overall_rmse),
        'accuracy': np.mean(overall_accuracy),
        'median_error': np.mean(overall_median_error),
        'mean_error' : np.mean(overall_mean_error),
        'correlation':np.mean(corr)
    }
    return return_dict


In [31]:
metrics = cross_validation_metrics(RideWaits, keyFeatures,"Wait",10)

Fold 1
RMSE: 6.0732260457
Correlation: (0.97396761764927553, 0.0)
Accuracy: 0.928947731097
Mean Error: 3.87727640195
Median Error: 2.4910750777
-------------------------
Fold 2
RMSE: 5.65187157481
Correlation: (0.98426679942944462, 0.0)
Accuracy: 0.940172390488
Mean Error: 3.47899026157
Median Error: 1.77168759019
-------------------------
Fold 3
RMSE: 4.91791154206
Correlation: (0.98768679015027039, 0.0)
Accuracy: 0.940564055389
Mean Error: 3.03340974048
Median Error: 1.99461026121
-------------------------
Fold 4
RMSE: 6.13154560897
Correlation: (0.97359658257898918, 0.0)
Accuracy: 0.927273672653
Mean Error: 3.92428541216
Median Error: 2.49962698413
-------------------------
Fold 5
RMSE: 10.7646263104
Correlation: (0.93536159179077571, 0.0)
Accuracy: 0.834643614719
Mean Error: 6.150838461
Median Error: 3.78594580746
-------------------------
Fold 6
RMSE: 17.9471995249
Correlation: (0.83625193755711447, 5.4551834709844905e-276)
Accuracy: 0.684471302556
Mean Error: 11.7027463517
Median

In [44]:
metrics

{'accuracy': 0.81516012311550567,
 'correlation': 0.90129180544252585,
 'mean_error': 7.5653924358065492,
 'median_error': 4.6746564406052169,
 'rmse': 11.736031814446992}

In [46]:
metric_frame = pd.DataFrame(list(metrics.items()), columns = ['Metric Name', 'Metric Value'])

In [47]:
metric_frame

Unnamed: 0,Metric Name,Metric Value
0,rmse,11.736032
1,accuracy,0.81516
2,median_error,4.674656
3,mean_error,7.565392
4,correlation,0.901292
