## Model Evaluation - Random Forest Models on Diff Data
USA World Series Results,
Run on "Diff" data

# @To Do

- [ ] Randomize data and rebuild model
    * Limit to very simple tuning, so as not to overfit
    * n_estimators = 100 to 3-400
    * 5-fold or 6-fold CV
    * max_features = 5 or 6
- [ ] Merge new data from validation set into full data set
- [ ] Explore relationship between Posession Time + Attacking Rucks + Passes

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split

In [4]:
#Import Data - USA's differential data
df = pd.read_csv('../data/output/new_features_diffdata.csv')
df.head()

#Import validation data
#valdf = pd.read_csv('../data/output/new_features_diffdata_validate.csv')
#valdf.head()

Unnamed: 0,Opp,Tournament,Poss_Time_Diff,Score_Diff,Conv_Diff,Tries_Diff,Passes_Diff,Contestable_KO_Win_pct_Diff,PenFK_Against_Diff,RuckMaul_Diff,...,-99 : -75,-74 : -25,-24 : -1,0 : 25,26 : 50,51 : 75,76 : 100,101 : 125,126 : 150,Result
0,AUSTRALIA,2015_Cape_Town,13.96648,-10.638298,-14.285714,0.25,25.925926,-50.0,0.0,0.0,...,0.0,-12.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,WALES,2015_Cape_Town,7.471264,15.555556,14.285714,0.083333,27.868852,25.0,-20.0,-100.0,...,0.0,0.0,0.0,12.5,0.0,0.0,0.0,0.0,0.0,1
2,KENYA,2015_Cape_Town,-33.136095,-44.444444,-33.333333,-0.75,-10.638298,-16.666667,66.666667,60.0,...,0.0,0.0,-5.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,NEW ZEALAND,2015_Cape_Town,51.758794,33.333333,33.333333,0.0,76.119403,-75.0,-50.0,-100.0,...,-37.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,FIJI,2015_Cape_Town,12.880562,-20.833333,-25.0,0.266667,38.461538,-66.666667,-33.333333,-33.333333,...,0.0,-12.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


## Randomize Data

In [5]:
#Shuffle dataframes before running model to prevent overfitting
from sklearn.utils import shuffle
df = shuffle(df)
#shuffle validation set
#valdf = shuffle(valdf)

In [6]:
#Diagnostic
#df.info()
list(df.columns)
#df.head()

['Opp',
 'Tournament',
 'Poss_Time_Diff',
 'Score_Diff',
 'Conv_Diff',
 'Tries_Diff',
 'Passes_Diff',
 'Contestable_KO_Win_pct_Diff',
 'PenFK_Against_Diff',
 'RuckMaul_Diff',
 'Ruck_Win_pct_Diff',
 'Cards_diff',
 'Lineout_Win_Pct_Diff',
 'Scrum_Win_Pct_Diff',
 '-175 : -150',
 '-149 : -125',
 '-124 : -100',
 '-99 : -75',
 '-74 : -25',
 '-24 : -1',
 '0 : 25',
 '26 : 50',
 '51 : 75',
 '76 : 100',
 '101 : 125',
 '126 : 150',
 'Result']

### Pre-processing data

In [7]:
#Create a list of features to drop that are unneccessary or will bias the prediction
droplist = ['Opp', 'Score_Diff', 'Tries_Diff','Tournament', 'Conv_Diff','-175 : -150', '-149 : -125','-124 : -100', '-99 : -75', '-74 : -25','-24 : -1','0 : 25','26 : 50','51 : 75','76 : 100','101 : 125','126 : 150']

rf_data = df.drop((droplist), axis=1)

#Drop rows with Result == "2" (Ties). This label messes up classification models
rf_data.drop(rf_data[rf_data.Result == 2].index, inplace=True)

In [8]:
#rf_data.head()
#Check to insure 'Result' only contains 2 values (W, L)
#rf_data['Result'].describe()
#rf_data.describe()

In [9]:
#list(rf_data.columns) 

In [10]:
#Pull out the variable we're trying to predict: 'Result'
X = rf_data.drop('Result',axis=1)
y = rf_data['Result']
#X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.30)

### Train/Test Split

*Commented out, because if using standard scaler in a pipeline, it does it for you - see notes below.*

In [11]:
#Split into train/test/validate sets
#OR, keep as is and use new data for validate
#156 rows in original dataframe
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=77)

<span style="color:red">## NOTE:</span>  
https://stackoverflow.com/questions/51459406/apply-standardscaler-in-pipeline-in-scikit-learn-sklearn

When you use the StandardScaler as a step inside a Pipeline then scikit-learn will internally do the job for you.

What happens can be discribed as follows:  

* Step 0: The data are split into TRAINING data and TEST data according to the cv parameter that you specified in the GridSearchCV.
* Step 1: the scaler is fitted on the TRAINING data
* Step 2: the scaler transforms TRAINING data
* Step 3: the models are fitted/trained using the transformed TRAINING data
* Step 4: the scaler is used to transform the TEST data
* Step 5: the trained models predict using the transformed TEST data

***Note:*** You should be using grid.fit(X, y) and NOT grid.fit(X_train, y_train) because the GridSearchCV will automatically split the data into training and testing data (this happen internally).

Your code should look like this:

<code>pipe = Pipeline([
        ('scale', StandardScaler()),
        ('reduce_dims', PCA(n_components=4)),
        ('clf', SVC(kernel = 'linear', C = 1))])

param_grid = dict(reduce_dims__n_components=[4,6,8],
                  clf__C=np.logspace(-4, 1, 6),
                  clf__kernel=['rbf','linear'])

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X, y)
print(grid.best_score_)
print(grid.cv_results_)</code>

Once you run this code (when you call **grid.fit(X, y)**), you can access the outcome of the grid search in the result object returned from grid.fit(). The **best_score_ member** provides access to the best score observed during the optimization procedure and the **best_params_** describes the combination of parameters that achieved the best results.

**IMPORTANT:** if you want to keep a validation dataset of the original dataset use this:

<code>X_for_gridsearch, X_future_validation, y_for_gridsearch, y_future_validation 
    = train_test_split(X, y, test_size=0.15, random_state=1)</code>
    
Then use:

<code>grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X_for_gridsearch, y_for_gridsearch)</code>

### Create a transformation pipeline
Pipeline with Scaling and Random Forest Classifier

In [12]:
from sklearn.pipeline import Pipeline
# Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

#create the pipeline
scale_pipeline = Pipeline([
    ('std_scaler', StandardScaler()),
    ('classifier', RandomForestClassifier())
])

# fit the pipeline
#scale_pipeline.fit(X_train, y_train)

### Check the Pipeline test accuracy

In [13]:
# Pipeline test accuracy
#print('Test accuracy: %.3f' % scale_pipeline.score(X_test, y_test))

# Pipeline estimator params; estimator is stored as step 2 ([1]), second item ([1])
#print('\nModel hyperparameters:\n', scale_pipeline.steps[1][1].get_params())

### Grid Search with Cross Validation
Random search allowed us to narrow down the range for each hyperparameter. Now that we know where to concentrate our search, we can explicitly specify every combination of settings to try. We do this with GridSearchCV, a method that, instead of sampling randomly from a distribution, evaluates all combinations we define.  

### Hyperparameters
* n_estimators = number of trees in the foreset
* max_features = max number of features considered for splitting a node
* max_depth = max number of levels in each decision tree
* min_samples_split = min number of data points placed in a node before the node is split
* min_samples_leaf = min number of data points allowed in a leaf node
* bootstrap = method for sampling data points (with or without replacement)

In [14]:
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'classifier__bootstrap': [True, False],
    'classifier__max_depth': [60, 80, 100],
    'classifier__max_features': ['auto', 4, 5, 6],
    'classifier__min_samples_leaf': [1, 2, 3, 4, 5],
    'classifier__min_samples_split': [2, 5, 8, 10, 12],
    'classifier__n_estimators': [10, 20, 40, 60, 100], # [100, 200, 300, 400]
    'classifier__criterion': ['gini', 'entropy']
}

## Random Forest
If ***not*** using pipelines

In [15]:
#from sklearn.ensemble import RandomForestClassifier

#Fit RF Classifier model
#rf = RandomForestClassifier(random_state=101)

#from pprint import pprint
# Look at parameters used by our current forest
#print('Default Parameters currently in use:\n')
#pprint(rf.get_params())

### Execute GridSearch

In [16]:
# execute gridsearch and get best score
rf_grid = GridSearchCV(scale_pipeline, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2, scoring= 'accuracy')

# fit on ALL grid.fit(X, y) and NOT grid.fit(X_train, y_train) because the GridSearchCV will automatically 
# split the data into training and testing data (this happen internally).
rf_grid.fit(X, y)

print("RF Grid search Best Score:", rf_grid.best_score_)
print("RF Grid search Cross Validation Results:", rf_grid.cv_results_)

Fitting 3 folds for each of 6000 candidates, totalling 18000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   32.2s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   39.2s
[Parallel(n_jobs=-1)]: Done 357 tasks      | elapsed:   53.0s
[Parallel(n_jobs=-1)]: Done 640 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 1005 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 1450 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 1977 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 2584 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 3273 tasks      | elapsed:  4.4min
[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed:  5.6min
[Parallel(n_jobs=-1)]: Done 4893 tasks      | elapsed:  7.2min
[Parallel(n_jobs=-1)]: Done 5824 tasks      | elapsed:  8.5min
[Parallel(n_jobs=-1)]: Done 6837 tasks      | elapsed:  9.9min
[Parallel(n_jobs=-1)]: Done 7930 tasks      | elapsed: 11.5min
[Parallel(n_jobs=-1)]: Done 9105 tasks      | 

RF Grid search Best Score: 0.7086092715231788
RF Grid search Cross Validation Results: {'mean_fit_time': array([0.22959805, 0.16456668, 0.15793761, ..., 0.43528032, 0.55979904,
       0.65496071]), 'std_fit_time': array([0.00052125, 0.04314535, 0.01590334, ..., 0.01824349, 0.05616841,
       0.06821363]), 'mean_score_time': array([0.0240651 , 0.01057704, 0.01868534, ..., 0.01232068, 0.02331805,
       0.03397592]), 'std_score_time': array([0.00013155, 0.01028631, 0.00772812, ..., 0.0010294 , 0.00935509,
       0.00839581]), 'param_classifier__bootstrap': masked_array(data=[True, True, True, ..., False, False, False],
             mask=[False, False, False, ..., False, False, False],
       fill_value='?',
            dtype=object), 'param_classifier__criterion': masked_array(data=['gini', 'gini', 'gini', ..., 'entropy', 'entropy',
                   'entropy'],
             mask=[False, False, False, ..., False, False, False],
       fill_value='?',
            dtype=object), 'param_cl

In [17]:
print("Best Parameters from Grid Search: \n", rf_grid.best_params_)

Best Parameters from Grid Search: 
 {'classifier__bootstrap': True, 'classifier__criterion': 'gini', 'classifier__max_depth': 80, 'classifier__max_features': 6, 'classifier__min_samples_leaf': 3, 'classifier__min_samples_split': 8, 'classifier__n_estimators': 10}


In [23]:
# Print pipeline estimator
# Pipeline estimator params; estimator is stored as step 2 ([1]), second item ([1])
# print('\nModel hyperparameters:\n', scale_pipeline.steps[1][1].get_params())
aprint('\nModel hyperparameters:\n\n', scale_pipeline.steps)


Model hyperparameters:

 [('std_scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('classifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]


## Next Steps
Take best model parameters found in <code>'scale_pipeline.steps'</code> and use them in a final model in 'RF-Remix.ipynb'