# Grid Search Model Selection Notebook

***

The purpose of this notebook is to replicate some of the model selection functionality offered by DataRobot for client projects that cannot justify the sometimes prohibative cost of DataRobot implemetations

## How to use

The notebook guides you through the process of finding a good classifier and tuning the parameters with a grid search
By default the classifiers that are evaluated are:

    ExtraTrees
    RandomForest
    AdaBoost
    GradientBoosting
    XGBoostClassifier
    
These defaults also include some hyperparameter suggestions in the 'params' dictionary - these can be added/removed if you have some intuition about the kind of classifier that would do well for your scenario

## Quickstart

1. Change the csv location etc : [Input csv Link](#quickstart_link)

2. Change the name of the output filenames: [Filename Link](#quickstart_link2)

3. Run all the cells in this notbook 

## Dependancies

- pandas
- numpy
- sklearn

***

## 1. Helper Function

Firstly we will define a helper function that will be used later to evaluate different models
Dont worry too much about this function - the rest of the code in this workbook is explained in full

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import GridSearchCV

class EstimatorSelectionHelper:

    def __init__(self, models, params):
        if not set(models.keys()).issubset(set(params.keys())):
            missing_params = list(set(models.keys()) - set(params.keys()))
            raise ValueError("Some estimators are missing parameters: %s" % missing_params)
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv=3, n_jobs=3, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print("Running GridSearchCV for %s." % key)
            model = self.models[key]
            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring, refit=refit,
                              return_train_score=True)
            gs.fit(X,y)
            self.grid_searches[key] = gs    

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                 'estimator': key,
                 'min_score': min(scores),
                 'max_score': max(scores),
                 'mean_score': np.mean(scores),
                 'std_score': np.std(scores),
            }
            return pd.Series({**params,**d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]        
                scores.append(r.reshape(len(params),1))

            all_scores = np.hstack(scores)
            for p, s in zip(params,all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns]

## 2. Dataset import and processing <a id='quickstart_link'></a>

Next we will import the dataset we will be working on - in the default example the Nokia dataset is used because this document was originally prepared for use on the Nokia Project

In [108]:
# Load the dataset here
data = pd.read_csv("raw_dataset_edited_cont.csv")

# split the dataset into a features and target dataframe - here the target column is called 'Request Status'
request_raw = data['REQUEST STATUS']
features_raw = data.drop('REQUEST STATUS', axis = 1)

# Encode the target dataframe to numerical values if required
request = (request_raw =='APPROVED').astype(int)

### 2.1 Scaling & Normalisation - TODO

In [99]:
#TODO: Section allowing the dataset to be scaled - may require pipelines

#skewed = ['CBMs', 'Data1', 'Data2', 'Data3', 'Weights(kgs)']
#features_log_transformed = pd.DataFrame(data = features_raw)
#features_log_transformed[skewed] = features_raw[skewed].apply(lambda x: np.log(x + 1))

#from sklearn.preprocessing import MinMaxScaler

# Initialize a scaler, then apply it to the features
#scaler = MinMaxScaler() # default=(0, 1)
#numerical = ['CBMs', 'Data1', 'Data2', 'Data3', 'Weights(kgs)']

#features_log_minmax_transform = pd.DataFrame(data = features_log_transformed)
#features_log_minmax_transform[numerical] = scaler.fit_transform(features_log_transformed[numerical])

# Show an example of a record with scaling applied
#display(features_log_minmax_transform.head(n = 5))

# Done: One-hot encode the 'features_log_minmax_transform' data using pandas.get_dummies()

#features_final = pd.get_dummies(features_log_minmax_transform)

## 3. Encode categorical variables

In [100]:
# One hot encode the necessary columns in the features dataframe
features_final = pd.get_dummies(features_raw)

# Print the total number of features after one-hot encoding
encoded = list(features_final.columns)
print("{} total features after one-hot encoding.".format(len(encoded)))


196 total features after one-hot encoding.


## 4. Split datasets into train and test

In [101]:
# Import train_test_split
from sklearn.cross_validation import train_test_split

# Split the features and target data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features_final, 
                                                    request, 
                                                    test_size = 0.2, 
                                                    random_state = 0)

## 5. Define models of interest and hyperparameters

In [102]:
# Firstly import the classifiers that are of interest
# the classifiers included below should  be good for most situations but feel free to add any others
# https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

# Add any classifiers you want to evaluate into this dictionary

models1 = {
    'ExtraTreesClassifier': ExtraTreesClassifier(),
    'RandomForestClassifier': RandomForestClassifier(),
    'AdaBoostClassifier': AdaBoostClassifier(),
    'GradientBoostingClassifier': GradientBoostingClassifier(),
    'XGBoostClassifier': XGBClassifier()
}

# For each classifier above add the hyperparameters you would like to evaluate
# The defaults included included below should cover most situation

params1 = {
    'ExtraTreesClassifier': { 'n_estimators': [16, 32] },
    'RandomForestClassifier': {'min_samples_split' : [2,4,6,8,10,14,20], 'n_estimators': [1, 5, 10, 15, 20]},
    'AdaBoostClassifier':  { 'n_estimators': [16, 32] },
    'GradientBoostingClassifier': { 'n_estimators': [16, 32], 'learning_rate': [0.8, 1.0] },
    'XGBoostClassifier': {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [3, 4, 5]
        },
}


## 6. Run the Grid Search evaluation

This step can take some time depending on the dataset and number of model combinations defined

In [103]:
helper1 = EstimatorSelectionHelper(models1, params1)
helper1.fit(X_train, y_train, scoring='f1', n_jobs=2)

Running GridSearchCV for ExtraTreesClassifier.
Fitting 3 folds for each of 2 candidates, totalling 6 fits


[Parallel(n_jobs=2)]: Done   6 out of   6 | elapsed:    4.2s finished


Running GridSearchCV for RandomForestClassifier.
Fitting 3 folds for each of 35 candidates, totalling 105 fits


[Parallel(n_jobs=2)]: Done 105 out of 105 | elapsed:    4.1s finished


Running GridSearchCV for AdaBoostClassifier.
Fitting 3 folds for each of 2 candidates, totalling 6 fits


[Parallel(n_jobs=2)]: Done   6 out of   6 | elapsed:    2.5s finished


Running GridSearchCV for GradientBoostingClassifier.
Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=2)]: Done  12 out of  12 | elapsed:    2.6s finished


Running GridSearchCV for XGBoostClassifier.
Fitting 3 folds for each of 405 candidates, totalling 1215 fits


[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    5.6s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:   16.9s
[Parallel(n_jobs=2)]: Done 446 tasks      | elapsed:   36.2s
[Parallel(n_jobs=2)]: Done 796 tasks      | elapsed:  1.1min
[Parallel(n_jobs=2)]: Done 1215 out of 1215 | elapsed:  1.8min finished


## 7. Grid Search Summary

In [104]:
# Summary view of the top 10 models
helper1.score_summary(sort_by='max_score').head(10)

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score,colsample_bytree,gamma,learning_rate,max_depth,min_child_weight,min_samples_split,n_estimators,subsample
6,RandomForestClassifier,0.860465,0.878986,0.896825,0.0148519,,,,,,2.0,20.0,
29,RandomForestClassifier,0.844106,0.866888,0.896,0.0216532,,,,,,14.0,10.0,
21,RandomForestClassifier,0.854962,0.870222,0.890625,0.0150067,,,,,,8.0,20.0,
1,ExtraTreesClassifier,0.857143,0.876521,0.889831,0.0140179,,,,,,,32.0,
135,XGBoostClassifier,0.853659,0.869137,0.888889,0.0146966,0.6,2.0,,4.0,1.0,,,1.0
10,RandomForestClassifier,0.852713,0.872072,0.888,0.0146086,,,,,,4.0,15.0,
40,GradientBoostingClassifier,0.86166,0.873439,0.887967,0.0109142,,,0.8,,,,32.0,
16,RandomForestClassifier,0.853846,0.868106,0.886275,0.0135242,,,,,,6.0,20.0,
401,XGBoostClassifier,0.817121,0.854978,0.886275,0.0286106,1.0,2.0,,3.0,10.0,,,0.8
26,RandomForestClassifier,0.850575,0.866213,0.886275,0.0149063,,,,,,10.0,20.0,


## 8. Train Final Model

In [105]:
#Get best model or manually select below
model_index = helper1.score_summary(sort_by='max_score').index[0]
#model_index = 1 # Uncomment this line to select a model that isnt the top scorer

selected_model = helper1.score_summary().iloc[helper1.score_summary().index == model_index]

best_classifier = models1[selected_model.to_dict('records')[0]['estimator']]

parameters = selected_model.drop(['estimator', 'min_score', 'mean_score', 'max_score', 'std_score'], axis=1).dropna(axis=1, how='all').to_dict('records')[0]

best_classifier.set_params(**parameters)

best_classifier.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=20, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

Testing the accuracy of the final model:

In [106]:
from sklearn.metrics import accuracy_score

y_pred = best_classifier.predict(X_test)

accuracy_score(y_test, y_pred)

0.79508196721311475

## 9. Pickle model <a id='quickstart_link2'></a>

In [107]:
# joblib is better than pickle when using large numpy arrays
from sklearn.externals import joblib

#Choose where to save the file
joblib.dump(model, 'Nokia_model_v1.pkl')
joblib.dump(X_train.columns, 'Nokia_model_v1_columns.pkl')

['Nokia_model_v1_columns.pkl']

## 10. Summary & Use with BP

You now have a trained model saved to disk - this model can be used in a RPA implementation
There is a BP object that will handle running the model as a webservice in the background - you just have to specify where the models you want to use are and pass a collection of features to get predictions - full details here - TODO