# Hyperparameter Tuning with Random Forest: Grid and Random Search

In this notebook, we will explore two methods for hyperparameter tuning with a machine learning model -- today we'll consider Random Forest. [In contrast](https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/) to model __parameters__ which are learned during training, model __hyperparameters__ are set by the data scientist ahead of training and control implementation aspects of the model.

These settings need to be tuned for each problem because the best model hyperparameters for one particular dataset will __not be__ the best across all datasets. The process of [hyperparameter tuning (also called hyperparameter optimization)](https://en.wikipedia.org/wiki/Hyperparameter_optimization) means finding the combination of hyperparameter values for a machine learning model that performs the best - **as measured on a validation dataset** - for a problem. 

There are several approaches to hyperparameter tuning:

1. __Manual__: select hyperparameters based on intuition/experience/guessing, train the model with the hyperparameters, and score on the validation data. Repeat process until you run out of patience or are satisfied with the results. 
2. __Grid Search__: set up a grid of hyperparameter values and for each combination, train a model and score on the validation data. In this approach, every single combination of hyperparameters values is tried which can be very inefficient!
3. __Random search__: set up a grid of hyperparameter values and select _random_ combinations to train the model and score. The number of search iterations is set based on time/resources. 
4. __Automated Hyperparameter Tuning__: use methods such as gradient descent, Bayesian Optimization, or evolutionary algorithms to conduct a guided search for the best hyperparameters.

(This [Wikipedia Article](https://en.wikipedia.org/wiki/Hyperparameter_optimization) provides a good high-level overview of tuning options with links for more details)

In this notebook, we will consider approaches 2 and 3 for a Random Forest Model.

### Getting Started

For this notebook, we will work with a subset of the data consisting of 10000 rows. Hyperparameter tuning is extremely computationally expensive and working with the full dataset in a Kaggle Kernel would not be feasible for more than a few search iterations. However, the same ideas that we will implement here can be applied to the full dataset. Also, while this notebook specifically uses random forests, the overall approach can be applied for any machine learning model. 

To "test" the tuning results, we will save some of the training data, 6000 rows, as a separate testing set. When we do hyperparameter tuning, it's crucial to __not tune the hyperparameters on the testing data__. We can only use the testing data __a single time__ when we evaluate the final model that has been tuned on the validation data. To actually test our methods from this notebook, we would need to train the best model on all of the training data, make predictions on the actual testing data, and then submit our answers to the competition. 

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# Splitting data
from sklearn.model_selection import train_test_split

Below we read in the data and separate into a training set of 10000 observations and a "testing set" of 6000 observations. After creating the testing set, we cannot do any hyperparameter tuning with it! 

In [2]:
### Read the training set.
data_or = pd.read_csv('./train_bureau_corrs_removed.csv')
print(f"Shape originale training set: {data_or.shape}")
print(data_or.info())


# Sample 16000 rows.
data = data_or.sample(n = 16000, random_state = 42)

# Keep only numeric features (DISABLED!)
# data = data.select_dtypes('number')

# Separate labels from features
labels = data.loc[:, 'TARGET']
# features = data.drop(columns = ['TARGET', 'SK_ID_CURR'])
features = data.drop(columns = ['TARGET'])


# Split the small training set into training and testing data (10000 for training, 6000 for testing)
# NOTE the use of stratify...
train_features, test_features, train_labels, test_labels =\
    train_test_split(features, labels, test_size = 6000, random_state = 42, stratify = labels)

Shape originale training set: (307511, 199)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 199 entries, SK_ID_CURR to TARGET
dtypes: float64(145), int64(38), object(16)
memory usage: 466.9+ MB
None


Let's check the distribution of the classes...

In [3]:
print(f"Distribuzione classi train set originale: {np.histogram(data_or['TARGET'].values, [0,1,2], density = True)[0]}")
print(np.histogram(labels, [0,1,2], density = True)[0])
print(np.histogram(train_labels, [0,1,2], density = True)[0])
print(np.histogram(test_labels, [0,1,2], density = True)[0])

Distribuzione classi train set originale: [0.91927118 0.08072882]
[0.9185625 0.0814375]
[0.9186 0.0814]
[0.9185 0.0815]


We will also use only the numeric features to reduce the number of dimensions which will help speed up the hyperparameter search. Again, this is something we would not want to do on a real problem, but for demonstration purposes, it will allow us to see the concepts in practice (rather than waiting days/months for the search to finish).

In [4]:
print("Training features shape: ", train_features.shape)
print("Training labels shape: ", train_labels.shape)
print("Testing features shape: ", test_features.shape)
print("Test labels shape: ", test_labels.shape)

Training features shape:  (10000, 198)
Training labels shape:  (10000,)
Testing features shape:  (6000, 198)
Test labels shape:  (6000,)


# Cross Validation

To evaluate each combination of hyperparameter values, we need to measure its performance in terms of accuracy on a validation set. The hyperparameters __can not be tuned on the testing data__. We can only use the testing data __once__ when we evaluate the final model. The testing data is meant to serve as an estimate of the model performance when deployed on real unseen data, and therefore we do not want to optimize our model to the testing data because that will not give us a fair estimate of the actual performance.

The correct approach is therefore to use a **validation set**. However, instead of splitting the valuable training data into a separate training and validation set, we use [KFold cross validation](https://www.youtube.com/watch?v=TIgfjmp-4BA). In addition to preserving training data, this should give us a better estimate of generalization performance on the test set than using a single validation set (since then we are probably overfitting to that validation set). The performance of each set of hyperparameters is determined by Receiver Operating Characteristic Area Under the Curve (ROC AUC) from the cross-validation.

In this example, we will use **5-fold cross validation** which means training and testing the model with each set of hyperparameter values 5 times to assess performance. Part of the reason why hyperparameter tuning is so time-consuming is because of the use of cross validation. If we have a [large enough training set, we can probably get away with just using a single separate validation set](https://www.coursera.org/lecture/deep-neural-network/train-dev-test-sets-cxG1s), but cross validation is a **safer method to avoid overfitting**. 

To implement KFold cross validation, we will use the scikit cross validation function.

### Example of Cross Validation

We have to pass in a set of hyperparameters to the cross validation, so we will use the default hyperparameters in LightGBM. In the `cv` call, the `num_boost_round` is set to 10,000 (`num_boost_round` is the same as `n_estimators`), but this number won't actually be reached because we are using early stopping. As a reminder, the metric we are using is Receiver Operating Characteristic Area Under the Curve (ROC AUC).

The code below carries out both cross validation with 5 folds. 

In [5]:
# Let's pull in some functions introduce in the previous classes...

from sklearn.preprocessing import LabelEncoder

def label_encode(app_train, app_test) : 
    le = LabelEncoder()
    le_count = 0

    # Iterate through the columns
    for col in app_train:
        if app_train[col].dtype == 'object':
            # If 2 or fewer unique categories
            set_values = app_train[col].unique()
            num_values = len(list(set_values))
            if num_values <= 2:
                print(f"{col} will be label encoded! Found {num_values} values: {set_values}")
                # Train on the training data
                le.fit(app_train[col])
                # Transform both training and testing data
                app_train[col] = le.transform(app_train[col])
                app_test[col] = le.transform(app_test[col])

                # Keep track of how many columns were label encoded
                le_count += 1

    print('%d columns were label encoded.' % le_count)
    print('Training Features shape: ', app_train.shape)
    print('Testing Features shape: ', app_test.shape)
    
    return app_train, app_test


def one_hot_encode(app_train, app_test) :
    
    # Let's perform the one-hot encoding of categorical features with > 2 values...
    app_train = pd.get_dummies(app_train)
    app_test = pd.get_dummies(app_test)
    print('Training Features shape: ', app_train.shape)
    print('Testing Features shape: ', app_test.shape)
    
    return app_train, app_test


def align_train_test(app_train, app_test) :
    
    # Save target variable in a separate Series...
    train_labels = app_train['TARGET']

    # Align the training and testing data on columns -- this keeps only the columns present in both dataframes.
    app_train, app_test = app_train.align(app_test, join = 'inner', axis = 1)

    # Add the target column back in.
    app_train['TARGET'] = train_labels
    
    return train_labels, app_train, app_test

In [6]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm.sklearn import LGBMClassifier

def cross_val(app_train, app_test, classifier = 'rf', cv = 5, params = {}) :
    
    # Pick the classifier desired by the user...
    log_reg = LogisticRegression
    if classifier == 'decision' : log_reg = DecisionTreeClassifier
    elif classifier == 'SVC' : log_reg = SVC
    elif classifier == 'rf' : log_reg = RandomForestClassifier
    elif classifier == 'lgbm' : log_reg = LGBMClassifier
    else : pass
    
    
    # Label encoding, one hot encoding, train-test alignment.
    app_train, app_test = label_encode(app_train, app_test)
    print(f"Shape train e test dopo label encoding: {app_train.shape} e {app_test.shape}")
    
    app_train, app_test = one_hot_encode(app_train, app_test)
    print(f"Shape train e test dopo one hot encoding: {app_train.shape} e {app_test.shape}")
    
    train_labels, app_train, app_test = align_train_test(app_train, app_test)
    print(f"Shape train e test dopo allineamento: {app_train.shape} e {app_test.shape}")
    
    train = app_train.drop(columns = ['TARGET'], errors = 'ignore')
    test = app_test.copy()
    
 
    # Setup the pipeline. 
    from sklearn.pipeline import Pipeline
    clf = Pipeline([('imp', SimpleImputer(strategy = 'median')),
                    ('sca', MinMaxScaler(feature_range = (0, 1))),
                    ('clf', log_reg(**params))],
                    verbose = True)
    

    # Setup the cross validation. 
    from sklearn.model_selection import cross_val_score
    search = cross_val_score(estimator = clf, 
                             X = train, 
                             y = train_labels,
                             cv = cv,
                             scoring='roc_auc',
                             n_jobs = -1)

    
    # Compute feature importance (if applicable).
    #feat_imp = pd.Series(0, index=train.columns)
    #if(classifier in ['rf', 'lgbm']) :
    #    feat_imp = pd.Series(search.best_estimator_.named_steps["clf"].feature_importances_, index=train.columns)
        
    # Return the results.
    return search

In [7]:
def fit_score(app_train, app_test, classifier = 'rf', params = {}) :
    
    # Pick the classifier desired by the user...
    log_reg = LogisticRegression
    if classifier == 'decision' : log_reg = DecisionTreeClassifier
    elif classifier == 'SVC' : log_reg = SVC
    elif classifier == 'rf' : log_reg = RandomForestClassifier
    elif classifier == 'lgbm' : log_reg = LGBMClassifier
    else : pass
    
    
    # Label encoding, one hot encoding, train-test alignment.
    app_train, app_test = label_encode(app_train, app_test)
    print(f"Shape train e test dopo label encoding: {app_train.shape} e {app_test.shape}")
    
    app_train, app_test = one_hot_encode(app_train, app_test)
    print(f"Shape train e test dopo one hot encoding: {app_train.shape} e {app_test.shape}")
    
    train_labels, app_train, app_test = align_train_test(app_train, app_test)
    print(f"Shape train e test dopo allineamento: {app_train.shape} e {app_test.shape}")
    
    train = app_train.drop(columns = ['TARGET'], errors = 'ignore')
    test = app_test.copy()
    
 
    # Setup the pipeline. 
    from sklearn.pipeline import Pipeline
    clf = Pipeline([('imp', SimpleImputer(strategy = 'median')),
                    ('sca', MinMaxScaler(feature_range = (0, 1))),
                    ('clf', log_reg(**params))],
                    verbose = True)
    

    # Train the model. 
    clf.fit(train, train_labels)
    
    
    # Compute the predictions
    return clf.predict_proba(test)[:, 1]

In [13]:
print("Training features shape: ", train_features.shape)
print("Training labels shape: ", train_labels.shape)
print("Testing features shape: ", test_features.shape)
print("Test labels shape: ", test_labels.shape)


app_train = pd.concat([train_features, train_labels], axis = 1)
app_test = test_features
print(app_train.shape, app_test.shape)


# Cross validation with early stopping
params_classifier = {"class_weight" : "balanced",
                     'max_depth': 10,
                     'n_estimators' : 150,
                     'n_jobs' : -1} 
cv = cross_val(app_train.copy(), 
               app_test.copy(), 
               classifier = 'rf', 
               cv = 5, 
               params = params_classifier)

Training features shape:  (10000, 198)
Training labels shape:  (10000,)
Testing features shape:  (6000, 198)
Test labels shape:  (6000,)
(10000, 199) (6000, 198)
NAME_CONTRACT_TYPE will be label encoded! Found 2 values: ['Cash loans' 'Revolving loans']
CODE_GENDER will be label encoded! Found 2 values: ['F' 'M']
FLAG_OWN_CAR will be label encoded! Found 2 values: ['N' 'Y']
FLAG_OWN_REALTY will be label encoded! Found 2 values: ['Y' 'N']
4 columns were label encoded.
Training Features shape:  (10000, 199)
Testing Features shape:  (6000, 198)
Shape train e test dopo label encoding: (10000, 199) e (6000, 198)
Training Features shape:  (10000, 315)
Testing Features shape:  (6000, 310)
Shape train e test dopo one hot encoding: (10000, 315) e (6000, 310)
Shape train e test dopo allineamento: (10000, 311) e (6000, 310)


_cross_val_score_ returns a vector of $n$ values (here $n = 5$), where each value indicates the score the classifier got on a specific fold.

In [14]:
print(f"scores: {cv}")
print(f"avg. score: {np.mean(cv)}")
print(f"stddev. score: {np.std(cv)}")

scores: [0.70836524 0.6942167  0.68759414 0.69918946 0.70753529]
avg. score: 0.6993801655638473
stddev. score: 0.007910067825057831


**Let's train the model with the same parameters considered during the CV, and then compute predictions over the small test set to see its accuracy!**

In [15]:
params_classifier = {"class_weight" : "balanced",
                     'max_depth': 10,
                     'n_estimators' : 150,
                     'n_jobs' : -1}

predictions = fit_score(app_train.copy(),\
                        app_test.copy(),\
                        classifier = 'rf',\
                        params = params_classifier)

from sklearn.metrics import roc_auc_score
print(f"Accuracy on the test set: {roc_auc_score(test_labels, predictions)}")

NAME_CONTRACT_TYPE will be label encoded! Found 2 values: ['Cash loans' 'Revolving loans']
CODE_GENDER will be label encoded! Found 2 values: ['F' 'M']
FLAG_OWN_CAR will be label encoded! Found 2 values: ['N' 'Y']
FLAG_OWN_REALTY will be label encoded! Found 2 values: ['Y' 'N']
4 columns were label encoded.
Training Features shape:  (10000, 199)
Testing Features shape:  (6000, 198)
Shape train e test dopo label encoding: (10000, 199) e (6000, 198)
Training Features shape:  (10000, 315)
Testing Features shape:  (6000, 310)
Shape train e test dopo one hot encoding: (10000, 315) e (6000, 310)
Shape train e test dopo allineamento: (10000, 311) e (6000, 310)
[Pipeline] ............... (step 1 of 3) Processing imp, total=   0.3s
[Pipeline] ............... (step 2 of 3) Processing sca, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   0.9s
Accuracy on the test set: 0.7035046842548404


The score we get is very close to the one observed during the CV!

Observe that we used the same strategy we employ when training our model on the whole training set...

# Hyperparameter Tuning Implementation

Now we have the basic framework in place: we will use cross validation to determine the performance of model hyperparameters. The basic strategy for both grid and random search is simple. For each hyperparameter value combination:

1. evaluate the cross validation score and record the results along with the hyperparameters. 
2. Then, at the end of searching, choose the combination of hyperparameters that yielded the highest cross-validation score...
3. ...and train the model on all the training data.
4. Finally, make predictions on the test data.

## Four parts of Hyperparameter tuning

It's helpful to think of hyperparameter tuning as having four parts:

1. **Objective function**: a function that takes in hyperparameters and returns a score we are trying to minimize or maximize
2. **Domain**: the set of hyperparameter values over which we want to search. 
3. **Algorithm**: method for selecting the next set of hyperparameters to evaluate in the objective function.
4. **Results history**: data structure containing each set of hyperparameters and the resulting score from the objective function.

**Questions: which parts do we already have in this notebook?**

Switching from grid to random search will only require making minor modifications to these four parts. 

## 1 - Objective Function

The objective function takes in hyperparameters and outputs a value representing a score. Traditionally in optimization, this is a score to minimize, but here our score will be the ROC AUC which of course we want to maximize. What occurs in the middle of the objective function will vary according to the problem, but for this problem, we will use cross validation with the specified model hyperparameters to get the cross-validation ROC AUC. This score will then be used to select the best model hyperparameter values. 

In addition to returning the value to maximize, our objective function will return the hyperparameters and the iteration of the search. These results will let us go back and inspect what occurred during a search. The code below implements a simple objective function which we can use for both grid and random search.

In [None]:
def objective(hyperparameters, iteration, app_train, app_test):
    """Objective function for grid and random search. Returns
       the cross validation score from a set of hyperparameters."""
    
     # Perform n_folds cross validation
    cv = cross_val(app_train, app_test, classifier = 'rf', cv = 5, params = {})
    
    # results to retun
    score = np.mean(cv) 
    print(f"Score achieved with {hyperparameters}: {score}")
    
    return [score, hyperparameters, iteration]

In [None]:
params = {"class_weight" : "balanced", 
          'n_estimators' : 150}

score, params, iteration = objective(params, 
                                     1, 
                                     app_train.copy(), 
                                     app_test.copy())

print('The cross-validation ROC AUC was {:.5f}.'.format(score))

# 2 - Domain

The domain, or search space, is all the possible values for all the hyperparameters that we want to search over. For random and grid search, the domain is a hyperparameter grid and usually takes the form of a dictionary with the keys being the hyperparameters and the values lists of values for each hyperparameter.

## Hyperparameters for Random Forest

To see which settings we can tune, let's make a model and print it out. You can also refer to the [LightGBM documentation](http://lightgbm.readthedocs.io/en/latest/Parameters.html) for the description of all the hyperparameters.

In [None]:
# Create a default model
model = RandomForestClassifier()
model.get_params()

Some of these we do not need to tune such as `silent`, `objective`, `random_state`, and `n_jobs`. However, there are still many hyperparameters to optimize, and we will consider only a subset to tune. 

Choosing a hyperparameter grid is probably the most difficult part of hyperparameter tuning: it's nearly impossible ahead of time to say which values of hyperparameters will work well and the optimal settings will depend on the dataset. Moreover, the hyperparameters have complex interactions with each other which means that just tuning one at a time doesn't work because when we start changing other hyperparameters that will affect the one we just tuned! 

If we have prior experience with a model, we might know where the best values for the hyperparameters typically lie, or what a good search space is. However, if we don't have much experience, we can simply define a large search space and hope that the best values are in there somewhere. Typically, when first using a method, I define a wide search space centered around the default values. Then, if I see that some values of hyperparameters tend to work better, I can concentrate the search around those values.

In [None]:
# Hyperparameter grid
param_grid = {
    'max_leaf_nodes': [5, 7, 10],
    'max_depth': [5, 7, 10],
    'n_estimators': [100, 150],
    'class_weight': ["balanced"]
}

# 3 - Algorithm for selecting next values

Although we don't generally think of them as such, both grid and random search are algorithms. In the case of grid search, we input the domain and the algorithm selects the next value for each hyperparameter in an ordered sequence. The only requirement of grid search is that it tries every combination in a grid once (and only once). For random search, we input the domain and each time the algorithm gives us a random combination of hyperparameter values to try. There are no requirements for random search other than that the next values are selected at random. 

We will implement these algorithms very shortly, as soon as we cover the final part of hyperparameter tuning.

# Results History

The results history is a data structure that contains the hyperparameter combinations and the resulting score on the objective function. When we get to Bayesian Optimization, the model actually _uses the past results to decide on the next hyperparmeters_ to evaluate. Random and grid search are _uninformed_ methods that do not use the past history, but we still need the history so we can find out which hyperparameters worked the best! 

A dataframe is a useful data structure to hold the results.

In [None]:
MAX_EVALS = 5

# Dataframes for random and grid search
random_results = pd.DataFrame(columns = ['score', 'params', 'iteration'],
                              index = list(range(MAX_EVALS)))

grid_results = pd.DataFrame(columns = ['score', 'params', 'iteration'],
                              index = list(range(MAX_EVALS)))

print(grid_results)

# Grid Search Implementation

Grid search is best described as exhuastive guess and check. We have a problem: find the hyperparameters that result in the best cross validation score, and a set of values to try in the hyperparameter grid - the domain. The grid search method for finding the answer is to try all combinations of values in the domain and hope that the best combination is  in the grid (in reality, we will never know if we found the best settings unless we have an infinite hyperparameter grid which would then require an infinite amount of time to run).

Grid search suffers from one limiting problem: it is extremely computationally expensive because we have to perform cross validation with every single combination of hyperparameters in the grid! Let's see how many total hyperparameter settings there are in our simple little grid we developed.

In [None]:
com = 1
for x in param_grid.values():
    com *= len(x)
print('There are {} combinations'.format(com))

Let's assume 10 seconds per evaluation and see how many time evaluating the above combinations would take:

In [None]:
print('This would take {:.0f} seconds to finish.'.format((10 * com)))

I think we're going to need a better approach! Before we discuss alternatives, let's walk through how we would actually use this grid and evaluate all the hyperparameters.

The code below shows the "algorithm" for grid search. First, we [unpack the values](https://www.geeksforgeeks.org/packing-and-unpacking-arguments-in-python/) in the hyperparameter grid (which is a Python dictionary) using the line `keys, values = zip(*param_grid.items())`.  The key line is `for v in itertools.product(*values)` where we iterate through all the possible combinations of values in the hyperparameter grid one at a time.  For each combination of values, we create a dictionary `hyperparameters = dict(zip(keys, v))` and then pass these to the objective function defined earlier. The objective function returns the cross validation score from the hyperparameters which we record in the dataframe. This process is repeated for each and every combination of hyperparameter values. By using `itertools.product` (from [this Stack Overflow Question and Answer](https://codereview.stackexchange.com/questions/171173/list-all-possible-permutations-from-a-python-dictionary-of-lists)), we create a [generator](http://book.pythontips.com/en/latest/generators.html) rather than allocating a list of all possible combinations which would be far too large to hold in memory. 

In [None]:
import itertools

def grid_search(app_train, app_test, param_grid, max_evals = MAX_EVALS):
    """Grid search algorithm (with limit on max evals)"""
    
    # Dataframe to store results
    results = pd.DataFrame(columns = ['score', 'params', 'iteration'],
                              index = list(range(MAX_EVALS)))
    
    # https://codereview.stackexchange.com/questions/171173/list-all-possible-permutations-from-a-python-dictionary-of-lists
    keys, values = zip(*param_grid.items())
    
    i = 0
    
    # Iterate through every possible combination of hyperparameters
    for v in itertools.product(*values):
        
        # Create a hyperparameter dictionary
        hyperparameters = dict(zip(keys, v))
        print(f"Evaluating the following configuration: {hyperparameters}")
                
        # Evalute the hyperparameters
        eval_results = objective(hyperparameters, i, app_train.copy(), app_test.copy())
        
        results.loc[i, :] = eval_results
        
        i += 1
        
        # Normally would not limit iterations
        if i > MAX_EVALS:
            break
       
    # Sort with best score on top
    results.sort_values('score', ascending = False, inplace = True)
    results.reset_index(inplace = True)
    
    return results    

Normally, in grid search, we do not limit the number of evaluations. The number of evaluations is set by the total combinations in the hyperparameter grid (or the number of years we are willing to wait!). So the lines 

```
        if i > MAX_EVALS:
            break
```

would not be used in actual grid search. Here we will run grid search for 5 iterations just as an example. The results returned will show us the validation score (ROC AUC), the hyperparameters, and the iteration sorted by best performing combination of hyperparameter values.

In [None]:
grid_results = grid_search(app_train.copy(), app_test.copy(), param_grid)

print('The best validation score was {:.5f}'.format(grid_results.loc[0, 'score']))
print('\nThe best hyperparameters were:')

import pprint
pprint.pprint(grid_results.loc[0, 'params'])

Now, since we have the best hyperparameters, we can evaluate them on our "test" data (remember not the real test data)!

In [None]:
# Get the best parameters
grid_search_params = grid_results.loc[0, 'params']

# Train over train data, then compute predictions over the test set.
print(f"Computing predictions with {grid_search_params}")
predictions = fit_score(app_train.copy(),\
                        app_test.copy(),\
                        classifier = 'rf',\
                        params = grid_search_params)

from sklearn.metrics import roc_auc_score
print(f"Accuracy on the test set: {roc_auc_score(test_labels, predictions)}")


print('The best model from grid search scores {:.5f} ROC AUC on the test set.'.format(roc_auc_score(test_labels, predictions)))

It's interesting that the model scores better on the test set than in cross validation. Usually the opposite happens (higher on cross validation than on test) because the model is tuned to the validation data. In this case, the better performance is probably due to small size of the test data and we get very lucky (although this probably does not translate to the actual competition data). 

To get a sense of how grid search works, we can look at the progression of hyperparameters that were evaluated.

In [None]:
pd.options.display.max_colwidth = 1000
grid_results['params'].values

This is grid search trying every single value in the grid! No matter how small the increment between subsequent values of a hyperparameter, it will try them all. Clearly, we are going to need a more efficient approach if we want to find better hyperparameters in a reasonable amount of time. 

#### Application

If you want to run this on the entire dataset feel free to take these functions and put them in a script. However, I would advise against using grid search unless you have a very small hyperparameter grid because this is such as exhaustive method! 
Later, we will look at results from 1000 iterations of grid and random search run on the same small subset of data as we used above. I have not tried to run any form of grid search on the full data (and probably will not try this method).

# Random Search

Random search is surprisingly efficient compared to grid search. Although grid search will find the optimal value of hyperparameters (assuming they are in your grid) eventually, random search will usually find a "close-enough" value in far fewer iterations. [This great paper explains why this is so](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf): grid search spends too much time evaluating unpromising regions of the hyperparameter search space because it has to evaluate every single combination in the grid. Random search in contrast, does a better job of exploring the search space and therefore can usually find a good combination of hyperparameters in far fewer iterations. 

As [this article](https://medium.com/rants-on-machine-learning/smarter-parameter-sweeps-or-why-grid-search-is-plain-stupid-c17d97a0e881) lays out, random search should probably be the first hyperparameter optimization method tried because of its effectiveness. Even though it's an _uninformed_ method (meaning it does not rely on past evaluation results), random search can still usually find better values than the default and is simple to run.

Random search can also be thought of as an algorithm: randomly select the next set of hyperparameters from the grid! We can build a dictionary of hyperparameters by selecting one random value for each hyperparameter as follows (again accounting for subsampling):

In [None]:
import random
random.seed(50)

for k, v in param_grid.items() :
    print(k,v)

# Randomly sample from dictionary
random_params = {k: random.sample(v, 1)[0] for k, v in param_grid.items()}

random_params

Next, we define the `random_search` function. This takes the same general structure as `grid_search` except for the method used to select the next hyperparameter values. Moreover, random search is always run with a limit on the number of search iterations.

In [None]:
def random_search(app_train, app_test, param_grid, max_evals = MAX_EVALS):
    """Random search for hyperparameter optimization"""
    
    # Dataframe for results
    results = pd.DataFrame(columns = ['score', 'params', 'iteration'], index = list(range(MAX_EVALS)))
    
    # Keep searching until reach max evaluations
    for i in range(MAX_EVALS):
        
        # Choose random hyperparameters
        hyperparameters = {k: random.sample(v, 1)[0] for k, v in param_grid.items()}
        print(f"Evaluating the following configuration: {hyperparameters}")
                
        # Evalute the hyperparameters
        eval_results = objective(hyperparameters, i, app_train.copy(), app_test.copy())
        
        results.loc[i, :] = eval_results
    
    # Sort with best score on top
    results.sort_values('score', ascending = False, inplace = True)
    results.reset_index(inplace = True)
    return results 

In [None]:
random_results = random_search(app_train, app_test, param_grid, max_evals = 5)

print('The best validation score was {:.5f}'.format(random_results.loc[0, 'score']))
print('\nThe best hyperparameters were:')

import pprint
pprint.pprint(random_results.loc[0, 'params'])

We can also evaluate the best random search model on the "test" data.

In [None]:
# Get the best parameters
random_search_params = random_results.loc[0, 'params']

# Train over train data, then compute predictions over the test set.
print(f"Computing predictions with {grid_search_params}")
predictions = fit_score(app_train.copy(),\
                        app_test.copy(),\
                        classifier = 'rf',\
                        params = grid_search_params)

from sklearn.metrics import roc_auc_score
print(f"Accuracy on the test set: {roc_auc_score(test_labels, predictions)}")


print('The best model from grid search scores {:.5f} ROC AUC on the test set.'.format(roc_auc_score(test_labels, predictions)))

Finally, we can view the random search sequence of hyperparameters.

In [None]:
random_results['params']

This time we see hyperparameter values that are all over the place, almost as if they had been selected at random! Random search will do a much better job than grid search of exploring the search domain (for the same number of iterations). If we have a limited time to evaluate hyperparameters, random search is a better option than grid search for exactly this reason.

### Stacking Random and Grid Search: some final suggestions

One option for a smarter implementation of hyperparameter tuning is to combine random search and grid search: 

1. Use random search with  a large hyperparameter grid 
2. Use the results of random search to build a focused hyperparameter grid around the best performing hyperparameter values.
3. Run grid search on the reduced hyperparameter grid. 
4. Repeat grid search on more focused grids until maximum computational/time budget is exceeded.

Finally, worth of mention is the Bayesian optimization, which provides the advantages of a random search while attempting to focus on regions where the best combinations of hyperparameters may be located.
See also https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html#skopt.BayesSearchCV

# Going back to the code we've seen in previous classes...

We can avoid the need of implementing all the logic behind the grid and random search we've seen in the above cells by using the _GridSearchCV_ and _RandomizedSearchCV_ classes.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

**Remember**: we already used them before! In fact we can reuse that code here. Thus, let's pull in code we've used in previous classes...

In [None]:
from sklearn.preprocessing import LabelEncoder

def label_encode(app_train, app_test) : 
    le = LabelEncoder()
    le_count = 0

    # Iterate through the columns
    for col in app_train:
        if app_train[col].dtype == 'object':
            # If 2 or fewer unique categories
            set_values = app_train[col].unique()
            num_values = len(list(set_values))
            if num_values <= 2:
                print(f"{col} will be label encoded! Found {num_values} values: {set_values}")
                # Train on the training data
                le.fit(app_train[col])
                # Transform both training and testing data
                app_train[col] = le.transform(app_train[col])
                app_test[col] = le.transform(app_test[col])

                # Keep track of how many columns were label encoded
                le_count += 1

    print('%d columns were label encoded.' % le_count)
    print('Training Features shape: ', app_train.shape)
    print('Testing Features shape: ', app_test.shape)
    
    return app_train, app_test


def one_hot_encode(app_train, app_test) :
    
    # Let's perform the one-hot encoding of categorical features with > 2 values...
    app_train = pd.get_dummies(app_train)
    app_test = pd.get_dummies(app_test)
    print('Training Features shape: ', app_train.shape)
    print('Testing Features shape: ', app_test.shape)
    
    return app_train, app_test


def align_train_test(app_train, app_test) :
    
    # Save target variable in a separate Series...
    train_labels = app_train['TARGET']

    # Align the training and testing data on columns -- this keeps only the columns present in both dataframes.
    app_train, app_test = app_train.align(app_test, join = 'inner', axis = 1)

    # Add the target column back in.
    app_train['TARGET'] = train_labels
    
    return train_labels, app_train, app_test

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm.sklearn import LGBMClassifier

def classify(app_train, app_test, type_search = "grid", classifier = 'rf', cv = 5, param_grid = {}, n_jobs = 1) :
    
    # Pick the classifier desired by the user...
    log_reg = LogisticRegression
    if classifier == 'decision' : log_reg = DecisionTreeClassifier
    elif classifier == 'SVC' : log_reg = SVC
    elif classifier == 'rf' : log_reg = RandomForestClassifier
    elif classifier == 'lgbm' : log_reg = LGBMClassifier
    else : pass
    
    
    # Label encoding, one hot encoding, train-test alignment.
    app_train, app_test = label_encode(app_train, app_test)
    app_train, app_test = one_hot_encode(app_train, app_test)
    train_labels, app_train, app_test = align_train_test(app_train, app_test)
    
    
    train = app_train.drop(columns = ['TARGET'], errors = 'ignore')
    test = app_test.copy()
    
 
    # Setup the pipeline. 
    from sklearn.pipeline import Pipeline
    clf = Pipeline([('imp', SimpleImputer(strategy = 'median')),
                    ('sca', MinMaxScaler(feature_range = (0, 1))),
                    ('clf', log_reg())],
                    verbose = True)
    

    # Setup the search: grid or random? Pick your favourite! 
    from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
    search = 0
    if type_search == 'grid' :
        search = GridSearchCV(estimator = clf, 
                              param_grid = param_grid, 
                              cv = cv,
                              scoring = 'roc_auc',
                              n_jobs = n_jobs,
                              verbose = 1)
    else :
        search = RandomizedSearchCV(estimator = clf, 
                                    param_distributions = param_grid,
                                    cv = cv,
                                    scoring = 'roc_auc',
                                    n_jobs = n_jobs,
                                    n_iter = 3,
                                    verbose = 1)
    
    
    # Training ...
    search.fit(train, train_labels)

    
    #print('Miscellanea of results:', search.cv_results_)
    #print('Score achieved by the best config. during stratified CV:', search.best_score_)
    #print('Best estimator config:', search.best_estimator_)

    
    # Compute predictions...
    log_reg_pred = search.predict_proba(test)[:, 1]
    # print(log_reg_pred)

    
    # Final result dataframe.
    submit = app_test[['SK_ID_CURR']].copy()
    submit.loc[:, 'TARGET'] = log_reg_pred

    
    # Compute feature importance (if applicable).
    feat_imp = pd.Series(0, index=train.columns)
    if(classifier in ['rf', 'lgbm']) :
        feat_imp = pd.Series(search.best_estimator_.named_steps["clf"].feature_importances_, index=train.columns)
        
    # Return the results.
    return submit, search.cv_results_, feat_imp

In [None]:
# Hyperparameter grid
param_grid = {
    'clf__max_leaf_nodes': [7, 10],
    'clf__max_depth': [7, 10],
    'clf__n_estimators': [100, 150],
    'clf__class_weight': ["balanced"]
}

predictions, search_res, feat_imp = classify(app_train.copy(),
                                             app_test.copy(),
                                             classifier = 'rf',
                                             type_search = "rand",
                                             cv = 3, 
                                             param_grid = param_grid,
                                             n_jobs = -1)

In [None]:
print(search_res['params'])
print(search_res['mean_test_score'])

df_src_res = pd.DataFrame.from_dict(search_res['params'], orient = 'columns')
df_src_res['score'] = search_res['mean_test_score']
df_src_res

## Score versus Hyperparameters

As a final plot, we can show the score versus the value of each hyperparameter. We need to keep in mind that the hyperparameters are not changed one at a time, so if there are relationships between the values and the score, they do not mean that particular hyperparameter is influencing the score. However, we might be able to identify values of hyperparameters that seem more promising. Mostly these plots are for my own interest, to see if there are any trends! 

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns

fig, axs = plt.subplots(1, 3, figsize = (24, 6))
i = 0

# Plot of four hyperparameters
for i, hyper in enumerate(['clf__n_estimators', 'clf__max_leaf_nodes', 'clf__max_depth']):
        df_src_res[hyper] = df_src_res[hyper].astype(float)
        # Scatterplot
        sns.regplot(hyper, 'score', data = df_src_res, ax = axs[i])
        axs[i].scatter(df_src_res[hyper].max(), df_src_res['score'].max(), marker = '*', s = 200, c = 'k')
        axs[i].set(xlabel = '{}'.format(hyper), ylabel = 'Score', title = 'Score vs {}'.format(hyper));

plt.tight_layout()