# How to use Grid Search CV in sklearn, Keras, XGBoost, LightGBM in Python
GridSearchCV is a brute force on finding the best hyperparameters for a specific dataset and model. Why not automate it to the extend we can?

## Setup: Prepared Dataset

MNIST, Boston House Prices and Breast Cancer.

## MNIST dataset

For the MNIST dataset, we normalize the pictures, divide by the RGB code values and one-hot encode our output classes

In [None]:
# LOAD DATA
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# PREPROCESSING
def preprocess_mnist(x_train, y_train, x_test, y_test):
    # Normalizing all images of 28x28 pixels
    x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
    x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
    input_shape = (28, 28, 1)
    
    # Float values for division
    x_train = x_train.astype('float32')
    x_test = x_test.astype('float32')
    
    # Normalizing the RGB codes by dividing it to the max RGB value
    x_train /= 255
    x_test /= 255
    
    # Categorical y values
    y_train = to_categorical(y_train)
    y_test= to_categorical(y_test)
    
    return x_train, y_train, x_test, y_test, input_shape
    
X_train, y_train, X_test, y_test, input_shape = preprocess_mnist(x_train, y_train, x_test, y_test)

## Boston House Prices Dataset

For the house prices dataset, we do even less preprocessing. We really just remove a few columns with missing values, remove the rest of the rows with missing values and one-hot encode the columns.

In [None]:
boston = load_boston()
X = boston.data
y = boston.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Breast Cancer Dataset

For the last dataset, breast cancer, we don't do any preprocessing except for splitting the training and testing dataset into train and test splits

In [None]:
# Load breast cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Running GridSearchCV

In [None]:
def algorithm_pipeline(X_train_data, X_test_data, y_train_data, y_test_data, 
                       model, param_grid, cv=10, scoring_fit='neg_mean_squared_error',
                       do_probabilities = False):
    gs = GridSearchCV(
        estimator=model,
        param_grid=param_grid, 
        cv=cv, 
        n_jobs=-1, 
        scoring=scoring_fit,
        verbose=2
    )
    fitted_model = gs.fit(X_train_data, y_train_data)
    
    if do_probabilities:
      pred = fitted_model.predict_proba(X_test_data)
    else:
      pred = fitted_model.predict(X_test_data)
    
    return fitted_model, pred

# Models

## 1. Keras
Firtly, we define the neural network architecture, and since it's for the MNIST dataset that consists of pictures, we define it as some sort of convolutional neural network (CNN).

In [None]:
# Readying neural network model
def build_cnn(activation = 'relu',
              dropout_rate = 0.2,
              optimizer = 'Adam'):
    model = Sequential()
    
    model.add(Conv2D(32, kernel_size=(3, 3),
              activation=activation,
              input_shape=input_shape))
    model.add(Conv2D(64, (3, 3), activation=activation))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(dropout_rate))
    model.add(Flatten())
    model.add(Dense(128, activation=activation))
    model.add(Dropout(dropout_rate))
    model.add(Dense(10, activation='softmax'))
    
    model.compile(
        loss='categorical_crossentropy', 
        optimizer=optimizer, 
        metrics=['accuracy']
    )
    
    return model

Next, we just define the parameters and model to input into the algorithm_pipeline; we run classification on this dataset, since we are trying to predict which class a given image can be categorized into. Note that I commented out some of the parameters, because it would take a long time to train, but you can always fiddle around with which parameters you want. You could even add pool_size or kernel_size.

In [None]:
param_grid = {
              'epochs':[1,2,3],
              'batch_size':[128]
              #'epochs' :              [100,150,200],
              #'batch_size' :          [32, 128],
              #'optimizer' :           ['Adam', 'Nadam'],
              #'dropout_rate' :        [0.2, 0.3],
              #'activation' :          ['relu', 'elu']
             }

model = KerasClassifier(build_fn = build_cnn, verbose=0)

model, pred = algorithm_pipeline(X_train, X_test, y_train, y_test, model, 
                                        param_grid, cv=5, scoring_fit='neg_log_loss')

print(model.best_score_)
print(model.best_params_)

From this GridSearchCV, we get the best score and best parameters to be:

>-0.04399333562212302
{'batch_size': 128, 'epochs': 3}

Fixing bug for scoring with Keras
I came across this issue when coding a solution trying to use accuracy for a Keras model in GridSearchCV – you might wonder why 'neg_log_loss' was used as the scoring method?

The solution to using something else than negative log loss is to remove some of the preprocessing of the MNIST dataset; that is, REMOVE the part where we make the output variables categorical

#### Categorical y values
- y_train = to_categorical(y_train)
- y_test= to_categorical(y_test)

Surely we would be able to run with other scoring methods, right? Yes, that was actually the case (see the notebook). This was the best score and best parameters:

> 0.9858
{'batch_size': 128, 'epochs': 3}

## 2. XGBoost
Next we define parameters for the boston house price dataset. Here the task is regression, which I chose to use XGBoost for. I also chose to evaluate by a Root Mean Squared Error (RMSE).

In [None]:
model = xgb.XGBRegressor()
param_grid = {
    'n_estimators': [400, 700, 1000],
    'colsample_bytree': [0.7, 0.8],
    'max_depth': [15,20,25],
    'reg_alpha': [1.1, 1.2, 1.3],
    'reg_lambda': [1.1, 1.2, 1.3],
    'subsample': [0.7, 0.8, 0.9]
}

model, pred = algorithm_pipeline(X_train, X_test, y_train, y_test, model, 
                                 param_grid, cv=5)

# Root Mean Squared Error
print(np.sqrt(-model.best_score_))
print(model.best_params_)

The best score and parameters for the house prices dataset found from the GridSearchCV was

>3.4849014783892733
{'colsample_bytree': 0.8, 'max_depth': 20, 'n_estimators': 400, 'reg_alpha': 1.2, 'reg_lambda': 1.3, 'subsample': 0.8}

## 3. LightGBM
The next task was LightGBM for classifying breast cancer. The metric chosen was accuracy.

In [None]:
model = lgb.LGBMClassifier()
param_grid = {
    'n_estimators': [400, 700, 1000],
    'colsample_bytree': [0.7, 0.8],
    'max_depth': [15,20,25],
    'num_leaves': [50, 100, 200],
    'reg_alpha': [1.1, 1.2, 1.3],
    'reg_lambda': [1.1, 1.2, 1.3],
    'min_split_gain': [0.3, 0.4],
    'subsample': [0.7, 0.8, 0.9],
    'subsample_freq': [20]
}

model, pred = algorithm_pipeline(X_train, X_test, y_train, y_test, model, 
                                 param_grid, cv=5, scoring_fit='accuracy')

print(model.best_score_)
print(model.best_params_)

The best parameters and best score from the GridSearchCV on the breast cancer dataset with LightGBM was

>0.9736263736263736
{'colsample_bytree': 0.7, 'max_depth': 15, 'min_split_gain': 0.3, 'n_estimators': 400, 'num_leaves': 50, 'reg_alpha': 1.3, 'reg_lambda': 1.1, 'subsample': 0.7, 'subsample_freq': 20}

## 4. Sklearn
Just to show that you indeed can run GridSearchCV with one of sklearn's own estimators, I tried the RandomForestClassifier on the same dataset as LightGBM.

In [None]:
model = RandomForestClassifier()
param_grid = {
    'n_estimators': [400, 700, 1000],
    'max_depth': [15,20,25],
    'max_leaf_nodes': [50, 100, 200]
}

model, pred = algorithm_pipeline(X_train, X_test, y_train, y_test, model, 
                                 param_grid, cv=5, scoring_fit='accuracy')

print(model.best_score_)
print(model.best_params_)

And indeed the score was worse than from LightGBM, as expected:

>0.9648351648351648
{'max_depth': 25, 'max_leaf_nodes': 50, 'n_estimators': 1000}

## RandomSearchCV

We don't have to restrict ourselves to GridSearchCV – why not implement RandomSearchCV too, if that is preferable to you. This is implemented at the bottom of the notebook available here.

We can specify another parameter for the pipeline search_mode, which let's us specify which search algorithm we want to use in our pipeline. But we also introduce another parameter called n_iterations, since we need to provide such a parameter for both the RandomSearchCV class – but not GridSearchCV.

We can set the default for both those parameters, and indeed that is what I have done. search_mode = 'GridSearchCV' and n_iterations = 0 is the defaults, hence we default to GridSearchCV where the number of iterations is not used.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

def search_pipeline(X_train_data, X_test_data, y_train_data, y_test_data, 
                       model, param_grid, cv=10, scoring_fit='neg_mean_squared_error',
                       do_probabilities = False, search_mode = 'GridSearchCV', n_iterations = 0):
    fitted_model = None
    
    if(search_mode == 'GridSearchCV'):
        gs = GridSearchCV(
            estimator=model,
            param_grid=param_grid, 
            cv=cv, 
            n_jobs=-1, 
            scoring=scoring_fit,
            verbose=2
        )
        fitted_model = gs.fit(X_train_data, y_train_data)

    elif (search_mode == 'RandomizedSearchCV'):
        rs = RandomizedSearchCV(
            estimator=model,
            param_distributions=param_grid, 
            cv=cv,
            n_iter=n_iterations,
            n_jobs=-1, 
            scoring=scoring_fit,
            verbose=2
        )
        fitted_model = rs.fit(X_train_data, y_train_data)
    
    
    if(fitted_model != None):
        if do_probabilities:
            pred = fitted_model.predict_proba(X_test_data)
        else:
            pred = fitted_model.predict(X_test_data)
            
        return fitted_model, pred

Running this for the breast cancer dataset, it produces the below results, which is almost the same as the GridSearchCV result (which got a score of 0.9648)

>0.9626373626373627
{'n_estimators': 1000, 'max_leaf_nodes': 100, 'max_depth': 25}