# Finding optimal training parameters using grid search

When you are working with classifiers, you do not always know what the best parameters are. You cannot brute-force it by checking for all possible combinations manually. This is where grid search becomes useful. Grid search allows us to specify a range of values and the classifier will automatically run various configurations to figure out the best combination of parameters. Let's see how to do it.

In [18]:
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.metrics import classification_report 
from sklearn.model_selection import GridSearchCV # Grid Search
from sklearn.ensemble import ExtraTreesClassifier 
from sklearn.model_selection import train_test_split # Cross Validation
from sklearn.metrics import classification_report 

def visualize_classifier(classifier, X, y, title=''):
    # Define the minimum and maximum values for X and Y
    # that will be used in the mesh grid
    min_x, max_x = X[:, 0].min() - 1.0, X[:, 0].max() + 1.0
    min_y, max_y = X[:, 1].min() - 1.0, X[:, 1].max() + 1.0

    # Define the step size to use in plotting the mesh grid 
    mesh_step_size = 0.01

    # Define the mesh grid of X and Y values
    x_vals, y_vals = np.meshgrid(np.arange(min_x, max_x, mesh_step_size), np.arange(min_y, max_y, mesh_step_size))

    # Run the classifier on the mesh grid
    output = classifier.predict(np.c_[x_vals.ravel(), y_vals.ravel()])

    # Reshape the output array
    output = output.reshape(x_vals.shape)

    # Create a plot
    plt.figure()

    # Specify the title
    plt.title(title)

    # Choose a color scheme for the plot 
    plt.pcolormesh(x_vals, y_vals, output, cmap=plt.cm.gray)

    # Overlay the training points on the plot 
    plt.scatter(X[:, 0], X[:, 1], c=y, s=75, edgecolors='black', linewidth=1, cmap=plt.cm.Paired)

    # Specify the boundaries of the plot
    plt.xlim(x_vals.min(), x_vals.max())
    plt.ylim(y_vals.min(), y_vals.max())

    # Specify the ticks on the X and Y axes
    plt.xticks((np.arange(int(X[:, 0].min() - 1), int(X[:, 0].max() + 1), 1.0)))
    plt.yticks((np.arange(int(X[:, 1].min() - 1), int(X[:, 1].max() + 1), 1.0)))

    plt.show()


Load input data 

In [2]:
input_file = 'data_random_forests.txt' 
data = np.loadtxt(input_file, delimiter=',') 
X, y = data[:, :-1], data[:, -1] 

Separate the data into three classes:

In [3]:
# Separate input data into three classes based on labels 
class_0 = np.array(X[y==0]) 
class_1 = np.array(X[y==1]) 
class_2 = np.array(X[y==2]) 

Split the data into training and testing datasets:

In [5]:
# Split data into training and testing datasets  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 5 )

Specify the grid of parameters that you want the classifier to test. Usually we keep one parameter constant and vary the other parameter. We then do it vice versa to figure out the best combination. In this case, we want to find the best values for n_estimators and max_depth. Let's specify the parameter grid:

In [6]:
# Define the parameter grid  
parameter_grid = [ {'n_estimators': [100], 'max_depth': [2, 4, 7, 12, 16]}, 
                         {'max_depth': [4], 'n_estimators': [25, 50, 100, 250]} 
                         ] 

Let's define the metrics that the classifier should use to find the best combination of parameters:

In [7]:
metrics = ['precision_weighted', 'recall_weighted'] 

For each metric, we need to run the grid search, where we train the classifier for a particular combination of parameters:

In [11]:
for metric in metrics: 
    print("\n##### Searching optimal parameters for", metric) 

    classifier = GridSearchCV( ExtraTreesClassifier(random_state=0),
                              parameter_grid, cv=5, scoring=metric) 
    
    classifier.fit(X_train, y_train) 


##### Searching optimal parameters for precision_weighted

##### Searching optimal parameters for recall_weighted


Print the best score for each parameter combination:

In [30]:
print("\nBest parameters:", classifier.best_params_) 


Best parameters: {'max_depth': 4, 'n_estimators': 25}


Print the performance report:

In [31]:
y_pred = classifier.predict(X_test) 
print("\nPerformance report:\n") 
print(classification_report(y_test, y_pred)) 


Performance report:

              precision    recall  f1-score   support

         0.0       0.93      0.84      0.88        79
         1.0       0.85      0.86      0.85        70
         2.0       0.84      0.92      0.88        76

   micro avg       0.87      0.87      0.87       225
   macro avg       0.87      0.87      0.87       225
weighted avg       0.87      0.87      0.87       225

