# Hyperparameter Tuning
...

To begin with, we will perform the necessary imports and load the summary dataset.

In [8]:
# Import necessary libraries and information.
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from lib.constants import DATA_PATH

# Load batter summary data.
summary = pd.read_csv(
    DATA_PATH + "/Batter_Summary_Reduced.txt", delimiter="\t")


## 1. Prepare the Data.
Before tuning hyperparameters for the Random Forest model, we must first prepare the data. This includes:

1. Removing unnecessary fields (e.g., Name).
2. Fill in missing data.
3. Split the data into test and taining sets.

This process is performed below.

In [9]:
# Replace batter hand with a numeric representation (Right Hand = 0, Left Hand = 1).
summary = summary.drop(columns=["Name", "Batter_ID"])

# Fill missing data with the columns mean.
data = summary.copy()
for i in data.columns[data.isnull().any(axis=0)]:
    data[i].fillna(data[i].median(), inplace=True)

# Extract features and labels.
X = data.drop(columns=["International_One_Day_Batting_Average"])
y = data["International_One_Day_Batting_Average"]

# Split the training and test datasets.
X_train = X[:-12]
X_test = X[-12:]
y_train = y[:-12]
y_test = y[-12:]


## 2. Random Hyperparameter Search.
While it is possible to speculate the best hyperparameters for the Random Forest based on theoretical findings in literature, it is often more efficient to try a wide range of values to see what combination of hyperparameters works best. However, once a set of optimised hyperparameters have been chosen, we will explore their significance from a theoretical perspective.

The following hyperparameters will be tuned:

* n_estimators
* max_features
* max_depth
* min_samples_split
* min_samples_leaf
* bootstrap

**n_estimators**<br>
Defines the number of trees in the Random Forest.

A random forest is an extension of bootstrap aggregation of decision trees, making it an ensemble of decision trees. The n_estimators hyperparameter defines how many decision trees the random forest will contain. Typically, a higher number of trees will result in a better learning of the data, however, at the expense of a longer training process. At some point, increasing the number of trees will result in diminishing returns from the model.

To stabilise the error rate of the random forest, it is generally recommended to begin with ten times as many trees as there are features. However, this number should be raised or lowered depending on the other hyperparameters selected.

**max_features**<br>
Defines the number of features to consider at each split.

At each split, a certain number of features (max_features) are randomly selected from the dataset. From these randomly selected features, one is chosen as the best for splitting the node. This parameter reduces overfitting and increases the stability of the trees.

Depending on the computational cost and overfitting present in the model, it is typical to use fewer features (log2) or more features (sqrt) as necessary. It is also possible to provide a custom float for further fine-tuning.

> max_features is calculated as: sqrt(n_features), log_2(n_features), etc.

**max_depth**<br>
Defines the number of levels in the tree.

Theoretically, the maximum depth of a decision tree is one less than the number of samples, however, overfitting will occur before this is achieved. This occurs as the deeper a tree grows, the more complex it becomes and will capture more information about the dataset. Once this occurs, you must reduce the maximum depth. However, if the depth is too shallow, underfitting will occur.

There is no single value typically recommended for max_depth. Generally, the approach is to experimentally choose a value that does not overfit or underfit the data.

**min_samples_split**<br>
Defines the minimum number of samples required to split a node.

A node, not to be confused with a leaf node, is a node with children (also known as an internal node). If an internal node has fewer samples than min_samples_split, then the node is not permitted to split. For example, if min_samples_split = 7 and a node only contains 5 samples, it will not split. This parameter is intended to control overfitting. Higher values prevent the model from learning relations specific only to the sample it was provided, however, too high a value can result in underfitting.

Typically, ideal values range between 1 and 40 for CART algorithm, which is used in this project.

**min_samples_leaf**<br>
Defines the minimum number of samples required at each leaf node.

A leaf node is a node without children. If splitting an internal node results in a leaf with fewer samples than min_samples_leaf, the split will not be permitted. For example, if min_samples_leaf = 2 and splitting an internal node results in a leaf node with 1 sample, the split will not occur. 

Typically, ideal values range between 1 and 20 for the CART algorithm.

**bootstrap**<br>
Defines whether bootstrapping should be used to select samples for training each tree.

Bootstrapping is a resampling technique used to create a random subset of data for each tree by using random sampling with replacement. This results in approximately one-third of instances being left out of each tree. The idea is that although each tree might have high variance for a particular set of the training data, overall, the entire forest will have a lower variance. If bootstrapping is not used, the same training set is used for each tree in the forest and overall variance would be expected to be greater.

It is generally recommended that bootstrapping be used to reduce the variance of the model.

### 2.1 Create Random Hyperparameter Grid.
To perform a random hyperparameter search, we first need to create a grid containing the possible values for each parameter to sample from.  

In [208]:
# Number of trees in the forest.
n_estimators = [int(x) for x in np.linspace(start=100, stop=500, num=21)]

# Number of features to consider at every split.
max_features = ["auto", "sqrt"]

# Maximum number of levels in each tree.
max_depth = [int(x) for x in np.linspace(start=10, stop=40, num=11)]
max_depth.append(None)

# Minimum number of samples required to split a node.
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node.
min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree.
bootstrap = [True, False]

# Create the parameter grid.
param_grid = {"n_estimators": n_estimators,
              "max_features": max_features,
              "max_depth": max_depth,
              "min_samples_split": min_samples_split,
              "min_samples_leaf": min_samples_leaf,
              "bootstrap": bootstrap}


### 2.2 Randomised Search.
With the random parameter grid defined, we now wish to test combinations of hyperparameters. If we were to test all possible combinations from the random parameter grid defined previously, we would have 9072 tests to perform. Instead, RandomizedSearchCV is used to narrow down the possible values of the optimal hyperparameters. This method will allow a random selection of combinations to be tested, reducing the time taken for testing at the expense of a less thorough search. However, as this is only being used to narrow down the possible parameters, this trade-off is acceptable.

Below, we run RandomizedSearchCV to test 2000 combinations, performing three-fold cross-validation for each combination. From this, we can extract the optimal hyperparameters.

In [211]:
# Create the random forest regressor.
rfr = RandomForestRegressor()

# Create the random search cross-validator.
rfr_random = RandomizedSearchCV(
    estimator=rfr, param_distributions=param_grid, n_iter=2000, cv=3, n_jobs=-1)

# Fit the random search model.
rfr_random.fit(X_train, y_train)

# Print the best parameters for the model.
rfr_random.best_params_


{'n_estimators': 160,
 'min_samples_split': 5,
 'min_samples_leaf': 4,
 'max_features': 'sqrt',
 'max_depth': 22,
 'bootstrap': True}

### 2.3 Test Accuracy.
With the reduced hyperparameters selected, we can test the accuracy of the refined model.

In [212]:
accuracy = 0
num_tests = 250

for _ in range(num_tests):
  # Create the random forest regressor.
  rfr = RandomForestRegressor(n_estimators=160, min_samples_split=5,
                              min_samples_leaf=4, max_features="sqrt", max_depth=22, bootstrap=True)

  # Fit the model.
  rfr.fit(X_train, y_train)

  # Test the accuracy of the model.
  accuracy += rfr.score(X_test, y_test)

print("The accuracy of the model using Random Search hyperparameters is: {}".format(
    accuracy / num_tests))


The accuracy of the model using Random Search hyperparameters is: 0.7771169381040898


## 3. Refine Hyperparameters.
Random search allowed us to narrow down the optimal values for the hyperparameters. Now, we can perform a more thorough search around the refined parameters using Grid Search. This method will test all possible combinations of hyperparameter values we provide it, and choose the best combination for the Random Forest.

### 3.1 Create Refined Parameter Grid.
To perform a grid search, we first need to create the grid of possible values for each parameter to test. 

In [200]:
# Create the parameter grid.
param_grid = {"n_estimators": [int(x) for x in np.linspace(140, 160, 11)],
              "max_features": [2, 3, 4, 5, 6],
              "max_depth": [None] + [int(x) for x in np.linspace(15, 25, 11)],
              "min_samples_split": [2, 3, 4],
              "min_samples_leaf": [1, 2, 3],
              "bootstrap": [True]}


### 3.2 Grid Search.
From the parameter grid, we can perform a grid search to test all combinations of the possible hyperparameter values using five-fold cross-validation. From this, we can determine the most optimal hyperparameters.

In [201]:
# Create the random forest regressor.
rfr = RandomForestRegressor()

# Create the random search cross-validator.
rfr_random = GridSearchCV(
    estimator=rfr, param_grid=param_grid, cv=5, n_jobs=-1)

# Fit the random search model.
rfr_random.fit(X_train, y_train)

# Print the best parameters for the model.
rfr_random.best_params_


{'bootstrap': True,
 'max_depth': 16,
 'max_features': 2,
 'min_samples_leaf': 1,
 'min_samples_split': 3,
 'n_estimators': 142}

### 3.3 Test Accuracy.
With the optimal hyperparameters selected, we can now test the accuracy of the model.

In [213]:
# Initialise variabled for determining average accuracy.
accuracy = 0
num_tests = 250

for _ in range(num_tests):
  # Create the random forest regressor.
  rfr = RandomForestRegressor(n_estimators=142, min_samples_split=3,
                              min_samples_leaf=1, max_features=2, max_depth=16, bootstrap=True)

  # Fit the model.
  rfr.fit(X_train, y_train)

  # Test the accuracy of the model.
  accuracy += rfr.score(X_test, y_test)

# Print the accuracy.
print("The accuracy of the model using Grid Search hyperparameters is: {}".format(
    accuracy / num_tests))


The accuracy of the model using Grid Search hyperparameters is: 0.7855894094824831
