#**Automated hyperparameter tuning**
<font color='grey' size='1.5'> Created by Parisa Hosseinzadeh for *Machine learning for proteins*, Spring 2022. 

Today, we will work on two different examples of hyperparameter tunining in learning. In the first example, we go through hyperparameter tunining for a random forest classifier. In the second part, we will perform tuning on a deep learning model.

We're training both models on MNIST data.

A bit about MNIST dataset from [MNIST wikipedia page](https://en.wikipedia.org/wiki/MNIST_database):

The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. 

<img src="https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png?format=250w" >

## Random forest on MNIST data

### Loading required modules

In [None]:
%matplotlib inline

from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import numpy as np
from sklearn.model_selection import RandomizedSearchCV

### Loading and preparing the dataset

In [None]:
# Fetching MNIST Dataset
mnist = fetch_openml('mnist_784', version=1)

# Get the data and target
X, y = mnist["data"], mnist["target"]

# Split the train and test set
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

### Building model and testing

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Training on the existing dataset
rf_clf = RandomForestClassifier(
                      random_state= 42, # to make sure numbers are reproducible
                      bootstrap=True, # To reduce correlation 
                      max_depth = 1, # number of features
                      n_estimators = 20, # number of trees
                      )

# fitting
rf_clf.fit(X_train, y_train)

### Evaluation

In [None]:
# Evaluating the model
y_pred = rf_clf.predict(X_test)
score = accuracy_score(y_test, y_pred)
print("Accuracy score after training on existing dataset", score)

### Hyperparameter tuning

As you can see, the accuracy isn't ... great. So, we need to change a bunch of parameters to make the random forest better.

#### Q1. Possible parameters

Check [Scikit's random forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). What parameters are available for you to tune?

#### Checking current parameters

Let's take a look and see which parameters are currently being used and what's their value.

In [None]:
from pprint import pprint

# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rf_clf.get_params())

#### Choosing parameters to tune

As you can see, there are a lot of parameters to tune. It is not efficient to tune all of them, so we usually just focus on a subset.

#### Q2. Parameter to tune

What are the top 5 most important parameters that you'll pick to tune?

#### Building a grid and tuning

We usually pick the following parameters for a random forest: (parameters and hyperameter tuning adapted from [TDS - Will Koehersen](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74))

- `n_estimators` = number of trees in the foreset
- `max_features` = max number of features considered for splitting a node
- `max_depth` = max number of levels in each decision tree
- `min_samples_split` = min number of data points placed in a node before the node is split
- `min_samples_leaf` = min number of data points allowed in a leaf node
- `bootstrap` = method for sampling data points (with or without replacement)

The first step is to give each parameter some values we would like to search.

In [None]:
## Note: Fewer options are written to save time for this activity
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 20, stop = 200, num = 4)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 90, num = 5)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

Then, we would like to build a random grid that contains all these values.

In [None]:
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
pprint(random_grid)

#### Q3. Total combinations

How many total combinations exists for the grid you set?

As you can see, we have to search a lot of combinations if we want to search everything. To avoid this, we perform a random search. The benefit of a random search is that we are not trying every combination, but selecting at random to sample a wide range of values.

Note that there is also the option to try every combination. It's called *Grid search* and you can learn more about it at [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). It is highly recommended to use smaller combinations if you want to try that.

In [None]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 5 different combinations, and use all available cores
# We chose 5 to save class time. Usually it tests 100 or so.
rf_random = RandomizedSearchCV(estimator = rf, 
                               param_distributions = random_grid, 
                               n_iter = 5, 
                               cv = 2, 
                               verbose=1, 
                               random_state=42, 
                               n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)

#### Q4. Time 

How long did it take to run through your random search?

#### Getting the best model

Now let's get the best parameter and see how much our accuracy changed.

In [None]:
rf_random.best_params_
{'bootstrap': True,
 'max_depth': 70,
 'max_features': 'auto',
 'min_samples_leaf': 4,
 'min_samples_split': 10,
 'n_estimators': 400}

#### Q5. New accuracy

Rebuild your random forest with these parameters and find the accuracy on test set. What is your conclusion?

In [None]:
# your code here

In [None]:
#@markdown Sample code

# Training on the existing dataset
rf_clf = RandomForestClassifier(
                      bootstrap=True,
                      max_depth=70,
                      max_features='auto',
                      min_samples_leaf=4,
                      min_samples_split=10,
                      n_estimators=400
)

# fitting
rf_clf.fit(X_train, y_train)

# Evaluating the model
y_pred = rf_clf.predict(X_test)
score = accuracy_score(y_test, y_pred)
print("Accuracy score after training on existing dataset", score)

## Convolutional neural nets

We can also perform hypermapater tuning on deep learning models. Here, we will try it on a CNN trained on MNIST data.

### Loading necessary modules

In [None]:
pip install keras-tuner --upgrade

In [None]:
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D
from tensorflow.keras.losses import sparse_categorical_crossentropy
from tensorflow.keras.optimizers import Adam
from kerastuner.tuners import RandomSearch

### Loading in and preparing data

In [None]:
img_width, img_height, img_num_channels = 28, 28, 1

# Load MNIST data
(input_train, target_train), (input_test, target_test) = mnist.load_data()

# Reshape data
input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1)
input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1)

# Determine shape of the data
input_shape = (img_width, img_height, img_num_channels)

# Parse numbers as floats
input_train = input_train.astype('float32')
input_test = input_test.astype('float32')

# Scale data
input_train = input_train / 255
input_test = input_test / 255

### Building models

We first set the main parameters of our model. 

In [None]:
# Model configuration
batch_size = 100
loss_function = sparse_categorical_crossentropy
no_classes = 10
# changed epochs from 25 to 5 for speeding up
no_epochs = 5
validation_split = 0.2
verbosity = 1

Then we build our CNN model. As you can see, it is a simple CNN with 2 CNN layers and two dense layer. But one of the dense layer has a choice of number of dimentions. We also have a choice of learning rate.

In [None]:
# MODEL BUILDING FUNCTION
def build_model(hp):
  # Create the model
  model = Sequential()
  model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
  model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
  model.add(Flatten())
  hp_units = hp.Int('units', min_value=64, max_value=218, step=32)
  model.add(Dense(hp_units, activation='relu'))
  model.add(Dense(no_classes, activation='softmax'))

  # Display a model summary
  model.summary()

  # Compile the model
  model.compile(loss=loss_function,
                optimizer=Adam(
                  hp.Choice('learning_rate',
                            values=[1e-2, 1e-3, 1e-4])),# making learning rate tunable by setting values
                metrics=['accuracy'])
  
  # Return the model
  return model

### Tuning

We will also be building a random search for CNN to perform tuning. It's done very similar to the random forest. This time, we're going over only 5 trials.

In [None]:
# Perform tuning
# I changed max_trials from 5 --> 3 to save time
# I changed execution from 3 --> 1 to save time
tuner = RandomSearch(
    build_model,
    objective='val_accuracy',
    max_trials=3,
    executions_per_trial=1,
    directory='tuning_dir',
    project_name='machinecurve_example')

In [None]:
# Display search space summary
tuner.search_space_summary()

# Perform random search
# changed epochs from 5 to 3 to save time
tuner.search(input_train, target_train,
             epochs=3,
             validation_split=validation_split)

#### Q6. Time and setting

How long does this take? What other parameters we could change?

### Getting the best model and evaluating

In [None]:
# Get best model
models = tuner.get_best_models(num_models=1)
best_model = models[0]

# Fit data to model
history = best_model.fit(input_train, target_train,
            batch_size=batch_size,
            epochs=no_epochs,
            verbose=verbosity,
            validation_split=validation_split)


In [None]:
# Generate generalization metrics
score = best_model.evaluate(input_test, target_test, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')

#### Q7. Accuracy

What is the best accuracy? What's your conclusion?