<a href="https://colab.research.google.com/github/sp8rks/MaterialsInformatics/blob/main/worked_examples/hyperparameter_opt/materials_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Grid vs. Random Search Hyperparameter Optimization

# Setup

### Installation

For this project we will need to install matbench (for datasets) and CBFV (to create composition based feature vectors)

In [None]:
!pip install matbench
!pip install CBFV

### Imports

In [None]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier

from scipy.stats import randint

from matbench.bench import MatbenchBenchmark
from CBFV.composition import generate_features

### Data

Load the data from MatBench. This example will use the matbench_expt_is_metal dataset. The first tast is selected and loaded as well along with the first fold of the dataset. The data is split into train and test splits. 

In [None]:
mb = MatbenchBenchmark(subset=["matbench_expt_is_metal"])
task = list(mb.tasks)[0]
task.load()
fold0 = task.folds[0]
train_inputs, train_outputs = task.get_train_and_val_data(fold0)
test_inputs, test_outputs = task.get_test_data(fold0, include_target=True)
print(train_inputs[0:2], train_outputs[0:2])
print(train_outputs.shape, test_outputs.shape)
        

Describe the inputs and outputs of the training set. This outputs different statistics for the dataframes. This is helpful for getting a quick glance at the nature of our dataset. 

In [None]:
train_inputs.describe()

In [None]:
train_outputs.describe()

Sets up our train and test dataframes. Additionally this converts our data into copmosition based feature vectors using the generate_features from the CBFV library. 

In [None]:
train_df = pd.DataFrame({"formula": train_inputs, "target": train_outputs})
test_df = pd.DataFrame({"formula": test_inputs, "target": test_outputs})
train_df

X_train, y_train, _, _ = generate_features(train_df)
print(X_train.shape)
X_test, y_test, _, _ = generate_features(test_df)
print(X_test.shape)

## Hyperparameter Optimization

We can do hyperparameter tuning in different ways. Two common ways are grid search (less efficient) and random search (more efficient). Below are examples taken/modified from the website https://www.geeksforgeeks.org/hyperparameter-tuning/


First we will grid search over a logistic regression classifier. This is a model taken from scikit-learn. Grid search is slower as it tries every possible combination of parameters. Despite slowing down the model, this means that it can be deterministic and output the same results every single time. Be aware that the amount of features and parameters you feed the model will exponentially increase the time it takes for the model to find the best parameters. 

In [None]:
#Grid search first using logistic regression classifier model

# Creating the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}
  
# Instantiating logistic regression classifier
# https://stats.stackexchange.com/a/184026/293880
logreg = LogisticRegression(max_iter=100)
  
# Instantiating the GridSearchCV object
logreg_grid = GridSearchCV(logreg, param_grid, cv = 5)
  
logreg_grid.fit(X_train, y_train)
  
# Print the tuned parameters and score
print("Grid tuned Logistic Regression Parameters: {}".format(logreg_grid.best_params_)) 
print("Best score is {}".format(logreg_grid.best_score_))

Second, lets try a random search. Random search is much faster than grid search as it randomly draws samples from within specified regions. It evaluates a fixed number of these samples which lets you control how fast it will run. However, it may miss the best combination of parameters from not finding it through random sampling. This method increases it's speed and efficiency by quite a large margin at the cost of sacrificing some of the accuracy. 

In [None]:
#Now we can try random search with logistic regression
  
# Creating the hyperparameter grid 
param_dist = {"C": randint(-5,15)}
  
# Instantiating Decision Tree classifier
logreg = LogisticRegression()
  
# Instantiating RandomizedSearchCV object
logreg_random = RandomizedSearchCV(logreg, param_dist, cv = 5)
  
logreg_random.fit(X_train, y_train)
  
# Print the tuned parameters and score
print("Random tuned Logistic Regression Parameters: {}".format(logreg_random.best_params_))
print("Best score is {}".format(logreg_random.best_score_))


We can do the same grid vs random search with another model, like a decision tree classifier

In [None]:
#grid search for decision tree hyperparameters
  
# Creating the hyperparameter grid 
param_grid = {"max_depth": range(1,10),
              "max_features": range(1,10),
              "min_samples_leaf": range(1,10),
              "criterion": ["gini", "entropy"]}

# Instantiating Decision Tree classifier
tree = DecisionTreeClassifier()
  
# Instantiating GridSearchCV object
tree_grid = GridSearchCV(tree, param_grid, cv = 5)
  
tree_grid.fit(X_train, y_train)
  
# Print the tuned parameters and score
print("Grid tuned Decision Tree Parameters: {}".format(tree_grid.best_params_))
print("Best score is {}".format(tree_grid.best_score_))


In [None]:
#random search for decision tree hyperparameters
  
# Creating the hyperparameter grid 
param_dist = {"max_depth": randint(1,10),
              "max_features": randint(1,10),
              "min_samples_leaf": randint(1,10),
              "criterion": ["gini", "entropy"]}

# Instantiating Decision Tree classifier
tree = DecisionTreeClassifier()
  
# Instantiating RandomizedSearchCV object
tree_random = RandomizedSearchCV(tree, param_dist, cv = 5)
  
tree_random.fit(X_train, y_train)
  
# Print the tuned parameters and score
print("Random tuned Decision Tree Parameters: {}".format(tree_random.best_params_))
print("Best score is {}".format(tree_random.best_score_))


Hyperparameter optimization will show up in other notebooks and homeworks. Grid search and random search can be fine for certain tasks, but when the models and hyperparameters become more complicated and numerous it may be advantageous to explore other options such as Bayesian Optimization or Genetic Algorithms.