# Hyperparameter Optimization

For this exercise, we will have a look at Hyperparameter Optimization --
instead of just choosing the best type of machine learning model, we also want
to choose the best hyperparameter setting for a task. The end result (i.e. the
predictive performance) is again not important; how you get there is.

Your deliverable will be a report, written in a style that it
would be suitable for inclusion in an academic paper as the "Experimental
Setup" section or similar. If unsure, check an academic paper of your choice,
for example [this one](https://www.eecs.uwyo.edu/~larsko/papers/pulatov_opening_2022-1.pdf). The
level of detail should be higher than in a typical academic paper though. Your
report should be at most five pages, including references and figures but
excluding appendices. It should have the following structure:
- Introduction: What problem are you solving, how are you going to solve it.
- Dataset Description: Describe the data you're using, e.g. how many features and observations, what are you predicting, any missing values, etc.
- Experimental Setup: What specifically are you doing to solve the problem, i.e.\ what programming languages and libraries, how are you processing the data, what machine learning algorithms are you considering and what hyperparameters and value ranges, what measures you are using to evaluate them, what hyperparameter optimization method you chose, etc.
- Results: Description of what you observed, including plots. Compare
  performance before and after tuning, and show the best configuration.
- Code: Add the code you've used as a separate file.

Your report must contain enough detail to reproduce what you did without the
code. If in doubt, include more detail.

There is no required format for the report. You could, for example, use an
iPython notebook.

## Data and Setup

We will have a look at the [Wine Quality
dataset](https://archive-beta.ics.uci.edu/dataset/186/wine+quality). Choose the
one that corresponds to your preference in wine. You may also use a dataset of
your choice, for example one that's relevant to your research.

Choose a small number of different machine learning algorithms and
hyperparameters, along with value ranges, for each. You can use implementations
of AutoML systems (e.g. auto-sklearn), scientific papers, or the documentation
of the library you are using to determine the hyperparameters to tune and the
value ranges. Note that there is not only a single way to do this, but define a
reasonable space (e.g. don't include whether to turn on debug output, or random
forests with 1,000,000 trees, or tune the loss function). Your hyperparameter
search space should be so large that you cannot simply run a grid search.

Determine the best machine learning algorithm and hyperparameter setting for
your dataset. Make sure to optimize both the type of machine learning algorithm
and the hyperparameters at the same time (do not first choose the best ML
algorithm and then optimize its hyperparameters). Choose a suitable
hyperparameter optimizer; you could also use several and e.g. compare the
results achieved by random search and Bayesian optimization. Make sure that the
way you evaluate model performance avoids bias and overfitting. You could use
statistical tests to make this determination.

## Submission

Add your report and code to this repository. Bonus points if you can set up a
Github action to automatically run the code and generate the report!

## Useful Resources :
- "*Basics of HPO - Example and Practical Hints*" -From the AutoML Course Videos
- https://www.youtube.com/watch?v=Gol_qOgRqfA
- https://www.youtube.com/watch?v=0wUF_Ov8b0A&t=1058s

## Importing the Dataset as a Pandas Dataframe

In [None]:
import pandas as pd
import numpy as np

In [None]:
red_wine_df = pd.read_csv('winequality-red.csv', delimiter=';')

In [None]:
red_wine_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [None]:
X = red_wine_df.iloc[:, :-1]
y = red_wine_df['quality']

X.shape, y.shape

((1599, 11), (1599,))

## Importing our Models

Continuing on from the "ML Algorithm Selection" exercise, we will use the same models to figure out the best model, for the red wine quality dataset, out of the following models :
- Logistic Regression
- K-Nearest Neighbors
- Random Forest
- Support Vector Classifier (SVM Classifier)
- Decision Tree Classifier

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

random_forest_classifier = RandomForestClassifier()

In [None]:
random_forest_classifier.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

## Hyperparameter Optimization

Methods Used :
- Random Search
- Bayesian Optimization

### Bayesian Optimization

In [None]:
# Comment out this line to install the necessary library for Bayesian Optimization:
!pip install baytune



In [None]:
models = {
    'RF' : RandomForestClassifier
}

In [None]:
from sklearn.model_selection import cross_val_score

def scoring_function(model_name, hyperparameter_values):
    model_class = models[model_name]
    model_instance = model_class(**hyperparameter_values)
    scores = cross_val_score(
        cv=10,
        estimator=model_instance,
        X=X,
        y=y,
        scoring='accuracy',
    )

    return scores.mean()

In [None]:
from baytune.tuning import Tunable
from baytune.tuning import hyperparams as hp

# tunables = {
#     'RF': Tunable({
#         'n_estimators' : hp.IntHyperParam(min=50, max=1000, default=100),
#         'criterion' : hp.CategoricalHyperParam(['gini', 'entropy', 'log_loss'], default='gini'),
#         'max_depth': hp.IntHyperParam(min=1, max=20, default=5),
#         'min_samples_split': hp.IntHyperParam(min=2, max=10, default=2),
#         'min_samples_leaf': hp.IntHyperParam(min=1, max=10, default=1),
#         'min_weight_fraction_leaf': hp.FloatHyperParam(min=0.0, max=0.5, default=0.0),
#         'max_features': hp.CategoricalHyperParam(["sqrt", "log2", None], default='sqrt'),
#     })
# }

tunables = {
    'RF': Tunable({
        'n_estimators' : hp.IntHyperParam(min=50, max=1000, default=100),
        'criterion' : hp.CategoricalHyperParam(['gini', 'entropy', 'log_loss'], default='gini'),
        'max_depth': hp.IntHyperParam(min=1, max=100, default=5),
        'min_samples_split': hp.IntHyperParam(min=2, max=100, default=2),
        'min_samples_leaf': hp.IntHyperParam(min=1, max=100, default=1),
        'min_weight_fraction_leaf': hp.FloatHyperParam(min=0.0, max=0.5, default=0.0),
        'max_features': hp.CategoricalHyperParam(["sqrt", "log2", None], default='sqrt'),
    })
}

In [None]:
from baytune import BTBSession

session = BTBSession(
    tunables=tunables,
    scorer=scoring_function,
    verbose=True,
)

In [None]:
best_result = session.run(200)

best_result

  0%|          | 0/200 [00:00<?, ?it/s]

{'id': 'a1d2be51a06316e212a0355b8645b5d4',
 'name': 'RF',
 'config': {'n_estimators': 185,
  'criterion': 'gini',
  'max_depth': 27,
  'min_samples_split': 32,
  'min_samples_leaf': 25,
  'min_weight_fraction_leaf': 0.02226761651277137,
  'max_features': 'log2'},
 'score': 0.5978852201257862}

### Random Search

In [None]:
from sklearn.model_selection import RandomizedSearchCV

#### Random Forest

In [None]:
from scipy.stats import uniform

# Define the hyperparameters:

n_estimators = range(50, 1000)
criterion = ['gini', 'entropy', 'log_loss']
max_depth = range(1, 100)
min_samples_split = range(2, 100)
min_samples_leaf = range(1, 100)
min_weight_fraction_leaf = uniform(scale=0.5)
max_features = ["sqrt", "log2", None]

In [None]:
# Construct the hyperparameter distribution:

hyperparameter_distribution = {
    "n_estimators": n_estimators,
    "criterion": criterion,
    "max_depth" : max_depth,
    "min_samples_split" : min_samples_split,
    "min_samples_leaf" : min_samples_leaf,
    "min_weight_fraction_leaf" : min_weight_fraction_leaf,
    "max_features" : max_features,
}

In [None]:
# Construct the "Random Search" object:

K_FOLDS = 10
ITERATIONS = 200
random_search = RandomizedSearchCV(random_forest_classifier, hyperparameter_distribution, cv=K_FOLDS, scoring='accuracy', n_iter=ITERATIONS, verbose=2)

In [None]:
# Run the "Random Search" on the dataset and on the hyperparameter distribution:

random_search.fit(X, y)

Fitting 10 folds for each of 200 candidates, totalling 2000 fits
[CV] END criterion=entropy, max_depth=79, max_features=log2, min_samples_leaf=68, min_samples_split=69, min_weight_fraction_leaf=0.4597191491279367, n_estimators=997; total time=   2.8s
[CV] END criterion=entropy, max_depth=79, max_features=log2, min_samples_leaf=68, min_samples_split=69, min_weight_fraction_leaf=0.4597191491279367, n_estimators=997; total time=   5.2s
[CV] END criterion=entropy, max_depth=79, max_features=log2, min_samples_leaf=68, min_samples_split=69, min_weight_fraction_leaf=0.4597191491279367, n_estimators=997; total time=   4.0s
[CV] END criterion=entropy, max_depth=79, max_features=log2, min_samples_leaf=68, min_samples_split=69, min_weight_fraction_leaf=0.4597191491279367, n_estimators=997; total time=   3.4s
[CV] END criterion=entropy, max_depth=79, max_features=log2, min_samples_leaf=68, min_samples_split=69, min_weight_fraction_leaf=0.4597191491279367, n_estimators=997; total time=   2.2s
[CV] 

In [None]:
# Check the results:

pd.DataFrame(random_search.cv_results_)[['mean_test_score', 'std_test_score', 'params']]

Unnamed: 0,mean_test_score,std_test_score,params
0,0.550362,0.052027,"{'criterion': 'entropy', 'max_depth': 79, 'max..."
1,0.572866,0.063503,"{'criterion': 'entropy', 'max_depth': 81, 'max..."
2,0.575998,0.065332,"{'criterion': 'entropy', 'max_depth': 21, 'max..."
3,0.560362,0.068849,"{'criterion': 'entropy', 'max_depth': 55, 'max..."
4,0.564119,0.068248,"{'criterion': 'gini', 'max_depth': 16, 'max_fe..."
...,...,...,...
195,0.556616,0.065341,"{'criterion': 'log_loss', 'max_depth': 75, 'ma..."
196,0.572869,0.064637,"{'criterion': 'gini', 'max_depth': 94, 'max_fe..."
197,0.566623,0.060012,"{'criterion': 'entropy', 'max_depth': 92, 'max..."
198,0.563487,0.066890,"{'criterion': 'gini', 'max_depth': 18, 'max_fe..."


In [None]:
print(random_search.best_score_)
print(random_search.best_params_)

0.5922641509433962
{'criterion': 'entropy', 'max_depth': 95, 'max_features': 'log2', 'min_samples_leaf': 25, 'min_samples_split': 72, 'min_weight_fraction_leaf': 0.033151757479596144, 'n_estimators': 647}
