# Hyperparameter Optimization

For this exercise, we will have a look at Hyperparameter Optimization --
instead of just choosing the best type of machine learning model, we also want
to choose the best hyperparameter setting for a task. The end result (i.e. the
predictive performance) is again not important; how you get there is.

Your deliverable will be a report, written in a style that it
would be suitable for inclusion in an academic paper as the "Experimental
Setup" section or similar. If unsure, check an academic paper of your choice,
for example [this one](https://www.eecs.uwyo.edu/~larsko/papers/pulatov_opening_2022-1.pdf). The
level of detail should be higher than in a typical academic paper though. Your
report should be at most five pages, including references and figures but
excluding appendices. It should have the following structure:
- Introduction: What problem are you solving, how are you going to solve it.
- Dataset Description: Describe the data you're using, e.g. how many features and observations, what are you predicting, any missing values, etc.
- Experimental Setup: What specifically are you doing to solve the problem, i.e.\ what programming languages and libraries, how are you processing the data, what machine learning algorithms are you considering and what hyperparameters and value ranges, what measures you are using to evaluate them, what hyperparameter optimization method you chose, etc.
- Results: Description of what you observed, including plots. Compare
  performance before and after tuning, and show the best configuration.
- Code: Add the code you've used as a separate file.

Your report must contain enough detail to reproduce what you did without the
code. If in doubt, include more detail.

There is no required format for the report. You could, for example, use an
iPython notebook.

## Data and Setup

We will have a look at the [Wine Quality
dataset](https://archive-beta.ics.uci.edu/dataset/186/wine+quality). Choose the
one that corresponds to your preference in wine. You may also use a dataset of
your choice, for example one that's relevant to your research.

Choose a small number of different machine learning algorithms and
hyperparameters, along with value ranges, for each. You can use implementations
of AutoML systems (e.g. auto-sklearn), scientific papers, or the documentation
of the library you are using to determine the hyperparameters to tune and the
value ranges. Note that there is not only a single way to do this, but define a
reasonable space (e.g. don't include whether to turn on debug output, or random
forests with 1,000,000 trees, or tune the loss function). Your hyperparameter
search space should be so large that you cannot simply run a grid search.

Determine the best machine learning algorithm and hyperparameter setting for
your dataset. Make sure to optimize both the type of machine learning algorithm
and the hyperparameters at the same time (do not first choose the best ML
algorithm and then optimize its hyperparameters). Choose a suitable
hyperparameter optimizer; you could also use several and e.g. compare the
results achieved by random search and Bayesian optimization. Make sure that the
way you evaluate model performance avoids bias and overfitting. You could use
statistical tests to make this determination.

## Submission

Add your report and code to this repository. Bonus points if you can set up a
Github action to automatically run the code and generate the report!

## Useful Resources :
- "*Basics of HPO - Example and Practical Hints*" -From the AutoML Course Videos
- https://www.youtube.com/watch?v=Gol_qOgRqfA
- https://www.youtube.com/watch?v=0wUF_Ov8b0A&t=1058s

## Importing the Dataset as a Pandas Dataframe

In [None]:
import pandas as pd
import numpy as np

In [None]:
red_wine_df = pd.read_csv('winequality-red.csv', delimiter=';')

In [None]:
red_wine_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [None]:
X = red_wine_df.iloc[:, :-1]
y = red_wine_df['quality']

X.shape, y.shape

((1599, 11), (1599,))

## Importing our Model (Decision Tree Classifier)

### Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

decision_tree_model = DecisionTreeClassifier()

In [None]:
decision_tree_model.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': None,
 'splitter': 'best'}

## Hyperparameter Optimization

Methods Used :
- Bayesian Optimization
- Random Search


### Bayesian Optimization

In [None]:
# Comment out this line to install the necessary library for Bayesian Optimization:
!pip install baytune

Collecting baytune
  Downloading baytune-0.5.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.2/75.2 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting copulas>=0.3.2 (from baytune)
  Downloading copulas-0.10.1-py3-none-any.whl (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: copulas, baytune
Successfully installed baytune-0.5.0 copulas-0.10.1


In [None]:
models = {
    'DTC': DecisionTreeClassifier,
}

In [None]:
from sklearn.model_selection import cross_val_score

def scoring_function(model_name, hyperparameter_values):
    model_class = models[model_name]
    model_instance = model_class(**hyperparameter_values)
    scores = cross_val_score(
        estimator=model_instance,
        X=X,
        y=y,
        scoring='accuracy',
    )

    return scores.mean()

In [None]:
from baytune.tuning import Tunable
from baytune.tuning import hyperparams as hp

defaults = {'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': 5,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': None,
 'splitter': 'best'}

tunables = {
    'DTC': Tunable({
        'criterion' : hp.CategoricalHyperParam(["gini", "entropy", "log_loss"], default='gini'),
        'splitter' : hp.CategoricalHyperParam(["best", "random"], default='best'),
        'max_depth': hp.IntHyperParam(min=1, max=1000, default=5),
        'min_samples_split': hp.IntHyperParam(min=2, max=100, default=2),
        'min_samples_leaf': hp.IntHyperParam(min=1, max=100, default=1),
        'min_weight_fraction_leaf': hp.FloatHyperParam(min=0.0, max=0.5, default=0.0),
        'max_features': hp.CategoricalHyperParam(["sqrt", "log2", None], default=None),
    }),
}

In [None]:
from baytune import BTBSession

session = BTBSession(
    tunables=tunables,
    scorer=scoring_function,
    verbose=True,
)

In [None]:
best_result = session.run(200)

  0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
best_result

{'id': '1c754653e34fa7cdcf6ee0ed673be93f',
 'name': 'DTC',
 'config': {'criterion': 'gini',
  'splitter': 'best',
  'max_depth': 204,
  'min_samples_split': 40,
  'min_samples_leaf': 3,
  'min_weight_fraction_leaf': 0.13620395810713193,
  'max_features': None},
 'score': 0.5622472570532915}

### Random Search

In [None]:
from sklearn.model_selection import RandomizedSearchCV

#### Decision Tree Classifier


In [None]:
from scipy.stats import uniform

# Define the hyperparameters:

criterion = ['gini', 'entropy', 'log_loss']
splitter = ["best", "random"]
max_depth = range(1000)
min_samples_split = range(2, 100)
min_samples_leaf = range(100)
min_weight_fraction_leaf = uniform(scale=0.5)
max_features = ["sqrt", "log2", None]

In [None]:
# Construct the hyperparameter distribution:

hyperparameter_distribution = {
    "criterion": criterion,
    "splitter" : splitter,
    "max_depth" : max_depth,
    "min_samples_split" : min_samples_split,
    "min_samples_leaf" : min_samples_leaf,
    "min_weight_fraction_leaf" : min_weight_fraction_leaf,
    "max_features" : max_features,
}

In [None]:
# Construct the "Random Search" object:

K_FOLDS = 10
ITERATIONS = 200
random_search = RandomizedSearchCV(decision_tree_model, hyperparameter_distribution, cv=K_FOLDS, scoring='accuracy', n_iter=ITERATIONS, verbose=3)

In [None]:
random_search.fit(X, y)

Fitting 10 folds for each of 200 candidates, totalling 2000 fits
[CV 1/10] END criterion=entropy, max_depth=785, max_features=log2, min_samples_leaf=71, min_samples_split=59, min_weight_fraction_leaf=0.44179466590264466, splitter=best;, score=0.444 total time=   0.0s
[CV 2/10] END criterion=entropy, max_depth=785, max_features=log2, min_samples_leaf=71, min_samples_split=59, min_weight_fraction_leaf=0.44179466590264466, splitter=best;, score=0.519 total time=   0.0s
[CV 3/10] END criterion=entropy, max_depth=785, max_features=log2, min_samples_leaf=71, min_samples_split=59, min_weight_fraction_leaf=0.44179466590264466, splitter=best;, score=0.500 total time=   0.0s
[CV 4/10] END criterion=entropy, max_depth=785, max_features=log2, min_samples_leaf=71, min_samples_split=59, min_weight_fraction_leaf=0.44179466590264466, splitter=best;, score=0.475 total time=   0.0s
[CV 5/10] END criterion=entropy, max_depth=785, max_features=log2, min_samples_leaf=71, min_samples_split=59, min_weight_fr

20 fits failed out of a total of 2000.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/tree/_classes.py", line 889, in fit
    super().fit(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/tree/_classes.py", line 177, in fit
    self._validate_params()
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 600, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/

In [None]:
pd.DataFrame(random_search.cv_results_)[['mean_test_score', 'std_test_score', 'params']]

Unnamed: 0,mean_test_score,std_test_score,params
0,0.528479,0.063091,"{'criterion': 'entropy', 'max_depth': 785, 'ma..."
1,0.457142,0.041679,"{'criterion': 'log_loss', 'max_depth': 83, 'ma..."
2,0.548467,0.054040,"{'criterion': 'entropy', 'max_depth': 652, 'ma..."
3,0.492866,0.046204,"{'criterion': 'gini', 'max_depth': 481, 'max_f..."
4,0.508491,0.072728,"{'criterion': 'entropy', 'max_depth': 118, 'ma..."
...,...,...,...
195,0.507854,0.081343,"{'criterion': 'entropy', 'max_depth': 642, 'ma..."
196,0.425892,0.001956,"{'criterion': 'log_loss', 'max_depth': 899, 'm..."
197,0.527229,0.050344,"{'criterion': 'log_loss', 'max_depth': 679, 'm..."
198,0.436533,0.043918,"{'criterion': 'entropy', 'max_depth': 363, 'ma..."


In [None]:
print(random_search.best_score_)
print(random_search.best_params_)

0.5878459119496855
{'criterion': 'log_loss', 'max_depth': 557, 'max_features': None, 'min_samples_leaf': 92, 'min_samples_split': 48, 'min_weight_fraction_leaf': 0.031846123643138435, 'splitter': 'best'}
