# SLU15 - Hyperparameter Tuning: Exercise notebook

You can now test you new skills with another medical dataset concerning heart disease. We'll use the [Heart Disease UCI ](https://archive.ics.uci.edu/ml/datasets/Heart+Disease)dataset. The original dataset has 76 attributes, we'll use just 13 of them.

In [None]:
import pandas as pd
import numpy as np
import scipy
from scipy.stats import uniform
import hashlib
import json

from sklearn.linear_model import LogisticRegression
# These will be needed to prepare the dataset
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

import warnings
warnings.filterwarnings("ignore", category=UserWarning) # to ignore the warnings when running the searches

In [None]:
# Seed for reproducibility
np.random.seed(42)

# Load data
heart_df = pd.read_csv("data/heart.csv").rename({"condition":"target"}, axis=1)
heart_df.sample(5)

We start with the train-test split.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
                                        heart_df.drop("target", axis=1),
                                        heart_df.target, 
                                        test_size=0.3,
                                        random_state=42
                                        )

The target variable is binary, as the patient either has heart disease or not, so we're dealing with a  classification problem. We're going to standardize the dataset for you, as this is a usual requirement of classification algorithms.

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print("X_train of shape ", X_train.shape)
print("y_train of shape ", y_train.shape)
print("X_test of shape  ", X_test.shape)
print("y_test of shape  ", y_test.shape)

## Exercise 0 - Baseline

Start by creating a baseline. Fit a logistic regression classifier with default settings to the train data, then calculate the prediction and the roc-auc score for the test set.

In [None]:
# Use this variables for the classifier, the prediction and the roc-auc score:
# lr = ...
# lr_pred = 
# lr_score = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(lr, LogisticRegression), 'Did you use the correct model?'
assert hashlib.sha256(json.dumps(''.join(str(i) for i in lr_pred)).encode()).hexdigest() == \
'ff60580ede35329b87086966fbb750a1eafcc909f78349bc48dd773883e19ca6', 'The prediction is not correct.'
np.testing.assert_almost_equal(lr_score, 0.790, 3, err_msg = 'The roc-auc-score is not correct.')
print("The AUROC score of the baseline logistic regression classifier is ", lr_score)

The baseline score is ok, but maybe we can squeeze out more of the classifier with hyperparameter tuning.

## Exercise 1 - Grid Search

We'll start the hyperparameter tuning with a grid search.

### Exercise 1.1 - Hyperparameter search space

Create a hyperparameter search space for two logistic regression hyperparameters, the regularization parameter `C` and the `penalty`. Use these values of the hyperparameters in the specified order:
- regularization parameter `C`: 0.1, 1, 4, 8, and 10
- `penalty`: "l1", "l2", and "elasticnet"

Store the defined search space in the variable `grid`.

In [None]:
# grid = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(grid, dict), 'The grid should be a dictionary.'
assert "C" in grid, 'Make sure you have the requested parameters as keys.'
assert "penalty" in grid, 'Make sure you have the requested parameters as keys.'
assert all(num in grid["C"] for num in [0.1, 1, 4, 8, 10]), 'The values for C parameter are not correct.'
assert all(num in grid["penalty"] for num in ['l1', 'l2', 'elasticnet']), 'The values for penalty parameter are not correct.'

### Exercise 1.2 - Setup the grid search
Create a grid search with a logistic regression classifier and the hyperparameter space defined in 1.1. Use the AUROC scoring function. Use the `saga` solver and an l1_ratio of 0.1 in the logistic regression and keep the rest of the settings at default values. Assign the defined grid search to the `grid_search` variable.

In [None]:
# grid_search = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(grid_search, GridSearchCV), 'Are you using GridSearchCV?'
assert isinstance(grid_search.estimator, LogisticRegression), 'Are you using the right model?'
assert hashlib.sha256(json.dumps(str(grid_search.get_params())).encode()).hexdigest() == \
'4bc6775ceaf1f8b1b4b53aa2b72b9f61f48462db29fe41835af889761cce3175', 'Your parameters are not correct'

### Exercise 1.3 - Find the best model
Use the grid search defined in 1.2 and the train data to find the best hyperparameters. Assign the best model to the `gs_best_model` variable.

In [None]:
# gs_best_model = ...
 
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(gs_best_model, LogisticRegression), 'Make sure you are using the right model.'
assert hashlib.sha256(json.dumps(str(gs_best_model.get_params())).encode()).hexdigest() == \
'9967b25890d15457dde8d403536461e12544671c20c23b0e36f3b0c504e1e433', 'The parameters of the best model are not correct.'
np.testing.assert_almost_equal(grid_search.best_score_, 0.912, 3, err_msg = 'The roc-auc-score is not correct.')

### Exercise 1.4 - Best predictions
Use the best model found in 1.3 to make a prediction on the test data. Assign the prediction to the `best_preds` variable.

In [None]:
# best_preds = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert hashlib.sha256(json.dumps(''.join(str(i) for i in best_preds)).encode()).hexdigest() == \
'e8e6089457cf3df4ca068a8ca85dc8642571bde196a47286f4ba073c0b50d643', 'Your parameters are not correct'
print("The AUROC score of the best logistic regression classifier from the grid search is ", roc_auc_score(best_preds, y_test))

Nice, the performance improved! Let's see if we can do even better with random search.

## Exercise 2 - Random Search 

### Exercise 2.1 - Search distributions
Create search distributions for two logistic regression hyperparameters, the regularization parameter `C` and the `penalty`. Use these values of the hyperparameters:
- Regularization parameter `C`: uniformly distributed between 0.1 and 10 (hint: use a scipy distribution)
- `penalty`: "l1", "l2", or "elasticnet"

Assign the distributions to the `random_grid` variable.  

In [None]:
# random_grid = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert "C" in random_grid, 'Make sure you have the requested parameters as keys.'
assert "penalty" in random_grid, 'Make sure you have the requested parameters as keys.'
assert isinstance(random_grid["C"], scipy.stats._distn_infrastructure.rv_frozen), 'The values for the C parameter are not correct.'
np.testing.assert_almost_equal(random_grid['C'].median(), 5.05, decimal=2,
                              err_msg='The values for the C parameter are not correct.')
np.testing.assert_almost_equal(random_grid['C'].entropy(), 2.2925347571405443, decimal=3,
                              err_msg='The values for the C parameter are not correct.')
np.testing.assert_almost_equal(random_grid["C"].stats(moments='mvsk')[1], 8.1675, decimal=3,
                              err_msg='The values for the C parameter are not correct.')
np.testing.assert_almost_equal(random_grid["C"].stats(moments='mvsk')[2], 0.0, decimal=3,
                              err_msg='The values for the C parameter are not correct.')
np.testing.assert_almost_equal(random_grid["C"].stats(moments='mvsk')[3], -1.2, decimal=3,
                              err_msg='The values for the C parameter are not correct.')
assert all(num in grid["penalty"] for num in ['l1', 'l2', 'elasticnet']), 'The values for penalty parameter are not correct.'

### Exercise 2.2 - Setup the random search
Create a random search with a logistic regression classifier and the hyperparameter space defined in 2.1. Use the `saga` solver and an l1_ratio of 0.1 in the logistic regression and keep the rest of the settings at default values.
* Set the random_state to 42
* Set the number of iterations to 15
* Set the scoring to AUROC

Assign the defined random search to the `random_search` variable.

In [None]:
# random_search = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(random_search, RandomizedSearchCV), 'Are you using RandomizedSearchCV?'
assert isinstance(random_search.estimator, LogisticRegression), 'Make sure you are using the right model.'
assert hashlib.sha256(json.dumps(random_search.estimator.get_params()).encode()).hexdigest() == \
'5e0da474d543cfcbf692c33a2fca03e9f4f6df7e7201f5f6543620f23752cb66', 'The parameters of the model are not correct.'
assert random_search.random_state==42, 'Check the random_state value.'
assert random_search.n_iter==15, 'Check the number of iterations.'
assert 'C' in random_search.get_params()['param_distributions'].keys(), 'The search distributions of the random search are not correct.'
assert 'penalty' in random_search.get_params()['param_distributions'].keys(), 'The search distributions of the random search are not correct.'

### Exercise 2.3 - Find the best model
Use the random search defined in 2.2 and the train data to find the best hyperparameters. Assign the best model to the `rs_best_model` variable.

In [None]:
# rs_best_model = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(rs_best_model, LogisticRegression) , 'Make sure you are using the right model.'
assert  hashlib.sha256(json.dumps(str(rs_best_model.get_params())).encode()).hexdigest() ==\
'3a01c21a24a2afab00d03502531f698015f63138576764716f6b4da3e648fdee', 'Your parameters are not correct'
np.testing.assert_almost_equal(random_search.best_score_, 0.913, 3, err_msg = 'The roc-auc-score is not correct.')

### Exercise 2.4 - Best hyparameters
Get the hyparameters of the best model found by the random search and assign them to the variable `rs_best_params`.

In [None]:
# rs_best_params = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert hashlib.sha256(json.dumps(rs_best_params['penalty']).encode()).hexdigest() == \
'e523f9d32d200d4c898117c977ff5109e2a298ef92341cd3f9d2c18ac99d493a', 'The penalty hyperparameter is not correct.'
np.testing.assert_almost_equal(rs_best_params['C'], 0.305, 3, err_msg = 'The C hyperparameter is not correct.')
print("The AUROC score of the best logistic regression classifier found by the random search is ", 
      roc_auc_score(random_search.predict(X_test), y_test))

Looks like we have a winner: the model resulting from the grid search performed the best!