# SLU15 - Hyperparameter Tuning : Exercise notebook

### New concepts in this unit

*  Hyperparameter definition
*  Hyperparameter search
*  Model selection

### New tools in this unit
- [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
- [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

## Introduction
You decide you want to apply your data science skills to help identify the risk of heart disease in patients, and so decide to take a look at the [Heart Disease UCI ](https://archive.ics.uci.edu/ml/datasets/Heart+Disease)dataset. This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. You follow the example and load the simplified dataset. 

In [1]:
import pandas as pd
import numpy as np
import scipy
import warnings
from hashlib import sha256
import json

import sklearn
# These will be needed to prepare the dataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import roc_auc_score


In [2]:
# Seed for reproducibility
np.random.seed(42)

warnings.simplefilter("ignore")

# Load data
heart_df = pd.read_csv("data/heart.csv").rename({"condition":"target"}, axis=1)
heart_df.sample(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
167,66,0,3,178,228,1,0,165,1,1.0,1,2,2,1
211,59,1,3,140,177,0,0,162,1,0.0,0,1,2,1
63,41,1,1,135,203,0,0,132,0,0.0,1,0,1,0
154,37,0,2,120,215,0,0,170,0,0.0,0,0,0,0
5,64,1,0,170,227,0,2,155,0,0.6,1,0,2,0


You then train-test split your dataset 

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
                                        heart_df.drop("target", axis=1),
                                        heart_df.target, 
                                        test_size=0.3,
                                        random_state=42
                                        )

You notice that the target variable is binary, as the patient either has a heart disease or not, and thus you recognize you are dealing with a  classification problem. Remembering the amazing class you had about Logistic Regression, you decide to use this classifier as a first approach. This  means that it is a good idea to scale your observations to have zero mean and unit standard deviation

In [4]:
# Logistic Regression is not scale invariant, so you scale your data beforehand
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print("X_train of shape ", X_train.shape)
print("y_train of shape ", y_train.shape)
print("X_test of shape  ", X_test.shape)
print("y_test of shape  ", y_test.shape)

X_train of shape  (207, 13)
y_train of shape  (207,)
X_test of shape   (90, 13)
y_test of shape   (90,)


## Exercise 0 - Baseline

Start by creating a baseline. How good is a standard Logistic Regression classifier? 

In [5]:
# Create a Logistic Regression classifier with the default hyperparameters,
# assign it to a variable called lr, and then fit it to the train data 
# lr=...


### BEGIN SOLUTION
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
### END SOLUTION 

LogisticRegression()

In [6]:
assert isinstance(lr, sklearn.linear_model.LogisticRegression)
default_roc_auc_score=roc_auc_score(lr.predict(X_test), y_test)
np.testing.assert_almost_equal(default_roc_auc_score, 0.788, 3)

In [7]:
print("The AUROC score of the baseline Logistic regression classifier is ", default_roc_auc_score)

The AUROC score of the baseline Logistic regression classifier is  0.7888888888888889


The baseline score is ok, but you wonder if you could get better performance with this classifier. So, you decide to do some hyperparameter tuning.

## Exercise 1- Grid Search

Since you are not entirely sure what hyperparameters to choose, you decide to run a grid search to start with.

1.1) Create a hyperparameter search space  with the following specifications:
- Regularization parameter 'C' of 0.1, 1, 4, 8 and 10
- penalty: "l1", "l2" and "elasticnet"


In [8]:
# Create a hyperparameter search space  with the following specifications:
# - 'C' of 0.1, 1, 4, 8 and 10
# - penalty: "l1", "l2" and "elasticnet"
# assign your grid to the variable grid
# grid = ...

### BEGIN SOLUTION
grid = {"C": (0.1, 1, 4, 8, 10), "penalty": ("l2", "l1", "elasticnet")}
### END SOLUTION


In [9]:
assert isinstance(grid, dict), 'Make sure you are using a dictionary for your grid parameters'
assert "C" in grid, 'Make sure you have the requested parameters as keys'
assert "penalty" in grid, 'Make sure you have the requested parameters as keys'
assert all(num in grid["C"] for num in [0.1, 1, 4, 8, 10]), 'The values for C parameter are not correct'
assert "l2" in grid["penalty"], 'The values for penalty parameter are not correct'
assert "l1" in grid["penalty"], 'The values for penalty parameter are not correct'
assert "elasticnet" in grid["penalty"], 'The values for penalty parameter are not correct'

1.2) Create a gridsearch with a Logistic Regression using the hyperparameter space defined in 1.1. Set the scoring function as the AUROC


In [10]:
# Create a gridsearch with a Logistic Regression 
# use the hyperparameter space defined in 1.1
# Set the scoring function as the AUROC 
# assign the gridsearch to the variable grid_search
# grid_search = ...

### BEGIN SOLUTION
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(LogisticRegression(), grid, scoring="roc_auc")

print("scoring hash", 
      sha256(json.dumps(grid_search.get_params()["scoring"]).encode()).hexdigest()
     )
### END SOLUTION


scoring hash 0aa8f1817e367c7a5a0556519f3332ef9499e96db9f0c5abb56adab073631d56


In [11]:
scoring_hash='0aa8f1817e367c7a5a0556519f3332ef9499e96db9f0c5abb56adab073631d56'

assert isinstance(grid_search, sklearn.model_selection.GridSearchCV), 'Are you using GridSearchCV?'
assert isinstance(grid_search.estimator, sklearn.linear_model.LogisticRegression), 'Are you using the right model'
assert scoring_hash == sha256(
                        json.dumps(grid_search.get_params()["scoring"]).encode()
                    ).hexdigest(), 'Your parameters are not correct'

1.3). Find the best estimator using grid_serach

In [12]:
# Find the best estimator using grid_search from 1.2
# Begin by performing the grid search over the train data
# Then, extract the best estimator and assign it to best_model
# best_model = ...
 
### BEGIN SOLUTION
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_

hash_ = sha256(json.dumps(best_model.get_params()).encode()).hexdigest()
print("best model params hash", hash_)
### END SOLUTION

best model params hash be93249222ae4fb94bbe669b3c81885256fea04671b60dc0a0ecd4bb9fc42b86


In [13]:
best_params_hash = 'be93249222ae4fb94bbe669b3c81885256fea04671b60dc0a0ecd4bb9fc42b86'
student_hash = sha256(json.dumps(best_model.get_params()).encode()).hexdigest()

assert isinstance(best_model, sklearn.linear_model.LogisticRegression), 'Make sure you are using the right model'
assert best_params_hash == student_hash, 'Your parameters are not correct'

1.4) Make predictions on the test set using the estimator with the best found parameters

In [14]:
# Make predictions on the test data 
# using the estimator with the best found parameters
# Assign the predictions to the best_preds
# best_preds = ...

### BEGIN SOLUTION
best_preds = grid_search.predict(X_test)

print("preds hash:", sha256(best_preds).hexdigest())
### END SOLUTION

preds hash: 6f7d373ff044125e90b9e2577bc4c8cbfcaea8f62f8270fa32d74b517148afa1


In [15]:
preds_hash = '6f7d373ff044125e90b9e2577bc4c8cbfcaea8f62f8270fa32d74b517148afa1'
assert preds_hash == sha256(best_preds).hexdigest(), 'Your parameters are not correct'

In [16]:
print("The AUROC score of the best grid search Logistic regression classifier is ", roc_auc_score(best_preds, y_test))

The AUROC score of the best grid search Logistic regression classifier is  0.811111111111111


Nice, the performance improved! Let's see if we can do even better with Random search

## Exercise 2 - Random Search 


2.1) Create a random search distribution with the following hyperparameter distribution,

- Regularization parameter 'C' uniformly distributed between 0.1 and 10
- penalty  "l2", "l1" or "elasticnet"


In [17]:
# Create a random search distribution with the
# following hyperparameter distribution
#- 'C' uniformly distributed between 0.1 and 10 (hint: use a scipy distribution)
#- penalty  "l2", "l1" or "elasticnet"
# assign it to random_grid
# random_grid = ...

### BEGIN SOLUTION
from scipy.stats import uniform
random_grid = {"C": uniform(0.1, 10), "penalty": ("l2", "l1", "elasticnet")}
### END SOLUTION

In [18]:
assert "C" in random_grid, 'Make sure you have the requested parameters as keys'
assert "penalty" in random_grid, 'Make sure you have the requested parameters as keys'
assert isinstance(random_grid["C"], scipy.stats._distn_infrastructure.rv_frozen), 'The values for C parameter are not correct'
assert "l2" in random_grid["penalty"], 'The values for penalty parameter are not correct'
assert "l1" in random_grid["penalty"], 'The values for penalty parameter are not correct'
assert "elasticnet" in random_grid["penalty"], 'The values for penalty parameter are not correct'

2.2) Create a random search over a  Logistic Regression estimator.
* Set the random_state to 42
* Set the number of iterations to 15
* Set the scoring to AUROC

In [19]:
# Create a random search 
# - Use a Logistic Regression estimator
# - Set the random_state to 42
# - Set the number of iterations to 15
# - Set the scoring to AUROC
# - Use the random grid you created in 2.1
# assign it to random_search
# random_search = ...

### BEGIN SOLUTION
from sklearn.model_selection import RandomizedSearchCV

random_search = RandomizedSearchCV(LogisticRegression(), 
                                   random_grid,
                                   scoring="roc_auc",
                                   n_iter=15,
                                   random_state=42)

print("scoring hash", 
      sha256(json.dumps(grid_search.get_params()["scoring"]).encode()).hexdigest()
     )
### END SOLUTION

scoring hash 0aa8f1817e367c7a5a0556519f3332ef9499e96db9f0c5abb56adab073631d56


In [20]:
scoring_hash='0aa8f1817e367c7a5a0556519f3332ef9499e96db9f0c5abb56adab073631d56'

assert isinstance(random_search, sklearn.model_selection.RandomizedSearchCV), 'Are you using RandomizedSearchCV?'
assert isinstance(random_search.estimator, sklearn.linear_model.LogisticRegression), 'Make sure you are using the right model'
assert random_search.random_state==42, 'Check your random_state value'
assert random_search.n_iter==15, 'Check n_iter value'
assert scoring_hash == sha256(
                            json.dumps(random_search.get_params()["scoring"]).encode()
                        ).hexdigest(), 'Your parameters are not correct'

2.3) Get the best model from the random_search

In [21]:
# Get the best model from the random search
# Begin by performing the random search over the train data
# Then extract the best estimator and assign it to rs_best_model
# rs_best_model = ...

### BEGIN SOLUTION
random_search.fit(X_train, y_train)
rs_best_model = random_search.best_estimator_

hash_ = sha256(json.dumps(rs_best_model.get_params()).encode()).hexdigest()
print("best random search params hash", hash_)
### END SOLUTION

best random search params hash c2fbca42369c6609159487a711fb011cf55e626ed986aef10f6a0326c31b23ca


In [22]:
rs_best_model_hash ='c2fbca42369c6609159487a711fb011cf55e626ed986aef10f6a0326c31b23ca'
assert isinstance(rs_best_model, sklearn.linear_model.LogisticRegression) , 'Make sure you are using the right model'
assert rs_best_model_hash==sha256(json.dumps(rs_best_model.get_params()).encode()).hexdigest(), 'Your parameters are not correct'

 2.4) Get the best parameters of the random search

In [23]:
# Get the best parameters (for which the AUROC score was higher)
# of the random search and assign them to best_rs_params
# best_rs_params = ...

### BEGIN SOLUTION
best_rs_params = random_search.best_params_

print("best LR params hash",
     sha256(json.dumps(best_rs_params).encode()).hexdigest()
     )
### END SOLUTION

best LR params hash f0844d10d0d606c4bedd3ec11eb58f7303fb633665df9b3855a45d84c7caa04b


In [24]:
best_rs_params_hash = 'f0844d10d0d606c4bedd3ec11eb58f7303fb633665df9b3855a45d84c7caa04b'
assert best_rs_params_hash == sha256(json.dumps(best_rs_params).encode()).hexdigest(), 'Your parameters are not correct' 

In [25]:
print("The AUROC score of the best random search Logistic regression classifier is ",roc_auc_score(random_search.predict(X_test), y_test))

The AUROC score of the best random search Logistic regression classifier is  0.7888888888888889


Looks like we have a winner: the model resulting from the grid search performed best!