# SLU15 - Hyperparameter Tuning : Exercise notebook

### New concepts in this unit

*  Hyperparameter definition
*  Hyperparameter search
*  Model selection

### New tools in this unit
- [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
- [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

## Introduction
You decide you want to apply your data science skills to help identify the risk of heart disease in patients, and so decide to take a look at the [Heart Disease UCI ](https://archive.ics.uci.edu/ml/datasets/Heart+Disease)dataset. This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. You follow the example and load the simplified dataset. 

In [None]:
import pandas as pd
import numpy as np
import scipy
import warnings
from hashlib import sha256
import json

import sklearn
# These will be needed to prepare the dataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import f1_score


In [None]:
# Seed for reproducibility
np.random.seed(42)

warnings.simplefilter("ignore")

# Load data
heart_df = pd.read_csv("data/heart.csv").rename({"condition":"target"}, axis=1)
heart_df.sample(5)

You then train-test split your dataset 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
                                        heart_df.drop("target", axis=1),
                                        heart_df.target, 
                                        test_size=0.3,
                                        random_state=42
                                        )

You notice that the target variable is binary, as the patient either has a heart disease or not, and thus you recognize you are dealing with a  classification problem. Remembering the amazing class you had about Logistic Regression, you decide to use this classifier as a first approach. This  means that it is a good idea to scale your observations to have zero mean and unit standard deviation

In [None]:
# Logistic Regression is not scale invariant, so you scale your data beforehand
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print("X_train of shape ", X_train.shape)
print("y_train of shape ", y_train.shape)
print("X_test of shape  ", X_test.shape)
print("y_test of shape  ", y_test.shape)

## Exercise 0 - Baseline

Start by creating a baseline. How good is a standard Logistic Regression classifier? 

In [None]:
# Create a Logistic Regression classifier with the default hyperparameters,
# assign it to a variable called lr, and then fit it to the train data 
# lr=...


# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(lr, sklearn.linear_model.LogisticRegression)
default_f1_score=f1_score(lr.predict(X_test), y_test)
np.testing.assert_almost_equal(default_f1_score, 0.781, 3)

In [None]:
print("The f1 score of the baseline Logistic regression classifier is ", default_f1_score)

The baseline score is ok, but you wonder if you could get better performance with this classifier. So, you decide to do some hyperparameter tuning.

## Exercise 1- Grid Search

Since you are not entirely sure what hyperparameters to choose, you decide to run a grid search to start with.

1.1) Create a hyperparameter search space  with the following specifications:
- Regularization parameter 'C' of 0.1, 1 and 10
- penalty: "l1" and "l2"


In [None]:
# Create a hyperparameter search space  with the following specifications:
# - 'C' of 0.1, 1, and 10
# - penalty: "l1" and "l2"
# assign your grid to the variable grid
# grid = ...

# YOUR CODE HERE
raise NotImplementedError()


In [None]:
assert isinstance(grid, dict), 'Make sure you are using a dictionary for your grid parameters'
assert "C" in grid, 'Make sure you have the requested parameters as keys'
assert "penalty" in grid, 'Make sure you have the requested parameters as keys'
assert all(num in grid["C"] for num in [0.1, 1, 10 ]), 'The values for C parameter are not correct'
assert "l2" in grid["penalty"], 'The values for penalty parameter are not correct'
assert "l1" in grid["penalty"], 'The values for penalty parameter are not correct'

1.2) Create a gridsearch with a Logistic Regression using the hyperparameter space defined in 1.1. Set the scoring function as the f1


In [None]:
# Create a gridsearch with a Logistic Regression 
# use the hyperparameter space defined in 1.1
# Set the scoring function as the f1 
# assign the gridsearch to the variable grid_search
# grid_search = ...

# YOUR CODE HERE
raise NotImplementedError()


In [None]:
scoring_hash='4e319c63b3dafcef9b25b5549030f4301e5a745fd0fc561be773daeaa1b36f68'

assert isinstance(grid_search, sklearn.model_selection.GridSearchCV), 'Are you using GridSearchCV?'
assert isinstance(grid_search.estimator, sklearn.linear_model.LogisticRegression), 'Are you using the right model'
assert scoring_hash == sha256(
                        json.dumps(grid_search.get_params()["scoring"]).encode()
                    ).hexdigest(), 'Your parameters are not correct'

1.3). Find the best estimator using grid_serach

In [None]:
# Find the best estimator using grid_search from 1.2
# Begin by performing the grid search over the train data
# Then, extract the best estimator and assign it to best_model
# best_model = ...
 
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
best_params_hash = 'be93249222ae4fb94bbe669b3c81885256fea04671b60dc0a0ecd4bb9fc42b86'
student_hash = sha256(json.dumps(best_model.get_params()).encode()).hexdigest()

assert isinstance(best_model, sklearn.linear_model.LogisticRegression), 'Make sure you are using the right model'
assert best_params_hash == student_hash, 'Your parameters are not correct'

1.4) Make predictions on the test set using the estimator with the best found parameters

In [None]:
# Make predictions on the test data 
# using the estimator with the best found parameters
# Assign the predictions to the best_preds
# best_preds = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
preds_hash = '6f7d373ff044125e90b9e2577bc4c8cbfcaea8f62f8270fa32d74b517148afa1'
assert preds_hash == sha256(best_preds).hexdigest(), 'Your parameters are not correct'

In [None]:
print("The f1 score of the best grid search Logistic regression classifier is ", f1_score(best_preds, y_test))

Nice, the performance improved! Let's see if we can do even better with Random search

## Exercise 2 - Random Search 


2.1) Create a random search distribution with the following hyperparameter distribution,

- Regularization parameter 'C' uniformly distributed between 0.1 and 10
- penalty  "l2" or "l1"


In [None]:
# Create a random search distribution with the
# following hyperparameter distribution
#- 'C' uniformly distributed between 0.1 and 10 (hint: use a scipy distribution)
#- penalty  "l2" or "l1"
# assign it to random_grid
# random_grid = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert "C" in random_grid, 'Make sure you have the requested parameters as keys'
assert "penalty" in random_grid, 'Make sure you have the requested parameters as keys'
assert isinstance(random_grid["C"], scipy.stats._distn_infrastructure.rv_frozen), 'The values for C parameter are not correct'
assert "l2" in random_grid["penalty"], 'The values for penalty parameter are not correct'
assert "l1" in random_grid["penalty"], 'The values for penalty parameter are not correct'

2.2) Create a random search over a  Logistic Regression estimator.
* Set the random_state to 42
* Set the number of iterations to 10
* Set the scoring to  f1

In [None]:
# Create a random search 
# - Use a Logistic Regression estimator
# - Set the random_state to 42
# - Set the number of iterations to 10
# - Set the scoring to  f1
# - Use the random grid you created in 2.1
# assign it to random_search
# random_search = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
scoring_hash='4e319c63b3dafcef9b25b5549030f4301e5a745fd0fc561be773daeaa1b36f68'

assert isinstance(random_search, sklearn.model_selection.RandomizedSearchCV), 'Are you using RandomizedSearchCV?'
assert isinstance(random_search.estimator, sklearn.linear_model.LogisticRegression), 'Make sure you are using the right model'
assert random_search.random_state==42, 'Check your random_state value'
assert random_search.n_iter==10, 'Check n_iter value'
assert scoring_hash == sha256(
                            json.dumps(random_search.get_params()["scoring"]).encode()
                        ).hexdigest(), 'Your parameters are not correct'

2.3) Get the best model from the random_search

In [None]:
# Get the best model from the random search
# Begin by performing the random search over the train data
# Then extract the best estimator and assign it to rs_best_model
# rs_best_model = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
rs_best_model_hash ='46137cd04d6b638585075b0e41782f0af3915fbb0e36d097884ad2dda4b5f1cd'
assert isinstance(rs_best_model, sklearn.linear_model.LogisticRegression) , 'Make sure you are using the right model'
assert rs_best_model_hash==sha256(json.dumps(rs_best_model.get_params()).encode()).hexdigest(), 'Your parameters are not correct'

 2.4) Get the best parameters of the random search

In [None]:
# Get the best parameters (for which the f1 score was higher)
# of the random search and assign them to best_rs_params
# best_rs_params = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
best_rs_params_hash = '786002396fe6033b22256c5e0bba1ddb619771c4db81808f74e3cd51f2efd143'
assert best_rs_params_hash == sha256(json.dumps(best_rs_params).encode()).hexdigest(), 'Your parameters are not correct' 

In [None]:
print("The f1 score of the best random search Logistic regression classifier is ",f1_score(random_search.predict(X_test), y_test))

Looks like we have a winner: the model resulting from the grid search performed best!