# SLU18 - Hyperparameter Tuning : Exercise notebook

### New concepts in this unit

*  Hyperparameter definition
*  Hyperparameter search
*  Model selection

### New tools in this unit
- [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
- [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

## Introduction


Yay! You have a new fancy watch with all those accelerometers and gyros! 

![](./media/applewatch.png)


...and you were able to wait 10 minutes before hacking it and extract data from those instruments... you want to estimate when you are WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING and LAYING. Not sure why.

You come up with a dataset with 7353 labeled activities - yeah, strangely enough you did label all these movements.  

The data set has: 

- A 7353 instances with 6 features each (3 linear avg accelerations, 3 angular avg accelerations) - there are more in the given URL below, but we will not use it
- A label of the activity being done 

labels are already encoded as `1-WALKING`, `2-WALKING_UPSTAIRS`, `3-WALKING_DOWNSTAIRS`, `4-SITTING`, `5- STANDING` and `6-LAYING`


You don't need this but data came from here (that's the truth...) https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones)

In [None]:
import pandas as pd
import numpy as np
import scipy
import warnings
from hashlib import sha256
import json

import sklearn
# These will be needed to prepare the dataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
# Seed for reproducibility
np.random.seed(42)

warnings.simplefilter("ignore")

# Load data
mobile_df = pd.read_csv("data/X_data.csv", delimiter=",")
mobile_df_target = pd.read_csv("data/y_data.csv", delimiter=",")

You then train-test split your dataset so that you keep a portion of it out of training and validation process

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
                                        mobile_df,
                                        mobile_df_target, 
                                        test_size=0.30,
                                        random_state=42
                                        )

In [None]:
print("X_train of shape ", X_train.shape)
print("y_train of shape ", y_train.shape)
print("X_test of shape  ", X_test.shape)
print("y_test of shape  ", y_test.shape)

Notice your target variable is not binary and you are working with a multiclass classification problem. 
Because you want to start with a simple and explainable model, you decide to use a decision tree model.
You start by run it with default settings and with a simple accuracy metric (for the sake of simplicity).

## Exercise 0 - Simple Model, no Hyper Parameter Tuning

In [None]:
# Use a simple DecisionTreeClassifier with random_state = 43 with default settings, 
# assigned it to a variable called d_tree = ...
# Add the resulting score over the test set to a variable called default_score = ...


# YOUR CODE HERE
raise NotImplementedError()

In [None]:
import math
assert d_tree.random_state == 43
assert isinstance(d_tree, sklearn.tree.DecisionTreeClassifier)
assert sha256(d_tree.predict(X_test)).hexdigest() == "b9f93d9ac0d93397ea703723b7f3d4cb8c2be4823e707c04757ebb4e03901cec"
assert math.isclose(default_score, 0.528, abs_tol=0.001)

You decide to search for better hyperparameters that could increase your metrics.

## Exercise 1- Grid Search

Since you are not entirely sure what hyperparameters to choose, you decide to run a grid search to start with.

1.1) Create a hyperparameter search space with the following specifications:

- max_leaf_nodes: between 180 and 220 (incl.), with increments of 5
- max_depth between 12 and 18 (incl.), with increments of 1
- 'criterion': "gini" and "entropy"


In [None]:
# Create a hyperparameter search space (a dictionary with 3 entries) with the following specifications:
# - max_leaf_nodes: between 180 and 220 (incl.), with increments of 5
# - max_depth between 12 and 18 (incl.), with increments of 1
# - 'criterion': "gini" and "entropy"
# assign your grid to the variable grid
# grid = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(grid, dict)
assert "max_depth" in grid
assert "criterion" in grid
assert all(num in grid["max_depth"] for num in [n for n in range(12, 19, 1)])
assert all(num in grid["max_leaf_nodes"] for num in [n for n in range(180, 221, 5)])
assert "gini" in grid["criterion"] 
assert "entropy" in grid["criterion"]

1.2) Create a grid search (`GridSearchCV`) with a Decision Tree Classifier using the hyperparameter space defined in 1.1. Set the scoring function as "accuracy" (again, for the sake of simplicity)


In [None]:
# Create a grid search with a Decision Tree Classifier
# use the hyperparameter space defined in 1.1
# Again don't forget to set the random_state to 43
# Set the scoring function as the accuracy
# assign the gridsearch to the variable grid_search = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
scoring_hash='4c378f74c01e3ca1b174cc5fe7631fe9686a2d619f2090d244f7b958e9f18211'

assert isinstance(grid_search, sklearn.model_selection.GridSearchCV)
assert isinstance(grid_search.estimator, sklearn.tree.DecisionTreeClassifier)
assert grid_search.estimator.random_state == 43
assert scoring_hash == sha256(
                        json.dumps(grid_search.get_params()["scoring"]).encode()
                       ).hexdigest()

1.3). Find the best estimator using grid_search

In [None]:
# Find the best estimator using grid_search from 1.2
# Begin by performing the grid search over the train data (this can take around 1 minute)
# Then, extract the best estimator and assign it to best_dt_model
# best_dt_model = ...
# 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
hash_ = "79568e90ea80a995673b940fb772f0b6306160f85c85111c5c798f88716bba8a"
assert isinstance(best_dt_model, sklearn.tree.DecisionTreeClassifier)
assert hash_ == sha256(json.dumps(best_dt_model.get_params()).encode()).hexdigest()

1.4) Make predictions on the test set using the estimator with the best found parameters

In [None]:
# Measure accuracy for the best estimator from 1.3) over the test set
# assign it to the variable best_dt_model_score=...
# Check the new hyperparameters of this model and assign to 
# best_dt_model_param =...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
hash_  = "79568e90ea80a995673b940fb772f0b6306160f85c85111c5c798f88716bba8a"
assert math.isclose(best_dt_model_score, 0.570, abs_tol=0.001)
assert hash_ == sha256(json.dumps(best_dt_model_param).encode()).hexdigest()

Now Let's compare Normal and Grid Search approach:

In [None]:
print("Normal (Exercise 0) model score: ", default_score)
print("Grid Search (Exercise 1) Best Model Score: ", best_dt_model_score)
print("Score difference: ", (best_dt_model_score-default_score)*100)

You can see that you got more than 4% extra accuracy just by fastly tweaking hyperparameters. 

Looking into the new parameters you should actually see that your model was overfitting, and that the best model found is actually adding some regularization by shorting the tree max depth and the number of leaves.

## Exercise 2 - Random Search 

You then decide to try the Logistic Regression model along with a Random Search method

2.1) Create a random search distribution with the following hyperparameter distribution,

- Inverse of regularization strength 'C' between 0.1 and 10 with 100 points
- penalty  "l2" or "l1"


In [None]:
# Create a random search distribution with the
# following hyperparameter distribution
#- Inverse of regularization strength 'C' (list) uniformly distributed between 0.1 and 10 with 100 points (hint: use a numpy linspace)
#- penalty  "l2" or "l1"
# assign it to random_grid
# random_grid = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert "C" in random_grid
assert "penalty" in random_grid
assert isinstance(random_grid["C"], list)
assert "l2" in random_grid["penalty"]
assert "l1" in random_grid["penalty"]

2.2) Create a random search over a  Logistic Regression estimator.
- Set the random_state to 43
- Set the number of iterations to 25
- Set the scoring to accuracy

In [None]:
# Create a random search 
# - Use a Logistic Regression estimator
# - Set the random_state to 43
# - Set the number of iterations to 25
# - Set the scoring to accuracy
# - Use the random grid you created in 2.1
# assign it to random_search
# random_search = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
scoring_hash='4c378f74c01e3ca1b174cc5fe7631fe9686a2d619f2090d244f7b958e9f18211'

assert isinstance(random_search, sklearn.model_selection.RandomizedSearchCV)
assert isinstance(random_search.estimator, sklearn.linear_model.LogisticRegression)
assert random_search.random_state==43
assert random_search.n_iter==25
assert scoring_hash == sha256(
                        json.dumps(random_search.get_params()["scoring"]).encode()
                       ).hexdigest()

2.3) Get the best model from the random_search

In [None]:
# Get the best model from the random search
# Begin performing the random search over the train data
# Then extract the best estimator and assign it to rs_best_model
# rs_best_model = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
rs_best_model_hash ='f411e318b5da9e60c7774f3943ec7fc3765e0d629a963bb36a9022d5efd392ef'
assert isinstance(rs_best_model, sklearn.linear_model.LogisticRegression)
assert rs_best_model_hash==sha256(json.dumps(rs_best_model.get_params()).encode()).hexdigest()

 2.4) Get the score and the best parameters of the random search

In [None]:
# Get the score applied to test and assign it to best_rs_score=... 
# Also, get the best parameters (for which the accuracy was higher)
# of the random search and assign them to best_rs_params
# best_rs_params = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
import math
best_lr_params_hash = 'f8c0272053265c70defb9f46bfade221f11ca1565050cf6be22c252c5792097c'
assert best_lr_params_hash == sha256(json.dumps(best_rs_params).encode()).hexdigest() 
assert math.isclose(best_rs_score, 0.267, abs_tol=0.001)