# Hyperparameter Tuning

When fitting a linear regression model, what we're doing is choosing parameters, `k`, that fit the model the best.

With Ridge/Lasso regression we choose an `alpha` which produces the best model fit.

When fitting a **knn** model, we need to choose the `n_neighbors` which gives the best fit.

Logistic regression has a **regularisation** parameter, `C`.

Such parameters, `k`, `alpha`, `n_neighbors`, are called **hyper parameters**. They can not be learned be fitting the model. You discover the best value through trial and error:

- try a bunch of different htperparameter values
- fit them all separately
- see how each performs
- choose the best performing one

This is called **hyperparameter tuning**.

When fitting different values of a hyperparameter, it is essential to use **cross-validation** because using **test-train split** alone would risk **overfittting** the hyperparameter to the test set.

One technique that is used is **grid search cross-validation**:

- we choose a grid of the possible hyperparameter(s)
- perform `k-fold` cross-validation for each point in the grid(each hyperparameter or choice of hyperparamters)
- choose the hyperparameter that performs the best.

This example demonstrates a grid where we're comparing two hyperparameters, `C` and `alpha`.

![Grid Search](../imgs/logistic-regression-4.png)

We can implement **gscv** in sklearn with the `GridSearchCV` function.  We pass it a dictionary where the keys are the hyperparameter, e.g. `n_neighbors`, or `alpha`, and the values are lists we wish to tune the hyperparameter(s) over. If we specify multiple parameters, all combinations will be tried.

### Using hyperparameter tuning to find C paramater in Logistic Regression

Like the `alpha` parameter of **lasso** and **ridge** regularization, **logistic regression** has a regularization parameter: `C`. `C` controls the inverse of the regularization strength. A large `C` can lead to an overfit model, while a small `C` can lead to an underfit model.

We'll use **GridSearchCV** and **logistic regression** to find the optimal `C`.

In [31]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

In [32]:
# prepare the data
df = pd.read_csv('../data/diabetes.csv')
X = df.drop('diabetes', axis=1).values
y = df.diabetes.values

print(type(X), X.shape)
print(type(y), y.shape)

<class 'numpy.ndarray'> (768, 8)
<class 'numpy.ndarray'> (768,)


**NOTE**:  We're NOT performing a test-train split in this example

In [33]:
# Setup the hyperparameter grid by using c_space as the grid of values to tune C over.
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

# Instantiate a logistic regression classifier
logreg = LogisticRegression()

# Instantiate the GridSearchCV object with 5-fold cross-validation to tune 'C',
# specifying the classifier, parameter distribution and number of folds
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# fit the data, executing the grid search
logreg_cv.fit(X, y)

# Print the best parameter and best score obtained from GridSearchCV 
# by accessing the best_params_ and best_score_ attributes of logreg_cv.
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))

Tuned Logistic Regression Parameters: {'C': 163789.3706954068}
Best score is 0.7721354166666666


`C` value should be 3.72759, best score of 0.77083

### RandomisedSeach CV

GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters. A solution to this is to use RandomizedSearchCV, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions.

We'll use a d new algorithm/model, **Decision Tree**. Just like **k-NN**, **linear regression**, and **logistic regression**, **decision trees** in scikit-learn use the same `api`. There are `.fit()` and `.predict()` methods that you can use in exactly the same way. **Decision trees** have many parameters that can be tuned, such as `max_features`, `max_depth`, and `min_samples_leaf`. This makes it an ideal use case for **RandomizedSearchCV**.

In [30]:
# Import necessary modules
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

# Setup the parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}

# Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()

# Instantiate the RandomizedSearchCV object,  specify the classifier, 
# parameter distribution, and number of folds to use.
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)

# Fit it to the data
tree_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))


Tuned Decision Tree Parameters: {'criterion': 'entropy', 'max_depth': 3, 'max_features': 8, 'min_samples_leaf': 3}
Best score is 0.7395833333333334


Exercise results:

Tuned Decision Tree Parameters: {'criterion': 'gini', 'max_features': 3, 'max_depth': 3, 'min_samples_leaf': 7}
Best score is 0.7408854166666666

**Note**: RandomizedSearchCV will never outperform GridSearchCV. Instead, it is valuable because it saves on computation time.

We've used ALL our data in **cross-validation**, and so can not report on how well our model might perform against unseen data. We'll want to split our dataset first of all using `train-test-split` and set aside the test data and only perform the **grid search cross-validation** on the training set to choose the best hyperparameters(tune you hyperparameters) and then use the test set to evaluate how well our model will fair with unseen data(b.