# Model Optimization-Hyperparameters selection
Hyperparameter: parameter that can be tuned to optimize the performance of a learning algorithm.

* How should the dataset be created to find the **optimal tuning parameter**?
* How can K-fold cross-validation be used to search for an **optimal tuning parameter**?
* How do you search for **multiple tuning parameters** at once?
* How can we combine hyperparameters tuning and cross-validation with small dataset?
* How can the **computational expense** of this process be reduced?


Parameter tuning needs to be viewed as part of the learning algorithm and must be done using the training data only. The procedure that should be followed is the one in which we: 
1) Split the training data into a smaller “training” set and a "validation set” (normally, the data is shuffled first)
2) Build models using different values of the hyperparameter
k on the new, smaller training set and evaluate them on the validation set
3) Pick the best value of k and rebuild the model on the full original training set
4) Evaluate on a separate test dataset

**Adjusting the hyperparameter to the test data will lead to optimistic performance estimates on test
data!**

In [1]:
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from utilities.ml_utilities import print_cv_results
import numpy as np

In [2]:
# Load the dataset and retrieve features and target
iris = load_iris()
X, y = iris.data, iris.target

## Hyperparameters and k-fold cross-validation `GridSearchCV`
* For each combination of hyperparameters $H_i$ we would like to evaluate:
    1) We fit the model $k$ times in order to validate the model on each fold.
    2) We compute the average accuracy over the $k$ fold, for a combination of hyperparameters $H_i$.
* We pick the combination of hyperparameters with the best average accuracy.
* Refit the model with the best hyperparameters on the entire training set.
* We evaluate the model on the test set.


Firstly, we split the dataset into training and test. We use the 20% to test and the remaining for training the model. We use holdout method with stratification to split the dataset into training and test.

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, test_size=0.2, stratify=y)

Then, define the parameter values that should be searched. So, create a parameter grid: map the parameter names to the values that should be searched. We are defining the search space for our model.

In [4]:
param_grid = dict(n_neighbors=[1, 3, 5, 7, 9, 12])
print(f'Search space for KNearestNeighbours:\n{param_grid}')

Search space for KNearestNeighbours:
{'n_neighbors': [1, 3, 5, 7, 9, 12]}


Instantiate the grid and start the search. **NB:**
* We select **`cv = 10`**, thus we are performing 10-fold cross-validation.
* We can set **`refit = True`** if we would like to rebuild the model on the entire training set with the best hyperparameters.
* We can set **`n_jobs = -1`** to run computations in parallel (if supported by your computer and OS).

In [5]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=10, scoring='accuracy', 
                    n_jobs=-1, refit=True, return_train_score=True)
grid.fit(X_train, y_train)

In [6]:
print_cv_results(grid, 6)

Unnamed: 0,params,mean_train_score,std_train_score,mean_val_score,std_val_score,rank_val_score
0,{'n_neighbors': 3},0.962037,0.011302,0.975,0.038188,1
1,{'n_neighbors': 7},0.977778,0.009443,0.975,0.038188,1
2,{'n_neighbors': 9},0.978704,0.009305,0.975,0.038188,1
3,{'n_neighbors': 12},0.97963,0.009072,0.975,0.053359,1
4,{'n_neighbors': 5},0.978704,0.008333,0.966667,0.055277,5
5,{'n_neighbors': 1},1.0,0.0,0.95,0.055277,6


We can retrieve the best hyperparameters has followed, the best score and the best model.

In [7]:
print(f'Best validation score: {grid.best_score_}')
print(f'Best hyperparameters: {grid.best_params_}')
best_model = grid.best_estimator_
print(f'Best model: {best_model}')

Best validation score: 0.975
Best hyperparameters: {'n_neighbors': 3}
Best model: KNeighborsClassifier(n_neighbors=3)


Finally, we can evaluate our model on the test set.

In [8]:
print(f'Test accuracy score: {best_model.score(X_test, y_test)}')

Test accuracy score: 0.9333333333333333


## Searching multiple parameters simultaneously
We will see how to search multiple parameters simultaneously. In addition, we will use **`cv = RepeatedStratifiedKFold`**  as **`cv`** strategy.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, test_size=0.2, stratify=y)

In [10]:
param_grid = dict(n_neighbors=[1, 3, 5, 7, 9, 12],
                  weights=['uniform', 'distance'])
print(f'Search space for KNearestNeighbours:\n{param_grid}')

Search space for KNearestNeighbours:
{'n_neighbors': [1, 3, 5, 7, 9, 12], 'weights': ['uniform', 'distance']}


In [11]:
from sklearn.model_selection import RepeatedStratifiedKFold
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=10)

In [12]:
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=cv, scoring='accuracy',
                    n_jobs=-1, refit=True, return_train_score=True)
grid.fit(X_train, y_train)

In [13]:
print_cv_results(grid, 10)

Unnamed: 0,params,mean_train_score,std_train_score,mean_val_score,std_val_score,rank_val_score
0,"{'n_neighbors': 7, 'weights': 'distance'}",1.0,0.0,0.974167,0.043613,1
1,"{'n_neighbors': 9, 'weights': 'distance'}",1.0,0.0,0.974167,0.043613,1
2,"{'n_neighbors': 12, 'weights': 'distance'}",1.0,0.0,0.974167,0.043613,1
3,"{'n_neighbors': 7, 'weights': 'uniform'}",0.977222,0.00606,0.9725,0.045727,4
4,"{'n_neighbors': 12, 'weights': 'uniform'}",0.97537,0.007547,0.969167,0.045116,5
5,"{'n_neighbors': 9, 'weights': 'uniform'}",0.979352,0.006911,0.968333,0.045308,6
6,"{'n_neighbors': 5, 'weights': 'uniform'}",0.971019,0.007131,0.966667,0.04714,7
7,"{'n_neighbors': 5, 'weights': 'distance'}",1.0,0.0,0.966667,0.04714,7
8,"{'n_neighbors': 1, 'weights': 'uniform'}",1.0,0.0,0.965,0.048848,9
9,"{'n_neighbors': 1, 'weights': 'distance'}",1.0,0.0,0.965,0.048848,9


In [14]:
print(f'Best validation score: {grid.best_score_}')
print(f'Best hyperparameters: {grid.best_params_}')
best_model = grid.best_estimator_
print(f'Best model: {best_model}')

Best validation score: 0.9741666666666667
Best hyperparameters: {'n_neighbors': 7, 'weights': 'distance'}
Best model: KNeighborsClassifier(n_neighbors=7, weights='distance')


In [15]:
print(f'Test accuracy score: {best_model.score(X_test, y_test)}')

Test accuracy score: 1.0


## What to do when the training sets are very small? `nested cross-validation`

## Reducing computational expense using `RandomizedSearchCV`
- Searching many different parameters at once may be computationally infeasible
- `RandomizedSearchCV` searches a subset of the parameters, and you control the computational "budget"

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, test_size=0.2, stratify=y)

Specify "parameter distributions" rather than a "parameter grid".

In [17]:
from sklearn.model_selection import RandomizedSearchCV
param_dist = dict(n_neighbors=range(1,12), weights=['uniform', 'distance'])
print(f'Search space for KNearestNeighbours:\n{param_dist}')

Search space for KNearestNeighbours:
{'n_neighbors': range(1, 12), 'weights': ['uniform', 'distance']}


In [18]:
# n_iter controls the number of searches
rand = RandomizedSearchCV(KNeighborsClassifier(), param_dist, cv=10, scoring='accuracy', n_iter=10, refit=True, return_train_score=True)
rand.fit(X, y)

In [19]:
print_cv_results(rand, 10)

Unnamed: 0,params,mean_train_score,std_train_score,mean_val_score,std_val_score,rank_val_score
0,"{'weights': 'distance', 'n_neighbors': 11}",1.0,0.0,0.973333,0.03266,1
1,"{'weights': 'distance', 'n_neighbors': 9}",1.0,0.0,0.973333,0.03266,1
2,"{'weights': 'distance', 'n_neighbors': 10}",1.0,0.0,0.973333,0.03266,1
3,"{'weights': 'uniform', 'n_neighbors': 8}",0.98,0.005785,0.966667,0.044721,4
4,"{'weights': 'uniform', 'n_neighbors': 10}",0.976296,0.006458,0.966667,0.044721,4
5,"{'weights': 'uniform', 'n_neighbors': 7}",0.973333,0.005926,0.966667,0.044721,4
6,"{'weights': 'uniform', 'n_neighbors': 6}",0.972593,0.008148,0.966667,0.044721,4
7,"{'weights': 'uniform', 'n_neighbors': 4}",0.963704,0.006988,0.966667,0.044721,8
8,"{'weights': 'uniform', 'n_neighbors': 3}",0.960741,0.007444,0.966667,0.044721,8
9,"{'weights': 'distance', 'n_neighbors': 3}",1.0,0.0,0.966667,0.044721,8


In [20]:
print(f'Best validation score: {rand.best_score_}')
print(f'Best hyperparameters: {rand.best_params_}')
best_model = rand.best_estimator_
print(f'Best model: {best_model}')

Best validation score: 0.9733333333333334
Best hyperparameters: {'weights': 'distance', 'n_neighbors': 11}
Best model: KNeighborsClassifier(n_neighbors=11, weights='distance')


In [21]:
print(f'Test accuracy score: {best_model.score(X_test, y_test)}')

Test accuracy score: 1.0
