# Model Optimization-Hyperparameters selection
Hyperparameter: parameter that can be tuned to optimize the performance of a learning algorithm.

* How should the dataset be created to find the **optimal tuning parameter**?
* How can K-fold cross-validation be used to search for an **optimal tuning parameter**?
* How do you search for **multiple tuning parameters** at once?
* How can we combine hyperparameters tuning and cross-validation with small dataset?
* How can the **computational expense** of this process be reduced?


Parameter tuning needs to be viewed as part of the learning algorithm and must be done using the training data only. The procedure that should be followed is the one in which we: 
1) Split the training data into a smaller “training” set and a "validation set” (normally, the data is shuffled first)
2) Build models using different values of the hyperparameter
k on the new, smaller training set and evaluate them on the validation set
3) Pick the best value of k and rebuild the model on the full original training set
4) Evaluate on a separate test dataset

**Adjusting the hyperparameter to the test data will lead to optimistic performance estimates on test
data!**

In [1]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from utilities.ml_utilities import print_cv_results
import numpy as np

In [2]:
# Load the dataset and retrieve features and target
iris = load_iris()
X, y = iris.data, iris.target

## Hyperparameters and k-fold cross-validation `GridSearchCV`
* For each combination of hyperparameters $H_i$ we would like to evaluate:
    1) We fit the model $k$ times in order to validate the model on each fold.
    2) We compute the average accuracy over the $k$ fold, for a combination of hyperparameters $H_i$.
* We pick the combination of hyperparameters with the best average accuracy.
* Refit the model with the best hyperparameters on the entire training set.
* We evaluate the model on the test set.


Firstly, we split the dataset into training and test. We use the 20% to test and the remaining for training the model. We use holdout method with stratification to split the dataset into training and test.

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, test_size=0.2, stratify=y)

Then, define the parameter values that should be searched. So, create a parameter grid: map the parameter names to the values that should be searched. We are defining the search space for our model.

In [4]:
param_grid = dict(n_neighbors=[1, 3, 5, 7, 9, 12])
print(f'Search space for KNearestNeighbours:\n{param_grid}')

Search space for KNearestNeighbours:
{'n_neighbors': [1, 3, 5, 7, 9, 12]}


Instantiate the grid and start the search. **NB:**
* We select **`cv = 10`**, thus we are performing 10-fold cross-validation.
* We can set **`refit = True`** if we would like to rebuild the model on the entire training set with the best hyperparameters.
* We can set **`n_jobs = -1`** to run computations in parallel (if supported by your computer and OS).

In [5]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=10, scoring='accuracy', 
                    n_jobs=-1, refit=True, return_train_score=True)
grid.fit(X_train, y_train)

In [6]:
print_cv_results(grid, 6)

Unnamed: 0,params,mean_train_score,std_train_score,mean_val_score,std_val_score,rank_val_score
0,{'n_neighbors': 3},0.97037,0.008072,0.966667,0.055277,1
1,{'n_neighbors': 1},1.0,0.0,0.958333,0.055902,2
2,{'n_neighbors': 5},0.976852,0.006211,0.958333,0.055902,2
3,{'n_neighbors': 7},0.975926,0.008486,0.95,0.066667,4
4,{'n_neighbors': 9},0.969444,0.008333,0.95,0.066667,4
5,{'n_neighbors': 12},0.963889,0.012037,0.933333,0.072648,6


We can retrieve the best hyperparameters has followed, the best score and the best model.

In [7]:
print(f'Best validation score: {grid.best_score_}')
print(f'Best hyperparameters: {grid.best_params_}')
best_model = grid.best_estimator_
print(f'Best model: {best_model}')

Best validation score: 0.9666666666666666
Best hyperparameters: {'n_neighbors': 3}
Best model: KNeighborsClassifier(n_neighbors=3)


Finally, we can evaluate our model on the test set.

In [8]:
print(f'Test accuracy score: {best_model.score(X_test, y_test)}')

Test accuracy score: 0.9666666666666667


## Searching multiple parameters simultaneously
We will see how to search multiple parameters simultaneously. In addition, we will use **`cv = RepeatedStratifiedKFold`**  as **`cv`** strategy.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, test_size=0.2, stratify=y)

In [10]:
param_grid = dict(n_neighbors=[1, 3, 5, 7, 9, 12],
                  weights=['uniform', 'distance'])
print(f'Search space for KNearestNeighbours:\n{param_grid}')

Search space for KNearestNeighbours:
{'n_neighbors': [1, 3, 5, 7, 9, 12], 'weights': ['uniform', 'distance']}


In [11]:
from sklearn.model_selection import RepeatedStratifiedKFold
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=10)

In [12]:
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=cv, scoring='accuracy',
                    n_jobs=-1, refit=True, return_train_score=True)
grid.fit(X_train, y_train)

In [13]:
print_cv_results(grid, 10)

Unnamed: 0,params,mean_train_score,std_train_score,mean_val_score,std_val_score,rank_val_score
0,"{'n_neighbors': 3, 'weights': 'uniform'}",0.983148,0.004417,0.983333,0.037268,1
1,"{'n_neighbors': 3, 'weights': 'distance'}",1.0,0.0,0.983333,0.037268,1
2,"{'n_neighbors': 5, 'weights': 'distance'}",1.0,0.0,0.981667,0.038333,3
3,"{'n_neighbors': 7, 'weights': 'distance'}",1.0,0.0,0.981667,0.038333,3
4,"{'n_neighbors': 9, 'weights': 'distance'}",1.0,0.0,0.981667,0.038333,3
5,"{'n_neighbors': 12, 'weights': 'distance'}",1.0,0.0,0.980833,0.038828,6
6,"{'n_neighbors': 5, 'weights': 'uniform'}",0.982685,0.004839,0.98,0.041028,7
7,"{'n_neighbors': 7, 'weights': 'uniform'}",0.98287,0.004606,0.979167,0.039747,8
8,"{'n_neighbors': 9, 'weights': 'uniform'}",0.981389,0.005631,0.9775,0.043867,9
9,"{'n_neighbors': 12, 'weights': 'uniform'}",0.981389,0.005928,0.975833,0.044558,10


In [14]:
print(f'Best validation score: {grid.best_score_}')
print(f'Best hyperparameters: {grid.best_params_}')
best_model = grid.best_estimator_
print(f'Best model: {best_model}')

Best validation score: 0.9833333333333334
Best hyperparameters: {'n_neighbors': 3, 'weights': 'uniform'}
Best model: KNeighborsClassifier(n_neighbors=3)


In [15]:
print(f'Test accuracy score: {best_model.score(X_test, y_test)}')

Test accuracy score: 0.9


## What to do when the training sets are very small? `nested cross-validation`
For each training set of the **outer`k-fold cross-validation`**, run an **inner `p-fold cross-validation`** to choose the best hyperparameter value, where **p < k**.
* Outer cross-validation is used to estimate the quality of learning process.
* Inner cross-validations are used to choose hyperparameter values.

In [16]:
from sklearn.model_selection import  StratifiedKFold

In [17]:
param_grid = dict(n_neighbors=[1, 3, 5, 7, 9, 12],
                  weights=['uniform', 'distance'])
print(f'Search space for KNearestNeighbours:\n{param_grid}')

Search space for KNearestNeighbours:
{'n_neighbors': [1, 3, 5, 7, 9, 12], 'weights': ['uniform', 'distance']}


Set up the inner and outer cross-validation scheme.

In [18]:
inner_cv = StratifiedKFold(n_splits=3, shuffle=True)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True)

In [19]:
outer_results = []

for train_idx, test_idx in outer_cv.split(X, y):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    # Perform GridSearchCV within each outer fold
    grid_search = GridSearchCV(KNeighborsClassifier(), 
                               param_grid, 
                               cv=inner_cv, 
                               scoring='accuracy', 
                               return_train_score=True, 
                               refit=True,
                               n_jobs=-1)
    grid_search.fit(X_train, y_train)
    
    best_model = grid_search.best_estimator_
    test_score = best_model.score(X_test, y_test)
    outer_results.append((test_score, grid_search.best_params_))

outer_results = pd.DataFrame(outer_results, columns=["Outer_CV_Score", "Best_params"])
print(f'Outer CV Mean score: {outer_results.Outer_CV_Score.mean():.4f}, '
      f'Std: {outer_results.Outer_CV_Score.std():.4f}')
outer_results

Outer CV Mean score: 0.9800, Std: 0.0298


Unnamed: 0,Outer_CV_Score,Best_params
0,0.966667,"{'n_neighbors': 12, 'weights': 'distance'}"
1,1.0,"{'n_neighbors': 3, 'weights': 'uniform'}"
2,1.0,"{'n_neighbors': 7, 'weights': 'uniform'}"
3,1.0,"{'n_neighbors': 9, 'weights': 'uniform'}"
4,0.933333,"{'n_neighbors': 5, 'weights': 'uniform'}"


And now suppose we want to retrain on all the original dataset with the best parameters.

In [20]:
# Finding the index of the maximum Outer_CV_Score
max_score_idx = outer_results['Outer_CV_Score'].idxmax()

# Extracting the best parameters corresponding to the maximum score
best_params_with_max_score = outer_results.loc[max_score_idx, 'Best_params']
print("Best parameters with the highest outer CV score:", best_params_with_max_score)

Best parameters with the highest outer CV score: {'n_neighbors': 3, 'weights': 'uniform'}


In [21]:
# Re-train the model on the entire dataset with the best parameters
best_model = KNeighborsClassifier(**best_params_with_max_score)
best_model.fit(X, y)

# Final evaluation on the entire dataset
final_score = best_model.score(X, y)
print(f'Final model accuracy: {final_score:.4f}')

Final model accuracy: 0.9600


## Reducing computational expense using `RandomizedSearchCV`
- Searching many different parameters at once may be computationally infeasible
- `RandomizedSearchCV` searches a subset of the parameters, and you control the computational "budget"

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, test_size=0.2, stratify=y)

Specify "parameter distributions" rather than a "parameter grid".

In [23]:
from sklearn.model_selection import RandomizedSearchCV
param_dist = dict(n_neighbors=range(1,12), weights=['uniform', 'distance'])
print(f'Search space for KNearestNeighbours:\n{param_dist}')

Search space for KNearestNeighbours:
{'n_neighbors': range(1, 12), 'weights': ['uniform', 'distance']}


In [24]:
# n_iter controls the number of searches
rand = RandomizedSearchCV(KNeighborsClassifier(), param_dist, cv=10, scoring='accuracy', n_iter=10, refit=True, return_train_score=True)
rand.fit(X, y)

In [25]:
print_cv_results(rand, 10)

Unnamed: 0,params,mean_train_score,std_train_score,mean_val_score,std_val_score,rank_val_score
0,"{'weights': 'distance', 'n_neighbors': 9}",1.0,0.0,0.973333,0.03266,1
1,"{'weights': 'distance', 'n_neighbors': 11}",1.0,0.0,0.973333,0.03266,1
2,"{'weights': 'distance', 'n_neighbors': 10}",1.0,0.0,0.973333,0.03266,1
3,"{'weights': 'uniform', 'n_neighbors': 7}",0.973333,0.005926,0.966667,0.044721,4
4,"{'weights': 'uniform', 'n_neighbors': 8}",0.98,0.005785,0.966667,0.044721,4
5,"{'weights': 'uniform', 'n_neighbors': 6}",0.972593,0.008148,0.966667,0.044721,4
6,"{'weights': 'distance', 'n_neighbors': 6}",1.0,0.0,0.966667,0.044721,4
7,"{'weights': 'distance', 'n_neighbors': 7}",1.0,0.0,0.966667,0.044721,4
8,"{'weights': 'distance', 'n_neighbors': 4}",1.0,0.0,0.966667,0.044721,9
9,"{'weights': 'uniform', 'n_neighbors': 2}",0.978519,0.005185,0.953333,0.052068,10


In [26]:
print(f'Best validation score: {rand.best_score_}')
print(f'Best hyperparameters: {rand.best_params_}')
best_model = rand.best_estimator_
print(f'Best model: {best_model}')

Best validation score: 0.9733333333333334
Best hyperparameters: {'weights': 'distance', 'n_neighbors': 9}
Best model: KNeighborsClassifier(n_neighbors=9, weights='distance')


In [27]:
print(f'Test accuracy score: {best_model.score(X_test, y_test)}')

Test accuracy score: 1.0
