**L3 BASIC SEARCH ALGORITHMS**

- The process of finding the best hyperparameters for a given dataset is called hyperparameter Optimization or Hyperparameter Tuning
- The best hyperparameters are those that maximise the performance of the ML algorithm

<p> A search consists of:</p>
<p> - Hyperparameter space
<p> - <b>A method for sampling candidate hyperparameters (the focus of this section)</b>
<p> - A cross validation scheme
<p> - A performance metric to minimise (or maximise)

**3.1 Hyperparameter tuning challenges**
- we can't defind a formula to find hyperparameters
- We try different combinations of hyperparameters and evaluate model performance
- The critical steps is to choose how many different <b>hyperparameter combinations</b> we are going to set
- Computer resources available to us

<img src='challenges.png'>
- Low effective dimensions
<img src='dimensions.png' width="1300" height="600">

**Explanation of low effective dimensions**

When decising what hyperparameters to use in tuning the model, it is important to understand that not every hyperparameter has the same impact on model performance:
<p> <i>i. The min samples to split the nodes has a very small impact on the model performance vs number of trees and tree depth
<p> <i>ii. Even with features that have a high impact on model performance e.g. Tree depth and number of trees, there's a point (e.g. numb of trees = 50, or tree depth =3 in our example) beyond which increasing that feature wont increase model performance
<p> <i> iii. In the charts, it's more important to search hyper parameters in certain regions e.g. Tree depth: between 1 - 3, number of trees: between 10 - 50  

**Basic Hyperparameter Tuning Methods**
1. Manual Search
2. Grid Search
3. Random Search

**1. Manual Search**

Consists of trying and testing different hyperparameters manually.

**Uses:**
<p><i> i. To identify regions of promising hyperparameters</i>: Remember we said there are values ranges of the hyperparameter where increasing the value changes the performance significantly (see low effective dimensions above), while increasing beyond that range doesn't change performance much. Manual search informs us of those value ranges where the model doesn't increase it's performance further
<p><i> ii. To delimit the Grid Search: </i> To run grid search, we need to create a hyperparameter space consisting of the interval of values we want to test. Usually these intervals are defined manually
<p><i> iii. To get familiar with the hyperparameters and their effect on the models: </i>After manually changing the values of these parameters and seeing their impact on model performance, we begin to understand which ones have a greater impact on model performance and which ones dont.
<p><i> iv. To establish the benchmark model: </i>Usually a quick model we build with our data after a little bit of data analysis. It is later optimised
    
**Limitations:**
<p><i> i. Lack of reproducibility: </i> because we are testing manually, if another experiment tries different values, they may not arrive at the same conclusions that we did
<p><i> ii. Time consuming: </i>
<p><i> ii. Does not explore the entire hyperparameter space: </i>
<p><i> ii. Does not scale: </i>
    
**1.1 Demo: Manual Search for Hyperparameters**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

from sklearn.model_selection import (
    KFold,
    cross_validate,
    train_test_split,
)

In [5]:
# load dataset
breast_cancer_X, breast_cancer_y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(breast_cancer_X)
y = pd.Series(breast_cancer_y).map({0:1, 1:0})

X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [7]:
# percentage of benign () and malign tumors (1)
y.value_counts(normalize=True)

0    0.627417
1    0.372583
dtype: float64

In [13]:
# split data into train and validation split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

X_train.shape, X_test.shape

((398, 30), (171, 30))

**1.1 Manual search - Logistic Regression**
- We'll first play with c = 0.001 vs 1: We know this is what affects the performance of logistic regression models the most
- Then we'll play with regularization/penalty = l1 vs l2: This doesn't impact performance as much

In [15]:
# Logistic Regression
logit = LogisticRegression(
    penalty='l2', C=0.001, solver='liblinear', random_state=4, max_iter=10000)

# K-Fold cross validation
kf = KFold(n_splits=5, shuffle=True, random_state=4)

# estimate generalization error
clf = cross_validate(
    logit,
    X_train,
    y_train,
    scoring='accuracy', #we optimise the accuracy
    return_train_score=True,
    cv=kf
)

print('mean train set accuracy: ', np.mean(clf['train_score']), ' +- ', np.std(clf['train_score']))
print('mean test set accuracy: ', np.mean(clf['test_score']), ' +- ', np.std(clf['test_score']))

mean train set accuracy:  0.9170836537134519  +-  0.0038064240947020133
mean test set accuracy:  0.9195886075949368  +-  0.006259426475686005


**1.2 What if we didn't use hyperparameter at all, how will a base model perform?**

In [16]:
logit.fit(X_train, y_train)

train_preds = logit.predict(X_train)
test_preds = logit.predict(X_test)

print('Train Accuracy: ', accuracy_score(y_train, train_preds))
print('Test Accuracy: ', accuracy_score(y_test, test_preds))

Train Accuracy:  0.9170854271356784
Test Accuracy:  0.9473684210526315


**1.2 Observations:**
- The model base model (trained without Kfold cross validation) is surprisingly higher than the one trained through kfold_cross validation
- Let's see if we can improve the model performance obtained through kfold cross validation (this time will use c =1 and retrain Kfold model)

**Try Kfold with c=1**

In [17]:
# Logistic Regression
logit = LogisticRegression(
    penalty='l2', C=1, solver='liblinear', random_state=4, max_iter=10000)

# K-Fold cross validation
kf = KFold(n_splits=5, shuffle=True, random_state=4)

# estimate generalization error
clf = cross_validate(
    logit,
    X_train,
    y_train,
    scoring='accuracy', #we optimise the accuracy
    return_train_score=True,
    cv=kf
)

print('mean train set accuracy: ', np.mean(clf['train_score']), ' +- ', np.std(clf['train_score']))
print('mean test set accuracy: ', np.mean(clf['test_score']), ' +- ', np.std(clf['test_score']))

mean train set accuracy:  0.9604266477395951  +-  0.0015497668615040091
mean test set accuracy:  0.9447784810126582  +-  0.02565126706427742


**1.3 Observations**
- We notice that the accuracy of the test set increased quite a bit, from 91 to 94%. However, the error also increased from 0.0063 to 0.026.
- So let's try another value for c and see if the accuracy increases and error decreases

**Model train with Kfold cv and c=0.1**

In [20]:
# Logistic Regression
logit = LogisticRegression(
    penalty='l2', C=0.1, solver='liblinear', random_state=4, max_iter=10000)

# K-Fold cross validation
kf = KFold(n_splits=5, shuffle=True, random_state=4)

# estimate generalization error
clf = cross_validate(
    logit,
    X_train,
    y_train,
    scoring='accuracy', #we optimise the accuracy
    return_train_score=True,
    cv=kf
)

print('mean train set accuracy: ', np.mean(clf['train_score']), ' +- ', np.std(clf['train_score']))
print('mean test set accuracy: ', np.mean(clf['test_score']), ' +- ', np.std(clf['test_score']))

mean train set accuracy:  0.9484966779046153  +-  0.006111315451121751
mean test set accuracy:  0.9347468354430379  +-  0.019811643085077286


**Observations**
- Accuracy is better and error is much smaller.
- Lets now see what the performance of a logistic regression model is with c=0.1 too

In [21]:
logit.fit(X_train, y_train)

train_preds = logit.predict(X_train)
test_preds = logit.predict(X_test)

print('Train Accuracy: ', accuracy_score(y_train, train_preds))
print('Test Accuracy: ', accuracy_score(y_test, test_preds))

Train Accuracy:  0.9447236180904522
Test Accuracy:  0.9532163742690059


**observations:**
- Performance in test set is higher and that in train set is similar  