# 🎯Hyperparameter Optimization for Machine Learning

The aim of this notebook:
> - Discuss multiple ways to optimize hyperparameters.
> - Understand the logic of each technique.
> - Considerations when utilizing each technique.
> - Master the use of Python open-source for hyperparameter tuning.

---

## Parameters in ML models
> - The objective of a typical learning algorithm is to find a function `f` that minimizes a certain `loss` over a `dataset`.
> - The learning algorithm produces `f` through the optimization of a training criteron with respect to a set of `parameters`.

---

## Hyperparameters in ML models
> - Hyperparameters are parameters that are not directly learnt by the learning algorithm.
> - Hyperparameters are specified outside of the training procedure.
> - Hyperparameters control the capacity of the model, i.e., how flexible the model is to fit the data.
> - Prevent over-fitting.
> - Hyperparameters could have a big impact on the performance of the learning algorithm.
> - Optimal hyperparameter settings often differ for different datasets. Therefore they should be optimized for each dataset.
---

## Hyperparameter Nature
>- Some hyperparameters are discrete: Number of estimators in ensemble models.
>- Some hyperparameters are continuous: Penalization coefficient, Number of samples per split.
>- Some hyperparameters are categorical: Loss (deviance, exponential), Regularization (Lasso, Ridge)

---

## Parameters vs Hyperparameters

|Parameters                  |   Hyperparameters |
|:-------------------------|----------------------:|
| - Intrinsic to model equation     | - Defined before training |
| - Optimized during training | - Constrain the algorithm|

> - The process of finding the best Hyperparameters for a given dataset is called `Hyperparameter Tuning` or `Hyperparameter Optimization`.

---

## Challenges
>- We can't define a formula to find the hyperparameters.
>- Try different combinations of hyperparameters and evaluate model performance. The critical step is to choose how many different combinations we are going to test.

The number of hyperparameter combination ---> the chance to get a better model ---> Computational cost

>- How do we find the hyperparameter combinations to maximize performance while diminishing computational costs?

---

## Methods
Different hyperparamete optimization strategies:
>- Manual Search
>- Grid Search
>- Random Search
>- Bayesian Optimization

---

## Generalization vs Over-fitting
> Generalization is the ability of an algorithm to be effective across various inputs. The performance of the machine learning model is constant across different datasets (with the same distribution on the training data). When the model performs well on the train set, but not on new / naive data, the model over-fits to the training data.

---

## Training a Machine Learning Model
> To prevent over-fitting, it is common practice to:
> - Separate the data into a train and a test set.
> - Train the model in the train set.
> - Evaluate in the test set.

# Loading The Data

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

In [None]:
data = pd.read_csv("/kaggle/input/lending-club-dataset/lending_club_loan_two.csv")
data.head()

# 🔄 Data Preprocessing

In [None]:
print(f"The Length of the data: {data.shape}")

drop_columns = ['emp_title', 'emp_length', 'title', 'address', 'grade', 
                'issue_d', 'earliest_cr_line', 'term']
data.dropna(inplace=True)
data.drop(drop_columns, axis=1, inplace=True)

print(f"The Length of the data: {data.shape}")

In [None]:
def pub_rec(number):
    if number == 0.0:
        return 0
    else:
        return 1
    
def mort_acc(number):
    if number == 0.0:
        return 0
    elif number >= 1.0:
        return 1
    else:
        return number
    
def pub_rec_bankruptcies(number):
    if number == 0.0:
        return 0
    elif number >= 1.0:
        return 1
    else:
        return number
    
data['pub_rec'] = data.pub_rec.apply(pub_rec)
data['mort_acc'] = data.mort_acc.apply(mort_acc)
data['pub_rec_bankruptcies'] = data.pub_rec_bankruptcies.apply(pub_rec_bankruptcies)
data['loan_status'] = data.loan_status.map({'Fully Paid':0, 'Charged Off':1})
data.loc[(data.home_ownership == 'ANY') | (data.home_ownership == 'NONE'), 'home_ownership'] = 'OTHER'  

In [None]:
dummies = ['sub_grade', 'verification_status', 'purpose', 'initial_list_status', 
           'application_type', 'home_ownership']

data = pd.get_dummies(data, columns=dummies, drop_first=True)

In [None]:
w_p = data.loan_status.value_counts()[0] / data.shape[0]
w_n = data.loan_status.value_counts()[1] / data.shape[0]

print(f"Weight of positive values {w_p}")
print(f"Weight of negative values {w_n}")

In [None]:
X = data.drop('loan_status', axis=1)
y = data.loan_status

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

## 1. KFold

In [None]:
import xgboost as xgb

from sklearn.model_selection import KFold, RepeatedKFold, StratifiedKFold, cross_validate, GridSearchCV

In [None]:
xgb_clf = xgb.XGBClassifier()

kf = KFold(n_splits=5, shuffle=True, random_state=42)

clf = cross_validate(xgb_clf, X_train, y_train, scoring='accuracy', return_train_score=True, cv=kf)

In [None]:
clf['test_score']

In [None]:
clf['train_score']

In [None]:
print(f"Mean train accuracy: {np.mean(clf['train_score']):.4f} +/- {np.std(clf['train_score']):.4f}")
print(f"Mean test accuracy: {np.mean(clf['test_score']):.4f} +/- {np.std(clf['test_score']):.4f}")

## 2. Repeated K-Fold

In [None]:
xgb_clf = xgb.XGBClassifier()

rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)

print(f"We expect K * n performance metrics: {5 * 10}")
clf = cross_validate(xgb_clf, X_train, y_train, scoring='accuracy', return_train_score=True, cv=rkf)
print(f"Number of metrics obtained: {len(clf['test_score'])}")

In [None]:
clf['test_score']

In [None]:
clf['train_score']

In [None]:
print(f"Mean train accuracy: {np.mean(clf['train_score']):.4f} +/- {np.std(clf['train_score']):.4f}")
print(f"Mean test accuracy: {np.mean(clf['test_score']):.4f} +/- {np.std(clf['test_score']):.4f}")

## 3. Straitified K-Fold Cross-Validation

In [None]:
xgb_clf = xgb.XGBClassifier()

skf = StratifiedKFold(n_splits=5, random_state=42)

clf = cross_validate(xgb_clf, X_train, y_train, scoring='accuracy', return_train_score=True, cv=skf)
print(f"Number of metrics obtained: {len(clf['test_score'])}")

In [None]:
clf['test_score']

In [None]:
clf['train_score']

In [None]:
print(f"Mean train accuracy: {np.mean(clf['train_score']):.4f} +/- {np.std(clf['train_score']):.4f}")
print(f"Mean test accuracy: {np.mean(clf['test_score']):.4f} +/- {np.std(clf['test_score']):.4f}")

# Hyperparameter Optimization

## Grid Search

>- Exhaustive search through a specified subset of hyperparameters of a learning algorithm.
>- Examines all possible combinations of the specified hyperparameters (Cartesian product of hyperparameters).

### Limitations
>- Curse of dimentionality: possible combinations grow exponentially with the number of hyperparameters.
>- Computationally expensive.
>- Hyperparameter values are determined manually.
>- Not ideal for continuous hyperparameters.
>- Does not explore the entire hyperparameter space (not feasible).
>- It performs worse than other searches (for models with complex hyperparameter spaces).

### Advantages
>- For models with simpler hyperparameter spaces works well.
>- It can be parallelized.

Grid Search is the most expensive method in terms of total computation time. However, if run in parallel, it is fast in terms of wall clock time. Sometimes, we run a small grid, determine where the optimum lies, and then expand the grid in that direction.

In [None]:
xgb_clf = xgb.XGBClassifier()

param_grid = dict(
    n_estimators= [100, 500, 1000], 
    max_depth= [2, 5, 10, 15],
    learning_rate= [0.01, 0.1, 0.5, 0.9],
    min_child_weight= [1, 2, 5], 
#     booster= ['gbtree', 'gblinear'], 
#     base_score= [0.25, 0.5, 0.75, 0.99]
)

hyperparameters_comb = 1
for keys, values in param_grid.items():
    hyperparameters_comb *= len(values)
print(f"Number of hyperparam combinations: {hyperparameters_comb}")

search = GridSearchCV(xgb_clf, param_grid=param_grid, scoring='roc_auc', cv=5, n_jobs=-1, verbose=1)
search.fit(X_train, y_train)

In [None]:
rf_clf = RandomForestClassifier()

## Random Search

>- Hyperparameter values are selected by independent (random) draws from uniform distribution of the hyperparameter space. Random Search selects the combinations of hyperparameter values at random from all the possible combinations given a hyperparameter space.

---

## Random Search vs Grid Search
>- Some parameters affect performance a lot and some others don't (Low Effective Dimension). 

|      Random Search                                                |   Grid Search                            |
|:------------------------------------------------------------------|-----------------------------------------:|
| Allows the exploration of more dimensions of important parameters | Waste time exploring non-important dimensions |
| Select values from a distribution of parameter values             | Parameters are defined manually |
| Good for continuous parameters                                    | Good for discrete parameters |

---

## Considerations
>- We choose a (computational) budget independently of the number of parameters and possible values.
>- Adding parameters that do not influence the performance does not decrease efficiency of the search (if enough iterations are allowed).
>- Important to specify a continuous distribution of the hyperparameter to take full advantage of the randomization.