# Hyperparameter Tuning


## Review: Cross-validation

[Cross-validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold) is a technique for evaluating the performance of a machine learning model on different subsets of the data. It helps to avoid overfitting and underfitting, and to estimate the generalization error of the model.

One common method of cross-validation is **k-fold cross-validation**, where the data is split into k equal parts, or folds. The model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times, each time using a different fold as the test set.

<font color='Blue'><b>Example</b></font>. The Auto MPG dataset retrieved from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/dataset/9/auto+mpg).

In [None]:
try:
    from ucimlrepo import fetch_ucirepo
except ImportError:
    !pip3 install -U ucimlrepo
    from ucimlrepo import fetch_ucirepo
import numpy as np
import pandas as pd

# fetch dataset
auto_mpg = fetch_ucirepo(name = 'Auto MPG')

# data (as pandas dataframes)
X = auto_mpg.data.features
y = auto_mpg.data.targets

# drop rows with missing values from X
X = X.dropna(axis=0, how='any')

# align X and y by index
X, y = X.align(y, join='inner', axis=0)
# We could also use sklearn.preprocessing.OneHotEncoder
# and fit it on the train set and then apply it on train and test set.
# the outcome would be similar to what we are doing here.
X.origin = X.origin.replace({1: 'USA', 2: 'Germany', 3: 'Japan'})
X = pd.get_dummies(X, dtype= 'int16')

# ln(mpg)
y = np.log(y['mpg'])
y.name = 'ln(mpg)'
print('X:')
display(X)
print('\ny:')
print(y)
print('\nInfo:')
X.info()

In [None]:
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import sem

# Initialize KFold cross-validator
random_state = 0
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True, random_state = random_state)

# Lists to store train and test scores for each fold
train_r2_scores, test_r2_scores, train_MSE_scores, test_MSE_scores = [], [], [], []

rfr = RandomForestRegressor(n_estimators = 100, n_jobs = -1, random_state = random_state)

def _Line(n = 80):
    print(n * '_')

def print_bold(txt):
    _left = "\033[1;34m"
    _right = "\033[0m"
    print(_left + txt + _right)

_Line()

# Perform Cross-Validation
for fold, (train_idx, test_idx) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]  # Extract training and testing subsets by index
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    rfr.fit(X_train, y_train)  # Train the SVR model

    # train
    y_train_pred = rfr.predict(X_train)
    train_r2_scores.append(metrics.r2_score(y_train, y_train_pred))
    train_MSE_scores.append(metrics.mean_squared_error(y_train, y_train_pred))

    # test
    y_test_pred = rfr.predict(X_test)
    test_r2_scores.append(metrics.r2_score(y_test, y_test_pred))
    test_MSE_scores.append(metrics.mean_squared_error(y_test, y_test_pred))

    #  Print the Train and Test Scores for each fold
    print_bold(f'Fold {fold + 1}:')
    train_proportion = 100*len(train_idx)/len(X)
    test_proportion = 100*len(test_idx)/len(X)
    print(f"\tTrain Set is {train_proportion:.2f}% of the dataset, and Test Set is {test_proportion:.2f}% of the dataset.")
    print(f"\tTrain R-Squared Score = {train_r2_scores[fold]:.5f}, Test R-Squared Score = {test_r2_scores[fold]:.5f}")
    print(f"\tTrain MSE Score = {train_MSE_scores[fold]:.5f}, Test MSE Score= {test_MSE_scores[fold]:.5f}")


df_scores = pd.DataFrame({'Fold': np.arange(1, len(train_r2_scores)+1),
                          'Train R-Squared Score':train_r2_scores,
                          'Test R-Squared Score':test_r2_scores,
                          'Train MSE Score':train_MSE_scores,
                          'Test MSE Score':test_MSE_scores,
                          }).set_index('Fold')

display(df_scores)

_Line()
print_bold('R-Squared Score:')
print(f"\tMean Train R-Squared Score: {np.mean(train_r2_scores):.5f} ± {sem(train_r2_scores, ddof = 0):.5f}")
print(f"\tMean Test R-Squared Score: {np.mean(test_r2_scores):.5f} ± {sem(test_r2_scores, ddof = 0):.5f}")
print_bold('MSE Score:')
print(f"\tMean Train MSE Accuracy Score: {np.mean(train_MSE_scores):.5f} ± {sem(train_MSE_scores, ddof = 0):.5f}")
print(f"\tMean Test MSE Accuracy Score: {np.mean(test_MSE_scores):.5f} ± {sem(test_MSE_scores, ddof = 0):.5f}")
_Line()

### Standard Error

The **standard error (SE)** is a way to measure how accurate an estimated value is, like the coefficients in a linear regression. It shows the average amount by which estimates can vary when we take multiple samples from the same group.

The standard error (SE) is typically calculated using this formula:

\begin{equation} \text{SE} = \frac{s}{\sqrt{n}} \end{equation}

Where:
- $ s $ represents the sample standard deviation, which measures the variability of data points within the sample.
- $ n $ is the number of data points (sample size).

For example, the standard error in Mean Train R-Squared Score: 0.98481 ± 0.00055 can be calculated as follows:

In [None]:
print(f'Standard Error = {(np.std(train_r2_scores, ddof = 0)/np.sqrt(len(train_r2_scores))):.5f}')

### Advantages and Disadvantages of Cross Validation

- **Advantages:**
    * It helps to prevent overfitting, which occurs when a model is trained too well on the training data and performs poorly on new, unseen data.
    * It provides a more realistic estimate of the model’s generalization performance, i.e., its ability to perform well on new, unseen data.
    * It can be used to compare different models and select the one that performs the best on average.
- **Disadvantages:**
    * It is computationally expensive, as it requires training and testing the model multiple times.
    * It can introduce variability in the results, depending on how the data is partitioned and shuffled.
    * It may not be suitable for some types of data, such as time series or spatial data, where the order or location of the data points matters.

## Stratified Cross-Validation

[Stratified cross-validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) is a method of evaluating the performance of a machine learning model on unseen data. It is similar to regular cross-validation, but it ensures that each fold or subset of the data has the same proportion of classes as the original data. This helps to avoid bias and improve the accuracy of the model.

Stratified cross-validation is especially useful for classification problems, where the outcome variable has two or more categories. For example, if we want to predict whether a person has a disease or not, we need to make sure that the training and testing sets have the same ratio of positive and negative cases. Otherwise, the model might overfit or underfit the data.

There are different types of stratified cross-validation, such as stratified k-fold cross-validation, stratified repeated k-fold cross-validation, and stratified shuffle-split cross-validation. They differ in how they split the data into folds and how many times they repeat the process.

In [None]:
import numpy as np
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

colors = ["#f5645a", "#0096ff", '#B2FF66']
edge_colors = ['#8A0002', '#2e658c', '#6A993D']
markers = ['o', '*', 's']
# Generate synthetic data
X, y = make_blobs(n_samples = [500, 300, 200],
                  centers=[[0, 0], [2, 2], [4, 4]],
                  n_features=2,
                  random_state=0, cluster_std=[1.0, 1, .6])

# Create a scatter plot using Seaborn
fig, ax = plt.subplots(1, 2, figsize=(10, 7), gridspec_kw={'width_ratios': [8, 2]})

for num in np.unique(y):
    ax[0].scatter(X[:, 0][y == num], X[:, 1][y == num], c=colors[num],
               s=40, ec=edge_colors[num], marker=markers[num], label=str(num))

ax[0].grid(True)
ax[0].legend(title='Class', fontsize=14)
ax[0].set(xlim = [-4, 6], ylim = [-4, 6], xlabel = 'Feature 1', ylabel = 'Feature 2')
ax[0].set_title('Synthetic Dataset', weight = 'bold', fontsize = 16, y = 1.02)

bar_heights, bar_labels = np.unique(y, return_counts=True)
bars = ax[1].bar(bar_heights, bar_labels, color=colors, edgecolor='k')

# Add xticks with labels
ax[1].set_xticks(bar_heights)
ax[1].set_xticklabels(['0', '1', '2'])

ax[1].grid(which='major', axis='y')
ax[1].set_title('Class Distribution', weight='bold', fontsize=16, y = 1.02)

# Add labels for bar heights inside each bar
for bar in bars:
    height = bar.get_height()
    Percentage = 100*height/X.shape[0]
    ax[1].annotate(f'{Percentage:.1f}%', xy=(bar.get_x() + bar.get_width() / 2, height),
                   xytext=(0, 3), textcoords='offset points', ha='center', fontsize=12)

plt.tight_layout()

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics
from scipy.stats import sem
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Function to print a line of underscores for separation
def _Line(n = 60, c = '_'):
    print(n * c)

def print_bold(txt, c = 31):
    print(f"\033[1;{c}m" + txt + "\033[0m")

# Create classifier instances
models = [RandomForestClassifier(),
          GradientBoostingClassifier()]
model_names = ['Random Forest Classifier', 'Gradient Boosting Classifier']
model_alphs = 'ab'

# Loop through each model
for model, name, alph in zip(models, model_names, model_alphs):
    _Line(n = 80, c = '=')
    print_bold(f'({alph}) {name}', c = 34)
    _Line(n = 80, c = '=')

    # Initialize KFold cross-validator
    n_splits = 5
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    # The splitt would be 80-20!

    # Lists to store train and test scores for each fold
    train_acc_scores, test_acc_scores, train_f1_scores, test_f1_scores = [], [], [], []
    train_class_proportions, test_class_proportions = [], []

    # Perform Cross-Validation
    for fold, (train_idx, test_idx) in enumerate(skf.split(X, y), 1):
        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]
        model.fit(X_train, y_train)

        # Calculate class proportions for train and test sets
        train_class_proportions.append([np.mean(y_train == model) for model in np.unique(y)])
        test_class_proportions.append([np.mean(y_test == model) for model in np.unique(y)])

        # train
        y_train_pred = model.predict(X_train)
        train_acc_scores.append(metrics.accuracy_score(y_train, y_train_pred))
        train_f1_scores.append(metrics.f1_score(y_train, y_train_pred, average = 'weighted'))

        # test
        y_test_pred = model.predict(X_test)
        test_acc_scores.append(metrics.accuracy_score(y_test, y_test_pred))
        test_f1_scores.append(metrics.f1_score(y_test, y_test_pred, average = 'weighted'))

    _Line()
    #  Print the Train and Test Scores for each fold
    for fold in range(n_splits):
        print_bold(f'Fold {fold + 1}:')
        print(f"\tTrain Class Proportions: {train_class_proportions[fold]}*{len(y_train)}")
        print(f"\tTest Class Proportions: {test_class_proportions[fold]}*{len(y_test)}")
        print(f"\tTrain Accuracy Score = {train_acc_scores[fold]:.5f}, Test Accuracy Score = {test_acc_scores[fold]:.5f}")
        print(f"\tTrain F1 Score (weighted) = {train_f1_scores[fold]:.5f}, Test F1 Score (weighted)= {test_f1_scores[fold]:.5f}")

    df_scores = pd.DataFrame({'Fold': np.arange(1, len(train_acc_scores)+1),
                          'Train Accuracy Score':train_acc_scores,
                          'Test Accuracy Score':test_r2_scores,
                          'Train F1 Score (weighted)':train_f1_scores,
                          'Test F1 Score (weighted)':test_f1_scores,
                          }).set_index('Fold')
    display(df_scores)

    _Line()
    print_bold('Accuracy Score:')
    print(f"\tMean Train Accuracy Score: {np.mean(train_acc_scores):.5f} ± {sem(train_acc_scores, ddof = 0):.5f}")
    print(f"\tMean Test Accuracy Score: {np.mean(test_acc_scores):.5f} ± {sem(test_acc_scores, ddof = 0):.5f}")
    print_bold('F1 Score:')
    print(f"\tMean Train F1 Accuracy Score (weighted): {np.mean(train_f1_scores):.5f} ± {sem(train_f1_scores, ddof = 0):.5f}")
    print(f"\tMean Test F1 Accuracy Score (weighted): {np.mean(test_f1_scores):.5f} ± {sem(test_f1_scores, ddof = 0):.5f}")
    _Line()

It does look like overfitting. Overfitting is when a model performs well on the training data, but poorly on the test data. This means that the model has learned the specific patterns and noise in the training data, but fails to generalize to new and unseen data. Overfitting can lead to poor performance and unreliable predictions.

There are several ways to prevent or reduce overfitting in gradient-boosting classifiers, such as:

- Limiting the number of trees in the model. More trees can increase the complexity and variance of the model, leading to overfitting. We can use the `n_estimators` parameter to control the number of trees.

- Applying learning rates. Learning rates can shrink the contribution of each tree, making the model more robust to noise and outliers. We can use the `learning_rate` parameter to adjust the learning rate.

- Tuning other hyperparameters, such as `max_depth`, `min_samples_split`, `min_samples_leaf`, `max_features`, and `gamma`. These parameters can control the size and complexity of the trees, and the minimum loss reduction required to make a further split.

- We can use the cross-validation grid search to find the optimal values for these parameters.