## Model Robustness at The Model Evaluation Stage: Nested Cross Validation
To obtain an unbiased evaluation of model performance (i.e., to prevent overfitting), when optimizing model's hyperparameters we should use Nested Cross Validation.   

Here using GridSearch strategy available in sklearn, we implement a nested cross valiation evaluation to optimize the hyperparameters of a random forest classifier. We will use the adapted Diabetes dataset (from sklearn) to the classification in our little experiment. 

### Importing Packages and Data Preparation

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score

# Load the Diabetes dataset
diabetes_data = load_diabetes()
X = pd.DataFrame(diabetes_data.data, columns=diabetes_data.feature_names)
y = diabetes_data.target

# Set a threshold for binary classification (e.g., using the median of y)
threshold = np.median(y)
y_binary = (y > threshold).astype(int)  # 1 for high risk, 0 for low risk

### Setting up the Hyperparatmeters Set and the Classifer

In [2]:
random_state = 42
# Define the hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize the RandomForestClassifier
rf = RandomForestClassifier(random_state=random_state)

### Setting up the Nested Cross-Validation:

In [3]:
# Inner Loop: Set up GridSearchCV with inner cross-validation
grid_search = GridSearchCV(estimator=rf, 
                           param_grid=param_grid, 
                           cv=5, 
                           scoring='accuracy', 
                           n_jobs=-1)

# Outer Loop: Perform nested cross-validation with outer cross-validation
nested_cv_scores = cross_val_score(grid_search, X, y_binary, cv=5, scoring='accuracy')

### Model Evaluation

In [4]:
# Calculate the mean and standard deviation of the nested cross-validation scores
mean_nested_cv_score = np.mean(nested_cv_scores)
std_nested_cv_score = np.std(nested_cv_scores)

print("Mean Accuracy across Nested Cross-Validation:", mean_nested_cv_score)
print("Standard Deviation of Accuracy across Nested Cross-Validation:", std_nested_cv_score)


Mean Accuracy across Nested Cross-Validation: 0.7148110316649643
Standard Deviation of Accuracy across Nested Cross-Validation: 0.03399686906691441


Use `RandomizedSearchCV` insted of `GridSearchCV` and compare them in terms of computational time and final accuracy of the model.

In [5]:
import time
from sklearn.model_selection import RandomizedSearchCV

model_rf = RandomForestClassifier(random_state=random_state)

# GridSearchCV vs RandomizedSearchCV, comparing time and accuracy
start_time = time.time()
grid_search = GridSearchCV(estimator=model_rf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
nested_cv_scores_grid = cross_val_score(grid_search, X, y_binary, cv=5, scoring='accuracy')
end_time = time.time()
grid_search_time = end_time - start_time

start_time = time.time()
random_search = RandomizedSearchCV(estimator=model_rf, param_distributions=param_grid, n_iter=10, cv=5, scoring='accuracy', n_jobs=-1, random_state=random_state)
nested_cv_scores_random = cross_val_score(random_search, X, y_binary, cv=5, scoring='accuracy')
end_time = time.time()
random_search_time = end_time - start_time

print("GridSearchCV Time:", grid_search_time)
print("RandomizedSearchCV Time:", random_search_time)
print("Mean Accuracy GridSearchCV:", np.mean(nested_cv_scores_grid))
print("Sandard Deviation of GridSearchCV:", np.std(nested_cv_scores_grid))
print("Mean Accuracy RandomizedSearchCV:", np.mean(nested_cv_scores_random))
print("Standard Deviation of RandomizedSearchCV:", np.std(nested_cv_scores_random))


GridSearchCV Time: 30.506365060806274
RandomizedSearchCV Time: 4.141875982284546
Mean Accuracy GridSearchCV: 0.7148110316649643
Sandard Deviation of GridSearchCV: 0.03399686906691441
Mean Accuracy RandomizedSearchCV: 0.7102145045965271
Standard Deviation of RandomizedSearchCV: 0.042108095024247144
