# Nested cross-validation 

Nested cross-validation is a technique used in machine learning to evaluate and tune models more robustly, especially when dealing with small datasets or when model performance is sensitive to variations in the training data. It's essentially a combination of two cross-validation loops: an outer loop and an inner loop.

**Outer Cross-Validation (Outer Loop)**: This is the outer loop of the process and is responsible for estimating the model's performance. It typically uses k-fold cross-validation, where the original dataset is divided into k subsets or folds. In each iteration, one fold is used as a validation set, and the remaining k-1 folds are used for training. The model is trained and evaluated k times, and the average performance metric (e.g., accuracy, F1-score) is calculated over these iterations. This gives you an estimate of how well the model performs on unseen data.

**Inner Cross-Validation (Inner Loop)**: Inside each iteration of the outer loop, there's another cross-validation loop. This inner loop is used for hyperparameter tuning or model selection. It's similar to the outer loop but focuses on selecting the best set of hyperparameters or model configuration. The inner loop also uses k-fold cross-validation but is applied to the training data from the outer loop. Different hyperparameter combinations or models are evaluated, and the best-performing combination is selected.

### key advantages of nested cross-validation

- Robust Performance Estimation: By using nested cross-validation, we obtain a more reliable estimate of our model's performance because it considers variations in both the training and validation data.

- Avoiding Data Leakage: Nested cross-validation helps prevent data leakage, which can occur when hyperparameter tuning or model selection is performed on the same data used for performance estimation. The inner loop ensures that model selection occurs on independent training and validation sets.

- Optimal Hyperparameter Tuning: It allows us to find the best hyperparameters or model configuration for our specific dataset while avoiding overfitting.

### Example workflow in  cross-validation:

**Outer Loop (Performance Estimation)**:

- Split the dataset into k folds.
- In each iteration:
    - Use k-1 folds for training.
    - Use the remaining fold for validation.
    - Calculate a performance metric (e.g., accuracy) on the validation set.
- Average the performance metrics from all iterations to estimate the model's overall performance.
    
**Inner Loop (Hyperparameter Tuning)**:

- Inside each iteration of the outer loop:
    - Split the training data from the outer loop into k folds.
    - In each inner iteration:
        - Use k-1 folds for training within the training data.
        - Use the remaining fold for validation within the training data.
        - Try different hyperparameter settings or model configurations.  
        - Calculate a performance metric on the inner validation set.
    - Choose the hyperparameters or model configuration that performed best on average across inner iterations.

Nested cross-validation provides a more robust and unbiased way to evaluate and tune models, ensuring that our final model's performance estimates are more trustworthy.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
from hyperopt import fmin, tpe, hp, Trials, space_eval
from hyperopt.pyll.base import scope
from sklearn.datasets import load_breast_cancer  # Replace with your dataset
from sklearn.ensemble import RandomForestClassifier 

In [2]:
# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Define the hyperparameter space to search
space = {
    'max_depth': hp.choice('max_depth', [int(x) for x in range(2, 11)]),
    'min_samples_split': hp.uniform('min_samples_split', 0.1, 1.0),
    'min_samples_leaf': hp.uniform('min_samples_leaf', 0.1, 0.5),
}

In [3]:
# Define the outer cross-validation loop
outer_scores = []
outer_loop_log = {}
for _ in range(5):  # Number of outer loop iterations
    # Split the data into training and test sets for the outer loop
    X_train_outer, X_test_outer, y_train_outer, y_test_outer = train_test_split(X, y, test_size=0.3, random_state=42)
    
    def objective(params):
        # Create a Random Forest classifier with the given hyperparameters
        clf = RandomForestClassifier(**params)
        # Use cross-validation to evaluate the model
        scores = cross_val_score(clf, X_train_outer, y_train_outer, cv=5, scoring='roc_auc')
        # Return the negative mean accuracy (to maximize accuracy)
        return -np.mean(scores)

    # Optimize hyperparameters using Hyperopt (inner loop)
    trials = Trials()
    best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=50, trials=trials)
    best_params = space_eval(space, best)
    print('best_params:', best_params)

    # Create a Random Forest classifier with the best hyperparameters
    clf = RandomForestClassifier(**best_params)
    # Train the final model on the training data for the outer loop
    clf.fit(X_train_outer, y_train_outer)
    # Evaluate the final model on the test set for the outer loop
    y_pred_outer = clf.predict_proba(X_test_outer)[:,1]
    test_auc = roc_auc_score(y_test_outer, y_pred_outer)
    print('test_auc in the outer loop', test_auc)
    outer_scores.append(test_auc)

# Calculate the mean and standard deviation of outer loop scores
mean_auc = np.mean(outer_scores)
std_auc = np.std(outer_scores)

print("Mean AUC: {:.3f}".format(mean_auc))
print("Standard Deviation: {:.4f}".format(std_auc))

  0%|          | 0/50 [00:00<?, ?trial/s, best loss=?]

100%|██████████| 50/50 [00:43<00:00,  1.16trial/s, best loss: -0.9840273047149894]
best_params: {'max_depth': 2, 'min_samples_leaf': 0.13753606671347776, 'min_samples_split': 0.4045267970130479}
test_auc in the outer loop 0.9917695473251029
100%|██████████| 50/50 [00:40<00:00,  1.23trial/s, best loss: -0.9817827820783485]
best_params: {'max_depth': 9, 'min_samples_leaf': 0.17861381870812446, 'min_samples_split': 0.4609184692006453}
test_auc in the outer loop 0.9952968841857731
100%|██████████| 50/50 [00:38<00:00,  1.31trial/s, best loss: -0.9828164203612479]
best_params: {'max_depth': 6, 'min_samples_leaf': 0.11679396979203922, 'min_samples_split': 0.19794846916048284}
test_auc in the outer loop 0.9972075249853026
100%|██████████| 50/50 [00:55<00:00,  1.11s/trial, best loss: -0.9825460004691532]
best_params: {'max_depth': 5, 'min_samples_leaf': 0.13198753586992315, 'min_samples_split': 0.37050727079099777}
test_auc in the outer loop 0.9970605526161082
100%|██████████| 50/50 [00:36<00:0

The result above is reassuring since we have multiple sets of hyperparameters that all perform. The very low standard deviation in performance across different outer loops suggests that the model is robust and not highly sensitive to the choice of hyperparameters.

### TODO

set up an outer_loop_log dict
* save outerloop test score, best_loss and best params
* we can compare, sort by test score or best_loss and decide on which best params to use
* easily calculate mean and std of test score across folds like we have now