<a href="https://colab.research.google.com/github/Chirag314/Practice/blob/main/Nested_CV_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.svm import SVC



In [2]:
#load data
X,y=load_iris(return_X_y=True)

The outer loop splits the data into K folds, typically using k-fold cross-validation. Each fold serves as a hold-out validation set while the rest of the data forms the training set. The model is evaluated K times, each time using a different fold as the validation set. The final performance score is the average of the scores obtained from all the iterations.

In [3]:
from sklearn.model_selection import KFold

outer_folds=5

outer_cv=KFold(n_splits=outer_folds,shuffle=True,random_state=42)

outer_scores=[]

Inner Cross-Validation Loop (Hyperparameter Tuning)
The inner loop is used for hyperparameter tuning. It also uses k-fold cross-validation to select the best hyperparameters for the model. For each set of hyperparameters, the model is trained on the training set (from the outer loop) and evaluated on the validation set (from the outer loop). The best hyperparameters are determined based on the average performance across all the inner fold iterations.

In [4]:
param_grid={'C':[0.1,1,10],
            'kernel':['linear','rbf']}
#initialize inner cv grid search
inner_cv=KFold(n_splits=3,shuffle=True,random_state=42)

svm_clf=SVC()

grid_search=GridSearchCV(estimator=svm_clf,param_grid=param_grid,cv=inner_cv)

Perform nested CV: we loop through the outer folds and for each fold, we perform hyperparameter tuing using the inner CV. We then train the model on the training set(excluding the validation fold) and evaluate its performance on the validation set.

In [5]:
#Nested CV
for train_index,test_index in outer_cv.split(X):
  X_train,X_test=X[train_index],X[test_index]
  y_train,y_test=y[train_index],y[test_index]

  #hyperparameter tuning using inner CV
  grid_search.fit(X_train,y_train)

  # Get the best estimator with tuned hyperparamets
  best_svm_clf=grid_search.best_estimator_

  # valuate on validation set
  score=best_svm_clf.score(X_test,y_test)
  outer_scores.append(score)

# Final score
mean_score=np.mean(outer_scores)
print("NEsted CV mean accuracy:", mean_score)

NEsted CV mean accuracy: 0.9733333333333334


Combine nested CV with different seeds with same base learner

In [6]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier

X,y=load_iris(return_X_y=True)

outer_folds=5
outer_cv=KFold(n_splits=outer_folds, shuffle=True, random_state=42)
outer_socres=[]

seed_values=[123,234,345]

for trian_index,test_index in outer_cv.split(X):
  X_train,X_test=X[train_index],X[test_index]
  y_train,y_test=y[train_index],y[test_index]

  prediction=[]

  for seed in seed_values:
    base_learner=RandomForestClassifier(random_state=seed)

    base_learner.fit(X_train,y_train)

    y_pred=base_learner.predict(X_test)

    prediction.append(y_pred)

  ensemble_predictins=sum(prediction)/len(prediction)

  accuracy=np.mean(ensemble_predictins==y_test)

  outer_scores.append(accuracy)

mean_accuracy=np.mean(outer_scores)

print("Nested CV mean accuracy:",mean_accuracy)



Nested CV mean accuracy: 0.9833333333333334


Combine the nested cross-validation technique for model selection and hyperparameter tuning, with the ensemble of a single base learner using different seeds.

In [9]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.svm import SVC

X,y=load_iris(return_X_y=True)

outer_folds=5
outer_cv=KFold(n_splits=outer_folds,shuffle=True, random_state=42)
outer_scores=[]

#Define different models

models={
    'RandomForest':(RandomForestClassifier(),{
        'n_estimators':[50,100,150],
        'max_depth':[None,5,10],
        'min_samples_split':[2,5]
    } ),
    'SVM':(SVC(),{
        'C':[1,10,100],
        'kernel':['linear','rbf']
    }),
    'GradientBoosting':(GradientBoostingClassifier(),{
        'n_estimators':[50,100,150],
        'learning_rate':[0.1,0.01],
        'max_depth':[3,5]

    })
}

seed_values=[324,562,987]

for train_index,test_index in outer_cv.split(X):
  X_train,X_test=X[train_index],X[test_index]
  y_train,y_test=y[train_index],y[test_index]

  predictions=[]

  for model_name,(model, param_grid) in models.items():
    inner_cv=KFold(n_splits=3,shuffle=True, random_state=42)
    grid_search=GridSearchCV(estimator=model, param_grid=param_grid,cv=inner_cv)

    grid_search.fit(X_train,y_train)
    best_model=grid_search.best_estimator_

    model_predictions=[]

    for seed in seed_values:
      base_learner=best_model.__class__(**best_model.get_params())
      base_learner.set_params(random_state=seed)
      base_learner.fit(X_train,y_train)
      y_pred=base_learner.predict(X_test)
      model_predictions.append(y_pred)

    ensemble_predictions=sum(model_predictions)/len(model_predictions)
    predictions.append(ensemble_predictions)
  accuracies=[np.mean(pred==y_test) for pred in predictions]
  outer_scores.append(accuracies)

mean_scores=np.mean(outer_scores,axis=0)
model_names=list(models.keys())

for name,score in zip(model_names,mean_scores):
  print(f"{name} - Mean Accuracy : {score: .4f}")



RandomForest - Mean Accuracy :  0.9533
SVM - Mean Accuracy :  0.9667
GradientBoosting - Mean Accuracy :  0.9533
