### Different Cross Validation Techniques

Stratified Cross-Validation

It is a variation of K-fold cross validation that ensures each fold has the same proportion of classes as the original dataset. It is particularly important when dealing with imbalanced datasets where some classes are more frequent than others.

For example: If the original dataset have dataset samples with 70% class A and 30% class B, then in stratified Kfold each fold will also have the same proportion of 70% class A and 30% class B

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier

In [2]:
# Load the dataset
data = load_breast_cancer()

X, y = data.data, data.target

In [3]:
# Define the model
rf_model = RandomForestClassifier()

In [None]:
# Use StratifiedKFold with 5 splits
stratified_kf = StratifiedKFold(n_splits=5)

In [5]:
scores = cross_val_score(rf_model, X, y, cv=stratified_kf)

print("StratifiedCV Scores: ", scores)
print("Mean Score: ", scores.mean())

StratifiedCV Scores:  [0.92105263 0.94736842 0.98245614 0.98245614 0.98230088]
Mean Score:  0.9631268436578171


Group KFold:

It is used when you have groups in your data, and you want to ensure that the same group is not split across different folds. 


In [7]:
from sklearn.datasets import load_iris
from sklearn.model_selection import GroupKFold, cross_val_score
import numpy as np

In [7]:
# Load the dataset
data = load_iris()
X, y = data.data, data.target

In [8]:
# Create artificial groups
# Let's say every 10 consecutive samples belong to the same "group"
groups = np.array([i // 10 for i in range(len(y))])

In [9]:
model = RandomForestClassifier()

In [10]:
group_kfold = GroupKFold(n_splits=5)

# Perform cross-validation
scores = cross_val_score(model, X, y, groups=groups, cv=group_kfold)

# Results
print("Cross-Validation Scores: ", scores)
print("Mean Score: ", scores.mean())

Cross-Validation Scores:  [1.         0.9        0.9        0.96666667 0.96666667]
Mean Score:  0.9466666666666667


Shuffle Split



In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import ShuffleSplit

In [2]:
data = load_iris()

X, y = data.data, data.target

In [3]:
shuffle_split = ShuffleSplit(n_splits = 5, test_size=0.3, random_state=42)

In [5]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000, random_state=42)

In [None]:
split_results = []
for train_index, test_index in shuffle_split.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

In [10]:
model.fit(X_train, y_train)
score = cross_val_score(model, X, y, cv=shuffle_split)

# Results
print("Cross-Validation Scores: ", score)
print("Mean Score: ", score.mean())

Cross-Validation Scores:  [1.         1.         0.91111111 0.95555556 0.93333333]
Mean Score:  0.96


StratifiedGroupKFold

In [11]:
from sklearn.model_selection import StratifiedGroupKFold 
from sklearn.datasets import make_classification

In [12]:
X, y = make_classification(n_samples=100, n_features=5, n_classes=2, random_state=42)


In [13]:
# Define Groups
groups = [i // 10 for i in range(100)]

In [14]:
# Define StratifiesGroupKFold with 5 folds
sgk_cv = StratifiedGroupKFold(n_splits=5)

In [16]:
for fold_idx, (train_idx, test_idx) in enumerate(sgk_cv.split(X, y, groups=groups)):
    print(f"Fold {fold_idx + 1}")
    print("Train:", train_idx)
    print("Test:", test_idx)
    print("------")

Fold 1
Train: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
 92 93 94 95 96 97 98 99]
Test: [50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69]
------
Fold 2
Train: [10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39 40 41 42 43
 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
 92 93 94 95 96 97 98 99]
Test: [ 0  1  2  3  4  5  6  7  8  9 20 21 22 23 24 25 26 27 28 29]
------
Fold 3
Train: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 50 51 52 53 54 55 56 57
 58 59 60 61 62 63 64 65 66 67 68 69 80 81 82 83 84 85 86 87 88 89 90 91
 92 93 94 95 96 97 98 99]
Test: [40 41 42 43 44 45 46 47 48 49 70 71 72 73 74 75 76 77 78 79]
-----

GroupShuffleSplit

In [17]:
from sklearn.model_selection import GroupShuffleSplit
import numpy as np



In [18]:
X = np.arange(20).reshape(10, 2)
y = np.arange(10)
groups = [1,1,2,2,3,3,4,4,5,5]

In [None]:
# Define GroupShuffleSplit with 2 splits and a test size of 30%
gss = GroupShuffleSplit(n_splits=5, test_size=0.3, random_state = 42)

In [22]:
for train_index, test_index in gss.split(X, y, groups=groups):
    print("Train indices:", train_index, "Test indices:", test_index)


Train indices: [0 1 4 5 6 7] Test indices: [2 3 8 9]
Train indices: [0 1 4 5 8 9] Test indices: [2 3 6 7]
Train indices: [4 5 6 7 8 9] Test indices: [0 1 2 3]
Train indices: [4 5 6 7 8 9] Test indices: [0 1 2 3]
Train indices: [2 3 6 7 8 9] Test indices: [0 1 4 5]
