## **GroupKFold**

**What it is:**  
- Splits the dataset into folds such that the same group is not represented in both testing and training sets.
- Groups are defined by a separate array that indicates which group each sample belongs to.

**When to use it:**  
- When your samples are related by groups (e.g., multiple measurements from the same subject) and you want to avoid leakage between train and test.

**Key points:**  
- Each group appears entirely in either the training or the testing set.
- The number of splits is constrained by the number of groups.
- When performing cross-validation (for example, using methods like GroupKFold), it's important to ensure that the data points belonging to the same group are not split between the training and test sets. This is particularly useful when the data points within a group are correlated or share some common properties.
- By using the groups parameter, you ensure that the splits keep these groups intact. That is, if one sample from a group is in the test set, all samples from that group will be in the test set, avoiding data leakage between training and test sets.


In [23]:
from sklearn.model_selection import GroupKFold
import numpy as np

X = np.arange(20).reshape(10, 2)
y = np.arange(10)
groups = np.array([1, 1, 2, 2, 3, 3, 4, 4, 5, 5])

gkf = GroupKFold(n_splits=5)
for train_index, test_index in gkf.split(X, y, groups):
    print("TRAIN:", train_index, "TEST:", test_index)


TRAIN: [0 1 2 3 4 5 6 7] TEST: [8 9]
TRAIN: [0 1 2 3 4 5 8 9] TEST: [6 7]
TRAIN: [0 1 2 3 6 7 8 9] TEST: [4 5]
TRAIN: [0 1 4 5 6 7 8 9] TEST: [2 3]
TRAIN: [2 3 4 5 6 7 8 9] TEST: [0 1]
