In [43]:
import pandas as pd
from sklearn.model_selection import GroupKFold, StratifiedKFold
from sklearn.metrics import roc_auc_score

Quickly test two KFold methods offered in SKlearn. 

* `StratifiedKFold`: make sure in each split, the train and test set each has the same distrution in terms of the target.  
* `GroupKFold`: make sure the grouped data points and kept together. 

In [49]:
df = pd.DataFrame({
    "group": ['x','x','x','y','y','y','y','z','z'],
    "category": ['a','a','a','a','a','b','b','b','b'],
    "Prob": [0.1, 0.3, 0.5, 0.7, 0.9, 0.2, 0.4, 0.3, 0.2],
    "target": [0, 0, 0, 1, 0, 1, 0, 0, 1]
    })
X = df.drop('target', axis=1)
y = df['target']

## StratifiedKFold

Maintain the distribution of the target.

In [50]:
skf = StratifiedKFold(n_splits = 3, shuffle = True, random_state = 42)

for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    X_train, y_train = X.iloc[train_index], y.iloc[train_index]
    X_test, y_test = X.iloc[test_index], y.iloc[test_index]

    print(f"fold {i+1}:")
    print("train target", y_train.tolist())
    print("test target", y_test.tolist()) 

fold 1:
train target [0, 1, 0, 1, 0, 0]
test target [0, 0, 1]
fold 2:
train target [0, 0, 0, 1, 0, 1]
test target [1, 0, 0]
fold 3:
train target [0, 0, 1, 0, 0, 1]
test target [0, 1, 0]


## GroupKFold

Keep group members unseparated which is useful in quite some scenarios. For example,

- **Preserving Temporal Order**: In time series data, the order of observations often carries valuable information. For example, in financial data, stock prices today are influenced by stock prices from previous days. GroupKFold ensures that each fold respects the temporal order by grouping consecutive time periods (e.g., days, weeks, months) together in the same fold.
- **Preventing Data Leakage** : When performing cross-validation on time series data, it's crucial to avoid data leakage. Data leakage occurs when information from the future (in the test set) influences the past (in the training set), leading to overly optimistic performance estimates. GroupKFold helps prevent data leakage by maintaining the integrity of time-based groups.
- **Panel Data**: Panel data, also known as longitudinal data, involves repeated observations over multiple time periods for multiple entities (e.g., individuals, companies). GroupKFold can help ensure that observations from the same entity are grouped together in the same fold, which can be important when evaluating models on this type of data.
(credit: ChatGPT)

In [54]:
gkf = GroupKFold(n_splits = 3)
groups = df['group'].copy()

for i, (train_index, test_index) in enumerate(gkf.split(X, y, groups = groups)):
    X_train, y_train = X.iloc[train_index], y.iloc[train_index]
    X_test, y_test   = X.iloc[test_index], y.iloc[test_index]

    print(f"fold {i+1}:")
    print("train group", X_train['group'].tolist())
    print("test group", X_test['group'].tolist()) 


fold 1:
train group ['x', 'x', 'x', 'z', 'z']
test group ['y', 'y', 'y', 'y']
fold 2:
train group ['y', 'y', 'y', 'y', 'z', 'z']
test group ['x', 'x', 'x']
fold 3:
train group ['x', 'x', 'x', 'y', 'y', 'y', 'y']
test group ['z', 'z']
