### Acknowledgement

In this notebook, we will use the implementation of Stratified Group K-fold cross validation presented in [this kernel](https://www.kaggle.com/jakubwasikowski/stratified-group-k-fold-cross-validation) by [Jakub Wąsikowski](https://www.kaggle.com/jakubwasikowski).

### The purpose of this notebook

One of the challenges that we are facing in this competition is finding the best way to implement a K-fold cross-validation. On one hand, we have a very imbalanced dataset, so some sort of stratification seems to be in order here. On the other hand, we know that some of the patients present in the dataset are represented by multiple images and it is probably a good idea not to mix images corresponding to the same `patient_id` in our training and validation sets. The latter can be achieved by using sklearn Group K-Fold split. Unfortunately, sklearn does not have an implementation of Stratified Group K-Folds, so we have to make one from scratch. Fortunately, one such implementation of Stratified Group K-Folds was already developed in the PetFinder competition (see the "Acknowledgement" section), so we can just take it and use it here. This is what we do in this notebook.

### Setting the number of folds

To illustrate the main idea we will be using an example of 5 folds. The number of folds can be easily adjusted by changing the parameter `NFOLDS` below.

In [None]:
NFOLDS=5

### Loading libraries

In [None]:
import random
import numpy as np
import pandas as pd
from pathlib import Path
from collections import Counter, defaultdict

### Function for implementing Stratified Group k-Fold split

In [None]:
def stratified_group_k_fold(X, y, groups, k, seed=None):
    labels_num = np.max(y) + 1
    y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
    y_distr = Counter()
    for label, g in zip(y, groups):
        y_counts_per_group[g][label] += 1
        y_distr[label] += 1

    y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
    groups_per_fold = defaultdict(set)

    def eval_y_counts_per_fold(y_counts, fold):
        y_counts_per_fold[fold] += y_counts
        std_per_label = []
        for label in range(labels_num):
            label_std = np.std([y_counts_per_fold[i][label] / y_distr[label] for i in range(k)])
            std_per_label.append(label_std)
        y_counts_per_fold[fold] -= y_counts
        return np.mean(std_per_label)
    
    groups_and_y_counts = list(y_counts_per_group.items())
    random.Random(seed).shuffle(groups_and_y_counts)

    for g, y_counts in sorted(groups_and_y_counts, key=lambda x: -np.std(x[1])):
        best_fold = None
        min_eval = None
        for i in range(k):
            fold_eval = eval_y_counts_per_fold(y_counts, i)
            if min_eval is None or fold_eval < min_eval:
                min_eval = fold_eval
                best_fold = i
        y_counts_per_fold[best_fold] += y_counts
        groups_per_fold[best_fold].add(g)

    all_groups = set(groups)
    for i in range(k):
        train_groups = all_groups - groups_per_fold[i]
        test_groups = groups_per_fold[i]

        train_indices = [i for i, g in enumerate(groups) if g in train_groups]
        test_indices = [i for i, g in enumerate(groups) if g in test_groups]

        yield train_indices, test_indices

### Loading data

In [None]:
PATH=Path('/kaggle/input/siim-isic-melanoma-classification/')
train=pd.read_csv(PATH/'train.csv')
print(f"The shape of the `train` is {train.shape}.\n")
print(f"The columns present in `train` are {train.columns.values}.")

In [None]:
train_x = train['image_name']
train_y = train['target'].values
groups = np.array(train['patient_id'].values)

In [None]:
def get_distribution(y_vals):
        y_distr = Counter(y_vals)
        y_vals_sum = sum(y_distr.values())
        return [f'{y_distr[i] / y_vals_sum:.5%}' for i in range(np.max(y_vals) + 1)]

In [None]:
distrs = [get_distribution(train_y)]
index = ['training set']

for fold_ind, (dev_ind, val_ind) in enumerate(stratified_group_k_fold(train_x, train_y, 
                                                                      groups, k=NFOLDS), 1):
    dev_y, val_y = train_y[dev_ind], train_y[val_ind]
    dev_groups, val_groups = groups[dev_ind], groups[val_ind]
    
    # making sure that train and validation group do not overlap:
    assert len(set(dev_groups) & set(val_groups)) == 0
    
    distrs.append(get_distribution(dev_y))
    index.append(f'development set - fold {fold_ind}')
    distrs.append(get_distribution(val_y))
    index.append(f'validation set - fold {fold_ind}')

display('Distribution per class:')
pd.DataFrame(distrs, index=index, columns=[f'Label {l}' for l in range(np.max(train_y) + 1)])

As you can see, it seems to be working pretty well -- each fold has almost the same percentages of 0's and 1's and the unique `patient_id` values do not overlap between different folds.

### Saving fold indicies

Now, let's save the fold indicies into NumPy arrays. 

In [None]:
for fold_ind, (dev_ind, val_ind) in enumerate(stratified_group_k_fold(train_x, train_y, 
                                                                      groups, k=NFOLDS), 1):
    
    dev_ind=np.array(dev_ind)
    val_ind=np.array(val_ind)
    
    np.save(f"train_idx_fold_{fold_ind}.npy", dev_ind)
    np.save(f"val_idx_fold_{fold_ind}.npy", val_ind)

### Final remark

I can think of two possible ways how these folds can be utilized: we can use them to do cross-validation on jpeg files or we can make a new set of tfrecord files based on these folds. In the latter case, the number of folds should probably be ajusted. 