### Stratified K-fold for this competition

Sharing the k-fold splits could be nice to team up later.  If team mates have been using the same k-fold splits at the time of the team merge, they can easily do a second level model.

So I encourage you to use this k-fold split. Perhaps, when you find a team, you are surprised that your teammates were using it too!

Regarding the dataset, it is quite imbalance.  On one hand, we have only 79 ETT-Abnormal cases. In the other hand, we have 21324 CVC-Normal cases

The goals of this notebook is to split the data, satisfiying:

    1. Group by patient. Usually, if you have 5 photos of an item in a catalog, you have to ensure that they must be toguether in the same fold.  I suppose it must be appliable to this case.

	2. All the classes are represented in the same number accross folds
    
		· specially less represented classes
        
    3. All the annotations are represented in the same number accross folds (there are labeld images in train set without annotations)
        
    4. Have the same number of observations in each fold
    

    
This script ensures to satisfy the first point (group by patient), and performs a heuristic effort to optimize the other points.

It has been perfomed in an iterative way. Before asigning an observation to a fold, it tries and evaluates how much does it hurts these assignment for each fold, to use the less bad option.

Since row order matters to this heuristic, it could have been improved by repeating the iterative algorithm N times, and random shuffling the rows on each execution, to get pick the best k-folds split.

Also, it could be probably better to use a genetic algorithm instead of an iterative heuristic.


Notebook Version 3:  Takes into account the number of annotations for each fold and labeled class.  

Future improvements:
  - give more importance for more difficult classes to be preddicted, according to confussion matrix
    

In [None]:
NUM_FOLDS=5

import pandas as pd
from tqdm import tqdm
import numpy as np




data_dir='../input/ranzcr-clip-catheter-line-classification/'

target_col = [
    'ETT - Abnormal',
    'ETT - Borderline',
    'ETT - Normal',
    'NGT - Abnormal',
    'NGT - Borderline',
    'NGT - Incompletely Imaged',
    'NGT - Normal',
    'CVC - Abnormal',
    'CVC - Borderline',
    'CVC - Normal',
    'Swan Ganz Catheter Present'
]
   
def annotation_col_name(x):
    x=x[1]
    return "annotation-"+"".join([c for c in x if c.upper()==c and c!=' '])
annotations_df = pd.read_csv(data_dir + 'train_annotations.csv', usecols=['StudyInstanceUID', 'label'])
annotations_df['ones']=1
annotations_df=pd.pivot_table(annotations_df, values=['ones'], index=['StudyInstanceUID'], columns=['label'], aggfunc=np.sum).fillna(0)
annotations_df.columns=[annotation_col_name(col_name) for col_name in annotations_df.columns]

   
train_gt_df = pd.read_csv(data_dir + '/train.csv')
train_gt_df=train_gt_df.merge(annotations_df, on='StudyInstanceUID', how='left' ).fillna(0)

target_col+=[x for x in annotations_df.columns]

num_target = len(target_col)
kfold_df=train_gt_df.copy().drop(['StudyInstanceUID'],axis=1)


print(len(train_gt_df.PatientID.unique()), "patients")

# Loss function to evaluate K-fold splits

Just using the rule of the thumb, a little bit of this, a little bit of that, and manually tunning some parameters


In [None]:
patient_kfold_df=kfold_df.groupby('PatientID').sum()


priority_df=patient_kfold_df.sum().to_frame().reset_index().rename(columns={'index':'label',0:'score',})
priority_df['score']=[(s*4 if 'annotation' in l else s )for l,s in zip(priority_df['label'],priority_df['score'])]
priority_df.sort_values(by='score',inplace=True)

label_priority=list(priority_df['label'])
label_importance=[(0.1+patient_kfold_df.sum().max()-x)**2 for x in priority_df['score']]
label_importance[0]=3*label_importance[0]




patient_kfold_df=patient_kfold_df.sort_values(by=label_priority, ascending=False)

def kfold_quality(folds): #(greater is worse)
    patient_kfold_df['fold']=folds+([-1]*(len(patient_kfold_df)-len(folds)))
    scores=patient_kfold_df[:len(folds)][['fold']+target_col].groupby('fold').sum()
    
    if scores.shape[0]<NUM_FOLDS:
        scores.reset_index(inplace=True)
        for f in range(NUM_FOLDS):
            if f not in scores.index:
                scores.loc[len(scores)] = 0
                scores.loc[len(scores)-1,'fold'] = f
        scores.set_index('fold', drop=True,inplace=True)

    # penalizes not having the same number of observations per fold
    score=scores.sum(axis=1).std()
    
    # penalizes not having the same number of observations per fold and class
    # using weights to give more importance to unbalances in less represented classes
    for label, importance in zip (label_priority,label_importance):
        scores[label]*=importance
        score+=scores[label].max()-scores[label].min()
    return score

# Iterative heuristic

In [None]:
current_kfold=[]
for _ in tqdm(range(len(patient_kfold_df))):
    fold_scores=[(f, kfold_quality(current_kfold+[f])) for f in range(NUM_FOLDS)]
    best_fold=sorted(fold_scores, key=lambda x: x[1])[0][0]
    current_kfold.append(best_fold)
patient_kfold_df['fold']=current_kfold

In [None]:
# Save results

patient_kfold_df=patient_kfold_df.reset_index()

train_gt_df=train_gt_df.merge(patient_kfold_df[['PatientID', 'fold']], on='PatientID', how='left')

patient_kfold_df[['PatientID', 'fold']].to_csv(f'stratified_{NUM_FOLDS}_folds_by_patient_id.csv', index=False)
train_gt_df[['StudyInstanceUID', 'fold']].to_csv(f'stratified_{NUM_FOLDS}_folds.csv', index=False)


# Results

In [None]:
train_gt_df[['fold']+target_col].groupby('fold').sum().transpose()

Images per fold:

In [None]:
images_per_fold=train_gt_df[['fold', 'StudyInstanceUID']].groupby('fold').count().rename(columns={'StudyInstanceUID':'Num. images'})
images_per_fold

Patients per fold:

In [None]:
patients_per_fold=train_gt_df[['fold', 'PatientID']].groupby('fold').agg({'PatientID': 'nunique'}).rename(columns={'PatientID':'Num. patients'})
patients_per_fold