## Example/test of `train_df_k_fold_split()` — Split data into K-Folds


NOTES:
- Part of examples for the `feedback-prize-2021-lib` [library](https://www.kaggle.com/sentinel1/feedback-prize-2021-lib)
- The `train_df_k_fold_split()` function requires newer version of `scikit-learn`. Please update the scikit-learn package first. If you will NOT update the `scikit-learn` package then the `train_df_k_fold_split()` function will raise a relevant exception if called, but the rest of the library will continue to work normally without updating anything, so you can skipp the `!pip install -U "scikit-learn"` part when you are NOT using the `train_df_k_fold_split()` function.

In [None]:
%%capture
!pip install -U "scikit-learn"

In [None]:
from pathlib import Path

if Path.cwd() == Path('/kaggle/working'):
    # Kaggle
    import sys
    LIB_PATH = (Path.cwd()/".."/"input"/"feedback-prize-2021-lib").resolve()
    assert LIB_PATH.is_dir(), ("Use the '+ Add data' feature to add the 'Notebook Output Files' from the 'sentinel1/feedback-prize-2021-lib' "
                               "in order to make some utilities importable from that library (one time restart is required after adding).")
    sys.path.insert(0, str(LIB_PATH))
else:
    # Local machine
    assert (Path.cwd()/"lib"/"feedback_util.py").is_file(), ("Run the 'sentinel1/feedback-prize-2021-lib' notebook locally "
                                                             "in order to generate the importable library on your machine")

In [None]:
from lib.feedback_util import get_train_df_with_fixed_PII_offsets, train_df_k_fold_split, get_train_essay_text, color_print_essay

In [None]:
import pandas as pd
import numpy as np

## Configuration of K fold split

In [None]:
K_FOLDS = 6  # Split the data into how many folds?

## Load metadata from the `train.csv`

Using the `get_train_df_with_fixed_PII_offsets()` function in order to have PII masking noise corrected in the `train_df`

NOTE: Passing the `use_tmp_cache=True` argument in the function call below will speed up loading of the `train_df` by saving cache file into the `../temp` directory on the first call and reusing it on the consecutive calls.

In [None]:
train_df = get_train_df_with_fixed_PII_offsets(use_tmp_cache=True)
train_df.head(2)

## Split data into K-Folds

Split the data using the `train_df_k_fold_split()` function in order to have evenly distributed dicsourse types between folds while preventing the same essay to spread across folds, this is achieved in the `train_df_k_fold_split()` function by using the `StratifiedGroupKFold` of the `scikit-learn ` (i.e. the Stratified K-Folds iterator variant with non-overlapping groups). Calling the function will add the `CV` column to the `train_df` dataframe indicating which fold is each row part of.

NOTE: Passing the `display_split_statistics=True` argument in the function call below will cause `train_df_k_fold_split()` function to calculate and display basic statistics of how the data was split across the folds (i.e. counts of essays, discourses and each discourse types per CV), it does NOT affect the returned dataframe and it would work faster with the `display_split_statistics=False` (which is the default).

In [None]:
train_df = train_df_k_fold_split(train_df, K=K_FOLDS, display_split_statistics=True)

In [None]:
train_df.head()

## Example Loops

Simply one of the many possible loops demonstrating the use of the above calculated K-Folds split (i.e. usage of the `CV` column in the `train_df`)

In [None]:
import importlib

if importlib.util.find_spec('ipywidgets') is not None:
    from tqdm.auto import tqdm
else:
    from tqdm import tqdm

    
train_folds = [1, 2, 3, 4]
val_folds = [5, 6]


for fold in train_folds:
    fold_df = train_df[train_df['CV'] == fold]
    for idx,row in tqdm(fold_df.iterrows(), desc=f'Train Fold {fold}', total=len(fold_df), dynamic_ncols=True, miniters=10):
        discourse_text = row['discourse_text']
        essay_text = get_train_essay_text(row['id'])
        # Train


for fold in val_folds:
    fold_df = train_df[train_df['CV'] == fold]
    for val_idx,val_row in tqdm(fold_df.iterrows(), desc=f'Val Fold {fold}', total=len(fold_df), dynamic_ncols=True, miniters=10):
        val_discourse_text = val_row['discourse_text']
        val_essay_text = get_train_essay_text(val_row['id'])
        # Validate


## Print the last "train" discourse text from the above loops (i.e. `discourse_text`)

In [None]:
print(discourse_text)

## Print the last "validation" discourse text from the above loops (i.e. `val_discourse_text`)

In [None]:
print(val_discourse_text)

## Print the last "train" essay from the above loops (i.e. `essay_text`)

In [None]:
print(essay_text)

In [None]:
color_print_essay(row['id'], train_df, start_end_indicators=True)

## Print the last "validation" essay text from the above loops (i.e. `val_essay_text`)

In [None]:
print(val_essay_text)

In [None]:
color_print_essay(val_row['id'], train_df)