# Prepare Training Data

This notebook aims to create a **completely leak free** training and validation fold datasets for training `DeBERTa` and `LGBM` models. 

Fold strategy:
1. Initially split the dataset into 5 folds. Let the folds be `1, 2, 3, 4, 5`.
    * One fold contains **20%** of the data. 
    * We'll use **80%** of the data for training and **20%** of the data for validation.
1. Take each training fold and further split it into 5 folds.
    * Let out training folds be `2, 3, 4, 5` and validation fold be `1`.
    * In this step, we combine the **80%** data from training fold and split it into 5 folds each containing **16%** of the total data. 
    * Let these folds be `1_A, 1_B, 1_C, 1_D, 1_E`.
    * Next, we train 5 `DeBERTa` models using 4 folds to train and 1 fold to validate at each step. When we validate, we save the results into `OOF_fold_1_train` file
    * After having all 5 models, we use all of them to predict results of fold `1` and save it as `OOF_fold_1_valid`.

## ⚙️ Setup 

### 📚 Importing Libraries

Importing from packages

In [1]:
import os
import pandas as pd
import numpy as np
from tqdm import tqdm
from pprint import pprint
import matplotlib.pyplot as plt
import torch
from sklearn.model_selection import StratifiedKFold, StratifiedGroupKFold
from transformers import AutoTokenizer
from tokenizers import AddedToken
import plotly.express as px

In [2]:
os.chdir("../../")

Importing user defined packages

In [3]:
from lib.utils.utils import seed_everything
from lib.config import config
from lib.paths import Paths
from lib.data_tools.data import (
    get_data_loaders,
    clean_text,
    sliding_window
)

### 🌱 Setting Random Seeds

In [4]:
seed_everything()

## Loading Dataset

In [5]:
df = pd.read_csv(Paths.COMPETITION_TRAIN_CSV_PATH)
df.shape

(17307, 3)

## ⌛ Data Processing

Converting classes of scores to range from 0 to 5.

In [6]:
df["score"] = df["score"] - 1

Cleaning text.

In [7]:
df["full_text"] = df["full_text"].map(lambda x: clean_text(x))

## 🪙 Tokenizer

Sources:
1. [MOTH's Notebook](https://www.kaggle.com/code/alejopaullier/aes-2-multi-class-classification-train?scriptVersionId=170290107&cellId=14)

In [8]:
%env TOKENIZERS_PARALLELISM=true

env: TOKENIZERS_PARALLELISM=true


In [9]:
tokenizer = AutoTokenizer.from_pretrained(config.model)



[Idea of adding special tokens from Chris Deotte](https://www.kaggle.com/code/cdeotte/deberta-v3-small-starter-cv-0-820-lb-0-800?scriptVersionId=174239814&cellId=17)

In [10]:
tokenizer.add_tokens([AddedToken("\n", normalized=False)])
tokenizer.add_tokens([AddedToken(" "*2, normalized=False)])

1

In [11]:
tokenizer.save_pretrained(Paths.TOKENIZER_PATH)

('output/microsoft/deberta-v3-xsmall/tokenizer_v2/tokenizer_config.json',
 'output/microsoft/deberta-v3-xsmall/tokenizer_v2/special_tokens_map.json',
 'output/microsoft/deberta-v3-xsmall/tokenizer_v2/spm.model',
 'output/microsoft/deberta-v3-xsmall/tokenizer_v2/added_tokens.json',
 'output/microsoft/deberta-v3-xsmall/tokenizer_v2/tokenizer.json')

In [12]:
print(tokenizer)

DebertaV2TokenizerFast(name_or_path='microsoft/deberta-v3-xsmall', vocab_size=128000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	128000: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128001: AddedToken("
", rstrip=False, lstrip

## ✂️ Train-Validation Splitting

Sources
1. [MOTH's Notebook](https://www.kaggle.com/code/alejopaullier/aes-2-multi-class-classification-train?scriptVersionId=170290107&cellId=12)
2. [Martin's post](https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/discussion/499959)

In [13]:
def split(df):
    df["fold"] = -1
    X, y = df["full_text"], df["score"]

    skf = StratifiedKFold(
        n_splits=config.n_folds,
        shuffle=True,
        random_state=config.random_seed,
    )

    for i, (_, valid_idx) in enumerate(skf.split(X, y)):
        df.loc[valid_idx, "fold"] = i

    return df

In [14]:
df = split(df)

In [15]:
px.histogram(df, x="fold", color="score", text_auto=True, barmode="stack")

## DeBERTa Dataset Preparation

In [16]:
def get_data_parts(train_fold, train_idx, valid_idx):
    train_part = train_fold.loc[train_idx].reset_index(drop=True)
    valid_part = train_fold.loc[valid_idx].reset_index(drop=True)

    # Sliding window to split long sequences into shorter ones with overlap
    train_part = sliding_window(train_part, tokenizer)
    valid_part = sliding_window(valid_part, tokenizer)

    return train_part, valid_part

In [17]:
def create_and_save_parts(fold, train_fold, fold_dir):
    X, y = train_fold["full_text"], train_fold["score"]

    skf = StratifiedKFold(
        n_splits=config.n_folds,
        shuffle=True,
        random_state=config.random_seed,
    )

    for part, (train_idx, valid_idx) in enumerate(skf.split(X, y)):
        train_part, valid_part = get_data_parts(train_fold, train_idx, valid_idx)

        train_loader, valid_loader = get_data_loaders(train_part, valid_part, tokenizer)

        part_dir = os.path.join(fold_dir, f"part_{part}")

        if not os.path.exists(part_dir):
            os.makedirs(part_dir)

        train_dataloader_path = os.path.join(part_dir, f"train_{fold}_{part}.pth")
        torch.save(train_loader, train_dataloader_path)
        print(f"Saved {train_dataloader_path} with {len(train_part)} samples ")

        valid_dataloader_path = os.path.join(part_dir, f"valid_{fold}_{part}.pth")
        torch.save(valid_loader, valid_dataloader_path)
        print(f"Saved {valid_dataloader_path} with {len(valid_part)} samples ")

        valid_csv_path = os.path.join(part_dir, f"valid_{fold}_{part}.csv")
        valid_part.to_csv(valid_csv_path, index=False)
        print(f"Saved {valid_csv_path}")

In [18]:
root_dataset_dir = "data/lgbm_deberta"

for fold in df.fold.unique():
    train_fold = df[df["fold"] != fold].reset_index(drop=True)
    valid_fold = df[df["fold"] == fold].reset_index(drop=True)

    fold_dir = os.path.join(root_dataset_dir, f"fold_{fold}")

    if not os.path.exists(fold_dir):
        os.makedirs(fold_dir)

    # For LGBM later on
    train_fold.to_csv(os.path.join(fold_dir, f"train_{fold}.csv"), index=False)
    valid_fold.to_csv(os.path.join(fold_dir, f"valid_{fold}.csv"), index=False)

    create_and_save_parts(fold, train_fold, fold_dir)

100%|██████████| 11076/11076 [00:20<00:00, 537.47it/s]
100%|██████████| 2770/2770 [00:05<00:00, 535.43it/s]


Saved data/lgbm_deberta/fold_3/part_0/train_3_0.pth with 15275 samples 
Saved data/lgbm_deberta/fold_3/part_0/valid_3_0.pth with 3827 samples 
Saved data/lgbm_deberta/fold_3/part_0/valid_3_0.csv


100%|██████████| 11077/11077 [00:20<00:00, 535.61it/s]
100%|██████████| 2769/2769 [00:05<00:00, 528.16it/s]


Saved data/lgbm_deberta/fold_3/part_1/train_3_1.pth with 15270 samples 
Saved data/lgbm_deberta/fold_3/part_1/valid_3_1.pth with 3832 samples 
Saved data/lgbm_deberta/fold_3/part_1/valid_3_1.csv


100%|██████████| 11077/11077 [00:20<00:00, 530.73it/s]
100%|██████████| 2769/2769 [00:05<00:00, 547.53it/s]


Saved data/lgbm_deberta/fold_3/part_2/train_3_2.pth with 15313 samples 
Saved data/lgbm_deberta/fold_3/part_2/valid_3_2.pth with 3789 samples 
Saved data/lgbm_deberta/fold_3/part_2/valid_3_2.csv


100%|██████████| 11077/11077 [00:20<00:00, 537.39it/s]
100%|██████████| 2769/2769 [00:05<00:00, 536.22it/s]


Saved data/lgbm_deberta/fold_3/part_3/train_3_3.pth with 15274 samples 
Saved data/lgbm_deberta/fold_3/part_3/valid_3_3.pth with 3828 samples 
Saved data/lgbm_deberta/fold_3/part_3/valid_3_3.csv


100%|██████████| 11077/11077 [00:20<00:00, 535.58it/s]
100%|██████████| 2769/2769 [00:05<00:00, 536.65it/s]


Saved data/lgbm_deberta/fold_3/part_4/train_3_4.pth with 15276 samples 
Saved data/lgbm_deberta/fold_3/part_4/valid_3_4.pth with 3826 samples 
Saved data/lgbm_deberta/fold_3/part_4/valid_3_4.csv


100%|██████████| 11076/11076 [00:20<00:00, 540.73it/s]
100%|██████████| 2769/2769 [00:05<00:00, 538.61it/s]


Saved data/lgbm_deberta/fold_0/part_0/train_0_0.pth with 15236 samples 
Saved data/lgbm_deberta/fold_0/part_0/valid_0_0.pth with 3808 samples 
Saved data/lgbm_deberta/fold_0/part_0/valid_0_0.csv


100%|██████████| 11076/11076 [00:20<00:00, 543.90it/s]
100%|██████████| 2769/2769 [00:05<00:00, 523.53it/s]


Saved data/lgbm_deberta/fold_0/part_1/train_0_1.pth with 15189 samples 
Saved data/lgbm_deberta/fold_0/part_1/valid_0_1.pth with 3855 samples 
Saved data/lgbm_deberta/fold_0/part_1/valid_0_1.csv


100%|██████████| 11076/11076 [00:20<00:00, 538.92it/s]
100%|██████████| 2769/2769 [00:05<00:00, 545.92it/s]


Saved data/lgbm_deberta/fold_0/part_2/train_0_2.pth with 15258 samples 
Saved data/lgbm_deberta/fold_0/part_2/valid_0_2.pth with 3786 samples 
Saved data/lgbm_deberta/fold_0/part_2/valid_0_2.csv


100%|██████████| 11076/11076 [00:20<00:00, 540.09it/s]
100%|██████████| 2769/2769 [00:05<00:00, 544.83it/s]


Saved data/lgbm_deberta/fold_0/part_3/train_0_3.pth with 15233 samples 
Saved data/lgbm_deberta/fold_0/part_3/valid_0_3.pth with 3811 samples 
Saved data/lgbm_deberta/fold_0/part_3/valid_0_3.csv


100%|██████████| 11076/11076 [00:20<00:00, 536.02it/s]
100%|██████████| 2769/2769 [00:05<00:00, 544.78it/s]


Saved data/lgbm_deberta/fold_0/part_4/train_0_4.pth with 15260 samples 
Saved data/lgbm_deberta/fold_0/part_4/valid_0_4.pth with 3784 samples 
Saved data/lgbm_deberta/fold_0/part_4/valid_0_4.csv


100%|██████████| 11076/11076 [00:20<00:00, 545.44it/s]
100%|██████████| 2770/2770 [00:05<00:00, 537.74it/s]


Saved data/lgbm_deberta/fold_2/part_0/train_2_0.pth with 15203 samples 
Saved data/lgbm_deberta/fold_2/part_0/valid_2_0.pth with 3822 samples 
Saved data/lgbm_deberta/fold_2/part_0/valid_2_0.csv


100%|██████████| 11077/11077 [00:20<00:00, 539.39it/s]
100%|██████████| 2769/2769 [00:05<00:00, 531.98it/s]


Saved data/lgbm_deberta/fold_2/part_1/train_2_1.pth with 15194 samples 
Saved data/lgbm_deberta/fold_2/part_1/valid_2_1.pth with 3831 samples 
Saved data/lgbm_deberta/fold_2/part_1/valid_2_1.csv


100%|██████████| 11077/11077 [00:20<00:00, 537.34it/s]
100%|██████████| 2769/2769 [00:05<00:00, 539.54it/s]


Saved data/lgbm_deberta/fold_2/part_2/train_2_2.pth with 15222 samples 
Saved data/lgbm_deberta/fold_2/part_2/valid_2_2.pth with 3803 samples 
Saved data/lgbm_deberta/fold_2/part_2/valid_2_2.csv


100%|██████████| 11077/11077 [00:20<00:00, 538.29it/s]
100%|██████████| 2769/2769 [00:05<00:00, 547.21it/s]


Saved data/lgbm_deberta/fold_2/part_3/train_2_3.pth with 15246 samples 
Saved data/lgbm_deberta/fold_2/part_3/valid_2_3.pth with 3779 samples 
Saved data/lgbm_deberta/fold_2/part_3/valid_2_3.csv


100%|██████████| 11077/11077 [00:20<00:00, 538.37it/s]
100%|██████████| 2769/2769 [00:05<00:00, 549.23it/s]


Saved data/lgbm_deberta/fold_2/part_4/train_2_4.pth with 15235 samples 
Saved data/lgbm_deberta/fold_2/part_4/valid_2_4.pth with 3790 samples 
Saved data/lgbm_deberta/fold_2/part_4/valid_2_4.csv


100%|██████████| 11076/11076 [00:20<00:00, 546.47it/s]
100%|██████████| 2770/2770 [00:05<00:00, 524.48it/s]


Saved data/lgbm_deberta/fold_4/part_0/train_4_0.pth with 15197 samples 
Saved data/lgbm_deberta/fold_4/part_0/valid_4_0.pth with 3851 samples 
Saved data/lgbm_deberta/fold_4/part_0/valid_4_0.csv


100%|██████████| 11077/11077 [00:20<00:00, 536.66it/s]
100%|██████████| 2769/2769 [00:05<00:00, 542.22it/s]


Saved data/lgbm_deberta/fold_4/part_1/train_4_1.pth with 15246 samples 
Saved data/lgbm_deberta/fold_4/part_1/valid_4_1.pth with 3802 samples 
Saved data/lgbm_deberta/fold_4/part_1/valid_4_1.csv


100%|██████████| 11077/11077 [00:20<00:00, 530.56it/s]
100%|██████████| 2769/2769 [00:05<00:00, 532.44it/s]


Saved data/lgbm_deberta/fold_4/part_2/train_4_2.pth with 15256 samples 
Saved data/lgbm_deberta/fold_4/part_2/valid_4_2.pth with 3792 samples 
Saved data/lgbm_deberta/fold_4/part_2/valid_4_2.csv


100%|██████████| 11077/11077 [00:20<00:00, 533.09it/s]
100%|██████████| 2769/2769 [00:05<00:00, 548.05it/s]


Saved data/lgbm_deberta/fold_4/part_3/train_4_3.pth with 15262 samples 
Saved data/lgbm_deberta/fold_4/part_3/valid_4_3.pth with 3786 samples 
Saved data/lgbm_deberta/fold_4/part_3/valid_4_3.csv


100%|██████████| 11077/11077 [00:20<00:00, 534.76it/s]
100%|██████████| 2769/2769 [00:05<00:00, 536.62it/s]


Saved data/lgbm_deberta/fold_4/part_4/train_4_4.pth with 15231 samples 
Saved data/lgbm_deberta/fold_4/part_4/valid_4_4.pth with 3817 samples 
Saved data/lgbm_deberta/fold_4/part_4/valid_4_4.csv


100%|██████████| 11076/11076 [00:20<00:00, 542.97it/s]
100%|██████████| 2769/2769 [00:05<00:00, 547.64it/s]


Saved data/lgbm_deberta/fold_1/part_0/train_1_0.pth with 15232 samples 
Saved data/lgbm_deberta/fold_1/part_0/valid_1_0.pth with 3789 samples 
Saved data/lgbm_deberta/fold_1/part_0/valid_1_0.csv


100%|██████████| 11076/11076 [00:20<00:00, 541.15it/s]
100%|██████████| 2769/2769 [00:05<00:00, 549.50it/s]


Saved data/lgbm_deberta/fold_1/part_1/train_1_1.pth with 15222 samples 
Saved data/lgbm_deberta/fold_1/part_1/valid_1_1.pth with 3799 samples 
Saved data/lgbm_deberta/fold_1/part_1/valid_1_1.csv


100%|██████████| 11076/11076 [00:20<00:00, 537.67it/s]
100%|██████████| 2769/2769 [00:05<00:00, 538.82it/s]


Saved data/lgbm_deberta/fold_1/part_2/train_1_2.pth with 15224 samples 
Saved data/lgbm_deberta/fold_1/part_2/valid_1_2.pth with 3797 samples 
Saved data/lgbm_deberta/fold_1/part_2/valid_1_2.csv


100%|██████████| 11076/11076 [00:22<00:00, 494.21it/s]
100%|██████████| 2769/2769 [00:05<00:00, 543.85it/s]


Saved data/lgbm_deberta/fold_1/part_3/train_1_3.pth with 15271 samples 
Saved data/lgbm_deberta/fold_1/part_3/valid_1_3.pth with 3750 samples 
Saved data/lgbm_deberta/fold_1/part_3/valid_1_3.csv


100%|██████████| 11076/11076 [00:20<00:00, 536.30it/s]
100%|██████████| 2769/2769 [00:05<00:00, 511.67it/s]


Saved data/lgbm_deberta/fold_1/part_4/train_1_4.pth with 15135 samples 
Saved data/lgbm_deberta/fold_1/part_4/valid_1_4.pth with 3886 samples 
Saved data/lgbm_deberta/fold_1/part_4/valid_1_4.csv
