# 📊 Data Preparation

This notebook prepares the data for training the deep learning model. At a highlevel it performs the following tasks:
1. Stratified k-fold spliting the dataset
2. Tokenize 
3. Create hugging face Dataset

## ⚙️ Setup 

### 📚 Importing Libraries

Importing from packages

In [1]:
import os
import pandas as pd

from sklearn.model_selection import StratifiedGroupKFold
from transformers import AutoTokenizer

In [2]:
os.chdir("../")

Importing user defined packages

In [3]:
from lib.utils import seed_everything
from lib.config import Config
from lib.paths import Paths

### 🌱 Setting Random Seeds

In [4]:
seed_everything(Config.RANDOM_SEED)

### 💽 Loading Data

In [5]:
train_df = pd.read_csv(Paths.TRAIN_CSV_PATH)
test_df = pd.read_csv(Paths.TEST_CSV_PATH)

train_df.shape, test_df.shape

((17307, 5), (3, 4))

## ✂️ Train-Validation Splitting

Use `StratifiedGroupKFold` to split `train_df` into `Config.N_FOLDS`.

Sources
1. [MOTH's Notebook](https://www.kaggle.com/code/alejopaullier/aes-2-multi-class-classification-train?scriptVersionId=170290107&cellId=12)

In [6]:
skf = StratifiedGroupKFold(
    n_splits=Config.N_FOLDS,
    shuffle=True,
    random_state=Config.RANDOM_SEED,
)

Seperate feature `X` and labels `y`.

In [7]:
X, y = train_df["full_text"], train_df["score"]

`groups` are determined by topic. This was done in **eda.ipynb**.

In [8]:
groups = train_df["topic"]

Assign fold number to dataframe `train_df`

In [9]:
train_df["fold"] = -1

for i, (train_idx, valid_idx) in enumerate(skf.split(X, y, groups)):
    train_df.loc[valid_idx, "fold"] = i

Distribution of data across `Config.N_FOLDS`.

In [10]:
train_df["fold"].value_counts()

fold
4    5459
2    4691
3    3017
0    2094
1    2046
Name: count, dtype: int64

## 🪙 Tokenizer

Sources:
1. [MOTH's Notebook](https://www.kaggle.com/code/alejopaullier/aes-2-multi-class-classification-train?scriptVersionId=170290107&cellId=14)

In [11]:
toknizer = AutoTokenizer.from_pretrained(Config.MODEL)
toknizer.save_pretrained(Paths.TOKENIZER_PATH)



('output/tokenizer/tokenizer_config.json',
 'output/tokenizer/special_tokens_map.json',
 'output/tokenizer/spm.model',
 'output/tokenizer/added_tokens.json',
 'output/tokenizer/tokenizer.json')

In [12]:
print(toknizer)

DebertaV2TokenizerFast(name_or_path='microsoft/deberta-v3-base', vocab_size=128000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	128000: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
