# 📊 Data Preparation

This notebook prepares the data for training the deep learning model. At a highlevel it performs the following tasks:
1. Stratified k-fold spliting the dataset
2. Tokenize 
3. Create hugging face Dataset

## ⚙️ Setup 

### 📚 Importing Libraries

Importing from packages

In [1]:
import os
import pandas as pd
from sklearn.model_selection import StratifiedGroupKFold

In [2]:
os.chdir("../")

Importing user defined packages

In [3]:
from lib.utils import seed_everything
from lib.config import Config
from lib.paths import Paths

### 🌱 Setting Random Seeds

In [4]:
seed_everything(Config.RANDOM_SEED)

### 💽 Loading Data

In [5]:
train_df = pd.read_csv(Paths.TRAIN_CSV_PATH)
test_df = pd.read_csv(Paths.TEST_CSV_PATH)

train_df.shape, test_df.shape

((17307, 5), (3, 4))

## ✂️ Train-Validation Splitting

Use `StratifiedGroupKFold` to split `train_df` into `Config.N_FOLDS`.

In [6]:
skf = StratifiedGroupKFold(
    n_splits=Config.N_FOLDS,
    shuffle=True,
    random_state=Config.RANDOM_SEED,
)

Seperate feature `X` and labels `y`.

In [7]:
X, y = train_df["full_text"], train_df["score"]

`groups` are determined by topic. This was done in **eda.ipynb**.

In [None]:
groups = train_df["topic"]

Assign fold number to dataframe `train_df`

In [8]:
train_df["fold"] = -1

for i, (train_idx, valid_idx) in enumerate(skf.split(X, y, groups)):
    train_df.loc[valid_idx, "fold"] = i

Distribution of data across `Config.N_FOLDS`.

In [9]:
train_df["fold"].value_counts()

fold
4    5459
2    4691
3    3017
0    2094
1    2046
Name: count, dtype: int64