# 05 â€” Persiapan Data untuk Fine-Tuning IndoBERT

Notebook ini berisi tahapan persiapan data untuk pemodelan berbasis **IndoBERT**.
Berbeda dengan TF-IDF yang menggunakan teks hasil preprocessing lengkap, IndoBERT
membutuhkan teks yang lebih alami agar konteks bahasa tidak hilang.

Pada tahap ini dilakukan:

1. Membaca dataset hasil preprocessing penuh.
2. Memilih kolom teks **raw (`comment`)** untuk IndoBERT.
3. Menyiapkan label numerik.
4. Membagi dataset menjadi train, validation, dan test.
5. Membuat dataset HuggingFace.
6. Melakukan tokenisasi menggunakan tokenizer IndoBERT.
7. Mengonversi dataset ke format PyTorch untuk proses training.

Notebook ini merupakan tahap sebelum fine-tuning IndoBERT pada notebook selanjutnya.


----

## Import Library

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer

## Import Dataset

In [2]:
df = pd.read_csv("../data/dataset_preprocessed.csv")

## Gunakan RAW COMMENT sebagai input IndoBERT

In [3]:
df["text"] = df["comment"].astype(str)
df[["text", "sentiment", "label"]].head()

Unnamed: 0,text,sentiment,label
0,Yg benci ya apa aja salah.. \nYg seneng ya mak...,neutral,1
1,Bandung akan miliki kereta pajajaran dgn beaya...,neutral,1
2,SUDAH JELAS GENG SOLO YANG HARUS BERTANGGUNG J...,negative,0
3,"Jokowi, Luhut, kroni2 yg harus bertanggungjaw...",negative,0
4,Yg ditangkap gorengan yg makan duduk manis,neutral,1


## Drop NA dan duplikat jika masih ada

In [4]:
df = df.dropna(subset=["text", "label"]).reset_index(drop=True)
df = df.drop_duplicates(subset=["text"]).reset_index(drop=True)

print("Total data:", len(df))

Total data: 987


## Trainâ€“Validationâ€“Test Split

In [5]:
# split train (80%) dan temp (20%)
train_df, temp_df = train_test_split(
    df,
    test_size=0.2,
    random_state=42,
    stratify=df["label"]
)

# split temp jadi validation (10%) dan test (10%)
val_df, test_df = train_test_split(
    temp_df,
    test_size=0.5,
    random_state=42,
    stratify=temp_df["label"]
)

len(train_df), len(val_df), len(test_df)

(789, 99, 99)

## Siapkan HuggingFace Dataset

In [6]:
train_ds = Dataset.from_pandas(train_df[["text", "label"]])
val_ds   = Dataset.from_pandas(val_df[["text", "label"]])
test_ds  = Dataset.from_pandas(test_df[["text", "label"]])

datasets = DatasetDict({
    "train": train_ds,
    "validation": val_ds,
    "test": test_ds
})

datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 789
    })
    validation: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 99
    })
    test: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 99
    })
})

## Load Tokenizer IndoBERT

In [8]:
model_name = "indobenchmark/indobert-base-p2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


## Fungsi Tokenisasi Batch

In [9]:
def tokenize_batch(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        padding="max_length",
        max_length=128
    )

## Tokenisasi Dataset

In [10]:
tokenized_ds = datasets.map(tokenize_batch, batched=True)
tokenized_ds

Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 789/789 [00:00<00:00, 10564.59 examples/s]
Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 99/99 [00:00<00:00, 8807.26 examples/s]
Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 99/99 [00:00<00:00, 9046.14 examples/s]


DatasetDict({
    train: Dataset({
        features: ['text', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 789
    })
    validation: Dataset({
        features: ['text', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 99
    })
    test: Dataset({
        features: ['text', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 99
    })
})

## Bersihkan Kolom & Ganti Nama Label

In [11]:
tokenized_ds = tokenized_ds.remove_columns(["text"])

tokenized_ds = tokenized_ds.rename_column("label", "labels")
tokenized_ds.set_format(type="torch")

tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['labels', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 789
    })
    validation: Dataset({
        features: ['labels', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 99
    })
    test: Dataset({
        features: ['labels', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 99
    })
})

## Save HuggingFace Dataset

In [12]:
tokenized_ds.save_to_disk("../data/indobert_tokenized/")

Saving the dataset (1/1 shards): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 789/789 [00:00<00:00, 104655.32 examples/s]
Saving the dataset (1/1 shards): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 99/99 [00:00<00:00, 19070.27 examples/s]
Saving the dataset (1/1 shards): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 99/99 [00:00<00:00, 14115.99 examples/s]
