# QnA GPT-2

## Install Dependencies

List Modules:
*   `evaluate`: Untuk metrik evaluasi.
*   `transformers`: Pustaka dari Hugging Face untuk model-model Transformer, termasuk GPT-2.
*   `datasets`: Pustaka dari Hugging Face untuk memuat dan memanipulasi dataset.
*   `scikit-learn`: Untuk utilitas machine learning umum (meskipun mungkin tidak banyak digunakan secara langsung di sini, seringkali menjadi dependensi).
*   `pandas`: Untuk manipulasi data tabular.
*   `torch`: Pustaka PyTorch untuk deep learning.
*   `accelerate`: Pustaka dari Hugging Face untuk memudahkan pelatihan pada berbagai perangkat keras (CPU, GPU, TPU, multi-GPU).
*   `rouge_score`: Untuk menghitung skor ROUGE, metrik evaluasi yang umum digunakan dalam tugas peringkasan dan generasi teks

In [1]:
!pip install -q evaluate transformers datasets scikit-learn pandas torch accelerate rouge_score

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.5/207.5 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.1/21.1 MB[0m [31m73.7 MB/s[0m eta 

In [2]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel, get_scheduler, AutoTokenizer
from torch.optim import AdamW
from datasets import load_dataset
import os
import evaluate
from tqdm.auto import tqdm
import json

2025-05-11 11:10:22.975083: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746961823.145034      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746961823.197098      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Data Preparation

### 1. `load_data_split(...)`
Memuat dataset dari Hugging Face (`lib3m/lib3m_qa_dataset_v1`), filter bahasa (`lang`), ambil `row` data, lalu split ke `train` dan `test` (default 80:20).

---

### 2. `class QADataset(Dataset)`
Dataset PyTorch untuk generative QA.

- `__init__`: Simpan data, tokenizer, dan `max_length`.
- `__getitem__`:
  - Format: `"<question>...<context>...<answer>"`
  - Tokenisasi input dan jawaban (dengan padding/truncation).
  - Ganti padding label jadi `-100` (agar tidak dihitung saat training).
  - Return: `input_ids`, `attention_mask`, `labels`.

In [3]:
def load_data_split(dataset_name="lib3m/lib3m_qa_dataset_v1", split="train", lang="en", test_size=0.2, row=100000):
    raw = load_dataset(dataset_name, split=split)
    dataset = raw.filter(lambda x: x['language'] == lang).select(range(row))
    dataset = dataset.train_test_split(test_size=test_size)
    return dataset

class QADataset(Dataset):
    def __init__(self, dataset, tokenizer, max_length=256):
        self.data = dataset
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        question = item['question']
        content = item['content'][:300]  # take 300 first characters
        answer = item['answer']

        # Format for generative QA
        text = f"<question> {question} <context> {content} <answer>"
        tokenized = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        
        labels = self.tokenizer(
            answer,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        ).input_ids
        
        labels[labels == self.tokenizer.pad_token_id] = -100
        
        return {
            'input_ids': tokenized.input_ids.squeeze(),
            'attention_mask': tokenized.attention_mask.squeeze(),
            'labels': labels.squeeze()
        }

## Load Model & Tokenizer

- Gunakan model GPT-2 (`gpt2`) dan tokenizer-nya.
- Set padding di **kiri** dan samakan `pad_token` dengan `eos_token` (GPT-2 tidak punya pad token default).
- Tambahkan token khusus: `<question>`, `<context>`, `<answer>`.
- Load `GPT2LMHeadModel` dan **resize** embedding agar cocok dengan tokenizer baru.

In [4]:
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token

tokenizer.add_special_tokens({'additional_special_tokens': ['<question>', '<context>', '<answer>']})

model = GPT2LMHeadModel.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Embedding(50260, 768)

## Training Config

Parameter penting untuk pelatihan model GPT-2 pada tugas QA adalah sebagai berikut:

- **`BATCH_SIZE`** sampel yang diproses per iterasi, dengan nilai `8` agar pelatihan efisien (dan tidak OOM).
- **`EPOCHS`** menentukan jumlah iterasi  `2`.
- **`MAX_LEN`** membatasi panjang input menjadi `256` token, menjaga performa dan efisiensi.
- **`MULTI_GPU`** mengaktifkan pelatihan paralel jika lebih dari satu GPU tersedia.

In [5]:
MODEL_DIR = '/kaggle/working/gpt2_qna_model'
BATCH_SIZE = 8
EPOCHS = 2
LR = 5e-5
MAX_LEN = 256 
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
MULTI_GPU = torch.cuda.device_count() > 1

## Data Preparation
Sel kode ini memuat dan mempersiapkan dataset untuk pelatihan dan validasi model GPT-2 QA:

1. **`load_data_split(row=100000)`**: Memuat dataset, memfilter data berbahasa Inggris, memilih 100.000 baris, dan membagi data menjadi pelatihan dan pengujian (80:20).
2. **`train_loader = DataLoader(...)`**: Membuat DataLoader untuk pelatihan dengan pengaturan batch, pengacakan data, dan pemrosesan paralel.

In [6]:
dataset = load_data_split(row=100000) 

train_dataset = QADataset(dataset['train'], tokenizer, max_length=MAX_LEN)
val_dataset = QADataset(dataset['test'], tokenizer, max_length=MAX_LEN)

train_loader = DataLoader(
    train_dataset, 
    batch_size=BATCH_SIZE, 
    shuffle=True, 
    pin_memory=True,
    num_workers=os.cpu_count()//2 if os.cpu_count() else 1
)

val_loader = DataLoader(
    val_dataset, 
    batch_size=BATCH_SIZE, 
    pin_memory=True,
    num_workers=os.cpu_count()//2 if os.cpu_count() else 1
)

README.md:   0%|          | 0.00/9.64k [00:00<?, ?B/s]

qa_pairs.parquet:   0%|          | 0.00/724M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/337525 [00:00<?, ? examples/s]

Filter:   0%|          | 0/337525 [00:00<?, ? examples/s]

## Training
Training dilakukan dengan parameter yang sudah disiapkan + optimizer (AdamW)

In [7]:
if MULTI_GPU:
    model = torch.nn.DataParallel(model)
model = model.to(DEVICE)

optimizer = AdamW(model.parameters(), lr=LR)
scheduler = get_scheduler(
    "linear", 
    optimizer=optimizer, 
    num_warmup_steps=0,
    num_training_steps=EPOCHS * len(train_loader)
)

os.makedirs(MODEL_DIR, exist_ok=True)

for epoch in range(EPOCHS):
    model.train()
    total_loss = 0
    loop = tqdm(train_loader, desc=f"Epoch {epoch+1}/{EPOCHS}", leave=False)
    
    for batch in loop:
        input_ids = batch['input_ids'].to(DEVICE)
        attention_mask = batch['attention_mask'].to(DEVICE)
        labels = batch['labels'].to(DEVICE)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss.mean() if MULTI_GPU else outputs.loss
        loss.backward()
        optimizer.step()
        scheduler.step()
        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch+1}/{EPOCHS}, Loss: {avg_loss:.4f}")
# Save final model
torch.save((model.module if MULTI_GPU else model).state_dict(), f"{MODEL_DIR}/final.pt")
tokenizer.save_pretrained(MODEL_DIR)

Epoch 1/2:   0%|          | 0/10000 [00:00<?, ?it/s]

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch 1/2, Loss: 6.1818


Epoch 2/2:   0%|          | 0/10000 [00:00<?, ?it/s]

Epoch 2/2, Loss: 5.9928


('/kaggle/working/gpt2_qna_model/tokenizer_config.json',
 '/kaggle/working/gpt2_qna_model/special_tokens_map.json',
 '/kaggle/working/gpt2_qna_model/vocab.json',
 '/kaggle/working/gpt2_qna_model/merges.txt',
 '/kaggle/working/gpt2_qna_model/added_tokens.json')

*   `stdout`: `Epoch 2/2, Loss: 5.9928`
    Ini adalah rata-rata *loss* pada akhir epoch kedua. Terlihat ada penurunan *loss* dari epoch pertama ke epoch kedua (dari 6.1818 menjadi 5.9928), yang mengindikasikan bahwa model terus belajar dan meningkatkan performanya pada data pelatihan

## Evaluation
Sel kode ini bertujuan untuk mengevaluasi performa model GPT-2 QA yang telah dilatih pada set data validasi. Evaluasi dilakukan menggunakan metrik ROUGE (Recall-Oriented Understudy for Gisting Evaluation), yang umum digunakan untuk tugas generasi teks seperti peringkasan dan QA.

In [8]:
metric_rouge = evaluate.load('rouge')

model.eval()
preds, refs = [], []
eval_progress_bar = tqdm(val_loader, desc="Evaluating")

for batch in eval_progress_bar:
    input_ids = batch['input_ids'].to(DEVICE)
    attention_mask = batch['attention_mask'].to(DEVICE)
    labels = batch['labels']

    with torch.no_grad():
        # use model or model.module based on whether we have multi-GPU
        generator = model if not MULTI_GPU else model.module
        generated_outputs = generator.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_new_tokens=100,
            num_beams=1,
            early_stopping=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    for i in range(generated_outputs.shape[0]):
        prompt_length = torch.sum(attention_mask[i]).item()
        generated_sequence = generated_outputs[i]
        
        decoded_pred_full = tokenizer.decode(generated_sequence, skip_special_tokens=True)
        
        input_text_decoded = tokenizer.decode(input_ids[i][:prompt_length], skip_special_tokens=True)
        
        if decoded_pred_full.startswith(input_text_decoded):
            decoded_pred = decoded_pred_full[len(input_text_decoded):].strip()
        else:
            decoded_pred = decoded_pred_full
        
        answer_start_token = "<answer>"
        if answer_start_token in decoded_pred:
            decoded_pred = decoded_pred.split(answer_start_token, 1)[-1].strip()

        valid_labels = labels[i][labels[i] != -100]
        decoded_ref = tokenizer.decode(valid_labels, skip_special_tokens=True).strip()
        
        preds.append(decoded_pred)
        refs.append(decoded_ref)

results = metric_rouge.compute(predictions=preds, references=refs)
print("Evaluation (ROUGE):", results)

with open(os.path.join(MODEL_DIR, "rouge_scores.json"), 'w') as f:
    json.dump(results, f, indent=4)
print(f"ROUGE scores saved to {os.path.join(MODEL_DIR, 'rouge_scores.json')}")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Evaluating:   0%|          | 0/2500 [00:00<?, ?it/s]



Evaluation (ROUGE): {'rouge1': 0.23877995790219309, 'rouge2': 0.0961991873303197, 'rougeL': 0.17105692606665626, 'rougeLsum': 0.17768095605213458}
ROUGE scores saved to /kaggle/working/gpt2_qna_model/rouge_scores.json


*   `rouge1`: Mengukur tumpang tindih unigram (kata tunggal). Skor sekitar 0.239 menunjukkan adanya kesamaan kata-kata individual antara prediksi dan referensi.
*   `rouge2`: Mengukur tumpang tindih bigram (pasangan kata berurutan). Skor sekitar 0.096, yang lebih rendah dari `rouge1`, adalah hal yang wajar dan menunjukkan bahwa mencocokkan frasa dua kata lebih sulit.
*   `rougeL`: Mengukur *Longest Common Subsequence* (LCS), yaitu urutan kata terpanjang yang sama antara prediksi dan referensi, tanpa harus berurutan secara ketat. Skor sekitar 0.171.
*   `rougeLsum`: Sama seperti `rougeL`, tetapi dihitung per kalimat dan kemudian dirata-ratakan (jika jawaban terdiri dari beberapa kalimat). Skor sekitar 0.178.
Nilai-nilai ROUGE ini (antara 0 dan 1, di mana 1 adalah sempurna) memberikan gambaran kuantitatif tentang seberapa baik model menghasilkan jawaban yang mirip dengan jawaban target.

## Testing
gambaran kualitatif tentang bagaimana model merespons input tertentu.

In [9]:
sample = dataset['test'][10]
question = sample['question']
context = sample['content']
actual_answer = sample['answer']

prompt = f"<question> {question} <context> {context[:300]} <answer>"

inputs = tokenizer(
    prompt,
    return_tensors='pt',
    truncation=True,
    max_length=MAX_LEN - 100
).to(DEVICE)

# Generate answer
model.eval()
with torch.no_grad():
    generator = model if not MULTI_GPU else model.module
    generated_outputs = generator.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_new_tokens=100,
        num_beams=5,
        early_stopping=True,
        no_repeat_ngram_size=2,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

generated_answer_full = tokenizer.decode(generated_outputs[0], skip_special_tokens=True)

# Extract just the answer part
generated_answer = generated_answer_full
if "<answer>" in generated_answer_full:
    generated_answer = generated_answer_full.split("<answer>")[-1].strip()

print("Question:", question)
print("Context Snippet:", context[:200] + "...")
print("\nPrompt for model:", prompt)
print("\nActual Answer:", actual_answer)
print("\nGenerated Answer:", generated_answer)

Question: How do the policy views of Governor Bernanke compare to those of Alan Greenspan, and what does this indicate about Federal Reserve policy continuity?
Context Snippet: # Chapter 22 Did Greenspan Deserve Support For Another Term?
## Conclusion

Thus, we have yet another in a string of performances by the Master of Illusion that has flopped badly and should have disqu...

Prompt for model: <question> How do the policy views of Governor Bernanke compare to those of Alan Greenspan, and what does this indicate about Federal Reserve policy continuity? <context> # Chapter 22 Did Greenspan Deserve Support For Another Term?
## Conclusion

Thus, we have yet another in a string of performances by the Master of Illusion that has flopped badly and should have disqualified him from consideration for another term. Unfortunately, Greenspan's departure from the stage <answer>

Actual Answer: Despite their differing personal styles, Governor Bernanke's views have become increasingly similar to 

model kesulitan menghasilkan jawaban yang bermakna dan relevan. Meskipun parameter `no_repeat_ngram_size=2` digunakan untuk mengurangi repetisi, kualitas inti dari generasi teks masih rendah