## Installing Library

In [1]:
!pip install -q transformers datasets evaluate accelerate bitsandbytes peft torch rouge_score trl

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.5/207.5 MB[0m [31m2.3 MB/s[0m eta 

## Import Library

In [2]:
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, BitsAndBytesConfig, DataCollatorForSeq2Seq
from peft import LoraConfig, get_peft_model, PeftModel
from datasets import load_dataset, Dataset
import pandas as pd
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
from rouge_score import rouge_scorer
import evaluate
from tqdm import tqdm
from trl import SFTTrainer, SFTConfig
from datasets import Dataset as HFDataset


2025-05-11 17:16:26.516957: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746983786.915607      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746983787.036790      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


### untuk mengurangi fragmentasi memori di GPU, dengan cara memperluas blok memorinya

In [3]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

## Dataset & Metric

1. load_data_split: Memuat dataset dari Hugging Face, mengonversinya ke DataFrame Pandas, dan memfilter data berdasarkan bahasa (default: Inggris).
2. split_dataframe: Membagi DataFrame menjadi set train (80%) dan validate (20%) menggunakan train_test_split dengan random_state 42.
3. format_data: Memformat setiap baris data menjadi string dengan struktur question ... context ... answer ..., dengan konteks dibatasi hingga 300 karakter untuk efisiensi.

In [None]:
# Fungsi untuk memuat dan memproses data
def load_data(dataset_name: str = "lib3m/lib3m_qa_dataset_v1", split: str = "train", lang: str = "en") -> pd.DataFrame:
    ds = load_dataset(dataset_name, split=split)
    df = ds.to_pandas()
    df = df[df.language == lang].reset_index(drop=True)
    return df

def split_dataframe(df, test_size: float = 0.2, random_state: int = 42) -> tuple:
    train_df, val_df = train_test_split(df, test_size=test_size, random_state=random_state, shuffle=True)
    return train_df.reset_index(drop=True), val_df.reset_index(drop=True)

# Kelas dataset untuk QnA generatif
def format_data(row):
    question = row['question']
    content = row['content'][:300]  # Limit context to 300 characters
    answer = row['answer']
    text = f"<question> {question} <context> {content} <answer> {answer}"
    return {"text": text}

## Setup Dataset, Model, and Training Config

1. load dan proses Dataset: Memuat dataset menggunakan load_data, membatasi ke 100.000 baris, dan membaginya menjadi set train dan validate.
2. load Model dan Tokenizer: Menggunakan model Qwen/Qwen2.5-1.5B dari Hugging Face dan tokenizernya. Model dimuat dengan quantization 4-bit (BitsAndBytesConfig) untuk mengurangi penggunaan memori GPU.
3. Konfigurasi LoRA: Menerapkan LoRA (Low-Rank Adaptation) untuk fine-tuning, mengurangi ukuran model agar muat di GPU
4. Membuat Dataset Hugging Face: Mengonversi DataFrame ke format Hugging Face Dataset, memformat data dengan format_data, dan menghapus kolom yang tidak diperlukan kecuali kolom text.
5. Konfigurasi Pelatihan (SFT): Mengatur parameter pelatihan dengan SFTConfig

In [5]:
# Memuat dataset
df = load_data()
df = df[:100000]
train_df, val_df = split_dataframe(df)

# Memuat tokenizer dan model
model_name = "Qwen/Qwen2.5-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)
base_model = AutoModelForCausalLM.from_pretrained(model_name, 
                                             device_map="auto", 
                                             quantization_config=quant_config)

# Mengkonfigurasi QLora
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, lora_config)

# Membuat dataset
train_dataset = HFDataset.from_pandas(train_df)
val_dataset = HFDataset.from_pandas(val_df)

train_dataset = train_dataset.map(format_data)
val_dataset = val_dataset.map(format_data)

train_dataset = train_dataset.remove_columns([col for col in train_dataset.column_names if col != 'text'])
val_dataset = val_dataset.remove_columns([col for col in val_dataset.column_names if col != 'text'])

# Mengatur argumen pelatihan
sft_config = SFTConfig(
    output_dir="/kaggle/working/qwen_model",
    per_device_train_batch_size=8,
    learning_rate=5e-5,
    num_train_epochs=2,
    weight_decay=0.01,
    fp16=True,  # Mixed precision untuk efisiensi VRAM
    gradient_accumulation_steps=4,  # Mengurangi beban VRAM
    logging_steps=10,
    save_steps=200,
    save_total_limit=2,
    report_to="none",
    max_seq_length=256,  # Sesuaikan dengan max_length dataset
    dataset_text_field="text"  # Field yang berisi teks untuk pelatihan
)

README.md:   0%|          | 0.00/9.64k [00:00<?, ?B/s]

qa_pairs.parquet:   0%|          | 0.00/724M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/337525 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/7.23k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

Map:   0%|          | 0/80000 [00:00<?, ? examples/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

## Training

1. Inisialisasi Trainer: Membuat objek SFTTrainer
2. Menjalankan Training: trainer.train() memulai proses pelatihan selama 2 epoch
3. torch.cuda.empty_cache() untuk membersihkan cache setelah training selesai
4. Simpan model dan tokenizer setelah di train

In [6]:
# Membuat Trainer
trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

# Melatih model
trainer.train()

torch.cuda.empty_cache()

# Menyimpan model
trainer.save_model("/kaggle/working/qwen_model")
tokenizer.save_pretrained("/kaggle/working/qwen_model")

Converting train dataset to ChatML:   0%|          | 0/80000 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/80000 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/80000 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/80000 [00:00<?, ? examples/s]

Converting eval dataset to ChatML:   0%|          | 0/20000 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/20000 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/20000 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/20000 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
10,2.7958
20,2.7023
30,2.7073
40,2.5787
50,2.5535
60,2.4888
70,2.4865
80,2.4353
90,2.43
100,2.4608


('/kaggle/working/qwen_model/tokenizer_config.json',
 '/kaggle/working/qwen_model/special_tokens_map.json',
 '/kaggle/working/qwen_model/vocab.json',
 '/kaggle/working/qwen_model/merges.txt',
 '/kaggle/working/qwen_model/added_tokens.json',
 '/kaggle/working/qwen_model/tokenizer.json')

## Evaluation

1. rouge = evaluate.load("rouge"): Memuat metrik ROUGE dari library evaluate untuk mengevaluasi kualitas teks yang dihasilkan oleh model dibandingkan dengan jawaban aslinya. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) mengukur kesamaan antara teks yang dihasilkan dan teks referensi berdasarkan n-gram (ROUGE-1, ROUGE-2, ROUGE-L, ROUGE-Lsum).
2. Inisialisasi dua list kosong untuk menyimpan jawaban prediksi dan jawaban asli.
3. model.eval(): Mengatur model ke mode evaluasi
4. batch_size = 16: memproses 16 sampel sekaligus untuk mengurangi waktu komputasi
5. Loop dengan tqdm: Mengiterasi dataset validasi dalam batch menggunakan tqdm untuk menampilkan progress bar
    - batch_df = val_df[i:i+batch_size]: Mengambil subset DataFrame validate untuk batch saat ini.
    - input_texts: Membuat daftar teks input dengan format question ... context ... answer untuk setiap baris dalam batch.
    - true_answers: Menyimpan jawaban asli dari kolom answer untuk batch saat ini.
    - inputs = tokenizer(...): Mengonversi  teks input menjadi tensor PyTorch dan memindahkan tensor ke GPU untuk komputasi
    - with torch.no_grad(): Menonaktifkan perhitungan gradien untuk menghemat memori selama evaluasi
    - torch.cuda.amp.autocast(): Menggunakan mixed precision untuk mempercepat komputasi dan mengurangi penggunaan memori
    - model.generate(...): generate text 
    - tokenizer.batch_decode(outputs, skip_special_tokens=True): Mengonversi output model kembali ke teks, menghapus token khusus
    - proses teks yang dihasilkan, hanya ambil teks tanpa bagian \<answer>
    - simpan hasil prediksi dan jawaban asli setiap batch
6. rouge.compute(predictions=predictions, references=references, use_stemmer=True): Menghitung skor ROUGE untuk semua prediksi dibandingkan dengan jawaban asli (references). Opsi use_stemmer=True mengaktifkan stemming (mengubah kata ke bentuk dasar, misalnya "running" menjadi "run") untuk meningkatkan akurasi perbandingan
7. Fungsi generate_answer digunakan untuk menghasilkan jawaban dari model berdasarkan pertanyaan dan konteks
8. for i in range()... : menguji model pada tiga sampel pertama dari dataset validasi untuk memeriksa kualitas jawaban
9. model.save_pretrained("/kaggle/working/qwen_model"), tokenizer.save_pretrained("/kaggle/working/qwen_model"): simpan model dan tokenizer


In [8]:
rouge = evaluate.load("rouge")
predictions = []
references = []

model.eval()
batch_size = 16  # Batch size untuk evaluasi
for i in tqdm(range(0, len(val_df), batch_size), desc="Evaluating"):
    batch_df = val_df[i:i+batch_size]
    input_texts = [f"<question> {row['question']} <context> {row['content'][:300]} <answer>" for _, row in batch_df.iterrows()]
    true_answers = [row['answer'] for _, row in batch_df.iterrows()]
    
    inputs = tokenizer(input_texts, return_tensors="pt", truncation=True, max_length=256, padding=True).to("cuda")
    
    with torch.no_grad(), torch.cuda.amp.autocast():
        outputs = model.generate(**inputs, max_new_tokens=50, num_beams=3)
    generated_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    
    batch_predictions = []
    for text in generated_texts:
        if "<answer>" in text:
            batch_predictions.append(text.split("<answer>")[1].strip())
        else:
            batch_predictions.append(text.strip())
    
    predictions.extend(batch_predictions)
    references.extend(true_answers)

# Hitung ROUGE untuk seluruh batch
results = rouge.compute(predictions=predictions, references=references, use_stemmer=True)
print("Evaluation Results:")
for metric, score in results.items():
    print(f"{metric}: {score:.4f}")

# Test generate untuk 3 sampel
def generate_answer(question, context):
    input_text = f"<question> {question} <context> {context} <answer>"
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=256).to("cuda")
    with torch.no_grad(), torch.cuda.amp.autocast():
        outputs = model.generate(**inputs, max_new_tokens=50, num_beams=3)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    if "<answer>" in generated_text:
        return generated_text.split("<answer>")[1].strip()
    return generated_text.strip()

for i in range(3):
    sample = val_df.iloc[i]
    question = sample['question']
    context = sample['content'][:300]
    true_answer = sample['answer']
    
    generated_answer = generate_answer(question, context)
    
    print(f"\nSample {i+1}:")
    print(f"Question: {question}")
    print(f"Context: {context}")
    print(f"Generated Answer: {generated_answer}")
    print(f"True Answer: {true_answer}")
    print("-" * 50)

# Menyimpan model
model.save_pretrained("/kaggle/working/qwen_model")
tokenizer.save_pretrained("/kaggle/working/qwen_model")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

  with torch.no_grad(), torch.cuda.amp.autocast():
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Evaluating:   0%|          | 1/1250 [00:06<2:19:29,  6.70s/it]Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Evaluating:   0%|          | 2/1250 [00:12<2:13:47,  6.43s/it]Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Evaluating:   0%|          | 3/1250 [00:19<2:12:40,  6.38s/it]Setting `pad_token_id` to

Evaluation Results:
rouge1: 0.4250
rouge2: 0.2126
rougeL: 0.3142
rougeLsum: 0.3144


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.



Sample 1:
Question: What actions did President Franklin D. Roosevelt take regarding gold holdings and transactions during the bank holiday of 1933?
Context: # A Brief History Of The Gold Standard, With A Focus On The United States
## The Great Depression And Bretton Woods

In the depths of the Great Depression, the newly inaugurated president Franklin D. Roosevelt euphemistically declared a "national bank holiday" on  
![42_image_0.png](42_image_0.png) 
Generated Answer: During the bank holiday of 1933, President Franklin D. Roosevelt instructed the Secretary of the Treasury, Henry Morgenthau, Jr., to freeze all gold holdings and transactions. This action was part of a broader effort to stabilize the economy
True Answer: During the bank holiday of 1933, President Franklin D. Roosevelt ordered banks to exchange their gold holdings for Federal Reserve notes and to cease fulfilling transactions in gold. Additionally, banks were required to provide lists of customers who had withdrawn gol

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.



Sample 2:
Question: What are the critical implications of the doctrine that 'ignorance of the law excuses no one' on individual rights and government authority?
Context: The safety of society, which is the only object of the criminal law, requires only that those acts which are understood by mankind at large to be intrinsically criminal, should be punished as crimes. The remaining few (if there are any) may safely be left to go unpunished. Nor does the safety of soc
Generated Answer: The doctrine that 'ignorance of the law excuses no one' has significant implications for individual rights and government authority. It suggests that ignorance of the law does not absolve individuals from the consequences of their actions. This doctrine can lead to a situation
True Answer: The doctrine that 'ignorance of the law excuses no one' critically undermines individual rights by denying people the autonomy to judge what their own rights and liberties are. It serves to maintain an arbitrary authori

('/kaggle/working/qwen_model/tokenizer_config.json',
 '/kaggle/working/qwen_model/special_tokens_map.json',
 '/kaggle/working/qwen_model/vocab.json',
 '/kaggle/working/qwen_model/merges.txt',
 '/kaggle/working/qwen_model/added_tokens.json',
 '/kaggle/working/qwen_model/tokenizer.json')