# Legal Document Abstractive Summarization Model

- Brygitta Josefien 		222102260 / IBDA
- Celine Vania Setiadi	222102122 / IBDA
- Samuel Revaldo Tjahyadi	222102304 / IBDA
- Timothy Rudolf Tan		222101412 / IBDA

----

## Pendahuluan
Legal documents seperti kontrak, perjanjian, kebijakan privasi, atau dokumen pengadilan biasanya sangat panjang, rumit dan penuh istilah teknis. Membaca serta memahami dokumen hukum secara keseluruhan memerlukan waktu yang lama dan keahlian khusus, sehingga memperlambat proses pengambilan keputusan baik di sektor bisnis maupun hukum.

### Tantangan utama:
- Bahasa hukum yang terkadang kompleks dan ambigu.
- Struktur dokumen tidak selalu konsisten.
- Konteks hukum sangat penting; kesalahan dalam merangkum dapat mengubah makna secara signifikan.
- Dataset legal beranotasi untuk training model masih terbatas.
- Perlunya menyeimbangkan akurasi, kesederhanaan, dan kelengkapan dalam ringkasan.

### Tujuan:
Kami membandingkan performa (berdasarkan ROUGE atau BLEU) sejumlah model abstractive text summarization dalam meringkas input berupa legal document menjadi output berupa summary abstraktif atau rangkuman singkat.

----

## Metode

### Jenis arsitektur AI yang dibandingkan:
- GRU
- BERT (diadaptasi dengan LoRA)
- PEGASUS (diadaptasi dengan LoRA)
- PEGASUS (diadaptasi dengan QLoRA)

### Sumber data:

* https://huggingface.co/datasets/FiscalNote/billsum

  * Dataset yang berisi teks rancangan undang-undang dari Kongres Amerika Serikat, yang dilengkapi dengan ringkasan dari setiap dokumen.

  * Dataset ini ideal untuk model karena menyediakan pasangan dokumen-rangkuman yang sudah tertata dengan rapi.

* Cara memperoleh data: raw data ready

----

----

# GRU

In [None]:
!pip install -q evaluate

In [None]:
!pip install -q rouge_score

In [None]:
!pip install -q sacrebleu

In [None]:
from datasets import load_dataset
from evaluate import load
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader
import time

In [None]:
TRAIN = True

In [None]:
if TRAIN:
    print(f"Let the training commence!")
else:
    print(f"Training session has taken place.")

In [None]:
# Load dataset
billsum_train = load_dataset("billsum", split="train")
billsum_test = load_dataset("billsum", split="test")

In [None]:
# Build tokenizer
def batch_iterator():
    for example in billsum_train:
        yield example["text"]
        yield example["summary"]

In [None]:
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

In [None]:
# tokenizer = Tokenizer(models.WordLevel(unk_token="<unk>"))
# tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
# trainer = trainers.WordLevelTrainer(
#     vocab_size=16000,
#     min_frequency=2,
#     special_tokens=["<pad>", "<unk>", "<bos>", "<eos>"]
# )
# tokenizer.train_from_iterator(batch_iterator(), trainer=trainer)

In [None]:
tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
trainer = trainers.BpeTrainer(
    vocab_size=16000,
    min_frequency=2,
    special_tokens=["<pad>", "<unk>", "<bos>", "<eos>"]
)
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer)

In [None]:
# Save/load tokenizer
tokenizer.save("tokenizer.json")
tokenizer = Tokenizer.from_file("tokenizer.json")

In [None]:
# Special token IDs
PAD_IDX = tokenizer.token_to_id("<pad>")
UNK_IDX = tokenizer.token_to_id("<unk>")
BOS_IDX = tokenizer.token_to_id("<bos>")
EOS_IDX = tokenizer.token_to_id("<eos>")

In [None]:
# Text to tensor
import re

def clean_text(text):
    text = text.replace('\n', ' ')               # Replace newlines with space
    text = re.sub(r'\s+', ' ', text)             # Collapse multiple whitespace
    return text.strip()                          # Remove leading/trailing spaces
    
def text_to_tensor(text, max_len=1022, verbose=False):
    tokens = tokenizer.encode(clean_text(text)).ids[:max_len]
    if verbose:
        print(f"{len(tokens)} tokens exist.")
    return torch.tensor([BOS_IDX] + tokens + [EOS_IDX], dtype=torch.long)

In [None]:
def decode_tokens(token_ids):
    return tokenizer.decode(token_ids, skip_special_tokens=True)

In [None]:
# Collate function
def collate_batch(batch):
    src_batch, tgt_batch = [], []
    for example in batch:
        src = text_to_tensor(example["text"])
        tgt = text_to_tensor(example["summary"])
        src_batch.append(src)
        tgt_batch.append(tgt)
    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX)
    return src_batch, tgt_batch

In [None]:
train_loader = DataLoader(billsum_train, batch_size=2, shuffle=True, collate_fn=collate_batch)
test_loader = DataLoader(billsum_test, batch_size=2, shuffle=False, collate_fn=collate_batch)

In [None]:
# Seq2Seq GRU model
class Encoder(nn.Module):
    def __init__(self, vocab_size, emb_size, hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size, padding_idx=PAD_IDX)
        self.gru = nn.GRU(emb_size, hidden_size)

    def forward(self, src):
        embedded = self.embedding(src)
        outputs, hidden = self.gru(embedded)
        return outputs, hidden

In [None]:
class Attention(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.attn = nn.Linear(hidden_size * 2, hidden_size)
        self.v = nn.Linear(hidden_size, 1, bias=False)

    def forward(self, hidden, encoder_outputs, mask):
        src_len = encoder_outputs.shape[0]
        hidden = hidden.repeat(src_len, 1, 1)
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim=2)))
        attention = self.v(energy).squeeze(2)
        attention = attention.masked_fill(mask == 0, -1e9)
        return F.softmax(attention, dim=0)

In [None]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, emb_size, hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size, padding_idx=PAD_IDX)
        self.gru = nn.GRU(emb_size + hidden_size, hidden_size)
        self.fc = nn.Linear(hidden_size * 2, vocab_size)
        self.attention = Attention(hidden_size)

    def forward(self, input, hidden, encoder_outputs, mask):
        input = input.unsqueeze(0)
        embedded = self.embedding(input)
        attn_weights = self.attention(hidden, encoder_outputs, mask)
        attn_weights = attn_weights.transpose(0, 1).unsqueeze(1)
        encoder_outputs = encoder_outputs.transpose(0, 1)
        attn_applied = torch.bmm(attn_weights, encoder_outputs)
        attn_applied = attn_applied.transpose(0, 1)
        rnn_input = torch.cat((embedded, attn_applied), dim=2)
        hidden = hidden.unsqueeze(0) if hidden.dim() == 2 else hidden
        output, hidden = self.gru(rnn_input, hidden)
        output = torch.cat((output.squeeze(0), attn_applied.squeeze(0)), dim=1)
        output = self.fc(output)
        return output, hidden

In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def create_mask(self, src):
        return (src != PAD_IDX)

    def forward(self, src, tgt):
        encoder_outputs, hidden = self.encoder(src)
        mask = self.create_mask(src)
        outputs = torch.zeros(tgt.shape[0]-1, tgt.shape[1],
                              self.decoder.fc.out_features, device=tgt.device)
        input = tgt[0, :]
        for t in range(1, tgt.shape[0]):
            output, hidden = self.decoder(input, hidden, encoder_outputs, mask)
            outputs[t-1] = output
            input = tgt[t]
        return outputs

In [None]:
# Hyperparameters and model init
vocab_size = tokenizer.get_vocab_size()
model = Seq2Seq(
    Encoder(vocab_size, emb_size=64, hidden_size=128),
    Decoder(vocab_size, emb_size=64, hidden_size=128)
)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"{device = }")
model = model.to(device)

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

In [None]:
# Training
def train_epoch(loader):
    model.train()
    total_loss = 0
    for i, packed in enumerate(loader):
        if i % 100 == 0:
            print(i)
        src, tgt = packed
        src, tgt = src.to(device), tgt.to(device)
        optimizer.zero_grad()
        output = model(src, tgt)
        output = output.view(-1, output.shape[-1])
        tgt = tgt[1:].reshape(-1)
        loss = criterion(output, tgt)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(loader)

In [None]:
def generate_summary(src_tensor, max_len=100):
    model.eval()
    with torch.no_grad():
        src_tensor = src_tensor.to(device)
        encoder_outputs, hidden = model.encoder(src_tensor)
        mask = model.create_mask(src_tensor)
        input_token = torch.tensor([BOS_IDX], device=device)

        generated = []
        for _ in range(max_len):
            output, hidden = model.decoder(input_token, hidden, encoder_outputs, mask)
            top1 = output.argmax(1)
            if top1.item() == EOS_IDX:
                break
            generated.append(top1.item())
            input_token = top1
    return decode_tokens(generated)

In [None]:
bleu = load("bleu")
rouge = load("rouge")

In [None]:
def evaluate(loader):
    model.eval()
    total_loss = 0
    predictions = []
    references = []

    with torch.no_grad():
        for i, (src, tgt) in enumerate(loader):
            src, tgt = src.to(device), tgt.to(device)
            output = model(src, tgt)
            loss = criterion(output.view(-1, output.shape[-1]), tgt[1:].reshape(-1))
            total_loss += loss.item()

            # Generate summaries
            for j in range(src.shape[1]):
                generated_summary = generate_summary(src[:, j].unsqueeze(1))
                target_summary = decode_tokens(tgt[1:, j].tolist())  # skip <bos>
                predictions.append(generated_summary)
                references.append(target_summary)

    rouge_score = rouge.compute(predictions=predictions, references=references)
    bleu_score = bleu.compute(predictions=predictions, references=[[ref] for ref in references])
    
    return total_loss / len(loader), rouge_score, bleu_score

In [None]:
train_losses, val_losses = [], []
rouge1_list, rouge2_list, rougeL_list = [], [], []
bleu_list = []

In [None]:
print("The state dict keys: \n\n", model.state_dict().keys())

In [None]:
from IPython.display import FileLink

In [None]:
if TRAIN:
    for epoch in range(4):
        start = time.time()
        train_loss = train_epoch(train_loader)
        val_loss, rouge_score, bleu_score = evaluate(test_loader)
        print(f"Epoch {epoch+1} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | Time: {time.time()-start:.2f}s")
        print("ROUGE:", rouge_score)
        print("BLEU:", bleu_score)
    
        train_losses.append(train_loss)
        val_losses.append(val_loss)
        rouge1_list.append(rouge_score['rouge1'])
        rouge2_list.append(rouge_score['rouge2'])
        rougeL_list.append(rouge_score['rougeL'])
        bleu_list.append(bleu_score['bleu'])
    
    checkpoint = {'model': model,
              'state_dict': model.state_dict(),
              'optimizer' : optimizer.state_dict()}
    torch.save(checkpoint, f'checkpoint.pth')
    FileLink('checkpoint.pth')

else:
    checkpoint = torch.load("/kaggle/input/gru/pytorch/v1/1/checkpoint.pth") 
    model = checkpoint['model']
    model.load_state_dict(checkpoint['state_dict'])
    
    if optimizer and 'optimizer' in checkpoint:
        optimizer.load_state_dict(checkpoint['optimizer'])

In [None]:
if TRAIN:
    FileLink('checkpoint.pth')
    print("File!")

In [None]:
# Example input (a new or test document)
idx = 4

sample_text = billsum_test[idx]['text']
print("Original sample text (cut):\n", sample_text[:1024], '\n')

# print("Original sample text (uncut):\n", sample_text, '\n')

# Tokenize and convert to tensor
input_ids = text_to_tensor(sample_text, verbose=True)  # same as used in training
input_tensor = input_ids.clone().detach().unsqueeze(1)

# Print original summary
print(f"Original summary:\n", billsum_test[idx]['summary'], '\n')

# Generate summary
summary = generate_summary(input_tensor)
print("Generated Summary:", re.sub('Ġ', '', summary), '\n')

In [None]:
# test_text = billsum_test[340]['text']  # or your own custom text
# print("Original text:\n", test_text[:600])

In [None]:
# example_summary = billsum_test[340]['summary']
# print(f"{example_summary = }")

In [None]:
# summary = generate_summary(model, test_text)
# print("Generated Summary:\n", summary)

In [None]:
# billsum_train[1]['text'][:500]

In [None]:
# billsum_train[1]['summary']

In [None]:
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(train_losses, label='Train Loss')
plt.plot(val_losses, label='Val Loss')
plt.legend()
plt.title('Loss over Epochs')

plt.subplot(1, 2, 2)
plt.plot(rouge1_list, label='ROUGE-1')
plt.plot(rouge2_list, label='ROUGE-2')
plt.plot(rougeL_list, label='ROUGE-L')
plt.legend()
plt.title('ROUGE scores over Epochs')

plt.tight_layout()
plt.show()

In [None]:
torch.save(checkpoint, f'checkpoint_newest.pth')

----

## PEGASUS (LoRa)

In [1]:
!pip install -q transformers
!pip install -q peft # Untuk finetuned using LoRa
!pip install -q datasets # Untuk ngeset datasetnya

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.5/207.5 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.1/21.1 MB[0m [31m67.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the 

In [2]:
from transformers import pipeline, set_seed
model_id = "google/pegasus-large"
summarizator = pipeline('summarization', model=model_id)
summarizator("Stonehenge is a prehistoric monument located on Salisbury Plain in Wiltshire, England, and is one of the most iconic and mysterious landmarks in the world. Constructed in several phases between 3100 BCE and 1600 BCE, it consists of massive sarsen stones and smaller bluestones arranged in a circular pattern. The purpose of Stonehenge remains a topic of debate, with theories suggesting it served as a ceremonial site, an astronomical observatory, or even a burial ground. Its alignment with the solstices adds to its intrigue, showcasing the advanced understanding of astronomy by its builders. Today, Stonehenge stands as a UNESCO World Heritage Site and a symbol of human ingenuity and cultural heritage.")

2025-05-12 07:02:09.894006: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747033330.194865      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747033330.263833      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


config.json:   0%|          | 0.00/3.09k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/260 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Device set to use cuda:0
Your max_length is set to 256, but your input_length is only 129. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=64)


[{'summary_text': 'Today, Stonehenge stands as a UNESCO World Heritage Site and a symbol of human ingenuity and cultural heritage.'}]

In [3]:
from datasets import load_dataset

train_dataset = load_dataset("billsum", split="train")
test_dataset = load_dataset("billsum", split="test")

print(f"Total Train Dataset : {len(train_dataset)}")
print(f"Total Test Dataset  : {len(test_dataset)}")

README.md:   0%|          | 0.00/7.27k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/91.8M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/15.8M [00:00<?, ?B/s]

ca_test-00000-of-00001.parquet:   0%|          | 0.00/6.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

Total Train Dataset : 18949
Total Test Dataset  : 3269


In [4]:
text_dataset = train_dataset[0]['text']
summary_dataset = train_dataset[0]['summary']

print(f"Text : {text_dataset}")
print("\n")
print(f"Summary : {summary_dataset}")

Text : SECTION 1. LIABILITY OF BUSINESS ENTITIES PROVIDING USE OF FACILITIES 
              TO NONPROFIT ORGANIZATIONS.

    (a) Definitions.--In this section:
            (1) Business entity.--The term ``business entity'' means a 
        firm, corporation, association, partnership, consortium, joint 
        venture, or other form of enterprise.
            (2) Facility.--The term ``facility'' means any real 
        property, including any building, improvement, or appurtenance.
            (3) Gross negligence.--The term ``gross negligence'' means 
        voluntary and conscious conduct by a person with knowledge (at 
        the time of the conduct) that the conduct is likely to be 
        harmful to the health or well-being of another person.
            (4) Intentional misconduct.--The term ``intentional 
        misconduct'' means conduct by a person with knowledge (at the 
        time of the conduct) that the conduct is harmful to the health 
        or well-being of anothe

In [5]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments

# Set up model
model_name = "google/pegasus-xsum"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

In [6]:
from peft import get_peft_model, LoraConfig, TaskType
# LoRa Confing
lora_config = LoraConfig(
    r=16,                        # Rank (kecil = sedikit parameter yang bisa ditune)
    lora_alpha=32,               # Scaling factor
    target_modules=["q_proj", "v_proj"],    # Set nilai attetion (query & value) *Namannya
    lora_dropout=0.1,
    bias="none",
    task_type="SEQ_2_SEQ_LM"      # Style punya Summarizer
)

# Set LoRa dengan model
model = get_peft_model(model, lora_config)

# Total Params bisa di train
model.print_trainable_parameters()

trainable params: 3,145,728 || all params: 572,894,208 || trainable%: 0.5491


In [7]:
# Preprocess function
max_input_length = 512
max_target_length = 128

def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["text"],
        max_length=max_input_length,
        truncation=True,
        padding="max_length"
    )

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"],
            max_length=max_target_length,
            truncation=True,
            padding="max_length"
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [8]:
# Tokenize datasets
tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_test = test_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/18949 [00:00<?, ? examples/s]



Map:   0%|          | 0/3269 [00:00<?, ? examples/s]

### Training Process

In [9]:
import torch

# Set training arguments
training_args = TrainingArguments(
    label_names=["labels"],
    output_dir="./pegasus_finetuned_billsum",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    save_total_limit=2,
    fp16=torch.cuda.is_available(),
)


# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
)

  trainer = Trainer(


In [10]:
!wandb login e007d4648d22209615bce3795862a9f676953469

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin


In [11]:
# Train the model
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33msamuelrev20[0m ([33msamuelrev20-calvin-institute-of-technology[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Tracking run with wandb version 0.19.6
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20250512_070434-82725l3m[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33m./pegasus_finetuned_billsum[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/samuelrev20-calvin-institute-of-technology/huggingface[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/samuelrev20-calvin-institute-of-technology/huggingface/runs/82725l3m[0m


Step,Training Loss
500,4.1807
1000,3.6114
1500,3.3515
2000,3.3421
2500,3.193
3000,3.1321
3500,3.1769
4000,3.1839
4500,3.115
5000,3.0583




TrainOutput(global_step=14214, training_loss=3.092876171854389, metrics={'train_runtime': 15279.6771, 'train_samples_per_second': 3.72, 'train_steps_per_second': 0.93, 'total_flos': 8.267804279046144e+16, 'train_loss': 3.092876171854389, 'epoch': 3.0})

In [12]:
# To save the fine-tuned PEFT model
model.save_pretrained("./pegasus-lora-billsum")
tokenizer.save_pretrained("./pegasus-lora-billsum")

('./flan-t5-lora-billsum/tokenizer_config.json',
 './flan-t5-lora-billsum/special_tokens_map.json',
 './flan-t5-lora-billsum/spiece.model',
 './flan-t5-lora-billsum/added_tokens.json',
 './flan-t5-lora-billsum/tokenizer.json')

### Evaluation Process

In [13]:
from transformers import PegasusTokenizer, PegasusForConditionalGeneration

model_path = "/kaggle/input/pegasus-checkpoint"

tokenizer = PegasusTokenizer.from_pretrained(model_path)
model = PegasusForConditionalGeneration.from_pretrained(model_path)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
!pip install evaluate
!pip install rouge_score

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=b3703983b5a69764479955b082580a1252b79ea67c762a42d5535adfb4156d0b
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [15]:
import evaluate

rouge = evaluate.load("rouge")
bleu = evaluate.load("bleu")

# Optional: limit eval size for quick test
eval_samples = tokenized_test.select(range(100))

predictions = trainer.predict(eval_samples)
decoded_preds = tokenizer.batch_decode(predictions.predictions, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(predictions.label_ids, skip_special_tokens=True)

# Strip whitespaces
decoded_preds = [pred.strip() for pred in decoded_preds]
decoded_labels = [label.strip() for label in decoded_labels]

# ROUGE
rouge_result = rouge.compute(predictions=decoded_preds, references=decoded_labels)
print("ROUGE scores:", rouge_result)

# BLEU
bleu_result = bleu.compute(predictions=[pred.split() for pred in decoded_preds],
                           references=[[label.split()] for label in decoded_labels])
print("BLEU score:", bleu_result)

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]



OutOfMemoryError: CUDA out of memory. Tried to allocate 3.67 GiB. GPU 0 has a total capacity of 14.74 GiB of which 3.30 GiB is free. Process 7947 has 11.44 GiB memory in use. Of the allocated memory 8.13 GiB is allocated by PyTorch, and 2.85 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

----

## PEGASUS (QLoRa)

In [1]:
!pip install -q transformers
!pip install -q peft # Untuk finetuned using LoRa
!pip install -q datasets # Untuk ngeset datasetnya
!pip install -q bitsandbytes # Untuk quantization
!pip install -q accelerate # Untuk handling of quantized models

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m31.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.5/207.5 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m0:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.1/21.1 MB[0m [31m85.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25h[31mERROR: pip's 

In [4]:
pip freeze > requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [8]:
from datasets import load_dataset

train_dataset = load_dataset("billsum", split="train")
test_dataset = load_dataset("billsum", split="test")

print(f"Total Train Dataset : {len(train_dataset)}")
print(f"Total Test Dataset  : {len(test_dataset)}")

Total Train Dataset : 18949
Total Test Dataset  : 3269


In [9]:
import bitsandbytes as bnb
text_dataset = train_dataset[0]['text']
summary_dataset = train_dataset[0]['summary']

print(f"Text : {text_dataset}")
print("\n")
print(f"Summary : {summary_dataset}")

Text : SECTION 1. LIABILITY OF BUSINESS ENTITIES PROVIDING USE OF FACILITIES 
              TO NONPROFIT ORGANIZATIONS.

    (a) Definitions.--In this section:
            (1) Business entity.--The term ``business entity'' means a 
        firm, corporation, association, partnership, consortium, joint 
        venture, or other form of enterprise.
            (2) Facility.--The term ``facility'' means any real 
        property, including any building, improvement, or appurtenance.
            (3) Gross negligence.--The term ``gross negligence'' means 
        voluntary and conscious conduct by a person with knowledge (at 
        the time of the conduct) that the conduct is likely to be 
        harmful to the health or well-being of another person.
            (4) Intentional misconduct.--The term ``intentional 
        misconduct'' means conduct by a person with knowledge (at the 
        time of the conduct) that the conduct is harmful to the health 
        or well-being of anothe

In [11]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training

# Set up model
model_name = "google/pegasus-xsum"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
    llm_int8_skip_modules=None
)

# Load the model with quantization
model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,  # 8-bit quantization bisa pakai yang 4-bit juga
    device_map={"": 0}  # Automatis mapping placement
)

model = prepare_model_for_kbit_training(model)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
from peft import get_peft_model, LoraConfig, TaskType
# LoRa Confing
lora_config = LoraConfig(
    r=16,                        # Rank (kecil = sedikit parameter yang bisa ditune)
    lora_alpha=32,               # Scaling factor
    target_modules=["q_proj", "v_proj"],    # Set nilai attetion (query & value) *Namannya
    lora_dropout=0.1,
    bias="none",
    task_type="SEQ_2_SEQ_LM"      # Style punya Summarizer
)

# Set LoRa dengan model
model = get_peft_model(model, lora_config)

# Total Params bisa di train
model.print_trainable_parameters()

trainable params: 3,145,728 || all params: 572,894,208 || trainable%: 0.5491


In [13]:
# Preprocess function
max_input_length = 512
max_target_length = 128

def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["text"],
        max_length=max_input_length,
        truncation=True,
        padding="max_length"
    )

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"],
            max_length=max_target_length,
            truncation=True,
            padding="max_length"
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [14]:
# Tokenize datasets
tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_test = test_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/18949 [00:00<?, ? examples/s]



Map:   0%|          | 0/3269 [00:00<?, ? examples/s]

### Training Process

In [15]:
import torch
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# Set training arguments
training_args = TrainingArguments(
    output_dir="./pegasus_qlora_billsum",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    learning_rate=2e-5,
    num_train_epochs=2,
    logging_dir="./logs",
    save_strategy="epoch",
    fp16=True,
    gradient_accumulation_steps=2,
    report_to="none"
)
# 6. Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
)

  trainer = Trainer(
No label_names provided for model class `PeftModelForSeq2SeqLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [16]:
!wandb login e007d4648d22209615bce3795862a9f676953469

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin


In [17]:
# Train the model
trainer.train()

  return fn(*args, **kwargs)


Step,Training Loss
500,4.1681
1000,3.564
1500,3.3402
2000,3.3403
2500,3.1899
3000,3.1307
3500,3.1735
4000,3.1854
4500,3.1215
5000,3.0663


  return fn(*args, **kwargs)


TrainOutput(global_step=9474, training_loss=3.1950612598666415, metrics={'train_runtime': 13222.2496, 'train_samples_per_second': 2.866, 'train_steps_per_second': 0.717, 'total_flos': 5.511142321422336e+16, 'train_loss': 3.1950612598666415, 'epoch': 1.999683377308707})

In [27]:
# To save the fine-tuned PEFT model
model.save_pretrained("./pegasus-qlora-epoch2-billsum")
tokenizer.save_pretrained("./pegasus-qlora-epoch2-billsum")

('./pegasus-qlora-epoch2-billsum/tokenizer_config.json',
 './pegasus-qlora-epoch2-billsum/special_tokens_map.json',
 './pegasus-qlora-epoch2-billsum/spiece.model',
 './pegasus-qlora-epoch2-billsum/added_tokens.json',
 './pegasus-qlora-epoch2-billsum/tokenizer.json')

### Evaluation Process

In [31]:
from transformers import PegasusTokenizer, PegasusForConditionalGeneration
from peft import PeftModel, PeftConfig

model_path = "/kaggle/input/pegasus-checkpoint"

# Load the configuration
config = PeftConfig.from_pretrained(model_path)

# Load the base model
base_model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)

# Load the LoRA adapter
model = PeftModel.from_pretrained(base_model, model_path)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Set the model to evaluation mode
model.eval()

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


PeftModelForSeq2SeqLM(
  (base_model): LoraModel(
    (model): PegasusForConditionalGeneration(
      (model): PegasusModel(
        (shared): Embedding(96103, 1024, padding_idx=0)
        (encoder): PegasusEncoder(
          (embed_tokens): Embedding(96103, 1024, padding_idx=0)
          (embed_positions): PegasusSinusoidalPositionalEmbedding(512, 1024)
          (layers): ModuleList(
            (0-15): 16 x PegasusEncoderLayer(
              (self_attn): PegasusAttention(
                (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
                (v_proj): lora.Linear(
                  (base_layer): Linear(in_features=1024, out_features=1024, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.1, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=1024, out_features=16, bias=False)
                  )
                  (lora_B): ModuleDict(
 

In [32]:
# Function to generate summary
def generate_summary(text, max_length=128):
    # Tokenize the input text
    inputs = tokenizer(
        text, 
        max_length=512, 
        return_tensors="pt", 
        truncation=True, 
        padding="max_length"
    )
    
    # Generate summary
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            num_beams=4,
            early_stopping=True
        )
    
    # Decode the generated summary
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary

In [36]:
!pip install evaluate
!pip install rouge_score

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=91dca62688b2c72adcbe279a0efb2bb0b7ef1001ae6158a675b2faee4dda6bb6
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [37]:
# Function to test multiple examples and print detailed results
import evaluate
from rouge_score import rouge_scorer

def test_summarizer(dataset, start_index=0, num_examples=5):
    # Compute ROUGE scores
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    
    # Store results
    results = []
    
    # Test specified number of examples
    for i in range(start_index, min(start_index + num_examples, len(dataset))):
        test_example = dataset[i]
        
        # Generate summary
        generated_summary = generate_summary(test_example['text'])
        
        # Calculate ROUGE scores
        rouge_scores = scorer.score(test_example['summary'], generated_summary)
        
        # Prepare result
        result = {
            'index': i,
            'original_text': test_example['text'][:500] + '...' if len(test_example['text']) > 500 else test_example['text'],
            'ground_truth_summary': test_example['summary'],
            'generated_summary': generated_summary,
            'rouge_scores': rouge_scores
        }
        results.append(result)
        
        # Print results for each example
        print(f"\n--- Example {i} ---")
        print("Original Text (first 500 chars):")
        print(result['original_text'])
        print("\nGround Truth Summary:")
        print(result['ground_truth_summary'])
        print("\nGenerated Summary:")
        print(result['generated_summary'])
        print("\nROUGE Scores:")
        for metric, score in result['rouge_scores'].items():
            print(f"{metric}: f1={score.fmeasure:.4f}, precision={score.precision:.4f}, recall={score.recall:.4f}")
    
    return results

In [38]:
test_results = test_summarizer(test_dataset, start_index=0, num_examples=5)


--- Example 0 ---
Original Text (first 500 chars):
SECTION 1. ENVIRONMENTAL INFRASTRUCTURE.

    (a) Jackson County, Mississippi.--Section 219 of the Water 
Resources Development Act of 1992 (106 Stat. 4835; 110 Stat. 3757) is 
amended--
        (1) in subsection (c), by striking paragraph (5) and inserting 
    the following:
        ``(5) Jackson county, mississippi.--Provision of an alternative 
    water supply and a project for the elimination or control of 
    combined sewer overflows for Jackson County, Mississippi.''; and
        (2) in...

Ground Truth Summary:
Amends the Water Resources Development Act of 1999 to: (1) authorize appropriations for FY 1999 through 2009 for implementation of a long-term resource monitoring program with respect to the Upper Mississippi River Environmental Management Program (currently, such funding is designated for a program for the planning, construction, and evaluation of measures for fish and wildlife habitat rehabilitation and enhancement)

In [None]:
import torch
torch.cuda.empty_cache()

In [None]:
import evaluate

rouge = evaluate.load("rouge")
bleu = evaluate.load("bleu")

# Optional: limit eval size for quick test
eval_samples = tokenized_test.select(range(10))

predictions = trainer.predict(eval_samples)
decoded_preds = tokenizer.batch_decode(predictions.predictions, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(predictions.label_ids, skip_special_tokens=True)

# Strip whitespaces
decoded_preds = [pred.strip() for pred in decoded_preds]
decoded_labels = [label.strip() for label in decoded_labels]

# ROUGE
rouge_result = rouge.compute(predictions=decoded_preds, references=decoded_labels)
print("ROUGE scores:", rouge_result)

# BLEU
bleu_result = bleu.compute(predictions=[pred.split() for pred in decoded_preds],
                           references=[[label.split()] for label in decoded_labels])
print("BLEU score:", bleu_result)

----

## BERT (LoRa)

In [6]:
# !pip install -q transformers
# !pip install -q datasets # Untuk ngeset datasetnya

In [7]:
import torch

In [8]:
from datasets import load_dataset

train_dataset = load_dataset("billsum", split="train")
test_dataset = load_dataset("billsum", split="test")

print(f"Total Train Dataset : {len(train_dataset)}")
print(f"Total Test Dataset  : {len(test_dataset)}")

README.md:   0%|          | 0.00/7.27k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/91.8M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/15.8M [00:00<?, ?B/s]

ca_test-00000-of-00001.parquet:   0%|          | 0.00/6.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

Total Train Dataset : 18949
Total Test Dataset  : 3269


In [9]:
text_dataset = train_dataset[0]['text']
summary_dataset = train_dataset[0]['summary']

print(f"Text : {text_dataset}")
print("\n")
print(f"Summary : {summary_dataset}")

Text : SECTION 1. LIABILITY OF BUSINESS ENTITIES PROVIDING USE OF FACILITIES 
              TO NONPROFIT ORGANIZATIONS.

    (a) Definitions.--In this section:
            (1) Business entity.--The term ``business entity'' means a 
        firm, corporation, association, partnership, consortium, joint 
        venture, or other form of enterprise.
            (2) Facility.--The term ``facility'' means any real 
        property, including any building, improvement, or appurtenance.
            (3) Gross negligence.--The term ``gross negligence'' means 
        voluntary and conscious conduct by a person with knowledge (at 
        the time of the conduct) that the conduct is likely to be 
        harmful to the health or well-being of another person.
            (4) Intentional misconduct.--The term ``intentional 
        misconduct'' means conduct by a person with knowledge (at the 
        time of the conduct) that the conduct is harmful to the health 
        or well-being of anothe

In [10]:
from transformers import BertTokenizer, EncoderDecoderModel
# Load pre-trained BERT tokenizer and model
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [11]:
# Initialize BERT as encoder-decoder model for summarization
model = EncoderDecoderModel.from_encoder_decoder_pretrained(
    encoder_pretrained_model_name_or_path=model_name,
    decoder_pretrained_model_name_or_path=model_name
)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertLMHeadModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert.encoder.layer.0.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.0.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.0.crossattention.output.dense.bias', 'bert.encoder.layer.0.crossattention.output.dense.weight', 'bert.encoder.layer.0.crossattention.self.key.bias', 'bert.encoder.layer.0.crossattention.self.key.weight', 'bert.encoder.layer.0.crossattention.self.query.bias', 'bert.encoder.layer.0.crossattention.self.query.weight', 'bert.encoder.layer.0.crossattention.self.value.bias', 'bert.encoder.layer.0.crossattention.self.value.weight', 'bert.encoder.layer.1.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.1.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.1.crossattention.output.dense.bias', 'bert.encoder.layer.1.crossattention.output.dense.weight', 'bert.encoder.layer.1.crossattention.self.key.bias', 'bert.e

In [12]:
# Special tokens and model configuration
model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.pad_token_id = tokenizer.pad_token_id
model.config.eos_token_id = tokenizer.sep_token_id

In [13]:
# Adjust max length for BillSum dataset
max_length = 512  # BERT's maximum
summary_max_length = 150  # Reasonable length for summaries

# Preprocess function
def process_data_to_model_inputs(batch):
    # Tokenize the input texts
    inputs = tokenizer(
        batch["text"], 
        padding="max_length", 
        truncation=True, 
        max_length=max_length,
        return_tensors="pt"
    )
    
    # Tokenize the target summaries
    targets = tokenizer(
        batch["summary"], 
        padding="max_length", 
        truncation=True, 
        max_length=summary_max_length,
        return_tensors="pt"
    )
    
    return {
        "input_ids": inputs.input_ids,
        "attention_mask": inputs.attention_mask,
        "labels": targets.input_ids,
    }

In [14]:
# train_dataset = train_dataset.select(range(100))
# test_dataset = test_dataset.select(range(20))

In [15]:
train_dataset = train_dataset.map(
    process_data_to_model_inputs, 
    batched=True, 
    batch_size=16
)
test_dataset = test_dataset.map(
    process_data_to_model_inputs, 
    batched=True, 
    batch_size=16
)


Map:   0%|          | 0/18949 [00:00<?, ? examples/s]

Map:   0%|          | 0/3269 [00:00<?, ? examples/s]

In [16]:
# Convert to PyTorch format
train_dataset.set_format(
    type="torch", 
    columns=["input_ids", "attention_mask", "labels"]
)
test_dataset.set_format(
    type="torch", 
    columns=["input_ids", "attention_mask", "labels"]
)

In [17]:
# Training setup
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./bert-summarizer",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    eval_strategy="epoch",
    label_names=["labels"],
    learning_rate=2e-5,
    num_train_epochs=3, 
    logging_dir="./logs",
    logging_steps=100,
    save_total_limit=1,
    report_to="none" 
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)


In [18]:
# Start training
trainer.train()

  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)


Epoch,Training Loss,Validation Loss
1,2.6676,2.559638
2,2.3179,2.347854
3,2.2661,2.277675


  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids

TrainOutput(global_step=14214, training_loss=2.618493327716072, metrics={'train_runtime': 7000.2703, 'train_samples_per_second': 8.121, 'train_steps_per_second': 2.03, 'total_flos': 3.487302524998656e+16, 'train_loss': 2.618493327716072, 'epoch': 3.0})

In [19]:
# Save the model
model.save_pretrained("./bert-summarizer-final")
tokenizer.save_pretrained("./bert-summarizer-final")

('./bert-summarizer-final/tokenizer_config.json',
 './bert-summarizer-final/special_tokens_map.json',
 './bert-summarizer-final/vocab.txt',
 './bert-summarizer-final/added_tokens.json')

In [20]:
!pip install evaluate
!pip install rouge_score



In [22]:
import evaluate

# Load metrics
rouge = evaluate.load("rouge")
bleu = evaluate.load("bleu")

eval_samples = test_dataset.select(range(100))  # Evaluate on 100 samples

# Generate predictions
predictions = trainer.predict(eval_samples)

# Decode predictions and labels
decoded_preds = tokenizer.batch_decode(predictions.predictions, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(predictions.label_ids, skip_special_tokens=True)

# Clean whitespace
decoded_preds = [pred.strip() for pred in decoded_preds]
decoded_labels = [label.strip() for label in decoded_labels]

# ROUGE
rouge_result = rouge.compute(predictions=decoded_preds, references=decoded_labels)
print("ROUGE scores:", rouge_result)

# BLEU
bleu_result = bleu.compute(predictions=[pred.split() for pred in decoded_preds],
                           references=[[label.split()] for label in decoded_labels])
print("BLEU score:", bleu_result)

  decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)


TypeError: int() argument must be a string, a bytes-like object or a real number, not 'list'

In [None]:
import evaluate
import numpy as np

# Load metrics
rouge = evaluate.load("rouge")
bleu = evaluate.load("bleu")

# Select evaluation samples (first 100)
eval_samples = test_dataset.select(range(100))

# Generate predictions
predictions = trainer.predict(eval_samples)

preds = predictions.predictions[0] if isinstance(predictions.predictions, tuple) else predictions.predictions
labels = predictions.label_ids

def clean_decode(ids, tokenizer):
    # Convert to numpy array if needed
    ids = np.array(ids)
    # Remove padding (-100) and special tokens
    ids = ids[ids != -100]
    return tokenizer.decode(ids, skip_special_tokens=True)

# Decode predictions and labels
decoded_preds = [clean_decode(pred, tokenizer) for pred in preds]
decoded_labels = [clean_decode(label, tokenizer) for label in labels]

# Clean whitespace
decoded_preds = [pred.strip() for pred in decoded_preds]
decoded_labels = [label.strip() for label in decoded_labels]

# Compute ROUGE metrics
rouge_result = rouge.compute(
    predictions=decoded_preds,
    references=decoded_labels,
    use_stemmer=True
)

# Compute BLEU score
bleu_result = bleu.compute(
    predictions=[pred.split() for pred in decoded_preds],
    references=[[label.split()] for label in decoded_labels]
)

# Print results
print("\nEvaluation Metrics:")
print(f"ROUGE-1: {rouge_result['rouge1']:.4f}")
print(f"ROUGE-2: {rouge_result['rouge2']:.4f}")
print(f"ROUGE-L: {rouge_result['rougeL']:.4f}")
print(f"BLEU: {bleu_result['bleu']:.4f}")

# Print some examples
print("\nSample Predictions:")
for i in range(3):  # Show first 3 examples
    print(f"\nExample {i+1}:")
    print(f"Original Text: {test_dataset[i]['text'][:200]}...")
    print(f"Reference Summary: {decoded_labels[i]}")
    print(f"Generated Summary: {decoded_preds[i]}")
    print("-" * 80)

In [None]:
# Function to generate summaries
def generate_summary(text):
    # Tokenize input text
    inputs = tokenizer(
        text, 
        padding="max_length", 
        truncation=True, 
        max_length=max_length,
        return_tensors="pt"
    )
    
    # Generate summary
    output = model.generate(
        inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_length=summary_max_length,
        num_beams=4,
        early_stopping=True
    )
    
    # Decode and return
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Test the summarizer
sample_text = train_dataset[0]["text"]
print("Original Text:")
print(sample_text[:500] + "...")  # Print first 500 chars
print("\nGenerated Summary:")
print(generate_summary(sample_text))
print("\nReference Summary:")
print(train_dataset[0]["summary"])

----

## Evaluasi | Perbandingan model

In [54]:
import torch

In [55]:
comparison = {
    "Models": ["GRU", "PEGASUS (LoRA)", "PEGASUS (QLoRA)", "BERT (LoRA)"],
    "ROUGE-1": [0, 0, 0, 0],
    "ROUGE-2": [0, 0, 0, 0],
    "ROUGE-L": [0, 0, 0, 0],
    "ROUGE-Lsum": [0, 0, 0, 0],
    "BLEU": [0, 0, 0, 0],
    "Brevity penalty": [0, 0, 0, 0],
    "Length ratio": [0, 0, 0, 0],
    "Translation length": [0, 0, 0, 0],
    "Reference length": [0, 0, 0, 0]
}

In [56]:
from datasets import load_dataset

train_dataset = load_dataset("billsum", split="train")
test_dataset = load_dataset("billsum", split="test")

print(f"Size of training data: {len(train_dataset)}")
print(f"Size of test data: {len(test_dataset)}")

Size of training data: 18949
Size of test data: 3269


In [57]:
import pandas as pd

## GRU

In [58]:
from seq2seq import Encoder, Decoder, Attention, Seq2Seq

In [59]:
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")

In [60]:
checkpoint = torch.load("checkpoint_gru.pth")
model = checkpoint['model']
model.load_state_dict(checkpoint['state_dict'])
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
if 'optimizer' in checkpoint:
    optimizer.load_state_dict(checkpoint['optimizer'])

  checkpoint = torch.load("checkpoint_gru.pth")


In [61]:
# Special token IDs
PAD_IDX = tokenizer.token_to_id("<pad>")
UNK_IDX = tokenizer.token_to_id("<unk>")
BOS_IDX = tokenizer.token_to_id("<bos>")
EOS_IDX = tokenizer.token_to_id("<eos>")

In [62]:
import re

def clean_text(text):
    text = text.replace('\n', ' ')               # Replace newlines with space
    text = re.sub(r'\s+', ' ', text)             # Collapse multiple whitespace
    return text.strip()                          # Remove leading/trailing spaces

def text_to_tensor(text, max_len=1022, verbose=False):
    tokens = tokenizer.encode(clean_text(text)).ids[:max_len]
    if verbose:
        print(f"{len(tokens)} tokens exist.")
    return torch.tensor([BOS_IDX] + tokens + [EOS_IDX], dtype=torch.long)


In [63]:
def decode_tokens(token_ids):
    return tokenizer.decode(token_ids, skip_special_tokens=True)

In [64]:
def generate_summary(src_tensor, max_len=100):
    model.eval()
    with torch.no_grad():
        src_tensor = src_tensor.to(device)
        encoder_outputs, hidden = model.encoder(src_tensor)
        mask = model.create_mask(src_tensor)
        input_token = torch.tensor([BOS_IDX], device=device)

        generated = []
        for _ in range(max_len):
            output, hidden = model.decoder(input_token, hidden, encoder_outputs, mask)
            top1 = output.argmax(1)
            if top1.item() == EOS_IDX:
                break
            generated.append(top1.item())
            input_token = top1
    return decode_tokens(generated)

In [65]:
for i in range(5):
    print(f"\n====================={i}=====================\n")

    input_text = test_dataset[i]["text"]

    # Tokenize and convert to tensor
    input_ids = text_to_tensor(input_text, verbose=True)
    input_tensor = input_ids.clone().detach().unsqueeze(1)

    summary_text = re.sub('Ġ', '', generate_summary(input_tensor))

    # Output
    print("\n--- Original Text ---\n")
    print(input_text)
    print("\n--- Reference Summary ---\n")
    print(test_dataset[i]["summary"])
    print("\n--- Generated Summary ---\n")
    print(summary_text)



1022 tokens exist.

--- Original Text ---

SECTION 1. ENVIRONMENTAL INFRASTRUCTURE.

    (a) Jackson County, Mississippi.--Section 219 of the Water 
Resources Development Act of 1992 (106 Stat. 4835; 110 Stat. 3757) is 
amended--
        (1) in subsection (c), by striking paragraph (5) and inserting 
    the following:
        ``(5) Jackson county, mississippi.--Provision of an alternative 
    water supply and a project for the elimination or control of 
    combined sewer overflows for Jackson County, Mississippi.''; and
        (2) in subsection (e)(1), by striking ``$10,000,000'' and 
    inserting ``$20,000,000''.
    (b) Manchester, New Hampshire.--Section 219(e)(3) of the Water 
Resources Development Act of 1992 (106 Stat. 4835; 110 Stat. 3757) is 
amended by striking ``$10,000,000'' and inserting ``$20,000,000''.
    (c) Atlanta, Georgia.--Section 219(f)(1) of the Water Resources 
Development Act of 1992 (106 Stat. 4835; 113 Stat. 335) is amended by 
striking ``$25,000,000 fo

In [66]:
from torch.nn.utils.rnn import pad_sequence
def collate_batch(batch):
    src_batch, tgt_batch = [], []
    for example in batch:
        src = text_to_tensor(example["text"])
        tgt = text_to_tensor(example["summary"])
        src_batch.append(src)
        tgt_batch.append(tgt)
    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX)
    return src_batch, tgt_batch

In [67]:
from torch.utils.data import DataLoader

test_loader = DataLoader(test_dataset, batch_size=2, shuffle=False, collate_fn=collate_batch)

In [68]:
from evaluate import load

rouge = load("rouge")
bleu = load("bleu")

def evaluate_with_metrics(loader, max_len=100):
    model.eval()
    predictions = []
    references = []

    with torch.no_grad():
        for src, tgt in loader:
            src, tgt = src.to(device), tgt.to(device)

            # Batch generation for speed
            for i in range(src.shape[1]):  # batch iteration
                input_tensor = src[:, i].unsqueeze(1)
                tgt_tensor = tgt[:, i]

                # Generate prediction
                pred_text = generate_summary(input_tensor, max_len=max_len)
                label_text = decode_tokens(tgt_tensor[1:].tolist())  # skip <bos>

                predictions.append(pred_text.strip())
                references.append(label_text.strip())

    # Compute ROUGE
    rouge_result = rouge.compute(predictions=predictions, references=references)

    # Compute BLEU
    bleu_result = bleu.compute(
        predictions=[" ".join(pred.split()) for pred in predictions],
        references=[[" ".join(ref.split())] for ref in references]
    )

    print("ROUGE scores:", rouge_result)
    print("BLEU score:", bleu_result)

    return rouge_result, bleu_result

In [69]:
test_subset = torch.utils.data.Subset(test_dataset, range(10))
test_loader_small = DataLoader(test_subset, batch_size=2, collate_fn=collate_batch)

rouge_score, bleu_score = evaluate_with_metrics(test_loader_small)

ROUGE scores: {'rouge1': 0.3499901843505875, 'rouge2': 0.1647360138116667, 'rougeL': 0.27756991137715153, 'rougeLsum': 0.27740763837182986}
BLEU score: {'bleu': 0.07462051869930457, 'precisions': [0.5241545893719807, 0.254278728606357, 0.16831683168316833, 0.13408521303258145], 'brevity_penalty': 0.3186315667767985, 'length_ratio': 0.4664788732394366, 'translation_length': 828, 'reference_length': 1775}


In [70]:
comparison["ROUGE-1"][0] = rouge_score["rouge1"]
comparison["ROUGE-2"][0] = rouge_score["rouge2"]
comparison["ROUGE-L"][0] = rouge_score["rougeL"]
comparison["ROUGE-Lsum"][0] = rouge_score["rougeLsum"]

In [71]:
comparison["BLEU"][0] = bleu_score["bleu"]
comparison["Brevity penalty"][0] = bleu_score["brevity_penalty"]
comparison["Length ratio"][0] = bleu_score["length_ratio"]
comparison["Translation length"][0] = bleu_score["translation_length"]
comparison["Reference length"][0] = bleu_score["reference_length"]

## PEGASUS (LoRA)

In [72]:
from transformers import PegasusTokenizer, PegasusForConditionalGeneration

model_path = "pegasus_checkpoint"

tokenizer = PegasusTokenizer.from_pretrained(model_path)
model = PegasusForConditionalGeneration.from_pretrained(model_path)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [73]:
# Preprocess function
max_input_length = 512
max_target_length = 128

def preprocess_function(tokenizer, examples):
    model_inputs = tokenizer(
        examples["text"],
        max_length=max_input_length,
        truncation=True,
        padding="max_length"
    )

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"],
            max_length=max_target_length,
            truncation=True,
            padding="max_length"
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [74]:
# Tokenize datasets
tokenized_train = train_dataset.map(lambda x: preprocess_function(tokenizer, x), batched=True)
tokenized_test = test_dataset.map(lambda x: preprocess_function(tokenizer, x), batched=True)

In [75]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# Set training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./pegasus_qlora_billsum",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    num_train_epochs=2,
    logging_dir="./logs",
    save_strategy="epoch",
    fp16=True,
    gradient_accumulation_steps=2,
    report_to="none",
    dataloader_num_workers=0,
    do_predict=True,
    predict_with_generate=True
)

# 6. Trainer setup
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
)

  trainer = Seq2SeqTrainer(


In [76]:
import evaluate
import numpy as np

rouge = evaluate.load("rouge")
bleu = evaluate.load("bleu")

# Optional: limit eval size for quick test
eval_samples = tokenized_test.select(range(10))

pred_output = trainer.predict(eval_samples)

decoded_preds = tokenizer.batch_decode(pred_output.predictions, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(pred_output.label_ids, skip_special_tokens=True)

# Strip whitespaces
decoded_preds = [pred.strip() for pred in decoded_preds]
decoded_labels = [label.strip() for label in decoded_labels]

# ROUGE
rouge_result = rouge.compute(predictions=decoded_preds, references=decoded_labels)
print("ROUGE scores:", rouge_result)

# BLEU
bleu_result = bleu.compute(
    predictions=[" ".join(pred.split()) for pred in decoded_preds],
    references=[[" ".join(ref.split())] for ref in decoded_labels]
)
print("BLEU score:", bleu_result)

ROUGE scores: {'rouge1': 0.474928785983147, 'rouge2': 0.30566315736346317, 'rougeL': 0.38761850231003814, 'rougeLsum': 0.39071751650860426}
BLEU score: {'bleu': 0.1683378273208139, 'precisions': [0.6427350427350428, 0.41043478260869565, 0.3309734513274336, 0.28468468468468466], 'brevity_penalty': 0.4239585202564463, 'length_ratio': 0.5381784728610856, 'translation_length': 585, 'reference_length': 1087}


In [77]:
comparison["ROUGE-1"][1] = rouge_result["rouge1"]
comparison["ROUGE-2"][1] = rouge_result["rouge2"]
comparison["ROUGE-L"][1] = rouge_result["rougeL"]
comparison["ROUGE-Lsum"][1] = rouge_result["rougeLsum"]

In [78]:
comparison["BLEU"][1] = bleu_result["bleu"]
comparison["Brevity penalty"][1] = bleu_result["brevity_penalty"]
comparison["Length ratio"][1] = bleu_result["length_ratio"]
comparison["Translation length"][1] = bleu_result["translation_length"]
comparison["Reference length"][1] = bleu_result["reference_length"]

In [79]:
for i in range(5):
    print(f"\n====================={i}=====================\n")

    input_text = test_dataset[i]["text"]

    # Tokenize the input text
    inputs = tokenizer(
        input_text,
        max_length=512,
        truncation=True,
        padding="max_length",
        return_tensors="pt"
    )

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Generate the summary
    summary_ids = model.generate(
        **inputs,
        max_length=128,
        num_beams=4,
        length_penalty=2.0,
        early_stopping=True
    )

    # Decode the generated summary
    summary_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # Output
    print("\n--- Original Text ---\n")
    print(input_text)
    print("\n--- Reference Summary ---\n")
    print(test_dataset[i]["summary"])
    print("\n--- Generated Summary ---\n")
    print(summary_text)




--- Original Text ---

SECTION 1. ENVIRONMENTAL INFRASTRUCTURE.

    (a) Jackson County, Mississippi.--Section 219 of the Water 
Resources Development Act of 1992 (106 Stat. 4835; 110 Stat. 3757) is 
amended--
        (1) in subsection (c), by striking paragraph (5) and inserting 
    the following:
        ``(5) Jackson county, mississippi.--Provision of an alternative 
    water supply and a project for the elimination or control of 
    combined sewer overflows for Jackson County, Mississippi.''; and
        (2) in subsection (e)(1), by striking ``$10,000,000'' and 
    inserting ``$20,000,000''.
    (b) Manchester, New Hampshire.--Section 219(e)(3) of the Water 
Resources Development Act of 1992 (106 Stat. 4835; 110 Stat. 3757) is 
amended by striking ``$10,000,000'' and inserting ``$20,000,000''.
    (c) Atlanta, Georgia.--Section 219(f)(1) of the Water Resources 
Development Act of 1992 (106 Stat. 4835; 113 Stat. 335) is amended by 
striking ``$25,000,000 for''.
    (d) Paters

## PEGASUS (QLoRA)

In [80]:
model_path = "pegasus-qlora-checkpoint"

tokenizer = PegasusTokenizer.from_pretrained(model_path)
model = PegasusForConditionalGeneration.from_pretrained(model_path)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [81]:
tokenized_train = train_dataset.map(lambda x: preprocess_function(tokenizer, x), batched=True)
tokenized_test = test_dataset.map(lambda x: preprocess_function(tokenizer, x), batched=True)

In [82]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./pegasus_qlora_billsum",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    num_train_epochs=2,
    logging_dir="./logs",
    save_strategy="epoch",
    fp16=True,
    gradient_accumulation_steps=2,
    report_to="none",
    dataloader_num_workers=0,
    do_predict=True,
    predict_with_generate=True
)

# 6. Trainer setup
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
)

  trainer = Seq2SeqTrainer(


In [83]:
rouge = evaluate.load("rouge")
bleu = evaluate.load("bleu")

# Optional: limit eval size for quick test
eval_samples = tokenized_test.select(range(10))

pred_output = trainer.predict(eval_samples)

decoded_preds = tokenizer.batch_decode(pred_output.predictions, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(pred_output.label_ids, skip_special_tokens=True)

# Strip whitespaces
decoded_preds = [pred.strip() for pred in decoded_preds]
decoded_labels = [label.strip() for label in decoded_labels]

# ROUGE
rouge_result = rouge.compute(predictions=decoded_preds, references=decoded_labels)
print("ROUGE scores:", rouge_result)

# BLEU
bleu_result = bleu.compute(
    predictions=[" ".join(pred.split()) for pred in decoded_preds],
    references=[[" ".join(ref.split())] for ref in decoded_labels]
)
print("BLEU score:", bleu_result)

ROUGE scores: {'rouge1': 0.45631120961159227, 'rouge2': 0.27881869424501227, 'rougeL': 0.36053773589072124, 'rougeLsum': 0.3607912709143214}
BLEU score: {'bleu': 0.14369187611735781, 'precisions': [0.625, 0.3709677419354839, 0.291970802919708, 0.24349442379182157], 'brevity_penalty': 0.4010246451383619, 'length_ratio': 0.5225390984360626, 'translation_length': 568, 'reference_length': 1087}


In [84]:
comparison["ROUGE-1"][2] = rouge_result["rouge1"]
comparison["ROUGE-2"][2] = rouge_result["rouge2"]
comparison["ROUGE-L"][2] = rouge_result["rougeL"]
comparison["ROUGE-Lsum"][2] = rouge_result["rougeLsum"]

In [85]:
comparison["BLEU"][2] = bleu_result["bleu"]
comparison["Brevity penalty"][2] = bleu_result["brevity_penalty"]
comparison["Length ratio"][2] = bleu_result["length_ratio"]
comparison["Translation length"][2] = bleu_result["translation_length"]
comparison["Reference length"][2] = bleu_result["reference_length"]

In [86]:
for i in range(5):
    print(f"\n====================={i}=====================\n")

    input_text = test_dataset[i]["text"]

    # Tokenize the input text
    inputs = tokenizer(
        input_text,
        max_length=512,
        truncation=True,
        padding="max_length",
        return_tensors="pt"
    )

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Generate the summary
    summary_ids = model.generate(
        **inputs,
        max_length=128,
        num_beams=4,
        length_penalty=2.0,
        early_stopping=True
    )

    # Decode the generated summary
    summary_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # Output
    print("\n--- Original Text ---\n")
    print(input_text)
    print("\n--- Reference Summary ---\n")
    print(test_dataset[i]["summary"])
    print("\n--- Generated Summary ---\n")
    print(summary_text)




--- Original Text ---

SECTION 1. ENVIRONMENTAL INFRASTRUCTURE.

    (a) Jackson County, Mississippi.--Section 219 of the Water 
Resources Development Act of 1992 (106 Stat. 4835; 110 Stat. 3757) is 
amended--
        (1) in subsection (c), by striking paragraph (5) and inserting 
    the following:
        ``(5) Jackson county, mississippi.--Provision of an alternative 
    water supply and a project for the elimination or control of 
    combined sewer overflows for Jackson County, Mississippi.''; and
        (2) in subsection (e)(1), by striking ``$10,000,000'' and 
    inserting ``$20,000,000''.
    (b) Manchester, New Hampshire.--Section 219(e)(3) of the Water 
Resources Development Act of 1992 (106 Stat. 4835; 110 Stat. 3757) is 
amended by striking ``$10,000,000'' and inserting ``$20,000,000''.
    (c) Atlanta, Georgia.--Section 219(f)(1) of the Water Resources 
Development Act of 1992 (106 Stat. 4835; 113 Stat. 335) is amended by 
striking ``$25,000,000 for''.
    (d) Paters

## BERT

In [87]:
from transformers import BertTokenizer, EncoderDecoderModel

In [88]:
model_path = "bert-summarizer-final"

tokenizer = BertTokenizer.from_pretrained(model_path)
model = EncoderDecoderModel.from_pretrained(model_path)

In [89]:
tokenized_train = train_dataset.map(lambda x: preprocess_function(tokenizer, x), batched=True)
tokenized_test = test_dataset.map(lambda x: preprocess_function(tokenizer, x), batched=True)

In [90]:
# Training setup
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./bert-summarizer",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    eval_strategy="epoch",
    label_names=["labels"],
    learning_rate=2e-5,
    num_train_epochs=3,
    logging_dir="./logs",
    logging_steps=100,
    save_total_limit=1,
    report_to="none"
)

trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
    )

In [91]:
import evaluate

In [92]:
import evaluate
from tqdm import tqdm

# Load metrics
rouge = evaluate.load("rouge")
bleu = evaluate.load("bleu")

# Optional: limit eval size for quick test
eval_samples = tokenized_test.select(range(10))

# Generate predictions manually since we use Trainer (not Seq2SeqTrainer)
decoded_preds = []
decoded_labels = []

for i in tqdm(range(0, len(eval_samples), 4)):  # batch size = 4
    batch = eval_samples[i:i+4]

    input_ids = torch.tensor(batch["input_ids"]).to(model.device)
    attention_mask = torch.tensor(batch["attention_mask"]).to(model.device)
    labels = torch.tensor(batch["labels"])

    # Generate outputs
    outputs = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_length=150,
        num_beams=4,
        early_stopping=True,
        decoder_start_token_id=tokenizer.cls_token_id
    )

    # Decode predictions and references
    preds = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    refs = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds.extend([pred.strip() for pred in preds])
    decoded_labels.extend([ref.strip() for ref in refs])

# Compute ROUGE
rouge_result = rouge.compute(predictions=decoded_preds, references=decoded_labels)
print("ROUGE scores:", rouge_result)

# Compute BLEU
bleu_result = bleu.compute(
    predictions=[" ".join(pred.split()) for pred in decoded_preds],
    references=[[" ".join(ref.split())] for ref in decoded_labels]
)
print("BLEU score:", bleu_result)


100%|██████████| 3/3 [00:07<00:00,  2.45s/it]

ROUGE scores: {'rouge1': 0.27145045758874614, 'rouge2': 0.11478244068367871, 'rougeL': 0.2171088138129012, 'rougeLsum': 0.2195390588358097}
BLEU score: {'bleu': 0.09961801140245499, 'precisions': [0.30392156862745096, 0.1288546255506608, 0.08129175946547884, 0.057432432432432436], 'brevity_penalty': 0.856685765320913, 'length_ratio': 0.8660377358490566, 'translation_length': 918, 'reference_length': 1060}





In [93]:
print("BLEU score:", bleu_result)

BLEU score: {'bleu': 0.09961801140245499, 'precisions': [0.30392156862745096, 0.1288546255506608, 0.08129175946547884, 0.057432432432432436], 'brevity_penalty': 0.856685765320913, 'length_ratio': 0.8660377358490566, 'translation_length': 918, 'reference_length': 1060}


In [94]:
comparison["ROUGE-1"][3] = rouge_result["rouge1"]
comparison["ROUGE-2"][3] = rouge_result["rouge2"]
comparison["ROUGE-L"][3] = rouge_result["rougeL"]
comparison["ROUGE-Lsum"][3] = rouge_result["rougeLsum"]

In [95]:
comparison["BLEU"][3] = bleu_result["bleu"]
comparison["Brevity penalty"][3] = bleu_result["brevity_penalty"]
comparison["Length ratio"][3] = bleu_result["length_ratio"]
comparison["Translation length"][3] = bleu_result["translation_length"]
comparison["Reference length"][3] = bleu_result["reference_length"]

In [97]:
for i in range(5):
    print(f"\n====================={i}=====================\n")

    input_text = test_dataset[i]["text"]

    # Tokenize the input text
    inputs = tokenizer(
        input_text,
        max_length=512,
        truncation=True,
        padding="max_length",
        return_tensors="pt"
    )

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Generate the summary
    summary_ids = model.generate(
        **inputs,
        max_length=128,
        num_beams=4,
        length_penalty=2.0,
        early_stopping=True,
        decoder_start_token_id=model.config.decoder_start_token_id
    )

    # Decode the generated summary
    summary_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # Output
    print("\n--- Original Text ---\n")
    print(input_text)
    print("\n--- Reference Summary ---\n")
    print(test_dataset[i]["summary"])
    print("\n--- Generated Summary ---\n")
    print(summary_text)




--- Original Text ---

SECTION 1. ENVIRONMENTAL INFRASTRUCTURE.

    (a) Jackson County, Mississippi.--Section 219 of the Water 
Resources Development Act of 1992 (106 Stat. 4835; 110 Stat. 3757) is 
amended--
        (1) in subsection (c), by striking paragraph (5) and inserting 
    the following:
        ``(5) Jackson county, mississippi.--Provision of an alternative 
    water supply and a project for the elimination or control of 
    combined sewer overflows for Jackson County, Mississippi.''; and
        (2) in subsection (e)(1), by striking ``$10,000,000'' and 
    inserting ``$20,000,000''.
    (b) Manchester, New Hampshire.--Section 219(e)(3) of the Water 
Resources Development Act of 1992 (106 Stat. 4835; 110 Stat. 3757) is 
amended by striking ``$10,000,000'' and inserting ``$20,000,000''.
    (c) Atlanta, Georgia.--Section 219(f)(1) of the Water Resources 
Development Act of 1992 (106 Stat. 4835; 113 Stat. 335) is amended by 
striking ``$25,000,000 for''.
    (d) Paters

## Tabel perbandingan

In [101]:
df = pd.DataFrame(comparison)
df.head(len(df))

Unnamed: 0,Models,ROUGE-1,ROUGE-2,ROUGE-L,ROUGE-Lsum,BLEU,Brevity penalty,Length ratio,Translation length,Reference length
0,GRU,0.34999,0.164736,0.27757,0.277408,0.074621,0.318632,0.466479,828,1775
1,PEGASUS (LoRA),0.474929,0.305663,0.387619,0.390718,0.168338,0.423959,0.538178,585,1087
2,PEGASUS (QLoRA),0.456311,0.278819,0.360538,0.360791,0.143692,0.401025,0.522539,568,1087
3,BERT (LoRA),0.27145,0.114782,0.217109,0.219539,0.099618,0.856686,0.866038,918,1060


----

## Kesimpulan

Kami merekomendasikan PEGASUS (LoRA) sebagai model yang paling efektif untuk keperluan Seq2Seq abstractive summarization teks legal, dengan perolehan skor tertinggi untuk kesemua metrik:
- ROUGE-1: 0.474929
- ROUGE-2: 0.305663
- ROUGE-L: 0.387619
- ROUGE-Lsum: 0.390718
- BLEU: 0.168338

Dalam hal ini, ROUGE memang metrik yang lebih sesuai untuk summarization ketimbang BLEU. ROUGE mengukur kemiripan antara output model dengan referensi (label ground-truth), misalnya dari segi unigram untuk ROUGE-1. Maka, ROUGE cenderung berfokus ke recall (berapa banyak informasi referensi yang ditangkap model). Jadinya, PEGASUS (LoRA) yang paling akurat dalam mereproduksi isi dan struktur kalimat dari referensi.

PEGASUS (QLoRA) juga memberikan hasil mendekati PEGASUS (LoRA), dengan skor ROUGE dan BLEU yang cukup bersaing. QLoRA memberikan trade-off bisa menghemat resource hardware ketika berurusan dengan model yang lebih besar, walau model kehilangan sedikit performanya.

BERT (LoRA) dan GRU menunjukkan performa terendah. Agaknya GRU dan BERT (LoRA) kurang cocok untuk tugas ini.