<a href="https://colab.research.google.com/github/Ryzon3/csci_6967/blob/main/homework6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part/Task 1 Transformers

### Dataset Description and Preprocessing

We selected the [xsum](https://huggingface.co/datasets/shalinik/xsum) dataset, which contains around 220K news articles with corresponding summaries. For faster experimentation, we perform a 90/10 split on its official training set. Note I had to take a subset to make this run in a reasonable time.

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.

In [2]:
import datasets
from datasets import load_dataset, concatenate_datasets

# Load the xsum
xsum_dataset = load_dataset("xsum")

# Use the official training split and create a 90/10 train-test split
all_data = xsum_dataset["train"]
# Shuffle and only use the top 50k
all_data = all_data.shuffle(seed=42).select(range(50000))
train_test_split = all_data.train_test_split(test_size=0.1, seed=42)
train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]

print("Train dataset size:", len(train_dataset))
print("Test dataset size:", len(test_dataset))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

xsum.py:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

The repository for xsum contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/xsum.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


(…)SUM-EMNLP18-Summary-Data-Original.tar.gz:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.72M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

Train dataset size: 45000
Test dataset size: 5000


### Data Preprocessing and Tokenization

We tokenize the news articles (inputs) and their corresponding highlights using the tokenizer from the pre-trained model `facebook/bart-large-cnn` per the documentation. The input texts are truncated to a maximum length to manage memory constraints.


In [3]:
from transformers import BartTokenizer

# Load the tokenizer
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")

def preprocess_function(examples):
    inputs = tokenizer(examples["document"], max_length=1024, truncation=True)
    # Use as_target_tokenizer context to process the summaries
    with tokenizer.as_target_tokenizer():
        targets = tokenizer(examples["summary"], max_length=128, truncation=True)
    inputs["labels"] = targets["input_ids"]
    return inputs

# Map the preprocessing function to the train and test splits
train_dataset = train_dataset.map(preprocess_function, batched=True)
test_dataset = test_dataset.map(preprocess_function, batched=True)


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Map:   0%|          | 0/45000 [00:00<?, ? examples/s]



Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

### Loading and Fine-Tuning the BART Model

We load the pre-trained BART model from Hugging Face and fine-tune it using the Trainer API. Hyperparameters such as learning rate, batch size, and number of epochs are set here, and they can be tuned for better performance.


In [9]:
from transformers import BartForConditionalGeneration, Trainer, TrainingArguments, DataCollatorForSeq2Seq
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the pre-trained BART model
model = BartForConditionalGeneration.from_pretrained("facebook/bart-base").to(device)
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Define training arguments
training_args = TrainingArguments(
    learning_rate=5e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=1,
    weight_decay=0.01,
    save_total_limit=2,
    logging_steps=100,
    report_to=[]
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=data_collator
)

# Fine-tune the model
trainer.train()

Step,Training Loss
100,2.4834
200,2.4027
300,2.3491
400,2.3175
500,2.2851
600,2.2812
700,2.2528
800,2.2313
900,2.2299
1000,2.2166




KeyboardInterrupt: 

### Evaluation: BLEU and ROUGE Metrics

After fine-tuning, we evaluate the model on the test set. We generate summaries for the test articles and compute BLEU and ROUGE scores to measure the quality of the generated summaries.


In [12]:
!pip install evaluate
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=993916432238e802d94aa4e4c6b38330abe54d1d7b952e02ccabc79ee4ad3d70
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
import evaluate

# Load evaluation metrics
rouge_metric = evaluate.load("rouge")
bleu_metric = evaluate.load("bleu")

def generate_summary_batch(batch):
    # Tokenize a batch of documents and add padding for consistency
    inputs = tokenizer(batch["document"], return_tensors="pt", truncation=True, max_length=1024, padding=True)
    # Move inputs to the same device as the model
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    # Generate summaries for the entire batch
    summary_ids = model.generate(inputs["input_ids"], max_length=128, num_beams=4, early_stopping=True)

    # Decode each generated summary
    summaries = [tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids]
    return {"predicted_summary": summaries}

# Process the test dataset in batches
results = test_dataset.map(generate_summary_batch, batched=True, batch_size=16)



In [22]:
# Extract predictions and references for evaluation
predictions = [ex["predicted_summary"] for ex in results]
references = [[ex["summary"]] for ex in results] # Change this line

# Compute ROUGE scores
rouge_result = rouge_metric.compute(predictions=predictions, references=references)

bleu_result = bleu_metric.compute(
    predictions=predictions,
    references=references
)

print("ROUGE scores:", rouge_result)
print("BLEU score:", bleu_result)

ROUGE scores: {'rouge1': np.float64(0.3443513650306491), 'rouge2': np.float64(0.12777197401326412), 'rougeL': np.float64(0.2748596213519804), 'rougeLsum': np.float64(0.2746388869752648)}
BLEU score: {'bleu': 0.07723805442715455, 'precisions': [0.39010931806350857, 0.12516062426552152, 0.05664017105935946, 0.028573896092680098], 'brevity_penalty': 0.819208236580442, 'length_ratio': 0.8337384118606993, 'translation_length': 96050, 'reference_length': 115204}


### Results Analysis

The ROUGE scores indicate that the generated summaries capture roughly 34% of individual words and around 27–28% of sequential content compared to the references, reflecting a moderate level of content overlap. However, the low BLEU score (about 7.7%), with steep drops in longer n-gram precisions and a brevity penalty, shows that the model struggles to produce longer, more coherent phrases and tends to generate shorter summaries than the reference texts.