**Section 1: Dataset Preperation**

Dataset pre-processing from the CNN/Daily Mail Kaggle datset. We only used a small subsection due to space and compute constraints of the original dataset with the following ratios:

=====

Train: 28,000 / 287,113

Dev: 1,300 / 13,368

Test: 1,100 / 11,490

=====

We chose to truncate the datasets to around 10% of their original size as training on the original dataset took more than 24 hours on local machines with RTX 4090 GPUs. We did not apply any manual data cleaning, as the dataset was already preprocessed by the original authors. However, we used model-specific tokenization during training and inference, the T5 tokenizer for T5-Small and the GPT-2 tokenizer, for GPT-2.
The truncated data is saved as .csv files to the local lab machines on campus and used for the rest of our work.


In [None]:
import pandas as pd

dev = pd.read_csv("/home/parker78/NLP_FinalProj/cnn_dailymail/validation.csv")
train = pd.read_csv("/home/parker78/NLP_FinalProj/cnn_dailymail/train.csv")
test = pd.read_csv("/home/parker78/NLP_FinalProj/cnn_dailymail/test.csv")

train = train.loc[:28000, :]
dev = dev.loc[:1300, :]
test = test.loc[:1100, :]

dev.to_csv("cnn_dailymail_dev.csv")
train.to_csv("cnn_dailymail_train.csv")
test.to_csv("cnn_dailymail_test.csv")

**Section 2: GPT2 Fine-Tuning**

In this section, we did light fine-tuning on GPT2 since it is a purely decoder model and we would be zero-shotting it on summarization without fine-tuning, leading to a much poorer performance.

===

We use the GPT2 tokenizer and the GPU's on the local lab machine for computation speed. We used a small prompt as GPT2 only takes in a max of 1024 tokens which also includes the prompt. We did implement truncation incase articles went over that threshold as the mean article length was nearly 800 characters.

===

With space and time constrictions, we did a very small amount of fine-tuning on this model, only running for 1 epoch, with weight decay (reguralization), and learning rate warm-up. This model was saved and was used for our camparative analysis

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
import torch

device = torch.device('cuda' if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}") # my sanity check

train_dataset = load_dataset('csv', data_files={'train': '/home/parker78/NLP_FinalProj/cnn_dailymail_train.csv'})['train']
dev_dataset = load_dataset('csv', data_files={'dev': '/home/parker78/NLP_FinalProj/cnn_dailymail_dev.csv'})['dev']

tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2').to(device)
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id

def tokenize_function(examples):
    inputs = []
    for article, summary in zip(examples["article"], examples["highlights"]):
        prompt = "Briefly summarize this article:\n" + article + "\nSummary:"
        full_text = prompt + " " + summary

        tokenized = tokenizer(
            full_text,
            max_length=1024,
            padding="max_length",
            truncation=True
        )

        prompt_ids = tokenizer(prompt, truncation=True, max_length=1024)["input_ids"]
        prompt_len = len(prompt_ids)

        label_ids = tokenized["input_ids"].copy()
        label_ids = [
            token if i >= prompt_len and token != tokenizer.pad_token_id else -100
            for i, token in enumerate(label_ids)
        ]

        tokenized["labels"] = label_ids
        inputs.append(tokenized)

    return {
        key: [example[key] for example in inputs]
        for key in inputs[0]
    }


train_tokenized = train_dataset.map(tokenize_function, batched=True)
dev_tokenized = dev_dataset.map(tokenize_function, batched=True)

training_args = TrainingArguments(
    output_dir='./gpt_summarizer',
    eval_strategy='epoch',
    num_train_epochs=5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=dev_tokenized
)


trainer.train()

model.save_pretrained("./gpt_summarizer_model")
tokenizer.save_pretrained("./gpt_summarizer_model")

**Section 3: Comparitive Analysis**

Here, we get example outputs of the article, highlight (the reference), T5-small's summary, and GPT2's summary. We load up our dataset subsets here while using the lab computers GPU for the models, T5-small and GPT2. Two functions were created, both for their respected models where they would each use the models tokenizer and correct prompting input with the  methods returning the article passed in summarized.

===

For our metrics, rouge scores were calculated on how close the summarized text was to the highligh (reference). Each summary was appended to an array to easily navigate and ID the correct articles and have correct comparisons between GPT2 and T5-small. We printed out a couple of examples to insepct them for human review and understand the rouge metric better with its comparison. Lastly, the rouge metrics (1,2, L, L-Sum) were all computed and outputted for further analysis between the two models.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM
import torch
from datasets import load_dataset
import evaluate
from tqdm import tqdm

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# T5 (Encoder-Decoder)
t5_tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small")
t5_model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small").to(device)

# GPT-2 (Decoder-only)
gpt2_tokenizer = AutoTokenizer.from_pretrained("/home/parker78/NLP_FinalProj/gpt_summarizer_model")
gpt2_model = AutoModelForCausalLM.from_pretrained("/home/parker78/NLP_FinalProj/gpt_summarizer_model").to(device)

# The subset dataset
train_dataset = load_dataset('csv', data_files={"train": "/home/parker78/NLP_FinalProj/cnn_dailymail_train.csv"})["train"]
dev_dataset = load_dataset('csv', data_files={"dev": "/home/parker78/NLP_FinalProj/cnn_dailymail_dev.csv"})["dev"]
test_dataset = load_dataset('csv', data_files={"test": "/home/parker78/NLP_FinalProj/cnn_dailymail_test.csv"})["test"]

def summarize_t5(article):
    input_text = "summarize: " + article
    inputs = t5_tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True).to(device)
    with torch.no_grad():
        summary_ids = t5_model.generate(inputs["input_ids"], max_length=150, num_beams=4, early_stopping=True)
    return t5_tokenizer.decode(summary_ids[0], skip_special_tokens=True)

def summarize_gpt2(article):
    prompt = "Briefly summarize this article:\n" + article + "\nSummary:"
    inputs = gpt2_tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True).to(device)
    with torch.no_grad():
        outputs = gpt2_model.generate(inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=150, num_beams=4, early_stopping=True, pad_token_id=gpt2_tokenizer.eos_token_id)

    decoded = gpt2_tokenizer.decode(outputs[0], skip_special_tokens=True)

    if "Summary:" in decoded:
        return decoded.split("Summary:")[-1].strip()
    else:
        return decoded.strip()


rouge = evaluate.load("rouge")

t5_preds = []
gpt2_preds = []
refrences = []

for example in tqdm(dev_dataset):
    article = example["article"]
    refrences.append(example["highlights"])

    t5_summary = summarize_t5(article)
    t5_preds.append(t5_summary)

    gpt2_summary = summarize_gpt2(article)
    gpt2_preds.append(gpt2_summary)


t5_rouge = rouge.compute(predictions=t5_preds, references=refrences, use_stemmer=True)
gpt2_rouge = rouge.compute(predictions=gpt2_preds, references=refrences, use_stemmer=True)

# printing a couple of examples (did around 3-10 usually)
for i in range(10):
    print("\nArticle:", dev_dataset[i]["article"][:300], "... \n")
    print("Reference:", refrences[i], "\n")
    print("T5 Summary:", t5_preds[i], "\n")
    print("GPT-2 Summary:", gpt2_preds[i])

print("T5-small ROUGE:")
print({k: round(v * 100, 2) for k, v in t5_rouge.items()})

print("\nGPT-2 ROUGE:")
print({k: round(v * 100, 2) for k, v in gpt2_rouge.items()})

**Section 4: Other Model Testing**

The code block below was also used to try and test on the T5-Base model that allows for up to 768 tokens in it's input sequence. We tried to implement this so that we could attempt to mitigate any issues that may have been caused from an input sequence (the article) from being cut off and disrupting the summary being generated. We set the max output token length to 256 to prevent summaries that were generated from being too long. This also utilized the HuggingFace transformers library to use the T5 model and T5 tokenizer.

In [None]:
from datasets import load_dataset
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
import evaluate
import numpy as np

#Note: the file links will not work with collab as
#these truncated files were run locally on the /tmp drives on the lab machines
#as they had storage size.
train_data = pd.read_csv("/tmp/NLP/CNN_Daily_train.csv")
dev_data = pd.read_csv("/tmp/NLP/CNN_Daily_dev.csv")
prefix = "summarize: "


checkpoint = "google-t5/t5-base"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


# Check for empty, N/A, or null strings
print(train_data.isnull().sum())
print(dev_data.isnull().sum())

#Tokenize the data to be passed into the T5 tokenizer
def preprocess_function(examples):
    inputs = []
    inputs = [prefix + doc for doc in examples["article"]]
    model_inputs = tokenizer(inputs, max_length=768, truncation=True)
    labels = tokenizer(text_target=examples["highlights"], max_length=256, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


# Convert dataframe to huggingface dataset
hf_dataset_train = Dataset.from_pandas(train_data)
hf_dataset_dev = Dataset.from_pandas(dev_data)

tokenized_train = hf_dataset_train.map(preprocess_function, batched=True)
tokenized_dev = hf_dataset_dev.map(preprocess_function, batched=True)

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)
rouge = evaluate.load("rouge")


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    return {k: round(v, 4) for k, v in result.items()}


model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

# Model arguments, change hparams as necessary
training_args = Seq2SeqTrainingArguments(
    output_dir="test_output",
    evaluation_strategy="epoch",
    learning_rate=0.001,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    num_train_epochs=10,
    predict_with_generate=True,
    fp16=True,
    load_best_model_at_end=True,
    metric_for_best_model="eval_rougeL",
    greater_is_better=True,
)


trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_dev,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

from transformers import pipeline

best_model = AutoModelForSeq2SeqLM.from_pretrained("test_output")
summarizer = pipeline("summarization", model=best_model, tokenizer=tokenizer)

# Print out articles, summaries, and generated summaries for human evaluation
samples = dev_data.sample(5)
for i, row in samples.iterrows():
    summary = summarizer(prefix + row["article"], max_length=128, min_length=10, do_sample=False)[0]["summary_text"]
    print(f"ARTICLE: {row['article'][:400]}...\n")
    print(f"MODEL SUMMARY: {summary}\n")
    print(f"REFERENCE SUMMARY: {row['highlights']}\n")
    print("="*60)