# NLP PROJECT

11/22/2024

by Tamsyn Evezard


## OVERVIEW

This project involves fine-tuning a pretrained model (BART) with a CNN/DailyMail dataset from Hugging Face for the purpose of text summarization (information extraction). This dataset includes samples (full news articles) and labels (summaries of the articles) that can be used for fine-tuning. Because fine-tuning was computationally intensive with such a large dataset for my CPU, I fine-tuned the BART generative model with a smaller subset of my dataset for proof of concept. However, this could be done with the full dataset.

### STEP 1: Setup
- import dependencies
- load dataset
- create smaller train and eval datasets
- load bart model and tokenizer

In [12]:

!pip install transformers datasets evaluate transformers[torch]
!pip install py7zr

from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    DataCollatorForSeq2Seq
)
import datasets

# Load the CNN/DailyMail dataset
dataset = datasets.load_dataset("cnn_dailymail", "3.0.0")

# Use only a portion of the dataset
train_subset_size = 5000
eval_subset_size = 1000

# Shuffle and select a subset for training and validation
train_dataset = dataset["train"].shuffle(seed=42).select(range(train_subset_size))
eval_dataset = dataset["validation"].shuffle(seed=42).select(range(eval_subset_size))

# Load the pre-trained (non-fine-tuned) BART model and tokenizer
base_model_name = "facebook/bart-large-cnn"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForSeq2SeqLM.from_pretrained(base_model_name)



### STEP 2: Generate with non-fine-tuned model
- create generate_summary function to be used before and after fine-tuning
- test non-fine-tuned model

In [13]:
# Function to generate summary using any model
def generate_summary(input_text, llm, tokenizer):
    input_ids = tokenizer(
        input_text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=1024
    )
    # Generate summary with parameters
    tokenized_output = llm.generate(
        input_ids["input_ids"],
        min_length=30,
        max_length=200,
        num_beams=4,
        length_penalty=2.0,
        no_repeat_ngram_size=3,
        early_stopping=True
    )
    # Decode the output
    output = tokenizer.decode(tokenized_output[0], skip_special_tokens=True)
    return output

# Test the non-fine-tuned model
sample = dataset["test"][0]["article"]
label = dataset["test"][0]["highlights"]

print("=== Using Non-Fine-Tuned Model ===")
non_fine_tuned_summary = generate_summary(sample, llm=base_model, tokenizer=tokenizer)
print("Sample Article:")
print(sample)
print("-----------------")
print("Model-Generated Summary (Non-Fine-Tuned):")
print(non_fine_tuned_summary)
print("Ground Truth Summary:")
print(label)

=== Using Non-Fine-Tuned Model ===
Sample Article:
(CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC's founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians' efforts to join the body. But Palestinian Foreign Minister 

### STEP 3: Hugging Face Login
 - used to push fine-tuned model to hub

In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### STEP 4: Fine-tune
- preprocess & tokenize the data (inputs and labels) for smooth fine-tuning
- initialize the data collator to be used for trainer
- set training arguments
- initialize trainer
- train the model and push it to the hub to use later

In [20]:
# Preprocess the dataset for fine-tuning
def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["article"],
        max_length=1024,
        truncation=True,
        padding="max_length"
    )
    labels = tokenizer(
        examples["highlights"],
        max_length=128,
        truncation=True,
        padding="max_length"
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Tokenize the subsets
train_dataset = train_dataset.map(preprocess_function, batched=True)
eval_dataset = eval_dataset.map(preprocess_function, batched=True)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=base_model)

# Define training arguments for fine-tuning
training_args = Seq2SeqTrainingArguments(
    output_dir="./bart-news-finetuned",
    run_name="bart_cnn_finetune",
    hub_model_id="tamsyne8/bart-news-finedtuned-b",
    evaluation_strategy="epoch",
    learning_rate=1e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    predict_with_generate=True,
    logging_dir="./logs",
    logging_steps=10,
    auto_find_batch_size=True,
    save_strategy="epoch",
    push_to_hub=True
)

# Initialize the trainer for fine-tuning
trainer = Seq2SeqTrainer(
    model=base_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

  trainer = Seq2SeqTrainer(


In [21]:
# Fine-tune the model
trainer.train()
# Save
trainer.push_to_hub("tamsyne8/bart-news-finedtuned-b")

Epoch,Training Loss,Validation Loss
1,0.6404,0.818736
2,0.5459,0.833803




CommitInfo(commit_url='https://huggingface.co/tamsyne8/bart-news-finedtuned-b/commit/f8c2b97602f687299ec38ecbdef6dab5ea70b328', commit_message='tamsyne8/bart-news-finedtuned-b', commit_description='', oid='f8c2b97602f687299ec38ecbdef6dab5ea70b328', pr_url=None, repo_url=RepoUrl('https://huggingface.co/tamsyne8/bart-news-finedtuned-b', endpoint='https://huggingface.co', repo_type='model', repo_id='tamsyne8/bart-news-finedtuned-b'), pr_revision=None, pr_num=None)

### STEP 5: Test the fine-tuned model
- load model from hub
- test fine-tuned model by generating another summary & comparing to the ground truth summary (label)

In [22]:
# Load the fine-tuned model
fine_tuned_model = AutoModelForSeq2SeqLM.from_pretrained("tamsyne8/bart-news-finedtuned-b")

# Test the fine-tuned model
print("\n=== Using Fine-Tuned Model ===")
sample = eval_dataset[0]["article"]
label = eval_dataset[0]["highlights"]
fine_tuned_summary = generate_summary(sample, llm=fine_tuned_model, tokenizer=tokenizer)
print("Sample Article:")
print(sample)
print("-----------------")
print("Model-Generated Summary (Fine-Tuned):")
print(fine_tuned_summary)
print("Ground Truth Summary:")
print(label)




model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]


=== Using Fine-Tuned Model ===




Sample Article:
Jarryd Hayne's move to the NFL is a boost for rugby league in the United States, it has been claimed. The Australia international full-back or centre quit the National Rugby League in October to try his luck in American football and was this week given a three-year contract with the San Francisco 49ers. Peter Illfield, chairman of US Association of Rugby League, said: 'Jarryd, at 27, is one of the most gifted and talented rugby league players in Australia. He is an extraordinary athlete. Jarryd Hayne (right) has signed with the San Francisco 49ers after quitting the NRL in October . Hayne, who played rugby league for Australia, has signed a three year contract with the 49ers . 'His three-year deal with the 49ers, as an expected running back, gives the USA Rugby League a connection with the American football lover like never before. 'Jarryd's profile and playing ability will bring our sport to the attention of many. It also has the possibility of showing the American col

### STEP 6: Test an additional article

Let's test both the base and the fine-tuned model on a sample without a label (the assignment prompt for this project on blackboard), to get a simple "real world" effect of fine-tuning.

In [23]:
# Test the non-tuned model with my own "article": the assignment prompt for this project on blackboard
print("\n=== Using Base Model ===")
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForSeq2SeqLM.from_pretrained(base_model_name)
sample = "This is the third of 3 independent programming projects this semester. Students should build a generative large language model. Identify an application domain such as papers by Hinton. Use transfer learning i.e. retraining some parameters of a pre-trained model. As before, the recommended platform is: Python, Jupyter Notebook, Tensorflow. Students are encouraged to email technical questions to the professor or to schedule a meeting, live or virtual, by suggesting 3 meeting times in email (to jmiller@hood.edu). The projects are due Nov 22. The grader will take into account correctness of the experiment, clarity of the code and explanation, rigor with which the model was evaluated, and cleverness in trying things. "
summary = generate_summary(sample, llm=base_model, tokenizer=tokenizer)
print("Sample Article:")
print(sample)
print("-----------------")
print("Model-Generated Summary:")
print(summary)

# Test the fine-tuned model with my own "article"
print("\n=== Using Fine-Tuned Model ===")
sample = "This is the third of 3 independent programming projects this semester. Students should build a generative large language model. Identify an application domain such as papers by Hinton. Use transfer learning i.e. retraining some parameters of a pre-trained model. As before, the recommended platform is: Python, Jupyter Notebook, Tensorflow. Students are encouraged to email technical questions to the professor or to schedule a meeting, live or virtual, by suggesting 3 meeting times in email (to jmiller@hood.edu). The projects are due Nov 22. The grader will take into account correctness of the experiment, clarity of the code and explanation, rigor with which the model was evaluated, and cleverness in trying things. "
fine_tuned_summary = generate_summary(sample, llm=fine_tuned_model, tokenizer=tokenizer)
print("Sample Article:")
print(sample)
print("-----------------")
print("Model-Generated Summary (Fine-Tuned):")
print(fine_tuned_summary)


=== Using Base Model ===
Sample Article:
This is the third of 3 independent programming projects this semester. Students should build a generative large language model. Identify an application domain such as papers by Hinton. Use transfer learning i.e. retraining some parameters of a pre-trained model. As before, the recommended platform is: Python, Jupyter Notebook, Tensorflow. Students are encouraged to email technical questions to the professor or to schedule a meeting, live or virtual, by suggesting 3 meeting times in email (to jmiller@hood.edu). The projects are due Nov 22. The grader will take into account correctness of the experiment, clarity of the code and explanation, rigor with which the model was evaluated, and cleverness in trying things. 
-----------------
Model-Generated Summary:
Students should build a generative large language model. Identify an application domain such as papers by Hinton. Use transfer learning i.e. retraining some parameters of a pre-trained model.


# CONCLUSION

The fine-tuning, even with 2 epochs of a smaller section of the dataset, proved to generate a summary significantly closer to the label. Although the summary generated by the non-fine-tuned bart model was accurate, it was noticibly different compared to the label. The fine-tuning's improvement of the bart model for summarization was highlighted by the "real world" experiment exercised at the end.

## REFERENCES

#### <i>News Dataset</i>:
[1] “ccdv/cnn_dailymail · Datasets at Hugging Face,” huggingface.co, Apr. 16, 2023. https://huggingface.co/datasets/ccdv/cnn_dailymail


#### <i>Fine-tuning Tutorial: adjusted the original by using the news dataset instead of the samsum (dialogue) dataset, and fine-tuned with a smaller subset of the dataset</i>

[2] Ingenium Academy, “Fine-Tuning A LLM For Summarization | Generative AI with Hugging Face | Ingenium Academy,” YouTube, Sep. 19, 2023. https://www.youtube.com/watch?v=msgLLudzlLg (accessed Nov. 17, 2024).
