<a href="https://www.kaggle.com/code/aisuko/text-summarization-with-bart-series-llm?scriptVersionId=163528181" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

In this notebook, we will fine-tune `facebook/bart-large-xsum` model on `SamSum` dataset.

Note: There is a technique we did not mentioned in the previously notebook. It is `transfer learning`, we can also call it `fine-tuning`.


# Evaluation Strategy

Evaluating performance for language models can be quite tricky, especially when it comes to text summarization. The goal of our model is to produce a short sentence describing the content of a dialogue, while maintaining all the important information within that dialogue.

One of the quantitative metrics we can employ to evaluate performance is the `ROUGE Score`. It is considered one of the best metrics for text summarization and it evaluates performance by comparing the quality of a machine-generated summary to a human generated summary used for reference.

The similarities between both summaries are measured by analyzing the overlapping `n-grams`, either single words of sequences of words that are present in both summaries. These can be unigrams(ROUGE-1), where only the overlap of sole words is measured; biggrams(ROUGE-2), where we measure the overlap of two-word sequencesl trigrams(ROUGE-3), where we measrure the overlap of three-word sequences; etc. Besides that, we also have:


**ROUGE-L**

It measures the *Longest Common Subsequence(LCS)* between the two summaries, which helps to capture content coverage of the machine-generated text. If both summaries have the sequence "the apple is green", we have a match regardless of where they appear in both texts.

**ROUGE-S**

It avaluates the overlap of skip-bigrams, which are bigrams that permit gaps between words. This helps to measure the coherence of a machine-generated summary. For example, in the phrase "this apple is absolutely green", we find a match for the terms such as "apple" and "green", if that is what we are looking for.

These scores might typically range from 0 to 100, where 0 indicates no match and 100 indicates a perfect match between both summaries. Besides quantitative metrics, it is useful to use `human evaluation` to analyze the output of language models, since we are able to comprehend text in a wat that a machine does not.



In [1]:
!nvidia-smi # Checking GPU

Tue Feb 20 06:35:59 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:00:05.0 Off |  

In [2]:
%%capture --no-stderr
!pip install transformers==4.37.2
!pip install datasets==2.17.0
!pip install evaluate==0.4.1
!pip install rouge-score==0.1.2

In [3]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ['MODEL']='facebook/bart-large-xsum'

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tuning BART Series LLMs"
os.environ["WANDB_NOTES"] = ""
os.environ["WANDB_NAME"] = "ft-facebook-bart-large-xsum-on-samsum"

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [4]:
import warnings

warnings.filterwarnings('ignore')

In [5]:
# Data Handling
import pandas as pd

train=pd.read_csv('/kaggle/input/samsum-dataset-text-summarization/samsum-train.csv')
test=pd.read_csv('/kaggle/input/samsum-dataset-text-summarization/samsum-test.csv')
val=pd.read_csv('/kaggle/input/samsum-dataset-text-summarization/samsum-validation.csv')
type(train)

pandas.core.frame.DataFrame

In the notebook [Visualisation and Statistic SamSum Dataset](https://www.kaggle.com/code/aisuko/visualisation-and-statistic-samsum-dataset), we can see that some tags in a few texts, such as `file_photo` in dialogue. Let's remove these tags from the texts.

In [6]:
print(train['dialogue'].iloc[14727])

Romeo: You are on my ‘People you may know’ list.
Greta: Ah, maybe it is because of the changed number of somebody’s?
Greta: I don’t know you?
Romeo: This might be the beginning of a beautiful relationship
Romeo: How about adding me on your friend list and talk a bit?
Greta: No.
Romeo: Okay I see.


In [7]:
import re

def clean_tags(text):
    clean=re.compile('<.*?>') # compiling tags
    clean=re.sub(clean, '', text) # replacing tags text by an empty string
    
    # removing empty dialogues
    clean='\n'.join([line for line in clean.split('\n') if not re.match('.*:\s*$', line)])
    return clean

test1=clean_tags(train['dialogue'].iloc[14727])
test2=clean_tags(test['dialogue'].iloc[0])

print(test1)
print('\n'*3)
print(test2)

Romeo: You are on my ‘People you may know’ list.
Greta: Ah, maybe it is because of the changed number of somebody’s?
Greta: I don’t know you?
Romeo: This might be the beginning of a beautiful relationship
Romeo: How about adding me on your friend list and talk a bit?
Greta: No.
Romeo: Okay I see.




Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye


Let's define a function and apply `clean_tags` to the entire datasets. It's beneficial to conduct such data cleansing to eliminate noise-information.

In [8]:
def clean_df(df, cols):
    for col in cols:
        df[col]=df[col].fillna('').apply(clean_tags)
    return df

train=clean_df(train, ['dialogue','summary'])
test=clean_df(test, ['dialogue', 'summary'])
val=clean_df(val, ['dialogue', 'summary'])

# visualizing results
train.tail(3)

Unnamed: 0,id,dialogue,summary
14729,13819050,John: Every day some bad news. Japan will hunt...,Japan is going to hunt whales again. Island an...
14730,13828395,Jennifer: Dear Celia! How are you doing?\r\nJe...,Celia couldn't make it to the afternoon with t...
14731,13729017,Georgia: are you ready for hotel hunting? We n...,Georgia and Juliette are looking for a hotel i...


In [9]:
# Data Handling
from datasets import Dataset

train_ds=Dataset.from_pandas(train)
test_ds=Dataset.from_pandas(test)
val_ds=Dataset.from_pandas(val)

print(train_ds)
print('\n'*2)
print(test_ds)
print('\n'*2)
print(val_ds)

Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 14732
})



Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 819
})



Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 818
})


In [10]:
train_ds[0]

{'id': '13818513',
 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)",
 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.'}

# Prepprocess the data

The following `preprocess_func` can be directly copied from the Transformers documentation, and it serves well to preprocess data for several NLP tasks.

In [11]:
from transformers import BartTokenizer, BartForConditionalGeneration # BERT Tokenizer and architecture

tokenizer=BartTokenizer.from_pretrained(os.getenv('MODEL'))
tokenizer

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.51k [00:00<?, ?B/s]

BartTokenizer(name_or_path='facebook/bart-large-xsum', vocab_size=50265, model_max_length=1024, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	50264: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True, special=True),
}

In [12]:
def preprocess_func(example):
    # Iterating over every `dialogue` in the datset and saving them as input to the model
    inputs=[doc for doc in example['dialogue']]
    # we use tokenizer convert the input dialogues into tokens that can be easily understood by the BART model.
    # The truncation=True parameter ensures that all dialogues have a maximum number of 1024 tokens, as defined by the `max_length` parameter
    model_inputs=tokenizer(inputs, max_length=1024, truncation=True)
    
    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        # we tokenizes the target variable, which is our summaries. And we expect summaries to be a much shorter text than that of dialogues max_length=128
        labels=tokenizer(example['summary'], max_length=128, truncation=True)
    
    # we adding the tokenized labels to the preprocessed dataset, alongside the tokenized inputs.
    model_inputs['labels']=labels['input_ids']
    return model_inputs


tokenized_train= train_ds.map(preprocess_func, batched=True, remove_columns=['id', 'dialogue', 'summary'])
tokenized_test=test_ds.map(preprocess_func, batched=True, remove_columns=['id', 'dialogue', 'summary'])
tokenized_val=val_ds.map(preprocess_func, batched=True, remove_columns=['id', 'dialogue', 'summary'])

print(tokenized_train)
print(tokenized_test)
print(tokenized_val)

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 14732
})
Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 819
})
Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 818
})


## Checking sample

Our tokenized datasets consist now of only three features, `input_ids`, `attention_mask` and `labels`. Let's print a sample from our tokenized train dataset to investigate further how the preprocess function altered the data.

In [13]:
sample=tokenized_train[0]
print(sample['input_ids'])
print(sample['attention_mask'])
print(sample['labels'])

[0, 10127, 5219, 35, 38, 17241, 1437, 15269, 4, 1832, 47, 236, 103, 116, 50121, 50118, 39237, 35, 9136, 328, 50121, 50118, 10127, 5219, 35, 38, 581, 836, 47, 3859, 48433, 2]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[0, 10127, 5219, 17241, 15269, 8, 40, 836, 6509, 103, 3859, 4, 2]


**input_ids**

There are the token IDs mapped to the dialogues. Each token represents a word or subword that can be perfectly understood by the BART model. For instance, the number 5219 could be a map to a word like hello in BARt's vocabulary. Each word has its unique token in this context.

**attention_mask**

This mask indicates which tokens the model should pay attention to and which tokens should be ignored. This is often used in the context of padding - when some tokens are used to equalize the lengths of sentences - but most of these padding tokens do not hold any meaningful information, so the attention mask ensures the model does not focus on them. In the case of this specific sample, all toknes are masked as '1', meaning they are all relevant and none of them are used for padding.

**labels**

Similarly to the first feature, these are token IDs obtained from the words and subwords in the summaries. These are the tokens that the model will be trained on to give as output.

# Modeling

In [14]:
from transformers import pipeline

summarizer=pipeline('summarization', model=os.getenv('MODEL'))

news="summarize:Melbourne, Australia's cultural capital, pulsates with a vibrant energy. Grand Victorian architecture mingles with modern laneways bursting with street art, cafes, and independent shops. World-class museums and galleries showcase diverse collections, while renowned sporting events like the Australian Open electrify the atmosphere. Foodies delight in a multicultural culinary scene, from Michelin-starred restaurants to hidden gems serving up global flavors. Beyond the city, pristine beaches and lush national parks offer escapes into nature, making Melbourne a city that truly has it all."

summarizer(news)

2024-02-20 06:38:28.479653: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-20 06:38:28.479777: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-20 06:38:28.754605: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/309 [00:00<?, ?B/s]

[{'summary_text': "Melbourne, Australia, is one of the world's most visited cities, according to Lonely Planet."}]

# Loading model

It is possible to see below that models consist of an encoder and a decoder, we can see the Linear Layers, as well as the activation functions, which use $GeLU$, instead of the more typical $ReLU$. It is also interesting to observe the output layer, **lm_head**, which shows us that this model is ideal for generating outputs with a vocabulary size - `out_features=50264` - this shows us that this architecture

In [15]:
from transformers import BartForConditionalGeneration

model=BartForConditionalGeneration.from_pretrained(os.getenv('MODEL'))
print(model)

BartForConditionalGeneration(
  (model): BartModel(
    (shared): Embedding(50264, 1024, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50264, 1024, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
      (layers): ModuleList(
        (0-11): 12 x BartEncoderLayer(
          (self_attn): BartSdpaAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
          (final_layer_norm): La

## Batching the data

We must now use `DataCollatorForSeq2Seq` to batch the data. These data collators may also automatically apply some processing techniques, such as padding.

In [16]:
from transformers import DataCollatorForSeq2Seq

data_collator= DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
print(data_collator)

DataCollatorForSeq2Seq(tokenizer=BartTokenizer(name_or_path='facebook/bart-large-xsum', vocab_size=50265, model_max_length=1024, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	50264: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True, special=True),
}, model=BartForConditionalGeneration(
  (model): BartModel(
    (

# Loading evaluation metrics

In [17]:
from datasets import load_metric

metric=load_metric('rouge')

Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

In [18]:
import nltk
import numpy as np

# this divides a text into a list of sentences
nltk.download('punkt')

def compute_metrics(eval_pred):
    predictions, labels=eval_pred # obtaining predictions and true labels
    
    # decoding predictions
    decoded_preds=tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # obtaining the true labels tokens, while eliminating any possible masked token (i.e: label=-100)
    labels=np.where(labels!=-100, labels, tokenizer.pad_token_id)
    decoded_labels=tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # rouge expects a newline after each sentence
    decoded_preds=['\n'.join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels=['\n'.join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    # computing rouge score
    result=metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result={key: value.mid.fmeasure*100 for key, value in result.items()} # extracting some results
    
    # add mean-genrated length
    prediction_lens=[np.count_nonzero(pred!=tokenizer.pad_token_id) for pred in predictions]
    result['gen_len']=np.mean(prediction_lens)
    return {k: round(v,4) for k,v in result.items()}

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Training

In [19]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

training_args=Seq2SeqTrainingArguments(
    output_dir=os.getenv('WANDB_NAME'),
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='eval_loss',
    seed=42,
    learning_rate=2e-5,
    max_steps=100,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=1, # only for testing
    predict_with_generate=True,
    fp16=True,
    report_to='wandb',
    run_name=os.getenv('WANDB_NAME')
)

trainer=Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
0,No log,1.505537,49.5512,24.5568,40.7039,45.2274,26.4237


Non-default generation parameters: {'max_length': 62, 'min_length': 11, 'early_stopping': True, 'num_beams': 6, 'no_repeat_ngram_size': 3, 'forced_eos_token_id': 2}
There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm_head.weight'].


TrainOutput(global_step=100, training_loss=1.5889302062988282, metrics={'train_runtime': 763.3675, 'train_samples_per_second': 4.192, 'train_steps_per_second': 0.131, 'total_flos': 2331956926414848.0, 'train_loss': 1.5889302062988282, 'epoch': 0.22})

# Evaluating

In [20]:
validation=trainer.evaluate(eval_dataset=tokenized_val)
print(validation)

{'eval_loss': 1.4688810110092163, 'eval_rouge1': 50.9912, 'eval_rouge2': 25.7585, 'eval_rougeL': 41.4197, 'eval_rougeLsum': 46.5946, 'eval_gen_len': 26.8814, 'eval_runtime': 327.367, 'eval_samples_per_second': 2.499, 'eval_steps_per_second': 0.315, 'epoch': 0.22}


In [21]:
from transformers import GenerationConfig
kwargs={
    'model_name': f'{os.getenv("WANDB_NAME")}',
    'finetuned_from': f'{os.getenv("MODEL")}',
    'tasks': 'summarization'
}

trainer.push_to_hub(**kwargs)
tokenizer.push_to_hub(os.getenv('WANDB_NAME'))

generation_config=GenerationConfig(
    max_length=62, min_length=11, early_stopping=True, num_beams=6, no_repeat_ngram_size=3, forced_eos_token_id=2
)

generation_config.save_pretrained('aisuko/'+os.getenv('WANDB_NAME'), push_to_hub=True)

Non-default generation parameters: {'max_length': 62, 'min_length': 11, 'early_stopping': True, 'num_beams': 6, 'no_repeat_ngram_size': 3, 'forced_eos_token_id': 2}


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.92k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/1.70k [00:00<?, ?B/s]

# Inference

In [22]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("aisuko/ft-facebook-bart-large-xsum-on-samsum")
model = AutoModelForSeq2SeqLM.from_pretrained("aisuko/ft-facebook-bart-large-xsum-on-samsum")

OSError: aisuko/ft-facebook-bart-large-xsum-on-samsum does not appear to have a file named config.json. Checkout 'https://huggingface.co/aisuko/ft-facebook-bart-large-xsum-on-samsum/None' for available files.

In [None]:
val_ds[35]['dialogue']

In [None]:
summarizer(val_ds[35]['dialogue'])