<a href="https://www.kaggle.com/code/aisuko/text-summarization-with-bart-series-llm?scriptVersionId=163496659" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

In this notebook, we will fine-tune `facebook/bart-large-xsum` model on `SamSum` dataset.

Note: There is a technique we did not mentioned in the previously notebook. It is `transfer learning`, we can also call it `fine-tuning`.


# Evaluation Strategy

Evaluating performance for language models can be quite tricky, especially when it comes to text summarization. The goal of our model is to produce a short sentence describing the content of a dialogue, while maintaining all the important information within that dialogue.

One of the quantitative metrics we can employ to evaluate performance is the `ROUGE Score`. It is considered one of the best metrics for text summarization and it evaluates performance by comparing the quality of a machine-generated summary to a human generated summary used for reference.

The similarities between both summaries are measured by analyzing the overlapping `n-grams`, either single words of sequences of words that are present in both summaries. These can be unigrams(ROUGE-1), where only the overlap of sole words is measured; biggrams(ROUGE-2), where we measure the overlap of two-word sequencesl trigrams(ROUGE-3), where we measrure the overlap of three-word sequences; etc. Besides that, we also have:


**ROUGE-L**

It measures the *Longest Common Subsequence(LCS)* between the two summaries, which helps to capture content coverage of the machine-generated text. If both summaries have the sequence "the apple is green", we have a match regardless of where they appear in both texts.

**ROUGE-S**

It avaluates the overlap of skip-bigrams, which are bigrams that permit gaps between words. This helps to measure the coherence of a machine-generated summary. For example, in the phrase "this apple is absolutely green", we find a match for the terms such as "apple" and "green", if that is what we are looking for.

These scores might typically range from 0 to 100, where 0 indicates no match and 100 indicates a perfect match between both summaries. Besides quantitative metrics, it is useful to use `human evaluation` to analyze the output of language models, since we are able to comprehend text in a wat that a machine does not.



In [None]:
!nvidia-smi # Checking GPU

In [None]:
%%capture --no-stderr
!pip install transformers==4.37.2
!pip install datasets==2.17.0
!pip install evaluate==0.4.1
!pip install rouge-score==0.1.2
# Installing library to save zip archives
!pip install py7zr==0.20.8

In [None]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ['MODEL']='facebook/bart-large-xsum'

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tuning HuBERT"
os.environ["WANDB_NOTES"] = "Fine-tuning HuBERT on gtzan"
os.environ["WANDB_NAME"] = "ft-hubert-on-gtzan"

In [None]:
import warnings

warnings.filterwarnings('ignore')

In [None]:
# Data Handling
import pandas as pd

train=pd.read_csv('/kaggle/input/samsum-dataset-text-summarization/samsum-train.csv')
test=pd.read_csv('/kaggle/input/samsum-dataset-text-summarization/samsum-test.csv')
val=pd.read_csv('/kaggle/input/samsum-dataset-text-summarization/samsum-validation.csv')
type(train)

In the notebook [Visualisation and Statistic SamSum Dataset](https://www.kaggle.com/code/aisuko/visualisation-and-statistic-samsum-dataset), we can see that some tags in a few texts, such as `file_photo` in dialogue. Let's remove these tags from the texts.

In [None]:
print(train['dialogue'].iloc[14727])

In [None]:
def clean_tags(text):
    clean=re.complie('<.*?>') # compiling tags
    clean=re.sub(clean, '', text) # replacing tags text by an empty string
    
    # removing empty dialogues
    clean='\n'.join([line for line in clean.split('\n') if not re.match('.*:\s*$', line)])
    return clean

test1=clean_tags(train['dialogue'].iloc[14727])
test2=clean_tags(test['dialogue'].iloc[0])

print(test1)
print('\n'*3)
print(test2)

Let's define a function and apply `clean_tags` to the entire datasets. It's beneficial to conduct such data cleansing to eliminate noise-information.

In [None]:
def clean_df(df, cols):
    for col in cols:
        df[col]=df[col].fillna('').apply(clean_tags)
    return df

train=clean_df(train, ['dialogue','summary'])
test=clean_df(test, ['dialogue', 'summary'])
val=clean_df(val, ['dialogue', 'summary'])

# visualizing results
train.tail(3)

In [None]:
# Data Handling
from datasets import Dataset, load_metric

train_ds=Dataset.from_pandas(train)
test_ds=Dataset.from_pandas(test)
val_ds=Dataset.from_pandas(val)

print(train_ds)
print('\n'*2)
print(test_ds)
print('\n'*2)
print(val_ds)

In [None]:
train_ds[0]

# Modeling

In [None]:
from transformers import pipeline

summarizer=pipeline('summarization', model=os.getenv('MODEL'))

news='''Melbourne, Australia, a vibrant city pulsating with energy, seamlessly blends historical charm with modern dynamism. Nestled on the southeastern coast, it beckons with iconic landmarks, hidden alleyways teeming with artistic expression, and a diverse culinary scene that tantalizes every palate. Immerse yourself in the city's soul at Federation Square, a modern marvel where plazas, bars, and restaurants pulsate with life beside the Yarra River. Delve into Melbourne's artistic heart, the Southbank Arts Precinct, where the renowned Arts Centre Melbourne stages captivating performances and the National Gallery of Victoria houses a treasure trove of Australian and international art. Beyond the cultural haven, Melbourne's laneways unveil a hidden world. These narrow passageways, once industrial backstreets, have transformed into vibrant arteries brimming with trendy cafes, eclectic street art, and hidden bars, pulsating with a unique character. Explore Hosier Lane, adorned with captivating murals, or AC/DC Lane, a haven for rock n' roll memorabilia and hidden bars. Melbourne's culinary scene is a symphony of flavors, a testament to its multicultural tapestry. Michelin-starred restaurants offer exquisite experiences, while hidden gems tucked away in laneways tantalize with innovative dishes. Venture beyond the city center and discover the Yarra Valley, a renowned wine region, or embark on the Great Ocean Road, a world-famous coastal drive unfolding breathtaking scenery. Escape the urban buzz in the Dandenong Ranges, a haven of lush rainforests and charming villages. Melbourne's cultural tapestry is as diverse as its population, acknowledging the traditional owners and celebrating their heritage through various initiatives. Vibrant festivals showcase the city's multicultural spirit, and Melbourne embraces inclusivity, fostering a thriving LGBTQ+ community. As the sun sets, Melbourne's nightlife ignites, offering rooftop bars with stunning cityscapes, intimate jazz clubs pulsating with live music, and underground dance floors catering to all musical tastes. Melbourne defies definition, a symphony of experiences where history whispers, art bursts forth, and flavors tantalize. It's a city that embraces diversity, celebrates creativity, and pulsates with an infectious energy that lingers long after your visit.'''

summarizer(news)

In [None]:
from transfromers import BartTokenizer, BartForConditionalGeneration # BERT Tokenizer and architecture

tokenizer=BartTokenizer.from_pertrained(os.getenv('MODEL'))
tokenizer

In [None]:
model=BartForConditionalGeneration.from_pretrained(os.getenv('MODEL'))
print(model)

It is possible to see that models consist of an encoder and a decoder, we can see the Linear Layers, as well as the activation functions, which use $GeLU$, instead of the more typical $ReLU$. It is also interesting to observe the output layer, **lm_head**, which shows us that this model is ideal for generating outputs with a vocabulary size - `out_features=50264` - this shows us that this architecture