# Overview

In this notebook, we will fine-tune `facebook/bart-large-xsum` model on `SamSum` dataset.

Note: There is a technique we did not mentioned in the previously notebook. It is `transfer learning`, we can also call it `fine-tuning`.

# Evaluation Metrics

Evaluating performance for language models can be quite tricky, especially when it comes to text summarization. The goal of our model is to produce a short sentence describing the content of a dialogue, while maintaining all the important information within that dialogue.

One of the quantitative metrics we can employ to evaluate performance is the `ROUGE Score`. It is considered one of the best metrics for text summarization and it evaluates performance by comparing the quality of a machine-generated summary to a human generated summary used for reference.

The similarities between both summaries are measured by analyzing the overlapping `n-grams`, either single words of sequences of words that are present in both summaries. These can be unigrams(ROUGE-1), where only the overlap of sole words is measured; biggrams(ROUGE-2), where we measure the overlap of two-word sequencesl trigrams(ROUGE-3), where we measrure the overlap of three-word sequences; etc. Besides that, we also have:


**ROUGE-L**

It measures the *Longest Common Subsequence(LCS)* between the two summaries, which helps to capture content coverage of the machine-generated text. If both summaries have the sequence "the apple is green", we have a match regardless of where they appear in both texts.

**ROUGE-S**

It avaluates the overlap of skip-bigrams, which are bigrams that permit gaps between words. This helps to measure the coherence of a machine-generated summary. For example, in the phrase "this apple is absolutely green", we find a match for the terms such as "apple" and "green", if that is what we are looking for.

These scores might typically range from 0 to 100, where 0 indicates no match and 100 indicates a perfect match between both summaries. Besides quantitative metrics, it is useful to use `human evaluation` to analyze the output of language models, since we are able to comprehend text in a wat that a machine does not.



In [None]:
!nvidia-smi # Checking GPU

In [None]:
!pip install transformers==4.37.2
!pip install datasets==2.17.0
!pip install evaluate==0.4.1
!pip install rouge-score==0.1.2
# Installing library to save zip archives
!pip install py7zr==0.20.8

In [None]:
import warnings

warning.filterwarnings('ignore')

In [None]:
# configuring Pandas to exhibit larger columns

import pandas as pd

pd.set_option('display.max_colwidth', 1000)

# Credit

* https://www.kaggle.com/code/lusfernandotorres/text-summarization-with-large-language-models/notebook