<div>
    <h1>Large Language Models Projects</a></h1>
    <h3>Apply and Implement Strategies for Large Language Models</h3>
    <h2>4.1-BLEU,  ROUGE and N-Grams. </h2>
    <h3>Evaluating summaries with ROUGE </h3>
</div>

by [Pere Martra](https://www.linkedin.com/in/pere-martra/)
_______
Models: t5-base-cnn / t5-base

Colab Environment: CPU

Keys:
* Summary Evaluation.
* N-Grams.
* Rouge.

_______


# How to Evaluate Large Language Models for Summarization Using ROUGE.
The way we evaluate large language models is quite different from evaluating machine learning models, where metrics like Accuracy, F1 Score, or Recall were commonly used.

Metrics for generated language are distinct. Depending on the specific application, different metrics are chosen to assess the model's performance.

In this notebook, we will explore the usage of the ROUGE metric to measure the quality of summaries generated by a language model.

## What is ROUGE?
ROUGE isn't just a single metric; it's a set of metrics that measure the overlap and similarity between the generated summary and a reference summary that serves as a benchmark.

It returns fourth individual metrics. The metrics provided are:

* ROUGE-1: Measures the overlap of unigrams, or single words.
* ROUGE-2: Measures the overlap of bigrams, or pairs of words.
* ROUGE-L: Measures the longest common subsequence, rewarding longer shared sequences between the generated and reference summaries.
* ROUGE-LSUM: Calculated as the length of the LCS divided by the sum of the lengths of the generated summary and the reference summary.

## What are we going to do?
We are going to use two T5 models, one of them being the t5-Base model and the other a t5-base fine-tuned  specifically designed for creating summaries.

First, we will use a dataset and generate summaries using both models. By comparing the two generated summaries, we can observe whether the fine-tuning has been effective in producing different results. In other words, here we will only determine that the two models exhibit significant differences in summary generation, but we won't know which one might perform better.

To determine which model generates better summaries, we will utilize a well-known dataset called 'cnn_dailymail,' which is available in the 'datasets' library.

This dataset contains reference summaries that can be used for comparison. We will assess the summaries generated by the two models against these reference summaries.

The model that obtains a higher ROUGE score will be considered the one that produces better summaries.

## The models.
t5-Base Finnetunned: https://huggingface.co/flax-community/t5-base-cnn-dm

t5-Base: https://huggingface.co/t5-base


In [1]:
!pip install -q evaluate==0.4.2
!pip install -q transformers==4.42.4
!pip install -q rouge_score==0.1.2
!pip install kaggle
#!pip install -q datasets==2.1.0

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/547.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/316.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.1/316.1 kB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import evaluate
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# Load the Data

In [3]:
#Import generic libraries
import numpy as np
import pandas as pd
import torch


The dataset is available on Kaggle and comprises a collection of technological news articles compiled by MIT. The article text is located in the 'Article Body' column.

https://www.kaggle.com/datasets/deepanshudalal09/mit-ai-news-published-till-2023

## Importing Dataset from Kaggle

Yo only need acces to the articles.csv file from the Dataaset, you can download and load it directly, if you prefer to use the API Kaggle you can use the code Below. To use the Kaggle API you will need to hace your kaggle.json file with your keys in the directory /content/drive/MyDrive/kaggle

In [4]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/kaggle'

In [5]:
!kaggle datasets download -d deepanshudalal09/mit-ai-news-published-till-2023

Dataset URL: https://www.kaggle.com/datasets/deepanshudalal09/mit-ai-news-published-till-2023
License(s): unknown
Downloading mit-ai-news-published-till-2023.zip to /content
  0% 0.00/1.90M [00:00<?, ?B/s]
100% 1.90M/1.90M [00:00<00:00, 128MB/s]


In [6]:
import zipfile
file_path = '/content/mit-ai-news-published-till-2023.zip'
with zipfile.ZipFile(file_path, 'r') as zip_ref:
   zip_ref.extractall('/content/drive/MyDrive/kaggle')

## Loading Dataset

In [7]:
news = pd.read_csv('/content/drive/MyDrive/kaggle/articles.csv')
DOCUMENT="Article Body"

In [8]:
#Because it is just a course we select a small portion of News.
MAX_NEWS = 3
subset_news = news.head(MAX_NEWS)

In [9]:
subset_news.head()

Unnamed: 0.1,Unnamed: 0,Published Date,Author,Source,Article Header,Sub_Headings,Article Body,Url
0,0,"July 7, 2023",Adam Zewe,MIT News Office,Learning the language of molecules to predict ...,This AI system only needs a small amount of da...,['Discovering new materials and drugs typicall...,https://news.mit.edu/2023/learning-language-mo...
1,1,"July 6, 2023",Alex Ouyang,Abdul Latif Jameel Clinic for Machine Learning...,MIT scientists build a system that can generat...,"BioAutoMATED, an open-source, automated machin...",['Is it possible to build machine-learning mod...,https://news.mit.edu/2023/bioautomated-open-so...
2,2,"June 30, 2023",Jennifer Michalowski,McGovern Institute for Brain Research,"When computer vision works more like a brain, ...",Training artificial neural networks with data ...,"['From cameras to self-driving cars, many of t...",https://news.mit.edu/2023/when-computer-vision...


In [10]:
articles = subset_news[DOCUMENT].tolist()

# Load the Models and create the summaries

Both models are available on Hugging Face, so we will work with the Transformers library.

In [11]:
model_name_base = "t5-base"
model_name_finetuned = "flax-community/t5-base-cnn-dm"

In [12]:
#This function returns the tokenizer and the Model.
def get_model(model_id):
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

    return tokenizer, model


In [13]:
tokenizer_base, model_base = get_model(model_name_base)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [14]:
tokenizer_finetuned, model_finetuned = get_model(model_name_finetuned)

tokenizer_config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

With both models downloaded and ready, we create a function that will perform the summaries.

The function takes fourth parameters:

* the list of texts to summarize.
* the tokenizer.
* the model.
* the maximum length for the generated summary

In [15]:
import time

In [16]:
def create_summaries(texts_list, tokenizer, model, max_l=125):

    # We are going to add a prefix to each article to be summarized
    # so that the model knows what it should do
    prefix = "Summarize this news: "
    summaries_list = [] #Will contain all summaries

    texts_list = [prefix + text for text in texts_list]

    for text in texts_list:

        summary=""

        #calculate the encodings
        input_encodings = tokenizer(text,
                                    max_length=1024,
                                    return_tensors='pt',
                                    padding=True,
                                    truncation=True)

        # Generate summaries
        start = time.time()
        output = model.generate(
            input_ids=input_encodings.input_ids,
            attention_mask=input_encodings.attention_mask,
            max_length=max_l,  # Set the maximum length of the generated summary
            num_beams=2,     # Set the number of beams for beam search
            early_stopping=True
        )

        #Decode to get the text
        summary = tokenizer.batch_decode(output, skip_special_tokens=True)
        end = time.time()
        #Add the summary to summaries list
        elapsed_time = end - start
        print(f"Time taken: {elapsed_time:.3f} seconds")
        summaries_list += summary
    return summaries_list


To create the summaries, we call the 'create_summaries' function, passing both the news articles and the corresponding tokenizer and model.

In [17]:
# Creating the summaries for both models.
summaries_base = create_summaries(articles,
                                  tokenizer_base,
                                  model_base)


Time taken: 14.084 seconds
Time taken: 22.327 seconds
Time taken: 18.965 seconds


In [18]:
summaries_finetuned = create_summaries(articles,
                                      tokenizer_finetuned,
                                      model_finetuned)

Time taken: 14.765 seconds
Time taken: 18.770 seconds
Time taken: 23.711 seconds


In [19]:
summaries_base

['MIT and MIT-Watson AI Lab have developed a unified framework. the system can simultaneously predict molecular properties and generate new molecules. it uses this grammar to construct viable molecules and predict their properties.',
 '\'BioAutoMATED\' is an automated machine-learning system that can select and build an appropriate model for a given dataset. it can even take care of the laborious task of data preprocessing, whittling down a months-long process to just a few hours. \'"We want to lower these barriers for a lot of folks that want to use machine learning or biology," says first co-author Jacqueline Valeri.',
 "MIT and IBM research scientists have made a computer vision model more robust by training it to work like a part of the brain that humans and other primates rely on for object recognition. 'we asked the artificial neural network to make the function of one of your inside simulated “neural” layers as similar as possible to the corresponding biological neural layer,' s

In [20]:
summaries_finetuned

['Researchers created a machine-learning system that automatically learns the "language" of molecules using only a small, domain-specific dataset. The system learns to construct viable molecules and predict their properties. Computational design and Fabrication Group will be presented at the International Conference for Machine Learning.',
 "Automated machine-learning system can select and build an appropriate model for a given dataset. 'BioAutoMATED' is an automated machine-learning system. The tool includes binary classification models, multi-class classification models, and more complex neural networks.",
 "MIT and IBM researchers have found that artificial neural networks resemble the multilayered brain circuits that process visual information in humans and other primates. 'We asked it to do both of those things as well as the standard, computer vision approach,' said one expert. The network found to be more robust by training it to work like a part of the brain that humans rely on

At first glance, it's evident that the summaries are different.

However, it's challenging to determine which one is better.

It's even difficult to discern whether they are significantly distinct or if there are just subtle differences between them.

This is what we are going to verify now using ROUGE. When comparing the summaries of one model with those of the other, we don't get an idea of which one is better, but rather an idea of how much the summaries have changed with the fine-tuning applied to the model.

# ROUGE
Let's load the ROUEGE evaluator.

In [21]:
#With the function load of the library evaluate
#we create a rouge_score object
rouge_score = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Calculating ROUGE is as simple as calling the *compute* function of the *rouge_score* object we created earlier. This function takes the texts to compare as arguments and a third value *use_stemmer*, which indicates whether it should use *stemmer* or full words for the comparison.

A *stemmer* is the base of the word. Transform differents forms of a word in a same base.

Some samples of steammer are:
* Jumping -> Jump.
* Running -> Run.
* Cats -> Cat.

In [22]:
def compute_rouge_score(generated, reference):

    #We need to add '\n' to each line before send it to ROUGE
    generated_with_newlines = ["\n".join(sent_tokenize(s.strip())) for s in generated]
    reference_with_newlines = ["\n".join(sent_tokenize(s.strip())) for s in reference]

    return rouge_score.compute(
        predictions=generated_with_newlines,
        references=reference_with_newlines,
        use_stemmer=True,

    )

In [23]:
compute_rouge_score(summaries_base, summaries_finetuned)

{'rouge1': 0.47018752391886715,
 'rouge2': 0.3209013209013209,
 'rougeL': 0.34330271718331423,
 'rougeLsum': 0.44692881745120555}

We can see that there is a difference between the two models when performing summarization.

For example, in ROUGE-1, the similarity is 47%, while in ROUGE-2, it's a 32%. This indicates that the results are different, with some similarities but differents enough.

However, we still don't know which model is better since we have compared them to each other and not to a reference text. But at the very least, we know that the fine-tuning process applied to the second model has significantly altered its results.

# Comparing to a Dataset with real summaries.
We are going to load the Dataset cnn_dailymail. This is a well-known dataset available in the **Datasets** library, and it suits our purpose perfectly.

Apart from the news, it also contains pre-existing summaries.

We will compare the summaries generated by the two models we are using with those from the dataset to determine which model creates summaries that are closer to the reference ones.

In [24]:
from datasets import load_dataset

cnn_dataset = load_dataset("ccdv/cnn_dailymail", "3.0.0")

Downloading builder script:   0%|          | 0.00/9.27k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

The repository for ccdv/cnn_dailymail contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ccdv/cnn_dailymail.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/572k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/661k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [25]:
#Get just a few news to test
sample_cnn = cnn_dataset["test"].select(range(MAX_NEWS))

sample_cnn

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 3
})

We retrieve the maximum length of the summaries to give the models the option to generate summaries of the same length, if they choose to do so.

In [26]:
max_length = max(len(item['highlights']) for item in sample_cnn)
max_length = max_length + 10

In [27]:
summaries_t5_base = create_summaries(sample_cnn["article"],
                                      tokenizer_base,
                                      model_base,
                                      max_l=max_length)

Time taken: 16.551 seconds
Time taken: 13.505 seconds
Time taken: 13.629 seconds


In [28]:
summaries_t5_finetuned = create_summaries(sample_cnn["article"],
                                      tokenizer_finetuned,
                                      model_finetuned,
                                      max_l=max_length)

Time taken: 17.991 seconds
Time taken: 9.375 seconds
Time taken: 12.378 seconds


In [29]:
#Get the real summaries from the cnn_dataset
real_summaries = sample_cnn['highlights']

Let's take a look at the generated summaries alongside the reference summaries provided by the dataset.

In [30]:
summaries = pd.DataFrame.from_dict(
        {
            "base": summaries_t5_base,
            "finetuned": summaries_t5_finetuned,
            "reference": real_summaries,
        }
    )
summaries.head()

Unnamed: 0,base,finetuned,reference
0,"best died in hospice in Hickory, north Carolin...","Jimmie Best was ""the most constantly creative ...","James Best, who played the sheriff on ""The Duk..."
1,"""it doesn't matter what anyone says, he is pre...",Dr. Anthony Moschetto's attorney calls the all...,A lawyer for Dr. Anthony Moschetto says the ch...
2,president Barack Obama took part in a roundtab...,President Obama says climate change is a publi...,"""No challenge poses more of a public threat th..."


Now we can calculate the ROUGE scores for the two models.

In [31]:
summaries_t5_base

['best died in hospice in Hickory, north Carolina, of complications from pneumonia. he played bumbling sheriff Rosco P. Coltrane on "the Dukes of Hazzard" he was born in Kentucky and raised in rural Indiana.',
 '"it doesn\'t matter what anyone says, he is presumed to be innocent," attorney says. cardiologist\'s lawyer says allegations against his client are "completely unsubstantiated" prosecutors say he pleaded not guilty to all charges. he faces charges in connection with a plot to take out a rival doctor.',
 'president Barack Obama took part in a roundtable discussion this week on climate change. he refocused on the issue from a public health vantage point. the average american can also do their part to reduce their own carbon footprint.']

In [32]:
real_summaries

['James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .\n"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .',
 'A lawyer for Dr. Anthony Moschetto says the charges against him are baseless .\nMoschetto, 54, was arrested for selling drugs and weapons, prosecutors say .\nAuthorities allege Moschetto hired accomplices to burn down the practice of former associate .',
 '"No challenge poses more of a public threat than climate change," the President says .\nHe credits the Clean Air Act with making Americans "a lot" healthier .']

In [33]:
compute_rouge_score(summaries_t5_base, real_summaries)

{'rouge1': 0.3050834824090638,
 'rouge2': 0.07211128178870115,
 'rougeL': 0.2095520274299344,
 'rougeLsum': 0.2662418008348241}

In [34]:
compute_rouge_score(summaries_t5_finetuned, real_summaries)

{'rouge1': 0.31659149328289443,
 'rouge2': 0.11065084340946411,
 'rougeL': 0.22002036956205442,
 'rougeLsum': 0.24877540132887144}

With these results, I would say that the fine-tuned model performs slightly better than the T5-Base model. It consistently achieves higher ROUGE scores in all metrics except for LSUM, where the difference is minimal.

Additionally, the ROUGE metrics are quite interpretable.

LSUM indicates the percentage of the longest common subsequence, regardless of word order, in relation to the total length of the text.

This can be a good indicator of overall similarity between texts. However, both models have very similar LSUM scores, and the fine-tuned model has better scores in other ROUGE metrics.

Personally, I would lean towards the fine-tuned model, although the difference may not be very significant.


### Comparing entities with ROUGE

In [35]:
entities=['Paris, Londres, Barcelona, Reus']
entities_ref=['Reus, Paris, Londres, Barcelona']

In [36]:
compute_rouge_score(entities, entities_ref)

{'rouge1': 1.0,
 'rouge2': 0.6666666666666666,
 'rougeL': 0.75,
 'rougeLsum': 0.75}

In [37]:
entities_ref=['Paris, Londres, Barcelona, Reus']
compute_rouge_score(entities, entities_ref)

{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}