# In this notebook, we will explore the usage of the ROUGE metric to measure the quality of summaries generated by a language model.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate the quality of machine-generated text, particularly in the context of text summarization and machine translation. 
These metrics measure the similarity between the generated text and reference (human-written) text.

1. ROUGE-1/ROUGE-N (with N=1)
ROUGE-1 measures the overlap of unigrams between the generated text and the reference text. It calculates precision, recall, and F1-score for each gram size separately (unigrams, bigrams, trigrams, etc.).

2. ROUGE-L
Computes the longest common subsequence between the generated text and the reference text. It focuses on capturing the longest common sequence of words, which represents the semantic content shared by both texts.

3. ROUGE-SU (ROUGE-Skip-Bigram and Unigram)
It combines both skip-bigrams and unigram metrices, providing a broader perspective on the similarity between the generated and reference texts.

4. ROUGE-M (ROUGE-Meta):
ROUGE-M is a variant that considers multiple reference summaries. It calculates the average score over multiple references, providing a more robust evaluation.

Initially, we will employ a dataset to create summaries using both models. Through a comparison of the summaries produced by the two models, we can assess the efficacy of the fine-tuning in generating distinct outcomes. To clarify, our objective here is to establish notable contrasts in the summary generation between the two models, without ascertaining which model performs superiorly.

To determine which model generates better summaries, we will utilize a well-known dataset called 'cnn_dailymail,' which is available in the 'datasets' library.

This dataset contains reference summaries that can be used for comparison. We will assess the summaries generated by the two models against these reference summaries.

The model that obtains a higher ROUGE score will be considered the one that produces better summaries.


# Models Used

1. flan-t5-xxl: https://huggingface.co/google/flan-t5-xxl

> FLAN-T5 is just better at everything.The model was trained on a mixture of tasks like question answering, summarization, text classification etc.

2. flan-t5-11b-summarizer-filtered: https://huggingface.co/jordiclive/flan-t5-11b-summarizer-filtered

It is a fine-tuned version of google/flan-t5-xxl on various summarization datasets (xsum, wikihow, cnn_dailymail/3.0.0, samsum, scitldr/AIC, billsum, TLDR, wikipedia-summary). This model that can be used for a general-purpose summarizer for academic and general usage. The result works well on lots of text, although trained with a max source length of 512 tokens and 150 max summary length.

# Load the Data

In [5]:
#Import generic libraries
import numpy as np 
import pandas as pd
import torch
import warnings
warnings.filterwarnings("ignore")
from rouge import Rouge

In [6]:
# !pip install rouge

The dataset is available on Kaggle and comprises a collection of technological news articles compiled by MIT. The article text is located in the 'Article Body' column.

https://www.kaggle.com/datasets/deepanshudalal09/mit-ai-news-published-till-2023

In [7]:
df = pd.read_csv('/kaggle/input/mit-ai-news-published-till-2023/articles.csv')
# DOCUMENT="Article Body"

In [8]:
# #Because it is just a course we select a small portion of News.
# MAX_NEWS = 3
# subset_news = news.head(MAX_NEWS)

In [9]:
print(f"Shape: {df.shape}")
df.head()

Shape: (1018, 8)


Unnamed: 0.1,Unnamed: 0,Published Date,Author,Source,Article Header,Sub_Headings,Article Body,Url
0,0,"July 7, 2023",Adam Zewe,MIT News Office,Learning the language of molecules to predict ...,This AI system only needs a small amount of da...,['Discovering new materials and drugs typicall...,https://news.mit.edu/2023/learning-language-mo...
1,1,"July 6, 2023",Alex Ouyang,Abdul Latif Jameel Clinic for Machine Learning...,MIT scientists build a system that can generat...,"BioAutoMATED, an open-source, automated machin...",['Is it possible to build machine-learning mod...,https://news.mit.edu/2023/bioautomated-open-so...
2,2,"June 30, 2023",Jennifer Michalowski,McGovern Institute for Brain Research,"When computer vision works more like a brain, ...",Training artificial neural networks with data ...,"['From cameras to self-driving cars, many of t...",https://news.mit.edu/2023/when-computer-vision...
3,3,"June 30, 2023",Mary Beth Gallagher,School of Engineering,Educating national security leaders on artific...,"Experts from MIT’s School of Engineering, Schw...",['Understanding artificial intelligence and ho...,https://news.mit.edu/2023/educating-national-s...
4,4,"June 30, 2023",Adam Zewe,MIT News Office,Researchers teach an AI to write better chart ...,A new dataset can help scientists develop auto...,['Chart captions that explain complex trends a...,https://news.mit.edu/2023/researchers-chart-ca...


For test purpose, we will use a small subset of the news

In [10]:
articles = df['Article Body'].head(15).tolist()
#print(articles)

# Load the Models and create the summaries

Both models are available on Hugging Face, so we will work with the Transformers library.

In [11]:
import transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_small = 't5-base'
model_finetuned = 'flax-community/t5-base-cnn-dm'
#model_name_reference = "pszemraj/long-t5-tglobal-base-16384-booksum-V11-big_patent-V2"

In [12]:
#This function returns the tokenizer and the Model. 
def get_model(model_id):
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
    
    return tokenizer, model
    

In [13]:
tokenizer_small, model_small = get_model(model_small)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [14]:
tokenizer_reference, model_reference = get_model(model_finetuned)

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

In [15]:
# text = "Summarize the following article: " + articles[0]
# #print(text)
# input_encoding = tokenizer_small(text, max_length=200, padding=True, truncation=True, return_tensors="pt")

# with torch.no_grad():
#     output = model_small.generate(input_ids = input_encoding.input_ids,
#                                     attention_mask = input_encoding.attention_mask,
#                                     early_stopping=True,
#                                     num_beams=3,
#                                     max_length=200)
    
# # Convert tensor values to regular Python lists
# output_ids = output[0].tolist()

# # Decode the output
# summary = tokenizer_small.decode(output_ids, skip_special_tokens=True)

# print(summary)


In [16]:
def get_summaries(textlist, tokenizer, model, max_len = 200):
    
    prefix = "Summarize this news: "
    textlist = [prefix+text for text in textlist]
    summarydf = pd.DataFrame(columns=["Article","Summary"])
    
    summaries_list = []
    
    for text in textlist:
        
        summary=""
        
        input_encoding = tokenizer(text,max_length = max_len, padding=True, truncation=True, return_tensors="pt")
        
        with torch.no_grad():
            output = model.generate(input_ids = input_encoding.input_ids,
                                    attention_mask = input_encoding.attention_mask,
                                    early_stopping=True,
                                    num_beams=3,
                                    max_length=max_len)
            
        output_ids = output.tolist()
            
        summary = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
        
        data = {'Article':[text], 'Summary':[summary]}
        
        summarydf = summarydf.append(data, ignore_index=True)
        summaries_list += summary
        
    return summaries_list,summarydf

# Creating Summaries for both models

In [17]:
summarylist,summarydf = get_summaries(articles, 
                                  tokenizer_small, 
                                  model_small)

In [19]:
summaries_reference,summaryrefdf = get_summaries(articles, 
                                      tokenizer_reference, 
                                      model_reference)

In [21]:
summaries_reference[0:3]

["Researchers from MIT and the MIT-Watson AI Lab have developed a framework that can simultaneously predict molecular properties and generate new molecules much more efficiently than these popular deep-learning approaches. The researchers must show a machine-learning model to predict a molecule’s biological or mechanical properties — a process known as training.'",
 "An open-access paper on their proposed solution, called BioAutoMATED, was published on June 21 in Cell Systems. 'Is it possible to build machine-learning models without machine-learning expertise?', reads. One Termeer Professor of Medical Engineering and Science in the Department of Biological Engineering at MIT.",
 'According to MIT and IBM research scientists, one way to improve computer vision is to instruct the artificial neural networks that they rely on to deliberately mimic the way the brain’s biological neural network processes visual images. This May, researchers led by MIT Professor James DiCarlo, have made a com

# ROUGE

In [22]:
def get_rouge_score(model_summary, reference_summary):
    
    rouge = Rouge()
    scores = rouge.get_scores(model_summary,reference_summary,avg=True)
    
    return scores

In [24]:
print("ROUGE scores:", get_rouge_score(summarylist, summaries_reference))

ROUGE scores: {'rouge-1': {'r': 0.5355399989081867, 'p': 0.5141900150123881, 'f': 0.4863026887873003}, 'rouge-2': {'r': 0.4226534538106812, 'p': 0.3981211085167167, 'f': 0.37445084574906257}, 'rouge-l': {'r': 0.5217411833702407, 'p': 0.5022909984939001, 'f': 0.47373993620585614}}


We see that there is difference between 2 models. This indicates that the results are different, with some similarities but differents enough.
However, we still don't know which model is better since we have compared them to each other and not to a reference text. But at the very least, we know that the fine-tuning process applied to the second model has significantly altered its results

**Here's how to interpret the ROUGE scores:**

rouge-1, rouge-2, rouge-l: These are different variations of the ROUGE metric that consider different aspects of overlap between n-grams (sequences of n words) in the generated and reference summaries.

'r', 'p', 'f': These abbreviations stand for recall, precision, and F1-score, respectively.

**Recall (r):** This indicates the proportion of overlapping n-grams in the generated summary compared to the reference summary. A higher recall indicates that more relevant n-grams from the reference summary are present in the generated summary.

**Precision (p):** This indicates the proportion of overlapping n-grams in the generated summary compared to the reference summary. A higher precision indicates that more of the n-grams in the generated summary are relevant to the reference summary.

**F1-score (f):** The F1-score is the harmonic mean of recall and precision. It provides a balanced measure that takes both recall and precision into account. A higher F1-score indicates a better balance between recall and precision.

# Now lets compare the summary with the real summaries 

In [25]:
from datasets import load_dataset

cnn_dataset = load_dataset(
    "cnn_dailymail", version="3.0.0"
)

#Get just a few news to test
sample_cnn = cnn_dataset["test"].select(range(15))

sample_cnn

Downloading builder script:   0%|          | 0.00/3.51k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

Downloading and preparing dataset cnn_dailymail/default to /root/.cache/huggingface/datasets/cnn_dailymail/default/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234...


Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/572k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/661k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/5 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset cnn_dailymail downloaded and prepared to /root/.cache/huggingface/datasets/cnn_dailymail/default/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 15
})

In [26]:
max_length = max(len(item['highlights']) for item in sample_cnn)
max_length = max_length + 10

In [29]:
#Get the real summaries from the cnn_dataset
real_summaries = sample_cnn['highlights']

Now we can calculate the ROUGE scores for the two models.

In [30]:
get_rouge_score(summarylist,real_summaries)

{'rouge-1': {'r': 0.09677930523919792,
  'p': 0.06840318194454578,
  'f': 0.07438032931269183},
 'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0},
 'rouge-l': {'r': 0.08919388582769154,
  'p': 0.063641277182641,
  'f': 0.06860890418473492}}

In [31]:
get_rouge_score(summaries_reference,real_summaries)

{'rouge-1': {'r': 0.09062548534496882,
  'p': 0.06291223430628627,
  'f': 0.07144381641677473},
 'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0},
 'rouge-l': {'r': 0.08593861665810014,
  'p': 0.05994084186120961,
  'f': 0.06791303825185023}}

We can see that the fine-tuned model performs slightly better than the T5-Base model. Also, the ROUGE metrics are quite interpretable.
LSUM represents the proportion of the longest common subsequence, irrespective of word arrangement, compared to the overall length of the text.