# About this notebook

In this notebook, we will explore and evaluate various text summarization models on the [BillSum dataset](https://huggingface.co/datasets/billsum). We will compare several state-of-the-art Transformer-based models, such as BART, T5, and Pegasus, as well as a TextRank-based baseline model. Our goal is to determine their performance and suitability for summarizing legislative text data.

The BillSum dataset is a collection of U.S. Congressional and California state bills with their corresponding summaries. It is an interesting use case for text summarization because legislative documents tend to be lengthy, complex, and written in a formal style. Moreover, they often contain domain-specific terminology and jargon, making it challenging to generate concise and accurate summaries.

In this notebook, we will:

1. Provide an overview of the different summarization models, including their strengths and limitations.
2. Implement and demonstrate the use of these models to generate summaries of the legislative text.
3. Evaluate the models on multiple samples from the BillSum dataset, considering both ROUGE and BLEU metrics for a comprehensive comparison.
4. Discuss the importance of evaluating models on a diverse set of samples to ensure a more reliable and generalizable performance assessment.
5. Provide insights into the BillSum dataset and why it is an interesting use case for text summarization.

By the end of this notebook, you will gain a better understanding of the various text summarization models, their performance on legislative text data, and the challenges associated with evaluating models on complex and domain-specific datasets like BillSum.

# Imports

In [None]:
# Load setup.py file
%load ../utils/setup.py
%run ../utils/setup.py

# Load utils.py file
%load ../utils/utils.py
%run ../utils/utils.py

# Load textrank.py file
%load ../utils/textrank.py
%run ../utils/textrank.py

In [57]:
useGPU()

Have fun with this chapter!🥳


In [58]:
import pandas as pd
import numpy as np
from functools import partial
import logging

import nltk
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from summa import summarizer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline, GenerationConfig, set_seed
from transformers import BigBirdPegasusForConditionalGeneration, AutoTokenizer
from transformers import LEDForConditionalGeneration, LEDTokenizer
import torch
from datasets import load_dataset

from multiprocessing import Pool
import evaluate
from tqdm import tqdm

import matplotlib.pyplot as plt

from textwrap import TextWrapper

set_seed(42)

# Allocate enough RAM

Let us try to get a __GPU__ with at least __15GB RAM__ for our notebook.

In [59]:
# crash colab to get more RAM -> uncomment to use
# !kill -9 -1

We can execute the following command `!free -h`  to see if we have enough RAM and `!nvidia-smi` to get more info about our GPU type we got assigned.
If the allocated GPU is too small, the above cell can be used to run the command to crash the notebook hoping to get a better GPU after the crash, since the GPU is randomly allocated.

In [60]:
!free -h

               total        used        free      shared  buff/cache   available
Mem:            50Gi       6.4Gi       5.8Gi       2.0Mi        38Gi        44Gi
Swap:             0B          0B          0B


In [61]:
!nvidia-smi

Wed Jan 17 15:32:14 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8               9W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [62]:
if torch.cuda.is_available():
    gpu_device = torch.device('cuda')
    gpu_info = torch.cuda.get_device_properties(gpu_device)
    gpu_memory = gpu_info.total_memory / 1e9  # Convert bytes to gigabytes
    print(f"GPU: {gpu_info.name}, Total Memory: {gpu_memory:.2f} GB")
else:
    print("No GPU detected.")


GPU: Tesla T4, Total Memory: 15.84 GB


# Use case BillSum dataset

## Load the BillSum dataset

In [63]:
# Load the dataset
dataset = load_dataset("billsum")
#test_dataset.set_format(type='pandas', columns=['article', 'abstract'])



  0%|          | 0/3 [00:00<?, ?it/s]

In [64]:
dataset

DatasetDict({
    ca_test: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 1237
    })
    train: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 18949
    })
    test: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 3269
    })
})

In [65]:
# Get the test split of the dataset
dataset_test = load_dataset("billsum", split="ca_test")
dataset_test.set_format(type="pandas", columns=['text', 'summary'])



In [66]:
# Convert to pandas DataFrame
df = dataset_test.to_pandas()

## Excerpt of article and ground truth of summary




In [67]:
# Access first row using iloc
sample = df.iloc[0]

excerpt = 1000

print(f"\033[1mExcerpt of {excerpt} characters, total length of article: \
{len(sample['text'])}:\033[0m\n")

print(sample["text"][:excerpt])
print(f"\033[1m\n\nSummary (length: {len(sample['summary'])}):\033[0m\n")
print(sample["summary"])


[1mExcerpt of 1000 characters, total length of article: 8203:[0m

The people of the State of California do enact as follows:


SECTION 1.
The Legislature finds and declares all of the following:
(a) (1) Since 1899 congressionally chartered veterans’ organizations
have provided a valuable service to our nation’s returning service
members. These organizations help preserve the memories and incidents
of the great hostilities fought by our nation, and preserve and
strengthen comradeship among members.
(2) These veterans’ organizations also own and manage various
properties including lodges, posts, and fraternal halls. These
properties act as a safe haven where veterans of all ages and their
families can gather together to find camaraderie and fellowship, share
stories, and seek support from people who understand their unique
experiences. This aids in the healing process for these returning
veterans, and ensures their health and happiness.
(b) As a result of congressional chartering of th

In [68]:
articles_np= np.array(dataset_test['text'])
articles_np.shape

(1237,)

In [69]:
articles_np_1 = articles_np[:1]
articles_np_1.shape

(1,)

In [70]:
print(articles_np_1)

['The people of the State of California do enact as
follows:\n\n\nSECTION 1.\nThe Legislature finds and declares all of
the following:\n(a) (1) Since 1899 congressionally chartered veterans’
organizations have provided a valuable service to our nation’s
returning service members. These organizations help preserve the
memories and incidents of the great hostilities fought by our nation,
and preserve and strengthen comradeship among members.\n(2) These
veterans’ organizations also own and manage various properties
including lodges, posts, and fraternal halls. These properties act as
a safe haven where veterans of all ages and their families can gather
together to find camaraderie and fellowship, share stories, and seek
support from people who understand their unique experiences. This aids
in the healing process for these returning veterans, and ensures their
health and happiness.\n(b) As a result of congressional chartering of
these veterans’ organizations, the United States Internal Rev

## Applying TextRank algorithm to BillSum dataset

This code snippet is used to generate summaries for a collection of texts (in this case, legislative documents) using the TextRank algorithm. The TextRank summarization is implemented with the summa library, which is available on [PyPI](https://pypi.org/project/summa/). The implementation is based on the paper Mihalcea, R., Tarau, P.: [“TextRank: Bringing order into texts”](https://aclanthology.org/W04-3252/)

A `summarize_text` function is defined, which takes in two arguments - text and words. text is the input text to be summarized, and words is an optional parameter to specify the target number of words in the generated summary (default is 250 words). The `summarizer.summarize()` function from the summa library is used to generate the summary.

The code then employs the multiprocessing Pool class to parallelize the summarization process. This speeds up the computation by taking advantage of multiple CPU cores available on the system. The `Pool()` context manager creates a pool of worker processes, and the `pool.map()` function is used to apply the `summarize_text` function to each element of the `articles_np_1` array (which contains the input articles).

The `pool.map()` function distributes the articles across the available worker processes, and each process applies the summarize_text function independently. Once all the summaries are generated, they are collected and stored in the `summarized_articles` list. This parallelization can significantly improve the performance of the TextRank summarization, especially when working with large datasets.

In [71]:
# Define the function to summarize the text
def summarize_text(text, words=250):
    summary = summarizer.summarize(text, words=words)
    return summary

# Parallelize the TextRank summarization
with Pool() as pool:
    summarized_articles = pool.map(summarize_text, articles_np_1)


### Get summarized article

In [72]:
print(summarized_articles)

['(c) Section 501(c)(19) of the Internal Revenue Code and related
federal regulations provide for the exemption for posts or
organizations of war veterans, or an auxiliary unit or society of, or
a trust or foundation for, any such post or organization that, among
other attributes, carries on programs to perpetuate the memory of
deceased veterans and members of the Armed Forces and to comfort their
survivors, conducts programs for religious, charitable, scientific,
literary, or educational purposes, sponsors or participates in
activities of a patriotic nature, and provides social and recreational
activities for their members.\n(a) All buildings, and so much of the
real property on which the buildings are situated as may be required
for the convenient use and occupation of the buildings, used
exclusively for charitable purposes, owned by a veterans’ organization
that has been chartered by the Congress of the United States,
organized and operated for charitable purposes, and exempt from
f

In [73]:
references = np.array(dataset_test['summary'])

In [74]:
references_np = np.array(references)
print(references_np[:1])

['Existing property tax law establishes a veterans’ organization
exemption under which property is exempt from taxation if, among other
things, that property is used exclusively for charitable purposes and
is owned by a veterans’ organization.\nThis bill would provide that
the veterans’ organization exemption shall not be denied to a property
on the basis that the property is used for fraternal, lodge, or social
club purposes, and would make specific findings and declarations in
that regard. The bill would also provide that the exemption shall not
apply to any portion of a property that consists of a bar where
alcoholic beverages are served.\nSection 2229 of the Revenue and
Taxation Code requires the Legislature to reimburse local agencies
annually for certain property tax revenues lost as a result of any
exemption or classification of property for purposes of ad valorem
property taxation.\nThis bill would provide that, notwithstanding
Section 2229 of the Revenue and Taxation Code, no 

In [75]:
reference = references[0]

In [76]:
print(summarized_articles[0])

(c) Section 501(c)(19) of the Internal Revenue Code and related
federal regulations provide for the exemption for posts or
organizations of war veterans, or an auxiliary unit or society of, or
a trust or foundation for, any such post or organization that, among
other attributes, carries on programs to perpetuate the memory of
deceased veterans and members of the Armed Forces and to comfort their
survivors, conducts programs for religious, charitable, scientific,
literary, or educational purposes, sponsors or participates in
activities of a patriotic nature, and provides social and recreational
activities for their members.
(a) All buildings, and so much of the real property on which the
buildings are situated as may be required for the convenient use and
occupation of the buildings, used exclusively for charitable purposes,
owned by a veterans’ organization that has been chartered by the
Congress of the United States, organized and operated for charitable
purposes, and exempt from fede

In [77]:
# Create a dict to hold our summaries
summaries = {}
summaries['TextRank (Baseline)'] = summarized_articles

In [78]:
print(summaries['TextRank (Baseline)'][0])

(c) Section 501(c)(19) of the Internal Revenue Code and related
federal regulations provide for the exemption for posts or
organizations of war veterans, or an auxiliary unit or society of, or
a trust or foundation for, any such post or organization that, among
other attributes, carries on programs to perpetuate the memory of
deceased veterans and members of the Armed Forces and to comfort their
survivors, conducts programs for religious, charitable, scientific,
literary, or educational purposes, sponsors or participates in
activities of a patriotic nature, and provides social and recreational
activities for their members.
(a) All buildings, and so much of the real property on which the
buildings are situated as may be required for the convenient use and
occupation of the buildings, used exclusively for charitable purposes,
owned by a veterans’ organization that has been chartered by the
Congress of the United States, organized and operated for charitable
purposes, and exempt from fede

# Appyling various Transformer models to the BillSum dataset

Here we introduce a function called `generate_summary` that allows you to generate a summary of a given input text using various pre-trained transformer models. The function supports the following models:

- BART [Modelcard](https://huggingface.co/facebook/bart-large-cnn), [Paper](https://arxiv.org/abs/1910.13461)
- T5 [Modelcard](https://huggingface.co/sysresearch101/t5-large-finetuned-xsum-cnn), [Paper](https://arxiv.org/abs/1910.10683)
- ProphetNet [Modelcard](), [Paper](https://arxiv.org/abs/2001.04063)
- Pegasus [Modelcard](https://huggingface.co/google/pegasus-cnn_dailymail), [Paper](https://arxiv.org/abs/1912.08777)

__Function: generate_summary__ <br>
The `generate_summary` function takes two arguments: `model_name` and `sample_text`. The model_name argument is a string that specifies the transformer model to use for summarization, while the `sample_text` argument is the input text to be summarized.

__Usage__

To use the generate_summary function, simply call it with the desired model name and the input text:

`summary = generate_summary("BART", "This is an example text to summarize using BART.")
print(summary)
`

__Models and Tokenizers__

The function sets up the tokenizer and model using the `AutoTokenizer` and `AutoModelForSeq2SeqLM` classes from the Hugging Face Transformers library. The model and tokenizer are specified in the `model_dict` dictionary:

`model_dict = {
    "BART": "facebook/bart-large-cnn",
    "T5": "sysresearch101/t5-large-finetuned-xsum-cnn",
    "ProphetNet": "microsoft/prophetnet-large-uncased-cnndm"
}
`

__Handling Text Chunks__

Some transformer models have a __maximum input length constraint__. To handle this, the input text is broken down into smaller chunks before being passed to the pipeline. The chunks are processed one by one, and the generated summaries are concatenated to form the final summary.


__Error Handling__

If the specified model_name is not supported, the function raises a ValueError.

__Example__

Here's an example of how to use the generate_summary function:

`summary = generate_summary("BART", "This is an example text to summarize using BART.")
print(summary)
`

In [79]:
def generate_summary(model_name, sample_text):
    """
    Generate a summary of the input text using a specified model.

    Args:
        model_name (str): The name of the model to use for summarization. Supported values are "BART", "T5",
            "ProphetNet", "Pegasus", and "GPT-2".
        sample_text (str): The input text to summarize.

    Returns:
        str: The summary generated by the specified model.

    Raises:
        ValueError: If the specified model is not supported.

    Notes:
        - If `model_name` is "Pegasus", the `google/pegasus-cnn_dailymail` model will be used for summarization.
          This model doesn't require a tokenizer or a maximum length.
        - If `model_name` is "GPT-2", the `gpt2-xl` model will be used for text generation. The summary will be
          generated by extracting the first paragraph from the generated text that follows the "TL;DR:" token.

        - For other models, the `generate_summary` function uses the `pipeline` method from the transformers library to
          generate the summary. Since some of these models have a maximum input length constraint, the input text needs to
          be broken down into smaller chunks before being passed to the `pipeline`. The size of the chunks is determined
          by the `max_length` parameter of the tokenizer used for the specified model. The chunks are processed one by one,
          and the generated summaries are concatenated to form the final summary.

    Example:
        >>> generate_summary("BART", "This is an example text to summarize using BART.")
        'BART is used to summarize the input text.'
    """

    # Set up tokenizer and model
    model_dict = {
        "BART": "facebook/bart-large-cnn",
        "T5": "sysresearch101/t5-large-finetuned-xsum-cnn",
        "ProphetNet": "microsoft/prophetnet-large-uncased-cnndm"
    }

    if model_name == "Pegasus":
        # Generate summary using the specified model
        summarization_pipeline = pipeline("summarization", model="google/pegasus-cnn_dailymail")
        chunks = [sample_text[i:i+2048] for i in range(0, len(sample_text), 2048)]
        summaries = []
        for chunk in chunks:
            summarization_output = summarization_pipeline(chunk, max_length=512)
            summaries.append(summarization_output[0]["summary_text"].replace(" .<n>", ".\n"))
        summary = " ".join(summaries)
    elif model_name in model_dict.keys():
        tokenizer = AutoTokenizer.from_pretrained(model_dict[model_name], max_length=1024, truncation=True)
        model = AutoModelForSeq2SeqLM.from_pretrained(model_dict[model_name])
        gen_config = GenerationConfig(max_length=512)
        summarization_pipeline = pipeline("summarization", tokenizer=tokenizer, model=model, config=gen_config)
        # Generate summary using the specified model
        chunks = [sample_text[i:i+1024] for i in range(0, len(sample_text), 1024)]
        summaries = []
        for chunk in chunks:
            summarization_output = summarization_pipeline(chunk)
            summaries.append(summarization_output[0]["summary_text"])
        summary = " ".join(summaries)
    else:
        raise ValueError(f"Model {model_name} is not supported.")

    return summary


In [80]:
# Hide transformers output
logging.getLogger("transformers").setLevel(logging.ERROR)

## Generate summary from  <font color='red'>"Pegasus"</font> model

In [81]:
sample_text = sample[0]
summaries["Pegasus"] = generate_summary("Pegasus", sample_text)

## Generate summary from  <font color='red'>"BART"</font> model

In [82]:
summaries["BART"] = generate_summary("BART", sample_text)

## Generate summary from  <font color='red'>"T5"</font> model

In [83]:
summaries["T5"] = generate_summary("T5", sample_text)



## Generate summary from  <font color='red'>"ProphetNet"</font> model

In [84]:
summaries["ProphetNet"] = generate_summary("ProphetNet", sample_text)

## Generate summary from  <font color='red'>"BigBirdPegasus"</font> model

BigBird, is a sparse-attention based transformer which extends Transformer based models, such as BERT to much longer sequences. Moreover, BigBird comes along with a theoretical understanding of the capabilities of a complete transformer that the sparse model can handle. BigBird was introduced in this [paper](https://arxiv.org/abs/2007.14062), [Modelcard](https://huggingface.co/google/bigbird-pegasus-large-arxiv).
<br><br>

__Model description__

BigBird relies on block sparse attention instead of normal attention (i.e. BERT's attention) and can handle sequences up to a length of 4096 at a much lower compute cost compared to BERT. It has achieved SOTA on various tasks involving very long sequences such as long documents summarization, question-answering with long contexts.

The models checkpoint is obtained after fine-tuning BigBirdPegasusForConditionalGeneration for summarization on arxiv dataset from scientific papers.

In [85]:

tokenizer = AutoTokenizer.from_pretrained("google/bigbird-pegasus-large-arxiv")
model = BigBirdPegasusForConditionalGeneration.from_pretrained("google/bigbird-pegasus-large-arxiv", attention_type="original_full", block_size=16, num_random_blocks=2)

text = sample_text
inputs = tokenizer.encode_plus(text, return_tensors='pt', max_length=4096, truncation=True)
prediction = model.generate(**inputs, max_length=512)  # Set max_length here
prediction = tokenizer.batch_decode(prediction)


In [86]:
summaries['bigbird-pegasus'] = prediction

In [87]:
summaries['bigbird-pegasus'] = summaries['bigbird-pegasus'][0]

In [88]:
print_summaries(summaries, reference)

[1mGround truth[0m
Existing property tax law establishes a veterans’ organization
exemption under which property is exempt from taxation if, among other
things, that property is used exclusively for charitable purposes and
is owned by a veterans’ organization.
This bill would provide that the veterans’ organization exemption
shall not be denied to a property on the basis that the property is
used for fraternal, lodge, or social club purposes, and would make
specific findings and declarations in that regard. The bill would also
provide that the exemption shall not apply to any portion of a
property that consists of a bar where alcoholic beverages are served.
Section 2229 of the Revenue and Taxation Code requires the Legislature
to reimburse local agencies annually for certain property tax revenues
lost as a result of any exemption or classification of property for
purposes of ad valorem property taxation.
This bill would provide that, notwithstanding Section 2229 of the
Revenue and Ta

### Evaluating Summarization Models using ROUGE and BLEU Metrics

In this section, we demonstrate how to evaluate the summarization models using both the ROUGE and BLEU metrics. We use the Hugging Face's implementation of these metrics.

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters.
[Link](https://huggingface.co/spaces/evaluate-metric/rouge)

BLEU, or Bilingual Evaluation Understudy, is an evaluation metric used primarily for machine translation. It measures the similarity between a candidate translation and a set of reference translations, considering both the n-gram precision and a brevity penalty factor.
[Link](https://huggingface.co/spaces/evaluate-metric/google_bleu)

[Info about Huggingface evaluate](https://huggingface.co/docs/evaluate/v0.4.0/en/package_reference/loading_methods#evaluate.load)
<br> <br>

__Note:__ It can be beneficial to use both ROUGE and BLEU metrics for evaluation, as they focus on different aspects of the generated summaries. ROUGE is recall-oriented and mainly focuses on the coverage of important content, while BLEU is precision-oriented and evaluates the fluency and correctness of the generated text.

It is important to have an understanding of what constitutes a "good" score when evaluating summarization models. Here are some guidelines for interpreting the ROUGE and BLEU scores:

__ROUGE Scores__

ROUGE scores usually range from 0 to 1, with 1 being a perfect match between the generated summary and the reference summary. In practice, scores closer to 1 are rare, especially for abstractive summarization tasks.

For a simple baseline model like TextRank or LSA, a "good" ROUGE score can be around 0.3 to 0.4. For more advanced Transformer-based models like BART, T5, or Pegasus, a "good" ROUGE score can range from 0.4 to 0.6, depending on the task and dataset. State-of-the-art models can achieve even higher scores, but it's important to keep in mind that ROUGE scores are just one aspect of evaluating the quality of generated summaries.

__BLEU Scores__

BLEU scores also range from 0 to 1, with 1 indicating a perfect match between the generated text and the reference text. However, BLEU scores tend to be lower compared to ROUGE scores, as they are more sensitive to the differences in word order and phrasing.

For a simple baseline model, a "good" BLEU score could be around 0.1 to 0.2. For Transformer-based models, a "good" BLEU score might range from 0.2 to 0.4, again depending on the task and dataset. State-of-the-art models can achieve higher scores, but just like with ROUGE, it's essential to consider other factors when evaluating the quality of generated summaries.

Keep in mind that these are rough guidelines, and the definition of a "good" score can vary depending on the specific domain, dataset, and evaluation criteria. It is always a good idea to compare your model's scores with scores from existing state-of-the-art models on the same dataset to get a better understanding of its performance.

__Defining the SummarizationMetrics Class__

The SummarizationMetrics class is defined with methods for computing the ROUGE and BLEU metrics, as well as a method to compute both metrics together:

`class SummarizationMetrics:
    ...
`

This class is located in the `utils.py` file, feel free to look it up to get a better understanding how the mectrics are computed.

__Computing Metrics__

To compute the metrics, we first instantiate the SummarizationMetrics class:
`evaluator = SummarizationMetrics()
`

Then, we call the compute_sum_metric method on the evaluator object, passing the generated summaries and the reference summary:
`metrics_df = evaluator.compute_sum_metric(summaries, reference)
`

The compute_sum_metric method computes both the ROUGE and BLEU metrics by calling the `compute_rouge_metrics` and `compute_google_bleu_metrics` methods, respectively. The resulting metrics are combined into a single DataFrame and returned.

__Viewing the Metrics__

We can display the resulting metrics DataFrame to see the evaluation scores for each model:
`metrics_df
`

This DataFrame shows the ROUGE and BLEU scores for each summarization model, allowing for an easy comparison of their performance.


__Note:__ It's crucial to understand that evaluating a summarization model based on just one text example from a testing dataset is not enough to determine its overall performance. The reason behind this is that the complexity of the text samples can vary significantly within a dataset. Some text samples might be easier for a model to summarize, while others could be more challenging, containing complex sentence structures, domain-specific jargon, or long and convoluted narratives.

To get a more comprehensive sense of a model's performance, it is essential to evaluate it on a larger number of samples from the testing dataset. By doing so, we can ensure that we're assessing the model's ability to handle a diverse set of inputs, including those that differ in complexity, topic, and structure. This approach provides a more reliable and generalizable evaluation of the model's performance, helping us to better understand its strengths and weaknesses.

In the next section of this notebook, we will expand our evaluation to include multiple text samples from the testing dataset. This will give us a better understanding of how well the model performs across a variety of inputs and allow us to make more informed decisions about its applicability to real-world use cases.

In [89]:
evaluator = SummarizationMetrics()

metrics_df = evaluator.compute_sum_metric(summaries, reference)

metrics_df


100%|██████████| 6/6 [00:01<00:00,  5.51it/s]
100%|██████████| 6/6 [00:00<00:00, 222.32it/s]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum,google_bleu
TextRank (Baseline),0.414815,0.094293,0.177778,0.340741,0.128283
Pegasus,0.493151,0.224771,0.287671,0.438356,0.218349
BART,0.374456,0.197962,0.22061,0.328012,0.143989
T5,0.460829,0.203704,0.230415,0.368664,0.184557
ProphetNet,0.401198,0.156627,0.179641,0.317365,0.139651
bigbird-pegasus,0.11828,0.0,0.107527,0.11828,0.030905


## Evaluate more than one example

The `generate_avg_summary` is almost identical to `generate_summary`. However the main differences are:

1. The choice of models: While both functions support "BART", "T5", "ProphetNet", and "Pegasus", the generate_avg_summary function also supports the "BigBirdPegasus" model, which is designed for handling longer input sequences.

2. Handling long input text: The generate_avg_summary function takes advantage of the BigBirdPegasus model's ability to handle longer input sequences (up to 4096 tokens). This can be particularly useful for summarizing long documents or articles.

In [90]:
def generate_avg_summary(model_name, sample_text):
    """
    This function generates an average summary for the given text using the
    specified transformer model. Different transformer models such as Pegasus,
    T5, and others are supported.
    Each model has its own tokenizer and configuration settings. For example,
    Pegasus and T5 use a maximum length of 1024, while BigBirdPegasus has a
    maximum length of 4096. The input text is chunked according to the model's
    maximum length and processed accordingly. The final summary is obtained by
    concatenating the summaries generated for each chunk.


    Args:
        model_name (str): The name of the model to use for summarization. Supported values are "BART", "T5",
            "ProphetNet", "Pegasus", and "BigBirdPegasus".
        sample_text (str): The input text to summarize.

    Returns:
        str: The summary generated by the specified model.

    Raises:
        ValueError: If the specified model is not supported.

    Example:
        >>> generate_summary("BART", "This is an example text to summarize using BART.")
        'BART is used to summarize the input text.'
    """

    # Set up tokenizer and model
    model_dict = {
        "BART": ("facebook/bart-large-cnn", None),
        "T5": ("t5-large", None),
        "ProphetNet": ("microsoft/prophetnet-large-uncased-cnndm", None),
        "Pegasus": ("google/pegasus-cnn_dailymail", 1024),
        "BigBirdPegasus": ("google/bigbird-pegasus-large-arxiv", 4096),
    }

    if model_name in model_dict:
        if model_name == "Pegasus":
            tokenizer = AutoTokenizer.from_pretrained(model_dict[model_name][0], max_length=1024, truncation=True)
            model = AutoModelForSeq2SeqLM.from_pretrained(model_dict[model_name][0])
            summarization_pipeline = pipeline("summarization", tokenizer=tokenizer, model=model, max_length=model_dict[model_name][1])
            chunks = [sample_text[i:i+model_dict[model_name][1]] for i in range(0, len(sample_text), model_dict[model_name][1])]
            summaries = []
            for chunk in tqdm(chunks):
                summarization_output = summarization_pipeline(chunk)
                summaries.append(summarization_output[0]["summary_text"])
            summary = " ".join(summaries)
        elif model_name == "BigBirdPegasus":
            tokenizer = AutoTokenizer.from_pretrained(model_dict[model_name][0], max_length=4096, truncation=True)
            model = AutoModelForSeq2SeqLM.from_pretrained(model_dict[model_name][0])
            inputs = tokenizer(sample_text, return_tensors='pt', max_length=4096, truncation=True)
            outputs = model.generate(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
            summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
        else:
            tokenizer = AutoTokenizer.from_pretrained(model_dict[model_name][0], model_max_length=1024, truncation=True)
            model = AutoModelForSeq2SeqLM.from_pretrained(model_dict[model_name][0])
            if model_name == "T5":
                gen_config = GenerationConfig(max_length=1024)
                summarization_pipeline = pipeline("summarization", tokenizer=tokenizer, model=model, config=gen_config)
            else:
                summarization_pipeline = pipeline("summarization", tokenizer=tokenizer, model=model, max_length=512)
            chunks = [sample_text[i:i+1024] for i in range(0, len(sample_text), 1024)]
            summaries = []
            for chunk in tqdm(chunks):
                summarization_output = summarization_pipeline(chunk)
                summaries.append(summarization_output[0]["summary_text"])
            summary = " ".join(summaries)


    else:
        raise ValueError(f"Model {model_name} is not supported.")

    return summary


## Summarizing a Small Dataset using Transformer variants and `generate_avg_summary`

In this section, we demonstrate how to use the `generate_avg_summary` function to generate summaries for a small dataset using the various models. The dataset in this example is assumed to be in the `dataset_test`
 variable, which is an instance of Hugging Face's Dataset class.

__Selecting a Small Range of Samples__

First, we select a small range of samples from the dataset. In this example, we select the first 10 samples:
`# Select a small range of samples
dataset_small = dataset_test.select(range(10))
`

__Converting the Dataset to a Pandas DataFrame__
Next, we convert the small dataset to a Pandas DataFrame, which provides an easier way to apply the generate_avg_summary function on each article:
`# Convert dataset to pandas dataframe
df = dataset_small.to_pandas()
`

__Generating Summaries using BART, T5, Pegasus and our baseline, TextRank__

Now that we have a DataFrame containing the articles, we can apply the __generate_avg_summary__ function to each article to generate BART-based summaries. For each article, a progress bar will appear, indicating the progress of the summarization process. Since we have selected 10 articles, 10 progress bars will be displayed.
`# Generate summaries for each article with BART
df["BART_avg"] = df["text"].apply(lambda x: generate_avg_summary("BART", x))
`

After the summarization process is complete, the resulting DataFrame df will contain a new column named "Modl_avg" with the Model-generated summaries for each article. You can adjust the `.select(range(10)` with different ranges to evaluate less or more articles.

In [91]:
# Select a small range of samples
# See: https://huggingface.co/docs/datasets/v1.1.1/processing.html
dataset_small = dataset_test.select(range(10))

# Convert dataset to pandas dataframe
df = dataset_small.to_pandas()

# Note: every bar appearing is == to one data iteration, so if you have
# dataset_test.select(range(10)) 10 bars will apprear
# Generate summaries for each article with BART
df["BART_avg"] = df["text"].apply(lambda x: generate_avg_summary("BART", x))


100%|██████████| 9/9 [00:42<00:00,  4.67s/it]
100%|██████████| 9/9 [00:46<00:00,  5.22s/it]
100%|██████████| 14/14 [01:07<00:00,  4.82s/it]
100%|██████████| 11/11 [00:51<00:00,  4.70s/it]
100%|██████████| 7/7 [00:37<00:00,  5.41s/it]
100%|██████████| 12/12 [01:01<00:00,  5.12s/it]
100%|██████████| 16/16 [01:19<00:00,  4.96s/it]
100%|██████████| 7/7 [00:37<00:00,  5.37s/it]
100%|██████████| 11/11 [00:56<00:00,  5.16s/it]
100%|██████████| 10/10 [00:49<00:00,  4.93s/it]


In [92]:
# Generate summaries for each article with T5
df["T5_avg"] = df["text"].apply(lambda x: generate_avg_summary("T5", x))

100%|██████████| 9/9 [01:26<00:00,  9.65s/it]
100%|██████████| 9/9 [01:31<00:00, 10.19s/it]
100%|██████████| 14/14 [02:08<00:00,  9.21s/it]
100%|██████████| 11/11 [01:25<00:00,  7.81s/it]
100%|██████████| 7/7 [01:16<00:00, 10.93s/it]
100%|██████████| 12/12 [01:57<00:00,  9.78s/it]
100%|██████████| 16/16 [02:20<00:00,  8.78s/it]
100%|██████████| 7/7 [01:08<00:00,  9.75s/it]
100%|██████████| 11/11 [01:52<00:00, 10.20s/it]
100%|██████████| 10/10 [01:34<00:00,  9.50s/it]


In [93]:
# Generate summaries for each article with Pegasus
df["Pegasus_avg"] = df["text"].apply(lambda x: generate_avg_summary("Pegasus", x))

100%|██████████| 9/9 [01:52<00:00, 12.55s/it]
100%|██████████| 9/9 [02:00<00:00, 13.39s/it]
100%|██████████| 14/14 [02:39<00:00, 11.36s/it]
100%|██████████| 11/11 [01:51<00:00, 10.14s/it]
100%|██████████| 7/7 [01:39<00:00, 14.15s/it]
100%|██████████| 12/12 [02:38<00:00, 13.22s/it]
100%|██████████| 16/16 [03:33<00:00, 13.36s/it]
100%|██████████| 7/7 [01:36<00:00, 13.81s/it]
100%|██████████| 11/11 [02:24<00:00, 13.16s/it]
100%|██████████| 10/10 [01:38<00:00,  9.88s/it]


In [94]:
# get 10 examples
articles_np_10 = articles_np[:10]
articles_np_10.shape

(10,)

In [95]:
# Define a partial new function with the `words` argument fixed to 120
summarize_text_120 = partial(summarize_text, words=250)

# Parallelize the TextRank summarization
with Pool() as pool:
    summarized_articles = pool.map(summarize_text_120, articles_np_10)

In [96]:
df['TextRank (Baseline)_avg'] = summarized_articles

## Evaluating Summarization Models using ROUGE Metrics

In this section, we demonstrate how to evaluate the summarization models using the ROUGE metrics. ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters.

The evaluation is performed using the Hugging Face's implementation of ROUGE, which is a wrapper around the Google Research reimplementation of ROUGE.

The `compute_rouge` function, which calculates the ROUGE scores, is located in the `utils.py` file. This file is assumed to have been imported from a GitHub repository, and you can refer to it for more insights on the implementation of the compute_rouge function.

__Defining Model Names__

First, we define a list of model names that we want to evaluate. In this example, we include the TextRank (baseline), BART, T5, and Pegasus models:
`model_names = ["TextRank (Baseline)", "BART", "T5", "Pegasus"]
`

__Computing ROUGE Metrics__

Next, we use the compute_rouge function to calculate the ROUGE scores for each of the models:
`results = compute_rouge(df, model_names)
`

The compute_rouge function takes the DataFrame df and the list of model names as inputs, and returns a dictionary containing the ROUGE scores for each model.

__Creating a DataFrame for Average Metrics__

To better visualize and analyze the results, we can convert the results dictionary to a Pandas DataFrame:
`avg_metrics = pd.DataFrame.from_dict(results)
`

__Displaying the Results__

Finally, we display the DataFrame containing the average ROUGE metrics for each model:
`avg_metrics.T
`

By examining the DataFrame, you can compare the performance of the different summarization models and decide which one best suits your needs.

In [97]:
model_names = ["TextRank (Baseline)", "BART", "T5", "Pegasus"]
results = compute_rouge(df, model_names)

In [98]:
results_bleu = compute_bleu(df, model_names)

In [99]:
avg_metrics = pd.DataFrame.from_dict(results)
avg_metrics = avg_metrics.T

In [100]:
avg_metrics

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
TextRank (Baseline),0.450525,0.20071,0.246021,0.370348
BART,0.449934,0.206406,0.23239,0.311584
T5,0.475323,0.212825,0.23876,0.321936
Pegasus,0.440725,0.19245,0.222895,0.307402


In [101]:
df_bleu = pd.DataFrame.from_dict(results_bleu)
df_bleu = df_bleu.T
df_bleu

Unnamed: 0,google_bleu
TextRank (Baseline),0.184799
BART,0.166654
T5,0.170961
Pegasus,0.154844


In [102]:
pd.concat([avg_metrics, df_bleu], axis=1)

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum,google_bleu
TextRank (Baseline),0.450525,0.20071,0.246021,0.370348,0.184799
BART,0.449934,0.206406,0.23239,0.311584,0.166654
T5,0.475323,0.212825,0.23876,0.321936,0.170961
Pegasus,0.440725,0.19245,0.222895,0.307402,0.154844
