## ⚡ Further Notebooks In This Course ⚡

**Notebooks:**
1. [LLM 01 - How to use LLMs with Hugging Face](https://www.kaggle.com/code/aliabdin1/llm-01-llms-with-hugging-face)
2. [LLM 02 - Embeddings, Vector Databases, and Search](https://www.kaggle.com/code/aliabdin1/llm-02-embeddings-vector-databases-and-search)
3. [LLM 03 - Building LLM Chain](https://www.kaggle.com/code/aliabdin1/llm-03-building-llm-chain)
4. [LLM 04a - Fine-tuning LLMs](https://www.kaggle.com/code/aliabdin1/llm-04a-fine-tuning-llms)
4. [LLM 04b - Evaluating LLMs](https://www.kaggle.com/code/aliabdin1/llm-04b-evaluating-llms)
5. [LLM 05 - LLMs and Society](https://www.kaggle.com/code/aliabdin1/llm-05-llms-and-society)
6. [LLM 06 - LLMOps](https://www.kaggle.com/code/aliabdin1/llm-06-llmops)

**Hands-on Lab Notebooks:**
1. [LLM 01L - How to use LLMs with Hugging Face Lab](https://www.kaggle.com/code/aliabdin1/llm-01l-llms-with-hugging-face-lab)
2. [LLM 02L - Embeddings, Vector Databases, and Search Lab](https://www.kaggle.com/code/aliabdin1/llm-02l-embeddings-vector-databases-and-search)
3. [LLM 03L - Building LLM Chains Lab](https://www.kaggle.com/code/aliabdin1/llm-03l-building-llm-chains-lab)
4. [LLM 04L - Fine-tuning LLMs Lab](https://www.kaggle.com/code/aliabdin1/llm-04l-fine-tuning-llms-lab)
5. [LLM 05L - LLMs and Society Lab](https://www.kaggle.com/code/aliabdin1/llm-05l-llms-and-society-lab)

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Evaluating Large Language Models (LLMs)
This notebook demonstrates methods for evaluating LLMs.  We focus on the task of summarization and cover accuracy, ROUGE-N, and perplexity.

### ![Dolly](https://files.training.databricks.com/images/llm/dolly_small.png) Learning Objectives
1. Know how to compute ROUGE-N and other metrics.
2. Gain an intuitive understanding of ROUGE-N.
3. Test various models and model sizes on the same data, and compare their results.

## Classroom Setup

In [1]:
%pip install rouge_score==0.1.2

[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
!pip install evaluate

[0m

In [3]:
!pip install -U accelerate --quiet

[0m

In [4]:
#%run ../Includes/Classroom-Setup

In [5]:
mkdir cache

mkdir: cannot create directory ‘cache’: File exists


## How can we evaluate summarization?

Suppose you are developing a smartphone news app and need to display automatically generated summaries of breaking news articles.  How can you evaluate whether or not the summaries you are generating are good?

![](https://drive.google.com/uc?export=view&id=1V6cMD1LgivCb850JDhva1DO9EWVH8rJ7)

## Dataset

We will use a subset of the `cnn_dailymail` dataset from See et al., 2017, downloadable from the Hugging Face `datasets` hub: https://huggingface.co/datasets/cnn_dailymail

This dataset provides news article paired with summaries (in the "highlights" column).  Let's load the data and take a look at some examples.

In [6]:
import torch
from datasets import load_dataset

full_dataset = load_dataset(
    "cnn_dailymail", version="3.0.0", cache_dir="../working/cache"
)  # Note: We specify cache_dir to use pre-cached data.

# Use a small sample of the data during this lab, for speed.
sample_size = 100
sample = (
    full_dataset["train"]
    .filter(lambda r: "CNN" in r["article"][:25])
    .shuffle(seed=42)
    .select(range(sample_size))
)
sample

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/288 [00:00<?, ?ba/s]

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 100
})

In [7]:
display(sample.to_pandas())

Unnamed: 0,article,highlights,id
0,(CNN) -- Congolese side TP Mazembe Englebert b...,Congolese side TP Mazembe beat Internacional 2...,411f948ee838c8167e06e8176a50401903b572b5
1,(CNN) -- The family of Kendrick Johnson has su...,School district failed to protect Kendrick Joh...,264e6285453d651a86d1c5a4bbd69f3f0d72e380
2,"(CNN) -- The city of Berkeley, California, is ...",Feds file a civil forfeiture complaint to clos...,735dd139358dca6b596d9a62a0742e78a05c24b2
3,"Seoul, South Korea (CNN) -- President Park Geu...","The president's office says a spokesman ""damag...",50fd059adbc39b759a9f21ef25390382bb663405
4,Washington (CNN) -- Shortly after Elena Kagan ...,Kagan sought to block military recruiters from...,6689c823b9eb20e1e0ccf41cd935dd6197bd3cde
...,...,...,...
95,(CNN) -- Authorities in Arizona said Tuesday t...,Authorities seize $7.8 in cash and make more t...,2893b0fdae01c675294b7bdf0b5a5249abbd11f8
96,(CNN) -- French champions Lille have revealed ...,Lille reveal that striker Gervinho is set to h...,da4743727c9c322cedccbf21d8b94f867f9b1276
97,Washington (CNN) -- Sen. Tim Kaine of Virginia...,Hillary Clinton's 2016 decision could be known...,c5c217906d3daf61ad358bd24fa068b260e6b254
98,"(CNN) -- I own a property in Fort Pierre, Sout...",Keystone XL pipeline would bring tar sands cru...,4ed825bd03a54f36fb2b65dd7954cab03ddd03d1


In [8]:
example_article = sample["article"][0]
example_summary = sample["highlights"][0]
print(f"Article:\n{example_article}\n")
print(f"Summary:\n{example_summary}")

Article:
(CNN) -- Congolese side TP Mazembe Englebert became the first African team to clinch a place in FIFA's Club World Cup final after a stunning victory over Brazilian giants Internacional. Goals from Mulota Kubangu and Dioko Kaluyituka secured a 2-0 win for Mazembe in Abu Dhabi  and handed them a shot at being crowned world champions in Saturday's final. They become the first team from outside Europe or South America to reach the showpiece final and will face the winners of Wednesday's clash between Italian and European champions Inter Milan and Asian Champions League holders Seongnam Ilhwa FC, from Korea. Mazembe's coach, Lamine N'Diaye, told FIFA's website: "It's something special for us. We are here to represent Africa, and all of Africa will be proud of our work. "We believed in ourselves, we were confident and you could see that when we started attacking, especially at the start of the second half. We were lucky too, and don't forget that our goalkeeper was excellent -- he w

## Summarization

In [9]:
import pandas as pd
import torch
import gc
from transformers import AutoTokenizer, T5ForConditionalGeneration

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [10]:
def batch_generator(data: list, batch_size: int):
    """
    Creates batches of size `batch_size` from a list.
    """
    s = 0
    e = s + batch_size
    while s < len(data):
        yield data[s:e]
        s = e
        e = min(s + batch_size, len(data))


def summarize_with_t5(
    model_checkpoint: str, articles: list, batch_size: int = 8
) -> list:
    """
    Compute summaries using a T5 model.
    This is similar to a `pipeline` for a T5 model but does tokenization manually.

    :param model_checkpoint: Name for a model checkpoint in Hugging Face, such as "t5-small" or "t5-base"
    :param articles: List of strings, where each string represents one article.
    :return: List of strings, where each string represents one article's generated summary
    """
    if torch.cuda.is_available():
        device = "cuda:0"
    else:
        device = "cpu"

    model = T5ForConditionalGeneration.from_pretrained(
        model_checkpoint, cache_dir="../working/cache"
    ).to(device)
    tokenizer = AutoTokenizer.from_pretrained(
        model_checkpoint, model_max_length=1024, cache_dir="../working/cache"
    )

    def perform_inference(batch: list) -> list:
        inputs = tokenizer(
            batch, max_length=1024, return_tensors="pt", padding=True, truncation=True
        )

        summary_ids = model.generate(
            inputs.input_ids.to(device),
            attention_mask=inputs.attention_mask.to(device),
            num_beams=2,
            min_length=0,
            max_length=40,
        )
        return tokenizer.batch_decode(summary_ids, skip_special_tokens=True)

    res = []

    summary_articles = list(map(lambda article: "summarize: " + article, articles))
    for batch in batch_generator(summary_articles, batch_size=batch_size):
        res += perform_inference(batch)

        torch.cuda.empty_cache()
        gc.collect()

    # clean up
    del tokenizer
    del model
    torch.cuda.empty_cache()
    gc.collect()
    return res

In [11]:
t5_small_summaries = summarize_with_t5("t5-small", sample["article"])

In [12]:
reference_summaries = sample["highlights"]

In [13]:
display(
    pd.DataFrame.from_dict(
        {
            "generated": t5_small_summaries,
            "reference": reference_summaries,
        }
    )
)

Unnamed: 0,generated,reference
0,TP Mazembe Englebert beats internacional 2-0 t...,Congolese side TP Mazembe beat Internacional 2...
1,the family of Kendrick Johnson has sued a sout...,School district failed to protect Kendrick Joh...
2,"the city of Berkeley, California, filed a fede...",Feds file a civil forfeiture complaint to clos...
3,"new: spokesman says he was involved in an ""uns...","The president's office says a spokesman ""damag..."
4,the high court unanimously upheld the Solomon ...,Kagan sought to block military recruiters from...
...,...,...
95,"authorities in Tempe, Arizona say they seized ...",Authorities seize $7.8 in cash and make more t...
96,french champions Lille reveal that striker Ger...,Lille reveal that striker Gervinho is set to h...
97,a Democrat who has vowed to endorse Clinton sh...,Hillary Clinton's 2016 decision could be known...
98,a pipeline pipeline pipeline would bring tar s...,Keystone XL pipeline would bring tar sands cru...


You may see some warning messages in the output above.  While pipelines are handy, they provide less control over the tokenizer and model; we will dive deeper later.

But first, let's see how our summarization pipeline does!  We'll compute 0/1 accuracy, a classic ML evaluation metric.

In [14]:
accuracy = 0.0
for i in range(len(reference_summaries)):
    generated_summary = t5_small_summaries[i]
    if generated_summary == reference_summaries[i]:
        accuracy += 1.0
accuracy = accuracy / len(reference_summaries)

print(f"Achieved accuracy {accuracy}!")

Achieved accuracy 0.0!


Accuracy zero?!?  We can see that the (very generic) metric of 0/1 accuracy is not useful for summarization.  Thinking about this more, small variations in wording may not matter much, and many different summaries may be equally valid.  So how can we evaluate summarization?

## ROUGE

Now that we can generate summaries---and we know 0/1 accuracy is useless here---let's look at how we can compute a meaningful metric designed to evaluate summarization: ROUGE.

Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a set of evaluation metrics designed for comparing summaries from Lin et al., 2004.  See https://en.wikipedia.org/wiki/ROUGE_(metric) for more info.  Here, we use the Hugging Face Evaluator wrapper to call into the `rouge_score` package.  This package provides 4 scores:

* `rouge1`: ROUGE computed over unigrams (single words or tokens)
* `rouge2`: ROUGE computed over bigrams (pairs of consecutive words or tokens)
* `rougeL`: ROUGE based on the longest common subsequence shared by the summaries being compared
* `rougeLsum`: like `rougeL`, but at "summary level," i.e., ignoring sentence breaks (newlines)

In [15]:
import evaluate
import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")

rouge_score = evaluate.load("rouge")

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Downloading builder script: 0.00B [00:00, ?B/s]

You can call `rouge_score` evaluator directly, but we provide a convenience function below to handle the expected input format.

In [16]:
def compute_rouge_score(generated: list, reference: list) -> dict:
    """
    Compute ROUGE scores on a batch of articles.

    This is a convenience function wrapping Hugging Face `rouge_score`,
    which expects sentences to be separated by newlines.

    :param generated: Summaries (list of strings) produced by the model
    :param reference: Ground-truth summaries (list of strings) for comparison
    """
    generated_with_newlines = ["\n".join(sent_tokenize(s.strip())) for s in generated]
    reference_with_newlines = ["\n".join(sent_tokenize(s.strip())) for s in reference]
    return rouge_score.compute(
        predictions=generated_with_newlines,
        references=reference_with_newlines,
        use_stemmer=True,
    )

In [17]:
# ROUGE scores for our batch of articles
compute_rouge_score(t5_small_summaries, reference_summaries)

{'rouge1': 0.30852863404389136,
 'rouge2': 0.11774726736603104,
 'rougeL': 0.23141941719091055,
 'rougeLsum': 0.2854116516303946}

## Understanding ROUGE scores

In [18]:
# Sanity check: What if our predictions match the references exactly?
compute_rouge_score(reference_summaries, reference_summaries)

{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}

In [19]:
# And what if we fail to predict anything?
compute_rouge_score(
    generated=["" for _ in range(len(reference_summaries))],
    reference=reference_summaries,
)

{'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0.0, 'rougeLsum': 0.0}

Stemming predictions and references can help to ignore minor differences.

We will use `rouge_score.compute()` directly for these hand-constructed examples.

In [20]:
rouge_score.compute(
    predictions=["Large language models beat world record"],
    references=["Large language models beating world records"],
    use_stemmer=False,
)

{'rouge1': 0.6666666666666666,
 'rouge2': 0.4000000000000001,
 'rougeL': 0.6666666666666666,
 'rougeLsum': 0.6666666666666666}

In [21]:
rouge_score.compute(
    predictions=["Large language models beat world record"],
    references=["Large language models beating world records"],
    use_stemmer=True,
)

{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}

Let's look at how the ROUGE score behaves in various situations.

In [22]:
# What if we predict exactly 1 word correctly?
rouge_score.compute(
    predictions=["Large language models beat world record"],
    references=["Large"],
    use_stemmer=True,
)

{'rouge1': 0.2857142857142857,
 'rouge2': 0.0,
 'rougeL': 0.2857142857142857,
 'rougeLsum': 0.2857142857142857}

In [23]:
# The ROUGE score is symmetric with respect to predictions and references.
rouge_score.compute(
    predictions=["Large"],
    references=["Large language models beat world record"],
    use_stemmer=True,
)

{'rouge1': 0.2857142857142857,
 'rouge2': 0.0,
 'rougeL': 0.2857142857142857,
 'rougeLsum': 0.2857142857142857}

In [24]:
# What about 2 words?  Note how 'rouge1' and 'rouge2' compare with the case when we predict exactly 1 word correctly.
rouge_score.compute(
    predictions=["Large language"],
    references=["Large language models beat world record"],
    use_stemmer=True,
)

{'rouge1': 0.5, 'rouge2': 0.33333333333333337, 'rougeL': 0.5, 'rougeLsum': 0.5}

In [25]:
# Note how rouge1 differs from the rougeN (N>1) scores when we predict word subsequences correctly.
rouge_score.compute(
    predictions=["Models beat large language world record"],
    references=["Large language models beat world record"],
    use_stemmer=True,
)

{'rouge1': 1.0,
 'rouge2': 0.6,
 'rougeL': 0.6666666666666666,
 'rougeLsum': 0.6666666666666666}

d ## Compare small and large models

 We've been working with the `t5-small` model so far.  Let's compare several models with different architectures in terms of their ROUGE scores and some example generated summaries.

In [26]:
def compute_rouge_per_row(
    generated_summaries: list, reference_summaries: list
) -> pd.DataFrame:
    """
    Generates a dataframe to compare rogue score metrics.
    """
    generated_with_newlines = [
        "\n".join(sent_tokenize(s.strip())) for s in generated_summaries
    ]
    reference_with_newlines = [
        "\n".join(sent_tokenize(s.strip())) for s in reference_summaries
    ]
    scores = rouge_score.compute(
        predictions=generated_with_newlines,
        references=reference_with_newlines,
        use_stemmer=True,
        use_aggregator=False,
    )
    scores["generated"] = generated_summaries
    scores["reference"] = reference_summaries
    return pd.DataFrame.from_dict(scores)

### T5-small

The [T5](https://huggingface.co/docs/transformers/model_doc/t5) [[paper]](https://arxiv.org/pdf/1910.10683.pdf) family of models are text-to-text transformers that have been trained on a multi-task mixture of unsupervised and supervised tasks. They are well suited for task such as summarization, translation, text classification, question answering, and more.

The t5-small version of the T5 models has 60 million parameters.

In [27]:
# We computed t5_small_summaries above already.
compute_rouge_score(t5_small_summaries, reference_summaries)

{'rouge1': 0.30852863404389136,
 'rouge2': 0.11774726736603104,
 'rougeL': 0.23141941719091055,
 'rougeLsum': 0.2854116516303946}

In [28]:
t5_small_results = compute_rouge_per_row(
    generated_summaries=t5_small_summaries, reference_summaries=reference_summaries
)
display(t5_small_results)

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum,generated,reference
0,0.486486,0.333333,0.432432,0.486486,TP Mazembe Englebert beats internacional 2-0 t...,Congolese side TP Mazembe beat Internacional 2...
1,0.337662,0.080000,0.181818,0.311688,the family of Kendrick Johnson has sued a sout...,School district failed to protect Kendrick Joh...
2,0.352941,0.072289,0.211765,0.329412,"the city of Berkeley, California, filed a fede...",Feds file a civil forfeiture complaint to clos...
3,0.338462,0.158730,0.276923,0.338462,"new: spokesman says he was involved in an ""uns...","The president's office says a spokesman ""damag..."
4,0.146341,0.000000,0.073171,0.121951,the high court unanimously upheld the Solomon ...,Kagan sought to block military recruiters from...
...,...,...,...,...,...,...
95,0.555556,0.269231,0.481481,0.481481,"authorities in Tempe, Arizona say they seized ...",Authorities seize $7.8 in cash and make more t...
96,0.388889,0.142857,0.250000,0.305556,french champions Lille reveal that striker Ger...,Lille reveal that striker Gervinho is set to h...
97,0.444444,0.257143,0.416667,0.444444,a Democrat who has vowed to endorse Clinton sh...,Hillary Clinton's 2016 decision could be known...
98,0.235294,0.144578,0.235294,0.235294,a pipeline pipeline pipeline would bring tar s...,Keystone XL pipeline would bring tar sands cru...


### T5-base

The [T5-base](https://huggingface.co/t5-base) model has 220 million parameters.

In [29]:
t5_base_summaries = summarize_with_t5(
    model_checkpoint="t5-base", articles=sample["article"]
)
compute_rouge_score(t5_base_summaries, reference_summaries)

Downloading (…)lve/main/config.json: 0.00B [00:00, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

{'rouge1': 0.33407544344427237,
 'rouge2': 0.13664062069688038,
 'rougeL': 0.246907841634339,
 'rougeLsum': 0.31288500478905923}

In [30]:
t5_base_results = compute_rouge_per_row(
    generated_summaries=t5_base_summaries, reference_summaries=reference_summaries
)
display(t5_base_results)

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum,generated,reference
0,0.400000,0.294118,0.400000,0.400000,Congolese side TP Mazembe Englebert beat Brazi...,Congolese side TP Mazembe beat Internacional 2...
1,0.389610,0.080000,0.155844,0.337662,the family of Kendrick Johnson sues a south Ge...,School district failed to protect Kendrick Joh...
2,0.413793,0.164706,0.206897,0.321839,the city of Berkeley is trying to stop the fed...,Feds file a civil forfeiture complaint to clos...
3,0.516129,0.400000,0.483871,0.516129,president park geun-hye dismisses press spokes...,"The president's office says a spokesman ""damag..."
4,0.337662,0.213333,0.233766,0.337662,new: senator says Kagan acted responsibly and ...,Kagan sought to block military recruiters from...
...,...,...,...,...,...,...
95,0.393443,0.169492,0.360656,0.360656,"""these are sophisticated, criminal enterprises...",Authorities seize $7.8 in cash and make more t...
96,0.294118,0.060606,0.176471,0.235294,Ivory Coast striker Gervinho has expressed an ...,Lille reveal that striker Gervinho is set to h...
97,0.400000,0.176471,0.257143,0.342857,democrat says he thinks she will run for presi...,Hillary Clinton's 2016 decision could be known...
98,0.200000,0.076923,0.150000,0.175000,john avlon: keystone XL pipeline is not in our...,Keystone XL pipeline would bring tar sands cru...


### GPT-2

The [GPT-2](https://huggingface.co/gpt2) model is a generative text model that was trained in a self-supervised fashion. Its strengths are in using a 'completing the sentence' for a given prompt.  It has 124 million parameters.

In [33]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer


def summarize_with_gpt2(
    model_checkpoint: str, articles: list, batch_size: int = 8
) -> list:
    """
    Convenience function for summarization with GPT2 to handle these complications:
    - Append "TL;DR" to the end of the input to get GPT2 to generate a summary.
    https://huggingface.co/course/chapter7/5?fw=pt
    - Truncate input to handle long articles.
    - GPT2 uses a max token length of 1024.  We use a shorter 512 limit here.

    :param model_checkpoint: reference to checkpointed model
    :param articles: list of strings
    :return: generated summaries, with the input and "TL;DR" removed
    """
    if torch.cuda.is_available():
        device = "cuda:0"
    else:
        device = "cpu"

    tokenizer = GPT2Tokenizer.from_pretrained(
        model_checkpoint, padding_side="left", cache_dir="../working/cache"
    )
    tokenizer.add_special_tokens({"pad_token": tokenizer.eos_token})
    model = GPT2LMHeadModel.from_pretrained(
        model_checkpoint,
        pad_token_id=tokenizer.eos_token_id,
        cache_dir="../working/cache",
    ).to(device)

    def perform_inference(batch: list) -> list:
        tmp_inputs = tokenizer(
            batch, max_length=500, return_tensors="pt", padding=True, truncation=True
        )
        tmp_inputs_decoded = tokenizer.batch_decode(
            tmp_inputs.input_ids, skip_special_tokens=True
        )
        inputs = tokenizer(
            [article + " TL;DR:" for article in tmp_inputs_decoded],
            max_length=512,
            return_tensors="pt",
            padding=True,
            truncation=True,
        )
        summary_ids = model.generate(
            inputs.input_ids.to(device),
            attention_mask=inputs.attention_mask.to(device),
            num_beams=2,
            min_length=0,
            max_length=512 + 32,
        )
        return tokenizer.batch_decode(summary_ids, skip_special_tokens=True)

    decoded_summaries = []
    for batch in batch_generator(articles, batch_size=batch_size):
        decoded_summaries += perform_inference(batch)

        # batch clean up
        torch.cuda.empty_cache()
        gc.collect()

    # post-process decoded summaries
    summaries = [
        summary[summary.find("TL;DR:") + len("TL;DR: ") :]
        for summary in decoded_summaries
    ]

    # cleanup
    del tokenizer
    del model
    torch.cuda.empty_cache()
    gc.collect()

    return summaries

In [34]:
gpt2_summaries = summarize_with_gpt2(
    model_checkpoint="gpt2", articles=sample["article"]
)
compute_rouge_score(gpt2_summaries, reference_summaries)

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

{'rouge1': 0.1909147851253476,
 'rouge2': 0.045434127429818644,
 'rougeL': 0.1477892290674347,
 'rougeLsum': 0.1769305490152419}

In [35]:
gpt2_results = compute_rouge_per_row(
    generated_summaries=gpt2_summaries, reference_summaries=reference_summaries
)
display(gpt2_results)

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum,generated,reference
0,0.096774,0.033333,0.064516,0.096774,The Congolese side are proud of their work.\n\...,Congolese side TP Mazembe beat Internacional 2...
1,0.285714,0.131148,0.253968,0.285714,The sheriff's office is investigating the deat...,School district failed to protect Kendrick Joh...
2,0.093023,0.000000,0.093023,0.093023,"The dispensary was operating within 1,000 feet...",Feds file a civil forfeiture complaint to clos...
3,0.317460,0.131148,0.222222,0.285714,"Park's press secretary, Yoon Chang-jung, was i...","The president's office says a spokesman ""damag..."
4,0.235294,0.000000,0.164706,0.188235,Kagan was not allowed to attend Harvard's comm...,Kagan sought to block military recruiters from...
...,...,...,...,...,...,...
95,0.100000,0.000000,0.100000,0.100000,Sinaloa cartel busts in Arizona\n\nRead More,Authorities seize $7.8 in cash and make more t...
96,0.273973,0.056338,0.164384,0.246575,Gervinho has expressed an interest in moving t...,Lille reveal that striker Gervinho is set to h...
97,0.278481,0.025974,0.177215,0.278481,Kaine is the only Democrat in the Senate who h...,Hillary Clinton's 2016 decision could be known...
98,0.437500,0.191489,0.229167,0.395833,tar sands oil spills are a serious threat to d...,Keystone XL pipeline would bring tar sands cru...


### Comparing all models

We use a couple of helper functions to compare the above models, first by their evaluation metrics (quantitative) and second by their generated summaries (qualitative).

In [36]:
def compare_models(models_results: dict) -> pd.DataFrame:
    """
    :param models_results: dict of "model name" string mapped to pd.DataFrame of results computed by `compute_rouge_per_row`
    :return: pd.DataFrame with 1 row per model, with columns: model, rouge1, rouge2, rougeL, rougeLsum
    where metrics are averages over input results for each model
    """
    agg_results = []
    for r in models_results:
        model_results = models_results[r].drop(
            labels=["generated", "reference"], axis=1
        )
        agg_metrics = [r]
        agg_metrics[1:] = model_results.mean(axis=0)
        agg_results.append(agg_metrics)
    return pd.DataFrame(
        agg_results, columns=["model", "rouge1", "rouge2", "rougeL", "rougeLsum"]
    )

In [37]:
display(
    compare_models(
        {
            "t5-small": t5_small_results,
            "t5-base": t5_base_results,
            "gpt2": gpt2_results,
        }
    )
)

Unnamed: 0,model,rouge1,rouge2,rougeL,rougeLsum
0,t5-small,0.307743,0.118408,0.23168,0.285508
1,t5-base,0.333509,0.137224,0.247565,0.312887
2,gpt2,0.190622,0.045745,0.147443,0.176376


In [38]:
def compare_models_summaries(models_summaries: dict) -> pd.DataFrame:
    """
    Aggregates results from `models_summaries` and returns a dataframe.
    """
    comparison_df = None
    for model_name in models_summaries:
        summaries_df = models_summaries[model_name]
        if comparison_df is None:
            comparison_df = summaries_df[["generated"]].rename(
                {"generated": model_name}, axis=1
            )
        else:
            comparison_df = comparison_df.join(
                summaries_df[["generated"]].rename({"generated": model_name}, axis=1)
            )
    return comparison_df

In [39]:
# In the output table below, scroll to the right to see all models.
display(
    compare_models_summaries(
        {
            "t5_small": t5_small_results,
            "t5_base": t5_base_results,
            "gpt2": gpt2_results,
        }
    )
)

Unnamed: 0,t5_small,t5_base,gpt2
0,TP Mazembe Englebert beats internacional 2-0 t...,Congolese side TP Mazembe Englebert beat Brazi...,The Congolese side are proud of their work.\n\...
1,the family of Kendrick Johnson has sued a sout...,the family of Kendrick Johnson sues a south Ge...,The sheriff's office is investigating the deat...
2,"the city of Berkeley, California, filed a fede...",the city of Berkeley is trying to stop the fed...,"The dispensary was operating within 1,000 feet..."
3,"new: spokesman says he was involved in an ""uns...",president park geun-hye dismisses press spokes...,"Park's press secretary, Yoon Chang-jung, was i..."
4,the high court unanimously upheld the Solomon ...,new: senator says Kagan acted responsibly and ...,Kagan was not allowed to attend Harvard's comm...
...,...,...,...
95,"authorities in Tempe, Arizona say they seized ...","""these are sophisticated, criminal enterprises...",Sinaloa cartel busts in Arizona\n\nRead More
96,french champions Lille reveal that striker Ger...,Ivory Coast striker Gervinho has expressed an ...,Gervinho has expressed an interest in moving t...
97,a Democrat who has vowed to endorse Clinton sh...,democrat says he thinks she will run for presi...,Kaine is the only Democrat in the Senate who h...
98,a pipeline pipeline pipeline would bring tar s...,john avlon: keystone XL pipeline is not in our...,tar sands oil spills are a serious threat to d...


&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>