### Summarization
In this section we’ll take a look at how Transformer models can be used to condense long documents into summaries, a task known as **text summarization**. This is one of the most challenging NLP tasks as it requires a range of abilities, such as understanding long passages and generating coherent text that captures the main topics in a document. However, when done well, text summarization is a powerful tool that can speed up various business processes by relieving the burden of domain experts to read long documents in detail.

Although there already exist various fine-tuned models for summarization on the [Hugging Face Hub](https://huggingface.co/models?pipeline_tag=summarization&sort=downloads), almost all of these are only suitable for English documents. So, to add a twist in this section, we’ll train a bilingual model for English and Spanish. <br>
By the end of this section, you’ll have a [model](https://huggingface.co/huggingface-course/mt5-small-finetuned-amazon-en-es) that can summarize customer reviews.

As we’ll see, these summaries are concise because they’re learned from the titles that customers provide in their product reviews. Let’s start by putting together a suitable bilingual corpus for this task.

### Preparing a multilingual corpus
We’ll use the [Multilingual Amazon Reviews Corpus](https://huggingface.co/datasets/amazon_reviews_multi) to create our bilingual summarizer. This corpus consists of Amazon product reviews in six languages and is typically used to benchmark multilingual classifiers. However, since each review is accompanied by a short title, we can use the titles as the target summaries for our model to learn from! <br>
To get started, let’s download the English and Spanish subsets from the Hugging Face Hub:

In [4]:
from datasets import load_dataset

spanish_dataset = load_dataset("amazon_reviews_multi", "es")
english_dataset = load_dataset("amazon_reviews_multi", "en")
english_dataset

Downloading:   0%|          | 0.00/2.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.62k [00:00<?, ?B/s]

Downloading and preparing dataset amazon_reviews_multi/es (download: 77.58 MiB, generated: 52.44 MiB, post-processed: Unknown size, total: 130.02 MiB) to /root/.cache/huggingface/datasets/amazon_reviews_multi/es/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609...


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/77.5M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/1.93M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset amazon_reviews_multi downloaded and prepared to /root/.cache/huggingface/datasets/amazon_reviews_multi/es/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading and preparing dataset amazon_reviews_multi/en (download: 82.11 MiB, generated: 58.69 MiB, post-processed: Unknown size, total: 140.79 MiB) to /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609...


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/82.0M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/2.05M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset amazon_reviews_multi downloaded and prepared to /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
})

As you can see, for each language there are 200,000 reviews for the train split, and 5,000 reviews for each of the validation and test splits. The review information we are interested in is contained in the **review_body** and **review_title** columns. Let’s take a look at a few examples by creating a simple function that takes a random sample from the training set with the techniques we learned in [Chapter 5](https://huggingface.co/course/chapter5/1):

In [5]:
def show_samples(dataset, num_samples=3, seed=42):
    sample = dataset["train"].shuffle(seed=seed).select(range(num_samples))
    for example in sample:
        print(f"\n'>> Title: {example['review_title']}'")
        print(f"'>> Review: {example['review_body']}'")


show_samples(english_dataset)


'>> Title: Worked in front position, not rear'
'>> Review: 3 stars because these are not rear brakes as stated in the item description. At least the mount adapter only worked on the front fork of the bike that I got it for.'

'>> Title: meh'
'>> Review: Does it’s job and it’s gorgeous but mine is falling apart, I had to basically put it together again with hot glue'

'>> Title: Can't beat these for the money'
'>> Review: Bought this for handling miscellaneous aircraft parts and hanger "stuff" that I needed to organize; it really fit the bill. The unit arrived quickly, was well packaged and arrived intact (always a good sign). There are five wall mounts-- three on the top and two on the bottom. I wanted to mount it on the wall, so all I had to do was to remove the top two layers of plastic drawers, as well as the bottom corner drawers, place it when I wanted and mark it; I then used some of the new plastic screw in wall anchors (the 50 pound variety) and it easily mounted to the wall. 

This sample shows the diversity of reviews one typically finds online, ranging from positive to negative (and everything in between!). Although the example with the “meh” title is not very informative, the other titles look like decent summaries of the reviews themselves. Training a summarization model on all 400,000 reviews would take far too long on a single GPU, so instead we’ll focus on generating summaries for a single domain of products. <br>
To get a feel for what domains we can choose from, let’s convert english_dataset to a pandas.DataFrame and compute the number of reviews per product category:

In [6]:
english_dataset.set_format("pandas")
english_df = english_dataset["train"][:]

english_df["product_category"].value_counts()[:20]

home                      17679
apparel                   15951
wireless                  15717
other                     13418
beauty                    12091
drugstore                 11730
kitchen                   10382
toy                        8745
sports                     8277
automotive                 7506
lawn_and_garden            7327
home_improvement           7136
pet_products               7082
digital_ebook_purchase     6749
pc                         6401
electronics                6186
office_product             5521
shoes                      5197
grocery                    4730
book                       3756
Name: product_category, dtype: int64

The most popular products in the English dataset are about household items, clothing, and wireless electronics. To stick with the Amazon theme, though, let’s focus on summarizing book reviews — after all, this is what the company was founded on! We can see two product categories that fit the bill (book and digital_ebook_purchase), so let’s filter the datasets in both languages for just these products. <br>
As we saw in Chapter 5, the **Dataset.filter()** function allows us to slice a dataset very efficiently, so we can define a simple function to do this:

In [7]:
def filter_books(example):
    return (
        example["product_category"] == "book"
        or example["product_category"] == "digital_ebook_purchase"
    )

Now when we apply this function to english_dataset and spanish_dataset, the result will contain just those rows involving the book categories. <br>
Before applying the filter, let’s switch the format of english_dataset from "pandas" back to "arrow":

In [8]:
english_dataset.reset_format()

We can then apply the filter function, and as a sanity check let’s inspect a sample of reviews to see if they are indeed about books:

In [9]:
spanish_books = spanish_dataset.filter(filter_books)
english_books = english_dataset.filter(filter_books)
show_samples(english_books)

  0%|          | 0/200 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/200 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]


'>> Title: I'm dissapointed.'
'>> Review: I guess I had higher expectations for this book from the reviews. I really thought I'd at least like it. The plot idea was great. I loved Ash but, it just didnt go anywhere. Most of the book was about their radio show and talking to callers. I wanted the author to dig deeper so we could really get to know the characters. All we know about Grace is that she is attractive looking, Latino and is kind of a brat. I'm dissapointed.'

'>> Title: Good art, good price, poor design'
'>> Review: I had gotten the DC Vintage calendar the past two years, but it was on backorder forever this year and I saw they had shrunk the dimensions for no good reason. This one has good art choices but the design has the fold going through the picture, so it's less aesthetically pleasing, especially if you want to keep a picture to hang. For the price, a good calendar'

'>> Title: Helpful'
'>> Review: Nearly all the tips useful and. I consider myself an intermediate to a

Okay, we can see that the reviews are not strictly about books and might refer to things like calendars and electronic applications such as OneNote. Nevertheless, the domain seems about right to train a summarization model on. <br>
Before we look at various models that are suitable for this task, we have one last bit of data preparation to do: **combining the English and Spanish reviews as a single DatasetDict object.** 🤗 Datasets provides a handy **concatenate_datasets()** function that (as the name suggests) will stack two Dataset objects on top of each other. So, to create our bilingual dataset, we’ll loop over each split, concatenate the datasets for that split, and shuffle the result to ensure our model doesn’t overfit to a single language:

In [10]:
from datasets import concatenate_datasets, DatasetDict

books_dataset = DatasetDict()

for split in english_books.keys():
    books_dataset[split] = concatenate_datasets(
        [english_books[split], spanish_books[split]]
    )
    books_dataset[split] = books_dataset[split].shuffle(seed=42)

# Peek at a few examples
show_samples(books_dataset)


'>> Title: Easy to follow!!!!'
'>> Review: I loved The dash diet weight loss Solution. Never hungry. I would recommend this diet. Also the menus are well rounded. Try it. Has lots of the information need thanks.'

'>> Title: PARCIALMENTE DAÑADO'
'>> Review: Me llegó el día que tocaba, junto a otros libros que pedí, pero la caja llegó en mal estado lo cual dañó las esquinas de los libros porque venían sin protección (forro).'

'>> Title: no lo he podido descargar'
'>> Review: igual que el anterior'


This certainly looks like a mix of English and Spanish reviews! Now that we have a training corpus, one final thing to check is the distribution of words in the reviews and their titles. This is especially important for summarization tasks, where short reference summaries in the data can bias the model to only output one or two words in the generated summaries. The plots below show the word distributions, and we can see that the titles are heavily skewed toward just 1-2 words:

![](https://huggingface.co/course/static/chapter7/review-lengths.png "review-lengths")

To deal with this, we’ll filter out the examples with very short titles so that our model can produce more interesting summaries. Since we’re dealing with English and Spanish texts, we can use a rough heuristic to split the titles on whitespace and then use our trusty **Dataset.filter()** method as follows:

In [11]:
books_dataset = books_dataset.filter(lambda x: len(x["review_title"].split()) > 2)

  0%|          | 0/18 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Now that we’ve prepared our corpus, let’s take a look at a few possible Transformer models that one might fine-tune on it!

### Models for text summarization
If you think about it, **text summarization** is a similar sort of task to **machine translation**: we have a body of text like a review that we’d like to “translate” into a shorter version that captures the salient features of the input. Accordingly, most Transformer models for summarization adopt the encoder-decoder architecture that we first encountered in [Chapter 1](https://huggingface.co/course/chapter1/1), although there are some exceptions like the GPT family of models which can also be used for summarization in few-shot settings. **The following table lists some popular pretrained models that can be fine-tuned for summarization.**

| Transformer model |                                                                                                   Description                                                                                                   | Multilingual? |
|-------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|---------------|
| [GPT-2](https://huggingface.co/gpt2-xl)             | Although trained as an auto-regressive language model, you can make GPT-2 generate summaries by appending “TL;DR” at the end of the input text.                                                                 | ❌             |
| [PEGASUS](https://huggingface.co/google/pegasus-large)           | Uses a pretraining objective to predict masked sentences in multi-sentence texts. This pretraining objective is closer to summarization than vanilla language modeling and scores highly on popular benchmarks. | ❌             |
| [T5](https://huggingface.co/t5-base)                | A universal Transformer architecture that formulates all tasks in a text-to-text framework; e.g., the input format for the model to summarize a document is summarize: ARTICLE.                                 | ❌             |
| [mT5](https://huggingface.co/google/mt5-base)               | A multilingual version of T5, pretrained on the multilingual Common Crawl corpus (mC4), covering 101 languages.                                                                                                 | ✅             |
| [BART](https://huggingface.co/facebook/bart-base)              | A novel Transformer architecture with both an encoder and a decoder stack trained to reconstruct corrupted input that combines the pretraining schemes of BERT and GPT-2.                                       | ❌             |
| [mBART-50](https://huggingface.co/facebook/mbart-large-50)          | A multilingual version of BART, pretrained on 50 languages.                                                                                                                                                     | ✅             |

As you can see from this table, **the majority of Transformer models for summarization (and indeed most NLP tasks) are monolingual.** This is great if your task is in a “high-resource” language like English or German, but less so for the thousands of other languages in use across the world. Fortunately, **there is a class of multilingual Transformer models, like mT5 and mBART**, that come to the rescue. These models are pretrained using language modeling, but with a twist: instead of training on a corpus of one language, they are trained jointly on texts in over 50 languages at once!

**We’ll focus on *mT5*, an interesting architecture based on T5 that was pretrained in a text-to-text framework.** In **T5**, every NLP task is formulated in terms of a prompt prefix like summarize: which conditions the model to adapt the generated text to the prompt. As shown in the figure below, this makes **T5** extremely versatile, as you can solve many tasks with a single model!

![](https://huggingface.co/course/static/chapter7/t5.png "T5")

**mT5** doesn’t use prefixes, but shares much of the versatility of **T5** and has the advantage of being multilingual. <br>
Now that we’ve picked a model, let’s take a look at preparing our data for training.

### Preprocessing the data
**Our next task is to tokenize and encode our reviews and their titles.** As usual, we begin by loading the tokenizer associated with the pretrained model checkpoint. <br>
We’ll use **mt5-small** as our checkpoint so we can fine-tune the model in a reasonable amount of time:

In [12]:
from transformers import AutoTokenizer

model_checkpoint = "google/mt5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/553 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.11M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

> 💡 In the early stages of your NLP projects, a good practice is to train a class of “small” models on a small sample of data. This allows you to debug and iterate faster toward an end-to-end workflow. Once you are confident in the results, you can always scale up the model by simply changing the model checkpoint!

Let’s test out the mT5 tokenizer on a small example:

In [13]:
inputs = tokenizer("I loved reading the Hunger Games!")
inputs

{'input_ids': [336, 259, 28387, 11807, 287, 62893, 295, 12507, 309, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Here we can see the familiar **input_ids** and **attention_mask** that we encountered in our first fine-tuning experiments back in [Chapter 3](https://huggingface.co/course/chapter3/1). <br>
Let’s decode these input IDs with the tokenizer’s **convert_ids_to_tokens()** function to see what kind of tokenizer we’re dealing with:

In [14]:
tokenizer.convert_ids_to_tokens(inputs.input_ids)

['▁I', '▁', 'loved', '▁reading', '▁the', '▁Hung', 'er', '▁Games', '!', '</s>']

The special Unicode character ▁ and end-of-sequence token </s> indicate that we’re dealing with the **SentencePiece tokenizer**, which is based on the **Unigram segmentation algorithm** discussed in [Chapter 6](https://huggingface.co/course/chapter6/1). **Unigram** is especially useful for multilingual corpora since it allows SentencePiece to be agnostic about accents, punctuation, and the fact that many languages, like Japanese, do not have whitespace characters.

To tokenize our corpus, we have to deal with a subtlety associated with summarization: because our labels are also text, it is possible that they exceed the model’s maximum context size. **This means we need to apply truncation to both the reviews and their titles** to ensure we don’t pass excessively long inputs to our model. 
The tokenizers in 🤗 Transformers provide a nifty **as_target_tokenizer()** function that allows you to tokenize the labels in parallel to the inputs. This is typically done **using a context manager inside a preprocessing function that first encodes the inputs, and then encodes the labels as a separate column.** Here is an example of such a function for mT5:

In [15]:
max_input_length = 512
max_target_length = 30


def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["review_body"], max_length=max_input_length, truncation=True
    )
    # Set up the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["review_title"], max_length=max_target_length, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

Let’s walk through this code to understand what’s happening. <br>
The first thing we’ve done is define values for **max_input_length** and **max_target_length**, which set the upper limits for how long our reviews and titles can be. Since the review body is typically much larger than the title, we’ve scaled these values accordingly. Then, in the **preprocess_function()** itself we can see the reviews are first tokenized, followed by the titles with **as_target_tokenizer()**.

With **preprocess_function()**, it is then a simple matter to tokenize the whole corpus using the handy **Dataset.map()** function we’ve used extensively throughout this course:

In [16]:
tokenized_datasets = books_dataset.map(preprocess_function, batched=True)

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Now that the corpus has been preprocessed, let’s take a look at some metrics that are commonly used for summarization. <br>
As we’ll see, there is no silver bullet when it comes to measuring the quality of machine-generated text.

> 💡 You may have noticed that we used batched=True in our Dataset.map() function above. This encodes the examples in batches of 1,000 (the default) and allows you to make use of the multithreading capabilities of the fast tokenizers in 🤗 Transformers. Where possible, try using batched=True to get the most out of your preprocessing!

### Metrics for text summarization
In comparison to most of the other tasks we’ve covered in this course, measuring the performance of text generation tasks like summarization or translation is not as straightforward. For example, given a review like “I loved reading the Hunger Games”, there are multiple valid summaries, like “I loved the Hunger Games” or “Hunger Games is a great read”. Clearly, applying some sort of exact match between the generated summary and the label is not a good solution — even humans would fare poorly under such a metric, because we all have our own writing style.

For summarization, one of the most commonly used metrics is the [ROUGE score](https://en.wikipedia.org/wiki/ROUGE_(metric)) (short for Recall-Oriented Understudy for Gisting Evaluation). The basic idea behind this metric is to compare a generated summary against a set of reference summaries that are typically created by humans. To make this more precise, suppose we want to compare the following two summaries:

In [17]:
generated_summary = "I absolutely loved reading the Hunger Games"
reference_summary = "I loved reading the Hunger Games"

One way to compare them could be to count the number of overlapping words, which in this case would be 6. However, this is a bit crude, so instead ROUGE is based on computing the **precision** and **recall scores** for the overlap.

For ROUGE, recall measures how much of the reference summary is captured by the generated one. If we are just comparing words, recall can be calculated according to the following formula:
> Recall= Number of overlapping words / Total number of words in reference summary

For our simple example above, this formula gives a perfect **recall** of 6/6 = 1; i.e., all the words in the reference summary have been produced by the model. This may sound great, but imagine if our generated summary had been “I really really loved reading the Hunger Games all night”. This would also have perfect recall, but is arguably a worse summary since it is verbose. To deal with these scenarios we also compute the **precision**, which in the ROUGE context measures how much of the generated summary was relevant:

> Precision = Number of overlapping words / Total number of words in generated summary

Applying this to our verbose summary gives a precision of 6/10 = 0.6, which is considerably worse than the precision of 6/7 = 0.86 obtained by our shorter one. In practice, both **precision** and **recall** are usually computed, and then the **F1-score** (the harmonic mean of precision and recall) is reported. We can do this easily in 🤗 Datasets by first installing the **rouge_score** package:

In [18]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.0.4-py2.py3-none-any.whl (22 kB)
Installing collected packages: rouge-score
Successfully installed rouge-score-0.0.4


and then loading the ROUGE metric as follows:

In [19]:
from datasets import load_metric

rouge_score = load_metric("rouge")

Downloading:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Then we can use the **rouge_score.compute()** function to calculate all the metrics at once:

In [20]:
scores = rouge_score.compute(
    predictions=[generated_summary], references=[reference_summary]
)
scores

{'rouge1': AggregateScore(low=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), mid=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), high=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923)),
 'rouge2': AggregateScore(low=Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272), mid=Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272), high=Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272)),
 'rougeL': AggregateScore(low=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), mid=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), high=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923)),
 'rougeLsum': AggregateScore(low=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), mid=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.92307692307

Whoa, there’s a lot of information in that output — what does it all mean? <br>
First, 🤗 Datasets actually computes confidence intervals for **precision**, **recall**, and **F1-score**; these are the low, mid, and high attributes you can see here. <br>
Moreover, 🤗 Datasets computes a variety of ROUGE scores which are based on different types of text granularity when comparing the generated and reference summaries. <br>
**The rouge1 variant is the overlap of unigrams — this is just a fancy way of saying the overlap of words and is exactly the metric we’ve discussed above.** To verify this, let’s pull out the mid value of our scores:

In [21]:
scores["rouge1"].mid

Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923)

Great, the precision and recall numbers match up! Now what about those other ROUGE scores? <br>
**rouge2 measures the overlap between bigrams (think the overlap of pairs of words**), <br>
**while rougeL and rougeLsum measure the longest matching sequences of words by looking for the longest common substrings in the generated and reference summaries.** <br>
The “sum” in **rougeLsum refers to the fact that this metric is computed over a whole summary**, while **rougeL is computed as the average over individual sentences.**

We’ll use these ROUGE scores to track the performance of our model, but before doing that let’s do something every good NLP practitioner should do: **create a strong, yet simple baseline!**

### Creating a strong baseline
A common baseline for text summarization is to simply take the first three sentences of an article, often called the **lead-3 baseline**. <br>
We could use **full stops** to track the sentence boundaries, but this will fail on acronyms like “U.S.” or “U.N.” — so instead we’ll use the **nltk library**, which includes a better algorithm to handle these cases. <br>
You can install the package using pip as follows:

In [22]:
!pip install nltk



and then download the punctuation rules:

In [23]:
import nltk

nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Next, we import the **sentence tokenizer** from nltk and create a simple function to extract the first three sentences in a review. <br>
The convention in text summarization is to **separate each summary with a newline**, so let’s also include this and test it on a training example:

In [24]:
from nltk.tokenize import sent_tokenize

def three_sentence_summary(text):
  return "\n".join(sent_tokenize(text)[:3])

print(three_sentence_summary(books_dataset["train"][1]["review_body"]))

I grew up reading Koontz, and years ago, I stopped,convinced i had "outgrown" him.
Still,when a friend was looking for something suspenseful too read, I suggested Koontz.
She found Strangers.


This seems to work, so let’s now implement a function that extracts these “summaries” from a dataset and computes the ROUGE scores for the baseline:

In [25]:
def evaluate_baseline(dataset, metric):
  summaries = [three_sentence_summary(text) for text in dataset["review_body"]]
  return metric.compute(predictions=summaries, references=dataset["review_title"])

We can then use this function to compute the ROUGE scores over the validation set and prettify them a bit using Pandas:

In [26]:
import pandas as pd

score = evaluate_baseline(books_dataset["validation"], rouge_score)
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_dict = dict((rn, round(score[rn].mid.fmeasure * 100, 2)) for rn in rouge_names)
rouge_dict

{'rouge1': 16.71, 'rouge2': 8.86, 'rougeL': 15.53, 'rougeLsum': 15.97}

We can see that the **rouge2 score** is significantly lower than the rest; this likely reflects the fact that review titles are typically concise and so the lead-3 baseline is too verbose. <br>
Now that we have a good baseline to work from, let’s turn our attention toward fine-tuning **mT5**!

### Fine-tuning mT5 with Keras
Fine-tuning a model for summarization is very similar to the other tasks we’ve covered in this chapter. The first thing we need to do is load the pretrained model from the **mt5-small** checkpoint. 
Since **summarization is a sequence-to-sequence task**, we can load the model with the **AutoModelForSeq2SeqLM** class, which will automatically download and cache the weights:

In [27]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFMT5ForConditionalGeneration.

All the layers of TFMT5ForConditionalGeneration were initialized from the model checkpoint at google/mt5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMT5ForConditionalGeneration for predictions without further training.


> 💡 If you’re wondering why you don’t see any warnings about fine-tuning the model on a downstream task, that’s because **for sequence-to-sequence tasks we keep all the weights of the network**. Compare this to our text classification model in [Chapter 3](https://huggingface.co/course/chapter3/1), where the head of the pretrained model was replaced with a randomly initialized network.

The next thing we need to do is log in to the Hugging Face Hub. If you’re running this code in a notebook, you can do so with the following utility function:

In [28]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


which will display a widget where you can enter your credentials. Alternatively, you can run this command in your terminal and log in there:

> huggingface-cli login

Next, we need to define a data collator for our sequence-to-sequence task. Since mT5 is an encoder-decoder Transformer model, one subtlety with preparing our batches is that during decoding we need to shift the labels to the right by one. This is required to ensure that the decoder only sees the previous ground truth labels and not the current or future ones, which would be easy for the model to memorize. This is similar to how masked self-attention is applied to the inputs in a task like causal language modeling.

Luckily, 🤗 Transformers provides a **DataCollatorForSeq2Seq** collator that will dynamically pad the inputs and the labels for us. To instantiate this collator, we simply need to provide the tokenizer and model:

In [29]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

Let’s see what this collator produces when fed a small batch of examples. Since it expects a list of dicts, where each dict represents a single example in the dataset, we need to wrangle the data into the expected format before passing it to the data collator:

In [30]:
features = [tokenized_datasets["train"][i] for i in range(2)]
data_collator(features)

{'review_id': <tf.Tensor: shape=(2,), dtype=string, numpy=array([b'es_0008529', b'en_0887613'], dtype=object)>, 'product_id': <tf.Tensor: shape=(2,), dtype=string, numpy=array([b'product_es_0184753', b'product_en_0173743'], dtype=object)>, 'reviewer_id': <tf.Tensor: shape=(2,), dtype=string, numpy=array([b'reviewer_es_0276151', b'reviewer_en_0635017'], dtype=object)>, 'stars': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([2, 4], dtype=int32)>, 'review_body': <tf.Tensor: shape=(2,), dtype=string, numpy=
array([b'No est\xc3\xa1 mal, pero m\xc3\xa1s para ni\xc3\xb1os de 9 a\xc3\xb1os',
       b'I grew up reading Koontz, and years ago, I stopped,convinced i had "outgrown" him. Still,when a friend was looking for something suspenseful too read, I suggested Koontz. She found Strangers. The excitement art how good it was startled me. I was sure i had recommended something else. I ordered a copy for myself. This was a great reintroduction to an old favorite writer -- a novel full of fully 

The main thing to notice here is that the first example is longer than the second one, so the input_ids and attention_mask of the second example have been padded on the right with a [PAD] token (whose ID is 0). Similarly, we can see that the labels have been padded with -100s, to make sure the padding tokens are ignored by the loss function. And finally, we can see a new decoder_input_ids which has shifted the labels to the right by inserting a [PAD] token in the first entry.

We’re almost ready to train! We just need to convert our datasets to **tf.data.Datasets** using the data collator we defined above, and then **compile()** and **fit()** the model. <br>
First, the datasets:

In [31]:
tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
    columns=["input_ids", "attention_mask", "labels"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=8,
)
tf_eval_dataset = tokenized_datasets["validation"].to_tf_dataset(
    columns=["input_ids", "attention_mask", "labels"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=8,
)

Now, we define our training hyperparameters and compile:

In [32]:
from transformers import create_optimizer
import tensorflow as tf

# If you aren't me, you should probably fill in your own username instead
# If you -are- me, then you should probably deal with your GitHub issue backlog
username = "BatuhanYilmaz"
num_train_epochs = 8
num_train_steps = len(tf_train_dataset) * num_train_epochs
model_name = model_checkpoint.split("/")[-1]

optimizer, schedule = create_optimizer(
    init_lr=5.6e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

model.compile(optimizer=optimizer)

# Train in mixed-precision float16
tf.keras.mixed_precision.set_global_policy("mixed_float16")

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! Please ensure your labels are passed as keys in the input dict so that they are accessible to the model during the forward pass. To disable this behaviour, please pass a loss argument, or explicitly pass loss=None if you do not want your model to compute a loss.


INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: Tesla T4, compute capability 7.5


And finally, we fit the model. We use a **PushToHubCallback** to save the model to the Hub after each epoch, which will allow us to use it for inference later:

In [None]:
from transformers.keras_callbacks import PushToHubCallback

callback = PushToHubCallback(
    output_dir=f"{model_name}-finetuned-amazonbooks-en-es", tokenizer=tokenizer
)

model.fit(
    tf_train_dataset, validation_data=tf_eval_dataset, callbacks=[callback], epochs=8

We got some loss values during training, but really we’d like to see the ROUGE metrics we computed earlier. To get those metrics, we’ll need to generate outputs from the model and convert them to strings. Let’s build some lists of labels and predictions for the ROUGE metric to compare (note that if you get import errors for this section, you may need to!pip install tqdm):

In [34]:
from tqdm import tqdm
import numpy as np

all_preds = []
all_labels = []
for batch in tqdm(tf_eval_dataset):
    predictions = model.generate(**batch)
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = batch["labels"].numpy()
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels]
    all_preds.extend(decoded_preds)
    all_labels.extend(decoded_labels)

100%|██████████| 30/30 [01:21<00:00,  2.73s/it]


Once we have our lists of label and prediction strings, computing the ROUGE score is easy:

In [35]:
result = rouge_score.compute(
    predictions=decoded_preds, references=decoded_labels, use_stemmer=True
)
result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
{k: round(v, 4) for k, v in result.items()}

{'rouge1': 30.9524, 'rouge2': 24.6032, 'rougeL': 30.9524, 'rougeLsum': 30.9524}

### Using your fine-tuned model
Once you’ve pushed the model to the Hub, you can play with it either via the inference widget or with a pipeline object, as follows:

In [36]:
from transformers import pipeline

# Change the username to your Hub profile
hub_model_id = "huggingface-course/mt5-small-finetuned-amazon-en-es"
summarizer = pipeline("summarization", model=hub_model_id)

Downloading:   0%|          | 0.00/682 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/398 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.11M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/7.94M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

We can feed some examples from the test set (which the model has not seen) to our pipeline to get a feel for the quality of the summaries. First let’s implement a simple function to show the review, title, and generated summary together:

In [44]:
def print_summary(idx):
    review = books_dataset["test"][idx]["review_body"]
    title = books_dataset["test"][idx]["review_title"]
    summary = summarizer(books_dataset["test"][idx]["review_body"])[0]["summary_text"]
    print(f"'>>> Review: {review}'")
    print(f"\n'>>> Title: {title}'")
    print(f"\n'>>> Summary: {summary}'")

Let’s take a look at one of the English examples we get:

In [46]:
print_summary(50)

'>>> Review: Overall this is an amazing book! Only lost one star due to poor packaging, but besides a few bent and out of place pages this was great!'

'>>> Title: Good product, needs better packaging'

'>>> Summary: Overall this is an amazing book! Only lost one star due to poor packaging'


This is not too bad! We can see that our model has actually been able to perform abstractive summarization by augmenting parts of the review with new words. And perhaps the coolest aspect of our model is that it is bilingual, so we can also generate summaries of Spanish reviews:

In [47]:
print_summary(0)

'>>> Review: Es una trilogia que se hace muy facil de leer. Me ha gustado, no me esperaba el final para nada'

'>>> Title: Buena literatura para adolescentes'

'>>> Summary: Muy facil de leer'


The summary translates into “Very easy to read” in English, which we can see in this case was extracted directly from the review. Nevertheless, this shows the versatility of the mT5 model and has given you a taste of what it’s like to deal with a multilingual corpus!

Next, we’ll turn our attention to a slightly more complex task: training a language model from scratch.