## Summarization

This section delves into the application of Transformer models for text summarization, which involves condensing lengthy documents into shorter summaries. Text summarization stands as one of the most intricate NLP tasks, demanding a combination of skills such as comprehending lengthy passages and producing coherent text that encapsulates the central themes of a document. Successful text summarization serves as a powerful tool that can streamline various business processes by alleviating the need for domain experts to meticulously review lengthy documents.

This section also introduces the concept of training a bilingual Transformer model for summarizing customer reviews in both English and Spanish. By the end of this section, you will have developed a model capable of effectively summarizing customer reviews in both languages.

### Preparing a multilingual corpus

To create a bilingual text summarization model, we'll utilize the [Multilingual Amazon Reviews Corpus](https://huggingface.co/datasets/amazon_reviews_multi), a collection of Amazon product reviews in six languages. Traditionally employed for evaluating multilingual classifiers, this corpus proves valuable for our summarization task due to the inclusion of short titles alongside each review. These titles will serve as target summaries for our model. To initiate the process, we'll download the English and Spanish subsets from the Hugging Face Hub.

In [None]:
from datasets import load_dataset

spanish_dataset = load_dataset("amazon_reviews_multi", "es")
english_dataset = load_dataset("amazon_reviews_multi", "en")
english_dataset

The Multilingual Amazon Reviews Corpus contains 200,000 reviews for the training split in each language and 5,000 reviews for each of the validation and test splits. The relevant review information is stored in the `review_body` and `review_title` columns. To examine a few examples, we can create a simple function that retrieves a random sample from the training set using the techniques introduced in Chapter 5.

In [None]:
def show_samples(dataset, num_samples=3, seed=42):
    sample = dataset["train"].shuffle(seed=seed).select(range(num_samples))
    for example in sample:
        print(f"\n'>> Title: {example['review_title']}'")
        print(f"'>> Review: {example['review_body']}'")


show_samples(english_dataset)

This snippet unveils the diverse range of opinions lurking within online reviews, showcasing a spectrum from ecstatic praise to scathing criticism. While the "meh" title leaves much to be desired, others appear to effectively capture the essence of their corresponding reviews. However, training a summarization model on all 400,000 reviews would be a marathon for a single GPU. Therefore, we'll focus our efforts on summarizing reviews within a specific product category. To discover potential domains, let's convert the `english_dataset` to a `pandas.DataFrame` and tabulate the review count for each product category. This will provide a roadmap for selecting the domain that best suits our summarization needs.



In [None]:
english_dataset.set_format("pandas")
english_df = english_dataset["train"][:]
# Show counts for top 20 products
english_df["product_category"].value_counts()[:20]

The popularity contest amongst English dataset products is fierce, with household items, clothing, and wireless electronics reigning supreme. But staying true to Amazon's roots, let's set our sights on the captivating world of books! After all, this is where the company's story began. We spot two promising categories - "book" and "digital_ebook_purchase" - ripe for summarization. To delve deeper into the bookish realm, let's filter both the English and Spanish datasets to include only these two categories. This focused approach will allow us to tailor the summarization model to the specific nuances of book reviews, ensuring optimal performance and delightful results!



In [None]:
def filter_books(example):
    return (
        example["product_category"] == "book"
        or example["product_category"] == "digital_ebook_purchase"
    )

Before embarking on our bookish summarization journey, we need to ensure our tools are ready. While the function you provided will expertly filter both English and Spanish datasets for book-related reviews, a crucial step remains. We must transform the format of `english_dataset` back from `pandas` to `arrow`. This is because the summarization model expects data in the `arrow` format for optimal processing and efficiency.

Think of it as switching gears – pandas is great for exploration and analysis, but arrow is the sleek, high-performance vehicle for feeding the model and generating summaries. With this format switch, we'll be ready to filter both datasets and dive into the fascinating world of book reviews with sharpened focus and confidence!



In [None]:
english_dataset.reset_format()

We can then apply the filter function, and as a sanity check let’s inspect a sample of reviews to see if they are indeed about books:

In [None]:
spanish_books = spanish_dataset.filter(filter_books)
english_books = english_dataset.filter(filter_books)
show_samples(english_books)

Ah, the perils of data exploration! While the reviews seemed initially bookish, closer inspection reveals they encompass a broader range, including calendars and electronic applications like OneNote. This slight detour from our pure bookish focus shouldn't deter us, however. The domain still holds potential for training a summarization model.

Before delving into suitable models, one final data preparation step remains: merging the English and Spanish reviews into a single `DatasetDict` object.  Datasets offers a convenient `concatenate_datasets()` function, aptly named for its ability to stack two `Dataset` objects like building blocks. To create our bilingual dataset, we'll iterate through each split (train, validation, test), concatenate the corresponding English and Spanish datasets, and shuffle the result. This ensures our model encounters both languages throughout its training journey, preventing it from favoring one over the other.

In [None]:
from datasets import concatenate_datasets, DatasetDict

books_dataset = DatasetDict()

for split in english_books.keys():
    books_dataset[split] = concatenate_datasets(
        [english_books[split], spanish_books[split]]
    )
    books_dataset[split] = books_dataset[split].shuffle(seed=42)

# Peek at a few examples
show_samples(books_dataset)

It seems like the reviews and titles contain a blend of English and Spanish content. Checking the training data involves examining how words are distributed within both the reviews and their titles. This examination is crucial, particularly for tasks like summarization. In such tasks, brief summaries within the dataset could influence the model to generate similarly short summaries consisting of only a word or two. The visual representations below display these word distributions. Notably, the titles exhibit a significant inclination towards containing just 1-2 words. This imbalance might impact how the model generates summaries.

![](2023-12-16-13-46-22.png)

To address this, we'll exclude instances with extremely short titles, allowing our model to generate more engaging summaries. Given the mix of English and Spanish texts, we can apply a basic approach to split the titles using whitespace. Then, employing the `Dataset.filter()` method, we can proceed in the following manner:

In [None]:
books_dataset = books_dataset.filter(lambda x: len(x["review_title"].split()) > 2)

Now that our corpus is ready, let's explore some potential Transformer models suitable for fine-tuning on this data!

### Models for text summarization

Summarizing text shares a key similarity with machine translation: both aim to condense information while preserving essential points. Like translation models, most Transformer-based summarizers rely on the encoder-decoder architecture, though alternatives like GPT-family models exist for specific use cases. Here's a table highlighting popular pre-trained models that can be fine-tuned for summarization tasks.


| Transformer model | Description | Multilingual? |
|-------------------|-------------|---------------|
| GPT-2             | Although trained as an auto-regressive language model, you can make GPT-2 generate summaries by appending “TL;DR” at the end of the input text. | ❌ |
| PEGASUS           | Uses a pretraining objective to predict masked sentences in multi-sentence texts. This pretraining objective is closer to summarization than vanilla language modeling and scores highly on popular benchmarks. | ❌ |
| T5                | A universal Transformer architecture that formulates all tasks in a text-to-text framework; e.g., the input format for the model to summarize a document is summarize: ARTICLE. | ❌ |
| mT5               | A multilingual version of T5, pretrained on the multilingual Common Crawl corpus (mC4), covering 101 languages. | ✅ |
| BART              | A novel Transformer architecture with both an encoder and a decoder stack trained to reconstruct corrupted input that combines the pretraining schemes of BERT and GPT-2. | ❌ |
| mBART-50          | A multilingual version of BART, pretrained on 50 languages. | ✅ |


As illustrated in the table, most Transformer models designed for summarization (and generally in NLP tasks) are designed for a single language. While this is advantageous for "high-resource" languages like English or German, it presents limitations for the numerous other languages worldwide. Thankfully, a category of multilingual Transformer models, such as mT5 and mBART, steps in to address this issue. These models undergo pretraining in a unique manner: instead of training solely on one language corpus, they're trained simultaneously on texts from over 50 languages!

Let's delve into mT5, an intriguing architecture based on T5 that operates within a text-to-text framework. In T5, each NLP task is framed using a prompt prefix, such as "summarize:", which conditions the model to tailor the generated text according to the prompt. This approach renders T5 incredibly versatile, enabling it to handle a multitude of tasks using a single model, as depicted in the figure below.

![](2023-12-16-14-05-59.png)

In our model selection process, we've opted for mT5 due to its multilingual capabilities and versatile nature akin to T5. Now, moving forward, our focus shifts to the crucial stage of data preparation for training. This step is pivotal as it lays the groundwork for the model to learn from our dataset effectively.

Our next step involves tokenizing and encoding both the reviews and their corresponding titles. To kickstart this process, we'll initiate by loading the tokenizer linked to the pretrained model checkpoint. In this scenario, we'll opt for "mt5-small" as our checkpoint to ensure a manageable fine-tuning duration for the model.

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "google/mt5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

💡 In the initial phases of NLP projects, it's beneficial to train "small" models using a limited dataset. This approach facilitates quicker debugging and iteration towards establishing an end-to-end workflow. Once you've gained confidence in the outcomes, scaling up the model is straightforward—simply by switching to a different model checkpoint!

Let’s test out the mT5 tokenizer on a small example:

In [None]:
inputs = tokenizer("I loved reading the Hunger Games!")
inputs

Here, we're presented with the familiar input_ids and attention_mask components, reminiscent of our initial encounters during the early fine-tuning experiments in Chapter 3. To unveil the nature of this tokenizer, let's employ the tokenizer's `convert_ids_to_tokens()` function to decode these input IDs and inspect the tokenizer's characteristics.

In [None]:
tokenizer.convert_ids_to_tokens(inputs.input_ids)

The presence of the special Unicode character ▁ and the end-of-sequence token </s> indicates that the tokenizer in use is SentencePiece, which relies on the Unigram segmentation algorithm. Unigram proves particularly advantageous for multilingual corpora as it remains agnostic to accents, punctuation, and the absence of whitespace characters in certain languages like Japanese.

Regarding tokenization for summarization, there's a crucial consideration: the labels, being textual data, might exceed the model's maximum context size. Hence, truncating both the reviews and their corresponding titles becomes essential to prevent overly lengthy inputs from being processed by the model. Fortunately, 🤗 Transformers' tokenizers offer a convenient `text_target` argument, allowing parallel tokenization of the labels and inputs. Below is an example demonstrating how inputs and targets are processed for mT5:

In [None]:
max_input_length = 512
max_target_length = 30


def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["review_body"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["review_title"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

We begin by setting values for `max_input_length` and `max_target_length`, determining the maximum allowable lengths for the reviews and titles, respectively. Given that the review body tends to be longer than the title, we've scaled these values accordingly.

The `preprocess_function()` allows straightforward tokenization of the entire corpus using the `Dataset.map()` function, a utility extensively utilized throughout this course. This function ensures that the reviews and titles are tokenized while considering the maximum lengths specified for both inputs (`max_input_length`) and targets (`max_target_length`).

In [None]:
tokenized_datasets = books_dataset.map(preprocess_function, batched=True)

Now that the corpus has undergone preprocessing, let's delve into several metrics commonly employed for summarization. As we'll discover, there isn't a one-size-fits-all metric for gauging the quality of machine-generated text.

💡 It's worth noting that in the `Dataset.map()` function earlier, we utilized `batched=True`. This parameter encodes examples in batches, typically of size 1,000 by default, harnessing the multithreading capabilities of fast tokenizers within 🤗 Transformers. Whenever feasible, employing `batched=True` can optimize preprocessing and leverage the efficiency of tokenization processes!

### Metrics for text summarization

When it comes to tasks like summarization or translation, measuring performance isn't as straightforward compared to most other tasks we've covered in this course. Consider a review such as "I loved reading the Hunger Games"; multiple valid summaries like "I loved the Hunger Games" or "Hunger Games is a great read" exist. It's evident that exact matching between the generated summary and the label isn't a suitable solution—humans themselves would struggle under such a metric due to individual writing styles.

For summarization tasks, the ROUGE score (short for Recall-Oriented Understudy for Gisting Evaluation) is one of the most frequently used metrics. This metric's core concept involves comparing a generated summary against a set of reference summaries typically crafted by humans. 

#### Brief note on ROUGE Score
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used in natural language processing and specifically in evaluating the quality of text generated by machine learning models, especially in tasks like text summarization and machine translation.

It's designed to measure the similarity between automatically generated summaries or translations and reference (human-created) summaries or translations. The metrics in the ROUGE family primarily focus on recall, evaluating how much of the information in the generated text overlaps with the reference text.

ROUGE computes various scores, such as ROUGE-N, ROUGE-L, and ROUGE-W, among others. These scores capture different aspects of overlap between the generated and reference texts:

- **ROUGE-N:** Measures the overlap of n-grams (sequences of n words) between the generated and reference texts. ROUGE-1 focuses on single words (unigrams), ROUGE-2 on pairs of words (bigrams), and so on.

- **ROUGE-L:** Computes the longest common subsequence between the generated and reference texts, considering sentence-level overlap and accounting for the order of words.

- **ROUGE-W:** Considers weighted LCS (Longest Common Subsequence) to give higher importance to consecutive matches.

These scores help assess the quality of machine-generated text by quantifying how much content overlaps with the reference text. Higher ROUGE scores indicate better agreement between the generated and reference text, suggesting better quality in tasks like summarization or translation.

For example, if a model generates a summary with a high ROUGE score, it implies that the summary captures important information present in the reference summary, indicating a better quality summary. It's an essential evaluation metric used in assessing the performance of models in text-related tasks.

To illustrate this in detail, let's consider comparing the following two summaries:

In [None]:
generated_summary = "I absolutely loved reading the Hunger Games"
reference_summary = "I loved reading the Hunger Games"

A straightforward method to compare these summaries might involve counting the number of overlapping words, which in this case would be 6. However, this approach is somewhat rudimentary. Hence, the ROUGE metric relies on calculating precision and recall scores specifically for the overlap between the generated summary and the reference summaries.

🙋 No need to worry if precision and recall are new concepts for you! We'll walk through explicit examples together to clarify these metrics. Typically encountered in classification tasks, precision and recall hold specific definitions in that context. If you're interested in understanding how precision and recall are defined in classification, exploring [guides](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html) from libraries like scikit-learn can be quite helpful.

In the context of ROUGE, recall gauges how much of the reference summary is captured by the generated one. If we're simply comparing words, recall can be calculated using the following formula:

![](2023-12-16-14-38-24.png)

For our previous straightforward example, applying this formula yields a perfect recall of \( \frac{6}{6} = 1 \), indicating that all words in the reference summary have been produced by the model.

However, consider a scenario where the generated summary is "I really really loved reading the Hunger Games all night." Surprisingly, this also achieves perfect recall (6/6), yet it's arguably a worse summary due to its verbosity. To address such cases, we compute precision, which in the ROUGE context measures how relevant the generated summary is:

![](2023-12-16-14-39-02.png)

Applying this to our verbose summary results in a precision of 6/10 = 0.6, significantly worse than the precision of 6/7 = 0.86 obtained by the shorter one.

In practice, both precision and recall are computed, and then the F1-score, the harmonic mean of precision and recall, is commonly reported. To facilitate this computation within 🤗 Datasets, we start by installing the rouge_score package.

In [None]:
!pip install rouge_score

and then loading the ROUGE metric as follows:

In [None]:
import evaluate

rouge_score = evaluate.load("rouge")

We can leverage the `rouge_score.compute()` function from the `rouge_score` package we installed to calculate all the relevant metrics simultaneously. This function streamlines the computation of various ROUGE metrics, providing a comprehensive overview of the evaluation results in one go.

In [None]:
scores = rouge_score.compute(
    predictions=[generated_summary], references=[reference_summary]
)
scores

That output does contain quite a bit of information. 🤗 Datasets performs computations not only for precision, recall, and F1-score but also generates confidence intervals for these metrics. You'll notice attributes such as low, mid, and high, which correspond to these confidence intervals.

In addition, various ROUGE scores are computed based on different levels of text granularity when comparing the generated and reference summaries. For instance, the rouge1 variant represents the overlap of unigrams, essentially measuring the overlap of individual words as we discussed earlier.

To verify this further, let's extract the mid value from our computed scores. This value encapsulates the central tendency of the ROUGE scores, providing a clearer understanding of the evaluation outcome.

In [None]:
scores["rouge1"].mid

Those other ROUGE scores capture different aspects of summarization evaluation. 

- **Rouge2** assesses the overlap between bigrams, essentially capturing the similarity between pairs of words in the generated and reference summaries.
- **RougeL** and **RougeLsum** focus on identifying the longest matching sequences of words between the generated and reference summaries. 
    - **RougeLsum** computes this metric over the entire summary, considering the longest common substrings in the entire summary.
    - **RougeL**, on the other hand, is computed as the average over individual sentences, aiming to identify the longest common sequences of words within each sentence.

Each of these ROUGE variants offers a distinct perspective on how well the generated summary aligns with the reference summaries, capturing nuances in the overlap and sequence matching at various text granularities.

Before diving into tracking our model's performance using ROUGE scores, let's execute a fundamental yet crucial step in NLP: establishing a robust yet straightforward baseline!

### Creating a strong baseline

A prevalent baseline for text summarization involves extracting the initial three sentences of an article, often referred to as the lead-3 baseline. While tracking sentence boundaries using full stops might falter with acronyms like "U.S." or "U.N.," we'll opt for a more robust solution provided by the nltk library. To install the package, you can use pip with the following command:

In [None]:
!pip install nltk

and then download the punctuation rules:

In [None]:
import nltk

nltk.download("punkt")

Following that, we'll import the sentence tokenizer from nltk and craft a straightforward function to extract the initial three sentences from a review. In text summarization, it's customary to delimit each summary with a newline character. Let's incorporate this practice and test it on a training example to ensure its functionality:

In [None]:
from nltk.tokenize import sent_tokenize


def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])


print(three_sentence_summary(books_dataset["train"][1]["review_body"]))

Let's proceed by implementing a function that extracts these "summaries" from a dataset and computes the ROUGE scores for the baseline. This function will involve extracting the lead-3 summaries from the dataset and subsequently calculating the ROUGE scores to evaluate the baseline's performance.

In [None]:
def evaluate_baseline(dataset, metric):
    summaries = [three_sentence_summary(text) for text in dataset["review_body"]]
    return metric.compute(predictions=summaries, references=dataset["review_title"])

Using this function, we can compute the ROUGE scores over the validation set. Afterward, we can utilize Pandas to present these scores in a more organized and readable format, enhancing their clarity and comprehensibility.

In [None]:
import pandas as pd

score = evaluate_baseline(books_dataset["validation"], rouge_score)
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_dict = dict((rn, round(score[rn].mid.fmeasure * 100, 2)) for rn in rouge_names)
rouge_dict

The noticeable decrease in the rouge2 score likely indicates that the lead-3 baseline tends to be more verbose compared to concise review titles. With this solid baseline established, let's shift our focus towards fine-tuning mT5!

### Fine-tuning mT5 with the Trainer API

Fine-tuning a model for summarization follows a process similar to the other tasks we've explored. Initially, we start by loading the pretrained model from the `mt5-small` checkpoint. As summarization is a sequence-to-sequence task, we'll use the `AutoModelForSeq2SeqLM` class to load the model, allowing it to automatically fetch and cache the necessary weights:

In [None]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

The subsequent step involves logging in to the Hugging Face Hub. If you're executing this code in a notebook, you can utilize the following utility function to achieve this:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

which will display a widget where you can enter your credentials. Alternatively, you can run this command in your terminal and log in there:

In [None]:
huggingface-cli login

To compute ROUGE scores during training, we'll require the generation of summaries. Fortunately, 🤗 Transformers offers specialized classes such as `Seq2SeqTrainingArguments` and `Seq2SeqTrainer` designed to handle this task seamlessly. To observe this in action, let's initiate the definition of hyperparameters and other essential arguments for our experiments:

In [None]:
from transformers import Seq2SeqTrainingArguments

batch_size = 8
num_train_epochs = 8
# Show the training loss with every epoch
logging_steps = len(tokenized_datasets["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

args = Seq2SeqTrainingArguments(
    output_dir=f"{model_name}-finetuned-amazon-en-es",
    evaluation_strategy="epoch",
    learning_rate=5.6e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps,
    # push_to_hub=True,
)

In the provided setup, the argument `predict_with_generate` has been configured to enable the generation of summaries during evaluation. This facilitates the computation of ROUGE scores for each epoch by utilizing the model's `generate()` method, employing an iterative token prediction process within the decoder, as discussed in Chapter 1.

Several default hyperparameters have been adjusted, including the learning rate, number of epochs, and weight decay. Additionally, the `save_total_limit` option has been set to retain only up to 3 checkpoints during training to conserve storage space, considering that even the "small" version of mT5 utilizes around a GB of hard drive space.

The `push_to_hub=True` argument enables pushing the trained model to the Hub after completion of training. This action creates a repository under your user profile, specified by `output_dir`. It's noteworthy that you can define the repository's name using the `hub_model_id` argument, especially when pushing to an organization. For instance, when pushing the model to the huggingface-course organization, we appended `hub_model_id="huggingface-course/mt5-finetuned-amazon-en-es"` to Seq2SeqTrainingArguments.

The subsequent step involves providing the trainer with a `compute_metrics()` function to evaluate the model during training. For summarization tasks, this process requires more than simply applying `rouge_score.compute()` on the model's predictions. Instead, we must decode the outputs and labels into text before computing the ROUGE scores. The function below accomplishes this task, utilizing the `sent_tokenize()` function from nltk to separate the summary sentences with newlines:

In [None]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # Decode generated summaries into text
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    # Decode reference summaries into text
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # ROUGE expects a newline after each sentence
    decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels]
    # Compute ROUGE scores
    result = rouge_score.compute(
        predictions=decoded_preds, references=decoded_labels, use_stemmer=True
    )
    # Extract the median scores
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    return {k: round(v, 4) for k, v in result.items()}

The next step involves defining a data collator specifically tailored for our sequence-to-sequence task. When working with an encoder-decoder Transformer model like mT5, a crucial aspect in batch preparation is the shifting of labels to the right by one during decoding. This shift is necessary to ensure that the decoder only receives the previous ground truth labels, preventing it from accessing current or future labels, which could lead to memorization by the model. This process aligns with how masked self-attention is applied to inputs in tasks like [causal language modeling](https://huggingface.co/course/chapter7/6).

Fortunately, 🤗 Transformers offers a `DataCollatorForSeq2Seq` collator designed to handle dynamic padding for both inputs and labels. Instantiating this collator involves providing the tokenizer and model as arguments:

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

Before passing a small batch of examples to the collator, let's remove the columns containing strings because the collator won't be able to pad these elements appropriately:

In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(
    books_dataset["train"].column_names
)

To align with the collator's expectations, which require a list of dictionaries where each dictionary represents a single example in the dataset, we'll need to format the data accordingly before passing it to the data collator:

In [None]:
features = [tokenized_datasets["train"][i] for i in range(2)]
data_collator(features)

The notable aspect here is the discrepancy in length between the first and second examples. As a result, the `input_ids` and `attention_mask` of the second example have been right-padded with a [PAD] token (ID: 0). Similarly, the `labels` have been padded with `-100s` to ensure these tokens are disregarded by the loss function. Additionally, a new `decoder_input_ids` field has been generated, shifting the labels to the right by inserting a [PAD] token as the first entry.

With all these elements in place, we have everything essential for training! Now, the next step involves simply instantiating the trainer using the standard arguments:

In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

and launch our training run:

In [None]:
trainer.train()

During the training process, you'll notice the training loss decreasing while the ROUGE scores gradually increase with each epoch, reflecting the model's improving performance. Upon completion of training, you can access the final ROUGE scores by executing `Trainer.evaluate()`:

In [None]:
trainer.evaluate()

The obtained scores indicate that our model has significantly surpassed the performance of our lead-3 baseline — an excellent achievement! The final step involves pushing the model weights to the Hub, accomplished as follows:

In [None]:
# trainer.push_to_hub(commit_message="Training complete", tags="summarization")

Executing this command will save the checkpoint and configuration files to the specified output directory before uploading all files to the Hub. By utilizing the tags argument, we ensure that the Hub displays a summarization pipeline widget instead of the default text generation one associated with the mT5 architecture. For more information regarding model tags, refer to the 🤗 [Hub documentation](https://huggingface.co/docs/hub/main#how-is-a-models-type-of-inference-api-and-widget-determined). The output generated from `trainer.push_to_hub()` will be a URL pointing to the Git commit hash, enabling easy access to view the changes made in the model repository!

To conclude this section, let's explore fine-tuning mT5 using the low-level features provided by 🤗 Accelerate.

### Fine-tuning mT5 with 🤗 Accelerate

Fine-tuning our model with 🤗 Accelerate mirrors the process we encountered in the text classification example from Chapter 3. However, there are a few distinctions, notably the necessity to explicitly generate summaries during training and define the process for computing the ROUGE scores. Unlike the `Seq2SeqTrainer`, which handled generation during training, we'll have to manage these aspects manually within 🤗 Accelerate. Let's explore how we can incorporate these requirements using 🤗 Accelerate!

#### Preparing everything for training
To begin, we'll create a DataLoader for each split in our dataset. As PyTorch dataloaders anticipate batches of tensors, we must set the format to "torch" within our datasets:

In [None]:
tokenized_datasets.set_format("torch")

With our datasets now structured to contain tensors, the subsequent step involves instantiating the `DataCollatorForSeq2Seq` once more. To achieve this, we'll need a fresh version of the model. Let's proceed by reloading it from our cache:

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

We can then instantiate the data collator and use this to define our dataloaders:

In [None]:
from torch.utils.data import DataLoader

batch_size = 8
train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=batch_size,
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], collate_fn=data_collator, batch_size=batch_size
)

Continuing with our setup, the subsequent step involves defining the optimizer we intend to use. Similar to our other examples, we'll opt for AdamW, known for its effectiveness across various problem domains:

In [None]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

Finally, we'll provide our model, optimizer, and dataloaders to the `accelerator.prepare()` method to finalize the preparations:

In [None]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

Now that our objects are prepared, there are three remaining tasks to complete:

1. Define the learning rate schedule.
2. Implement a function to post-process the summaries for evaluation.
3. Create a repository on the Hub to which we can push our model.

For the learning rate schedule, we'll employ the standard linear schedule used in previous sections:

In [None]:
from transformers import get_scheduler

num_train_epochs = 10
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

To post-process the generated summaries, we require a function that splits them into sentences separated by newlines. This format aligns with the expectation of the ROUGE metric. The following code snippet achieves this:

In [None]:
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # ROUGE expects a newline after each sentence
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

    return preds, labels

Yes, this structure might ring a bell if you remember how we previously defined the `compute_metrics()` function for the `Seq2SeqTrainer`.

Concluding this setup, we're left with creating a model repository on the Hugging Face Hub. We'll utilize the 🤗 Hub library for this task. The process involves defining a name for our repository. The library provides a utility function to combine the repository ID with the user profile:

In [None]:
from huggingface_hub import get_full_repo_name

model_name = "test-bert-finetuned-squad-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

With this repository name established, we can proceed to clone a local version into our results directory, which will serve as the storage for the training artifacts:

In [None]:
from huggingface_hub import Repository

output_dir = "results-mt5-finetuned-squad-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

This action enables us to push the artifacts back to the Hub by invoking the `repo.push_to_hub()` method while training is in progress. Now, let's conclude our analysis by framing the training loop.

#### Training loop

The training loop for summarization closely resembles the other 🤗 Accelerate examples we've encountered and can be broadly divided into four primary steps:

1. Train the model by iterating over all examples in `train_dataloader` for each epoch.
2. Generate model summaries at the end of each epoch by initially generating the tokens and then decoding them (along with the reference summaries) into text.
3. Compute the ROUGE scores using the methods we previously discussed.
4. Save the checkpoints and push everything to the Hub. To expedite this process, we leverage the convenient `blocking=False` argument of the Repository object. This allows us to _asynchronously_ push the checkpoints per epoch, enabling continuous training without waiting for the somewhat sluggish upload of a GB-sized model!

These steps can be seen in the following block of code:

In [None]:
from tqdm.auto import tqdm
import torch
import numpy as np

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for step, batch in enumerate(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(
                batch["input_ids"],
                attention_mask=batch["attention_mask"],
            )

            generated_tokens = accelerator.pad_across_processes(
                generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
            )
            labels = batch["labels"]

            # If we did not pad to max length, we need to pad the labels too
            labels = accelerator.pad_across_processes(
                batch["labels"], dim=1, pad_index=tokenizer.pad_token_id
            )

            generated_tokens = accelerator.gather(generated_tokens).cpu().numpy()
            labels = accelerator.gather(labels).cpu().numpy()

            # Replace -100 in the labels as we can't decode them
            labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            if isinstance(generated_tokens, tuple):
                generated_tokens = generated_tokens[0]
            decoded_preds = tokenizer.batch_decode(
                generated_tokens, skip_special_tokens=True
            )
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

            decoded_preds, decoded_labels = postprocess_text(
                decoded_preds, decoded_labels
            )

            rouge_score.add_batch(predictions=decoded_preds, references=decoded_labels)

    # Compute metrics
    result = rouge_score.compute()
    # Extract the median ROUGE scores
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    result = {k: round(v, 4) for k, v in result.items()}
    print(f"Epoch {epoch}:", result)

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

That's a wrap! Running this will yield a model and results akin to those obtained using the Trainer. 

### Using your fine-tuned model

Once you’ve pushed the model to the Hub, you can play with it either via the inference widget or with a pipeline object, as follows:

In [None]:
from transformers import pipeline

hub_model_id = "huggingface-course/mt5-small-finetuned-amazon-en-es"
summarizer = pipeline("summarization", model=hub_model_id)

To assess the quality of the summaries, we can provide a few examples from the test set (examples unseen by the model) to our pipeline. Initially, let's create a straightforward function that displays the review, title, and generated summary together:

In [None]:
def print_summary(idx):
    review = books_dataset["test"][idx]["review_body"]
    title = books_dataset["test"][idx]["review_title"]
    summary = summarizer(books_dataset["test"][idx]["review_body"])[0]["summary_text"]
    print(f"'>>> Review: {review}'")
    print(f"\n'>>> Title: {title}'")
    print(f"\n'>>> Summary: {summary}'")

Let’s take a look at one of the English examples we get:

In [None]:
print_summary(100)

That looks promising! The model appears to have executed abstractive summarization by augmenting sections of the review with new words. Moreover, the most fascinating aspect is its bilingual capability, enabling the generation of summaries for Spanish reviews as well:

In [None]:
print_summary(0)

That summary translates into "Very easy to read" in English, a direct extraction from the review in this instance. Nonetheless, this showcases the versatility of the mT5 model, offering a glimpse into handling a multilingual corpus!

* 5 Stages (each stage has a lab )
* duration
* Quiz for each stage
* Have a notebook for the labs (each stage)
* Premeier project and capstone project (Detailed for the student).

