If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets as well as other dependencies. Uncomment the following cell and run it.

In [1]:
! pip install datasets transformers rouge-score nltk
! pip install --upgrade accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.29.1-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m68.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [27]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Then you need to install Git-LFS. Uncomment the following instructions:

In [3]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.9.2-1).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.


Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [4]:
import transformers

print(transformers.__version__)

4.29.1


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq).

We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely.

In [5]:
# from transformers.utils import send_example_telemetry

# send_example_telemetry("summarization_notebook", framework="pytorch")

# Fine-tuning a model on a summarization task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model for a summarization task. We will use the [XSum dataset](https://arxiv.org/pdf/1808.08745.pdf) (for extreme summarization) which contains BBC articles accompanied with single-sentence summaries.

![Widget inference on a summarization task](https://github.com/huggingface/notebooks/blob/main/examples/images/summarization.png?raw=1)

We will see how to easily load the dataset for this task using 🤗 Datasets and how to fine-tune a model on it using the `Trainer` API.

In [6]:
model_checkpoint = "t5-small"

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we picked the [`t5-small`](https://huggingface.co/t5-small) checkpoint. 

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [7]:
from datasets import load_dataset, load_metric

raw_datasets = load_dataset("xsum")
metric = load_metric("rouge")

Downloading builder script:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

Downloading and preparing dataset xsum/default to /root/.cache/huggingface/datasets/xsum/default/1.2.0/082863bf4754ee058a5b6f6525d0cb2b18eadb62c7b370b095d1364050a52b71...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.00M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

Dataset xsum downloaded and prepared to /root/.cache/huggingface/datasets/xsum/default/1.2.0/082863bf4754ee058a5b6f6525d0cb2b18eadb62c7b370b095d1364050a52b71. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

  metric = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

In [8]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

To access an actual element, you need to select a split first, then give an index:

In [9]:
raw_datasets["train"][0]

 'summary': 'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.',
 'id': '35232142'}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [10]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [11]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,document,summary,id
0,"Many have been using the hashtag #Quanmianertai# - shorthand for ""Freeing up all aspects to have a second child"" - which has become a top trending topic on Thursday night and Friday on the Weibo microblogging network.\nPeople have been sharing their own stories about how strictly the one-child policy had been enforced over the decades it has been in place.\n""I can still remember when I was little, the family planning department broke down the door in my family home to grab my mum and sterilise her. I still carry this trauma to this day. What kind of methods would they use to make us have a second child?"" wrote user MuziD-AiLee.\nHer post attracted scores of sympathetic responses recounting similar experiences.\n""My first child turned out to be twin girls. Two and a half years ago, I was visited daily in my home by planning officials telling me to go for sterilisation. If I didn't get sterilised I would not get the hukou,"" said Shuangbaotaixiaoruhexiaoyi.\n""So I was forced to be sterilised - I was only 23 at that time. My heart hurt so much then - I'm so young and I can't have any more children. I hate the family planning unit.""\nThe hukou is China's identity registration system - officials often deny this to illegal children, making it difficult for them to travel around the country and gain access to state education and healthcare.\nRead more: Children denied an identity under China's one-child policy\nOthers told stories of how their families had to flee or were punished by officials for having more than one child.\n""I can remember how officials would come in the middle of the night to catch people, everyone in my village would wrap up in their blankets and run to the cemetery at a nearby hill to spend the night,"" said Haohaomami3257673354.\n""When my mother had my little brother, we were fined. We had no money so they took away our things in our home - our television set, our benches, even our food. After that there was still forced sterilisation, and now she has a disease. Who will take responsibility for this?"" said Wannengjunren.\nOthers shared sympathy for young Chinese born after 1979, characterising them as the only generation without any siblings.\n""Those born in the 80s and 90s are perhaps now the only generation in the entire history of Chinese civilisation to not have any brothers or sisters... My heart aches for them,"" said user Laiquzhijian, in a post that was shared more than 47,000 times online.\nOthers expressed bitterness at the heavy burden this generation would have in supporting their parents and grandparents, and the perceived lack of support from the Chinese government.\nZhaonenenene posted a picture of what appeared to be a series of newspaper editorial headlines throughout the decades, apparently reflecting the shift in government sentiment.\nA 1985 headline reads: ""Just having one is good, the government will take care of the elderly"". One from 2005 reads: ""When taking care of the elderly one cannot rely on the government!""\nSome, however, said they were happy to have been only children, and that they planned to stick to having one child themselves.\n""You have the love of the whole family, you have the most leeway to spend money, you can have all that your heart desires, your parents have more spending money and more time to enjoy life, why would you want another sibling?"" said Waguzhongren.\nHongyanxibaby shared several pictures of herself and a girl who appeared to be her young daughter, including one where the girl is clutching keys to a Volkswagen vehicle.\nShe said: ""They've told us not to waste our genes and have another. Too bad we have the money to have a child but not to raise it - and I mean raise it in wealth.""\n""I'm not willing to have children just so that they grow up poor.""","The abolition of China's controversial one-child policy has triggered an intense emotional discussion online, with netizens expressing a mixture of regret, bitterness and sympathy.",34674393
1,"The devices went on sale via Apple's website this week and a pair costs £159 in the UK.\nThey are sold with a charging case and connect to Apple devices such as iPhones and Macs via Bluetooth.\nOne tech analyst said the Airpods would be ""easier to lose"" than conventional, wired earphones but pointed out that the design also had some advantages.\nCustomers will also be charged £65 to replace a lost charging case. It costs £45 to service the battery in a single Airpod or the charging case itself.\nIt is not surprising that replacing a lost product would incur a charge, noted IHS Technology analyst Ian Fogg.\n""What's striking I think about this is more that, because the Airpod is so small and doesn't have a cable, it's going to be easier to lose,"" he told the BBC.\nHe added that one benefit of wireless, miniaturised devices like the Airpods was the fact that connecting wires would not snag on clothing, for example.\nBut it might be possible to develop features for wireless earphones in the future that would help users find a lost Airpod and avoid having to pay for a replacement.\n""Apple could have a 'find-my-Airpod' feature,"" suggested Mr Fogg.\nOther companies making wireless earphones also have replacement charges.\nFor example, Bragi's The Dash earbuds - which retail for $299 (£240) as a pair in the US - can be replaced for $129 each.",Apple has said customers who lose one of the new wireless Airpod earphones will be charged £65 for a replacement.,38340748
2,"Sheikh Miskeen, which lies on one of the main routes from Damascus to the city of Deraa and the Jordanian border, fell after a month-long battle.\nRussian warplanes were reported to have played a key role in the offensive.\nRussia's foreign minister meanwhile declared its intervention had changed the course of the conflict in Syria, ahead of the start of peace talks.\nSergei Lavrov also warned it would be impossible to negotiate a political settlement without allowing Kurdish groups to attend.\nThe UN has invited Syria's government and opposition to indirect ""proximity talks"" that are due to start in Geneva on Friday, but a statement released on Tuesday did not give any details as to who had been invited or how many groups may take part.\nThere was a delay in sending out the invitations because of disagreements over who should be included in the opposition delegation.\nGovernment soldiers and allied fighters, including members of Lebanon's Hezbollah movement, took control of Sheikh Miskeen overnight, the Syrian Observatory for Human Rights reported.\nBut fighting was continuing on the western outskirts of the town on Tuesday, the UK-based monitoring group said.\n""The town is very important for both sides. They have both fought fiercely. Now by taking it, the regime has cut off the rebels links between eastern and western Deraa [province],"" its director Rami Abdul Rahman, told the Reuters news agency. ""The destruction in the town is huge.""\nThe region is the last where secular and nationalist rebel factions still hold substantial territory.\nThe government's offensive in Sheikh Miskeen was the first to be launched in the south following the start of Russia's air campaign against opponents of President Bashar al-Assad on 30 September.\nOn Tuesday, Russia's foreign minister told a news conference in Moscow that its intervention in Syria had ""really helped to turn around the situation in the country"" and ""helped towards reducing the territory controlled by terrorists"".\nMr Lavrov also stressed that no-one had supplied proof to support widespread allegations that Russian air strikes had caused civilian deaths in Syria.\nMoscow says it is targeting ""all terrorists"", above all members of the so-called Islamic State group (IS), but activists say many of its strikes have hit civilians and Western-backed rebels.\nThe Syrian Observatory, which relies on a network of sources on the ground, said last week that the Russian air campaign had killed 1,015 civilians, as well as 1,141 rebel fighters and 893 Islamic State (IS) militants.\nMr Lavrov also said it would be a ""grave mistake"" to accede to a demand by Turkey not to invite the Syrian Kurdish Democratic Union Party (PYD) to the upcoming peace talks.\nThe Kurdish party's YPG militia controls large parts of northern Syria and is a key ally of the US-led coalition against IS.\nThe PYD is an offshoot of the Kurdistan Workers' Party (PKK), which Turkey and a number of Western countries consider a terrorist organisation.\nMr Lavrov also denied that Russia's military intelligence chief had travelled to Damascus in an attempt to persuade President Assad to step down.\nMore than 250,000 people have been killed since the uprising against Mr Assad erupted in March 2011. Eleven million others have been driven from their homes.","Government forces have retaken control of a strategically important town in southern Syria, activists say.",35409423
3,"Let's start with the biggie - is there a Brexit effect that is frightening workers off from the British economy?\nAnecdotally, many of us who report on this field have picked up these stories. I spoke to a lot of Eastern European workers around the time of the general election who were rather nervous but somewhat resigned to Brexit. But not many of them suggested to me they were going to get on the first budget flight back home.\nBut today's data gives us a really good glimpse into the thousands of individual decisions that ordinary people make about their future.\nNet migration - that's the difference between the number of immigrants coming in for a year or more and the number of people who emigrate - has fallen substantially since the referendum. In March 2016, weeks out from the vote, it stood at almost 330,000.\nToday it is 81,000 down at 246,000 people - the lowest it has been for three years.\nThe estimates from the Office for National Statistics show that two-thirds of this fall in net migration is accounted for by changes in EU migration, and particularly by citizens of Eastern and Central Europe.\nIn the year to the end of March, fewer EU nationals arrived to live in the UK than in the previous 12 months - and there was an acceleration in the numbers leaving.\n81,000\ndecrease in net migration\n246,000\nnet migration to the UK, lowest figure for three years\nNet EU migration fell by 51,000\n'EU8' emigration rose by 17,000\nWhen you look at the figures for the 10 nations of Eastern and Central Europe, we can see that 62,000 of their citizens said ""do widzenia"" (""goodbye"") to the UK while 26,000 fewer of them arrived.\nWhen you drill down further, net migration from the A8 nations (Poland and others which joined the EU in 2004) has dropped very sharply. In the year to March 2016, 39,000 more of these citizens arrived than left. In the year to March 2017, that had crashed to just 7,000.\nInterestingly, notes Prof Jonathan Portes of King's College London, these figures show, for the first time, a stabilising of arrivals from the eight Eastern European nations - and that suggests they no longer regard the UK as as attractive as it once was.\n""Net migration from the A8 countries, which joined the EU in 2004, is now statistically insignificant for the first time since then,"" he says.\n""Moreover, figures for National Insurance registrations, which measure new arrivals registering to work, also fell, with the number of EU nationals registering in April to June falling more than 12% on the same period a year earlier.\n""These statistics confirm that Brexit is having a significant impact on migration flows, even before we have left the EU or any changes are made to law or policy.""\nFor its part, the ONS is cautioning that it's too early to say this is a long-term trend. So are there other factors beyond a suspected Brexit effect?\nSince the Brexit referendum, the falls in the pound on currency markets mean that money made in the UK buys less back home.\nThis is really important for workers who are sending cash back to their families - and a decisive factor in decisions to move all around the world.\nLast June, the pound bought almost 6 Polish zlotys. Today, it buys only 4.6 zlotys.\nWhat's more, when people choose to move to another country, they're not just looking at the circumstances there, but, fairly obviously, at the conditions at home.\nAnd there is no doubt that for some EU workers, coming to the UK isn't the slam-dunk deal it once was.\nThe Polish economy, for example, has one of the strongest growth rates in the EU and its government is lobbying workers to stay at home, rather than take their skills elsewhere.\nWhatever the precise factors, the government will want to present all this as a victory for its strategy and progress towards its net migration target.\nAnd while campaigners for falls will be buoyed by the statistics - some are urging caution.\n""This is a step forward but it is largely good fortune,"" says Lord Green, chairman of Migrationwatch UK.\n""It is mainly due to a reduction in the huge net inflow of East Europeans from 100,000 to 50,000. This should not obscure the fact that migration remains at an unacceptable level of a quarter of a million a year with massive implications for the scale and nature of our society.""\nThat's a pointer to the scale of the challenge ministers still face, if they are determined to stick to their target. Net migration from the rest of the world still stands at 180,000 people a year - and that is the one part of policy that the UK can currently completely control.\nThe August figures have also revealed some fascinating truths about migration, and people's intentions, that until now have been subjected to myth, fears and an awful lot of speculation - do people leave the UK when they should?\nWell, we don't really know - or at least we didn't until now. The ONS uses a large rolling survey at ports to estimate immigration and emigration - but it's only as good as a survey can be - it has limitations.\nNow, we have ""exit checks"" data - figures derived from the scans of passports and so on as people leave the UK at our ports.\nAnd the figures from the Home Office show, for the first time, that the vast majority of visitors to the UK who require a visa leave the UK when they should.\nSome 1.34 million visas granted to non-EEA nationals expired in 2016-17. Of those people who had not already secured a legal reason to stay on, 96.3% departed in time. A further 0.4% left after their visa expired. It's not quite clear what happened to the remaining 3.3%.\nSo, of all those visas, around 40,000 overstayed.\nAnd what's even more interesting are the figures around students. International students have been a hot topic in the migration debate with some claiming that they habitually overstay their visas. Some of the predictions for student over-stayers have been enormous.\nThe exit check data shows the rate of compliance - those who play by the rules - was 97.4%. And that suggests that assumptions about mass overstaying are either simply wrong or, alternatively, a thing of the past after a crackdown on bogus colleges.",There are so many important headlines in the August migration data that it is difficult to know where to begin.,41037021
4,"Martyn Galvin, 30, from Yarm, collected money from his friends, some he had known for 20 years, for a trip to Prague and a day at the races.\nBut the group was left ""shocked"" when he failed to turn up at the airport and found the trip had not been booked.\nGalvin was jailed for 20 months at Teesside Crown Court.\nOne of the group, who did not want to be named, told BBC Tees: ""It was shock more than anything and disbelief. We were just sat there thinking, 'Is that really happening?'.""\nHe said once the group had arrived at the airport, Galvin had text them to announce his cancer had just been diagnosed as terminal and apologised for a ""mix up"" with the booking.\nThe friend said: ""He [the groom] was in shock. He was on one side of the fence thinking his best mate's got terminal cancer but then on the other because of all the things that had happened...he started to think 'What's true and what's not?'.\n""It's almost as if he [Galvin] was starting to believe his own lies and living a lie. I still to this day don't believe it actually happened,"" he said.\nThe friend said Galvin had lied to them about having cancer four months before the stag-do.\nHe said: ""I remember incidents where I had picked him up from his house because the doctor said he wasn't allowed to drive and he would gingerly get into my car with a fresh bandage on that we later found out he was buying from Boots.""\nThe stag do still took place in Newcastle, Yarm and Middlesbrough and the wedding went ahead with a different best man.\nGalvin was also ordered to pay full compensation within 28 days.","A best man who conned his friends out of £8,000 for a stag-do and lied about having cancer has been jailed for fraud by false representation.",37129123


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [12]:
metric

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_aggregator: Return aggregates if this is set to True
Retu

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

In [13]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = ["hello there", "general kenobi"]
metric.compute(predictions=fake_preds, references=fake_labels)

{'rouge1': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rouge2': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeL': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeLsum': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [14]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

By default, the call above will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [15]:
tokenizer("Hello, this one sentence!")

{'input_ids': [8774, 6, 48, 80, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Instead of one sentence, we can pass along a list of sentences:

In [16]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [17]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["Hello, this one sentence!", "This is another sentence."]))

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}




If you are using one of the five T5 checkpoints we have to prefix the inputs with "summarize:" (the model can also translate and it needs the prefix to know which task it has to perform).

In [18]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [19]:
max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [20]:
preprocess_function(raw_datasets['train'][:2])

{'input_ids': [[21603, 10, 37, 423, 583, 13, 1783, 16, 20126, 16496, 6, 80, 13, 8, 844, 6025, 4161, 6, 19, 341, 271, 14841, 5, 7057, 161, 19, 4912, 16, 1626, 5981, 11, 186, 7540, 16, 1276, 15, 2296, 7, 5718, 2367, 14621, 4161, 57, 4125, 387, 5, 15059, 7, 30, 8, 4653, 4939, 711, 747, 522, 17879, 788, 12, 1783, 44, 8, 15763, 6029, 1813, 9, 7472, 5, 1404, 1623, 11, 5699, 277, 130, 4161, 57, 18368, 16, 20126, 16496, 227, 8, 2473, 5895, 15, 147, 89, 22411, 139, 8, 1511, 5, 1485, 3271, 3, 21926, 9, 472, 19623, 5251, 8, 616, 12, 15614, 8, 1783, 5, 37, 13818, 10564, 15, 26, 3, 9, 3, 19513, 1481, 6, 18368, 186, 1328, 2605, 30, 7488, 1887, 3, 18, 8, 711, 2309, 9517, 89, 355, 5, 3966, 1954, 9233, 15, 6, 113, 293, 7, 8, 16548, 13363, 106, 14022, 84, 47, 14621, 4161, 6, 243, 255, 228, 59, 7828, 8, 1249, 18, 545, 11298, 1773, 728, 8, 8347, 1560, 5, 611, 6, 255, 243, 72, 1709, 1528, 161, 228, 43, 118, 4006, 91, 12, 766, 8, 3, 19513, 1481, 410, 59, 5124, 5, 96, 196, 17, 19, 1256, 68, 27, 103, 317, 132

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [21]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Map:   0%|          | 0/204045 [00:00<?, ? examples/s]

Map:   0%|          | 0/11332 [00:00<?, ? examples/s]

Map:   0%|          | 0/11334 [00:00<?, ? examples/s]

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [22]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Note that  we don't get a warning like in our classification example. This means we used all the weights of the pretrained model and there is no randomly initialized head in this case.

To instantiate a `Seq2SeqTrainer`, we will need to define three more things. The most important is the [`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [23]:
batch_size = 16
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-xsum",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the cell and customize the weight decay. Since the `Seq2SeqTrainer` will save the model regularly and our dataset is quite large, we tell it to make three saves maximum. Lastly, we use the `predict_with_generate` option (to properly generate summaries) and activate mixed precision training (to go a bit faster).

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/t5-finetuned-xsum"` or `"huggingface/t5-finetuned-xsum"`).

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels:

In [24]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

The last thing to define for our `Seq2SeqTrainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, and we have to do a bit of pre-processing to decode the predictions into texts:

In [25]:
import nltk
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

Then we just need to pass all of this along with our datasets to the `Seq2SeqTrainer`:

In [28]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Cloning https://huggingface.co/HoldenCaulfieldRye/t5-small-finetuned-xsum into local empty directory.


Download file pytorch_model.bin:   0%|          | 7.40k/231M [00:00<?, ?B/s]

Download file runs/May14_14-27-38_b1b5c837832f/1684076534.2936773/events.out.tfevents.1684076534.b1b5c837832f.…

Download file training_args.bin: 100%|##########| 4.00k/4.00k [00:00<?, ?B/s]

Clean file runs/May14_14-27-38_b1b5c837832f/1684076534.2936773/events.out.tfevents.1684076534.b1b5c837832f.459…

Clean file training_args.bin:  25%|##5       | 1.00k/4.00k [00:00<?, ?B/s]

Download file runs/May14_14-27-38_b1b5c837832f/events.out.tfevents.1684076534.b1b5c837832f.459.0: 100%|#######…

Clean file runs/May14_14-27-38_b1b5c837832f/events.out.tfevents.1684076534.b1b5c837832f.459.0:  11%|#1        …

Clean file pytorch_model.bin:   0%|          | 1.00k/231M [00:00<?, ?B/s]

We can now finetune our model by just calling the `train` method:

In [None]:
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss


You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
trainer.push_to_hub()

# use this instead if prefer to set model to private 
# trainer.push_to_hub(private=True)

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("sgugger/my-awesome-model")
```