If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets as well as other dependencies. Right now this requires the current master branch of both. Uncomment the following cell and run it.

In [None]:
#! pip install git+https://github.com/huggingface/transformers.git
#! pip install git+https://github.com/huggingface/datasets.git
#! pip install rouge-score nltk

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then uncomment the following cell and input your username and password (this only works on Colab, in a regular notebook, you need to do this in a terminal):

In [1]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /home/matt/.huggingface/token


Then you need to install Git-LFS and setup Git if you haven't already. Uncomment the following instructions and adapt with your name and email:

In [2]:
# !apt install git-lfs
# !git config --global user.email "you@example.com"
# !git config --global user.name "Your Name"

Make sure your version of Transformers is at least 4.8.1 since the functionality was introduced in that version:

In [2]:
import transformers

print(transformers.__version__)

4.12.0.dev0


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq).

# Fine-tuning a model on a summarization task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model for a summarization task. We will use the [XSum dataset](https://arxiv.org/pdf/1808.08745.pdf) (for extreme summarization) which contains BBC articles accompanied with single-sentence summaries.

![Widget inference on a summarization task](images/summarization.png)

We will see how to easily load the dataset for this task using 🤗 Datasets and how to fine-tune a model on it using the `Trainer` API.

In [3]:
model_checkpoint = "t5-small"

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we picked the [`t5-small`](https://huggingface.co/t5-small) checkpoint. 

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [4]:
from datasets import load_dataset, load_metric

raw_datasets = load_dataset("xsum")
metric = load_metric("rouge")

Using custom data configuration default
Reusing dataset xsum (/home/matt/.cache/huggingface/datasets/xsum/default/1.2.0/4957825a982999fbf80bca0b342793b01b2611e021ef589fb7c6250b3577b499)


  0%|          | 0/3 [00:00<?, ?it/s]

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

In [5]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

To access an actual element, you need to select a split first, then give an index:

In [6]:
raw_datasets["train"][0]

{'summary': 'New Welsh Rugby Union chairman Gareth Davies believes a joint £3.3m WRU-regions fund should be used to retain home-based talent such as Liam Williams, not bring back exiled stars.',
 'document': 'Recent reports have linked some France-based players with returns to Wales.\n"I\'ve always felt - and this is with my rugby hat on now; this is not region or WRU - I\'d rather spend that money on keeping players in Wales," said Davies.\nThe WRU provides £2m to the fund and £1.3m comes from the regions.\nFormer Wales and British and Irish Lions fly-half Davies became WRU chairman on Tuesday 21 October, succeeding deposed David Pickering following governing body elections.\nHe is now serving a notice period to leave his role as Newport Gwent Dragons chief executive after being voted on to the WRU board in September.\nDavies was among the leading figures among Dragons, Ospreys, Scarlets and Cardiff Blues officials who were embroiled in a protracted dispute with the WRU that ended in 

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [7]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [8]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,document,summary,id
0,"The Dons trail Premiership leaders Celtic by four points having played a game more and McInnes knows his men can ill-afford any slip-ups.\n""We're going there to try to win because we know we don't have a lot of room for error, ""McInnes said.\n""We want to keep ourselves in the fight as much as possible.""\nMcInnes is hopeful Friday night's match at Tynecastle will go ahead after the Edinburgh side's scheduled game on Tuesday against Inverness Caledonian Thistle was postponed due to an outbreak of illness in Robbie Neilson's squad.\nUp to 12 Hearts players have been suffering from gastroenteritis, but there is no suggestion that Friday's match is under threat.\n""I'm encouraged by what's coming out of Hearts when they said yesterday that they're confident that the game will go ahead on Friday,"" the Aberdeen manager said.\n""Our preparation's the same. We're looking forward to the game. It's a very important game, as every game is at the minute.\n""We know there's no real margin for error at this stage of the season to keep close to Celtic.""\nAberdeen have visited Tynecastle twice this season, with mixed results.\nHearts knocked the Dons out of the Scottish Cup at the fourth round stage with a 1-0 victory in January.\nMcInnes has fonder memories of the 3-1 league win his side registered in Gorgie in September that extended Aberdeen's Premiership winning streak to eight matches.\n""We were strong that day,"" he said. ""We got ourselves in front and were quite clinical with our work in that first-half.\n""We thought Hearts would be a strong side this season with the squad they've managed to put together.\n""They've shown that and I think they deserve huge credit for the way they've performed this season. Very confident side, good experience in the right areas.\n""We've got a tough job. But, if we get that level of performance again, it gives us a chance to go and win the game.\n""We didn't set out this season to finish second. Last season, we were 20-odd points ahead of third spot and Hearts have certainly made sure that we are having to keep concentrating on that side of it.\n""Hearts will qualify for Europe, as we have done, and they'll be trying to pull back us towards them as we're trying to pull Celtic back towards us.\n""There's still plenty to play for. The three points are equally as important for both teams.""",Aberdeen manager Derek McInnes insists defeat is not an option for his side away to Hearts on Friday if they want to maintain their title challenge.,35970219
1,"The first minister has said On the Runs have letters ""stuffed in their pockets"" guaranteeing that they will not be prosecuted for any offence.\nThe DUP has claimed that amounts to a general amnesty for those concerned.\nBut is that the case? What do the letters actually say?\nWe do not know if all of the so-called ""letters of assurance"" were couched in identical terms, but evidence presented in private hearings at the Old Bailey suggests they were, and that legal safeguards were built in.\nIn his judgement in the case of John Downey who denied the murder of four soldiers in an IRA attack in Hyde Park in 1982, Mr Justice Sweeney refers to the fact that on 15 June 2000, Jonathan Powell, who was prime minister Tony Blair's Chief of Staff, wrote to Sinn FÃ©in president Gerry Adams enclosing letters representing decisions by the attorney general and the director of public prosecutions for England and Wales.\nThe letters stated that: ""Following a review of your case by the director of public prosecutions for England and Wales, he has concluded that on the evidence before him there is insufficient to afford a realistic prospect of convicting you for any such offence arising out of...""\nAnyone already convicted of paramilitary crimes became eligible for early release under the terms of the Northern Ireland Good Friday agreement of 1998.\nThe agreement did not cover:\nThey went on to say: ""You would not therefore face prosecution for any such offence should you return to the United Kingdom. That decision is based on the evidence currently available. Should such fresh evidence arise - and any statement made by you implicating yourself in... may amount to such evidence - the matter may have to be reconsidered.""\nThere are a number of key phrases. The statement that the decision is based on ""evidence currently available"" clearly suggests that if new evidence was to come to light, the issue could be reconsidered.\nThat is reinforced in the next sentence when it is spelt out clearly that should any fresh evidence arise ""the matter may have to be reconsidered.""\nThe judge also noted that on 22 March 2002, a briefing note was prepared for the prime minister for a meeting with Gerry Adams.\nBy that stage Sinn FÃ©in had provided a total of 161 names of On the Runs for clarification of their legal status.\nOf these, the judgement notes, ""47 had so far been cleared"". In a further 12 cases it said the director of public prosecutions for Northern Ireland ""had said there remained a requirement to prosecute, and in a further 10 the police had sufficient evidence to warrant arrest for questioning.""\nWe do not know details of the alleged offences this note referred to, but what it does make clear is that by that stage 22 of the 161 OTRs who sought legal clarification were not given assurances that they would not be prosecuted.\nIn the same month, the judgement notes that a note was prepared by a senior legal official following requests from Sinn FÃ©in for the administrative process dealing with OTRs to be speeded up.\nIt was noted that ""it would be necessary to include in the NIO's 'comfort letter' a qualification as to the level of comfort given.""\nA suggested draft again stated that the assurance that an individual was not wanted for arrest, questioning or charge was given ""on the basis of the information currently available.""\nIt added: ""If any other outstanding offence or offences come to light, or if any request for extradition were to be received these would have to be dealt with in the usual way.""\nPeter Hain, who was secretary of state when the process to deal with On the Runs was introduced, told the court that the key phraseology used in the personal letters was ""in essence common to all, that on the basis of current information they were not wanted and would not be arrested.""\nThere is also clear evidence, other than the contents of the judgement, that not all On the Runs received assurances that they would not be prosecuted.\nThe BBC has obtained a copy of a letter sent to the Northern Ireland Policing Board in April 2010 by Assistant Chief Constable Drew Harris, the PSNI officer with oversight of its role in the process, which was to establish whether named individuals were wanted for questioning or arrest.\nIn it, he told the board that:\n""Of the submitted names, 173 are not wanted, eight have been returned to prison and 11 remain wanted. In the year 2007 to 2008, three persons were arrested and referred to the court service. Of the remaining names, 10 have been referred to the PPS for direction, 11 are proceeding through Historical Enquiry Team review and two are ongoing live investigations.""\nThe letter sent to John Downey in July 2007 contained a caveat that his assurance could be reconsidered if new evidence came to light.\nHis letter said: ""The Secretary of State for Northern Ireland has been informed by the attorney general that on the basis of the information currently available, there is no outstanding direction for prosecution in Northern Ireland, there are no warrants in existence, nor are you wanted in Northern Ireland for arrest, questioning or charge by the police.\n""The Police Service of Northern Ireland are not aware of any interest in you from any other police force in the UK. If any other outstanding offence or offences come to light, or if any request for extradition were to be received, these would have to be dealt with in the usual way.""\nThe key phrase is once again ""information currently available.""\nThe problem for the Northern Ireland Office was that at the time the letter was sent, John Downey was listed as wanted by the Metropolitan Police in connection with the Hyde Park bombing.\nThe problem for the PSNI was that it was aware of this fact, but had not made the NIO nor attorney general aware of it when it carried out a review of John Downey's legal status.\nThat meant Downey was wrongly informed that there were no warrants for his arrest, and enabled his lawyers to argue that there had been an abuse of process when he was arrested at Gatwick airport in May, 2013.\nIt also meant the caveat about ""any other outstanding offence or offences"" coming to light was null and void, because at the time the letter was sent he was wanted in connection with the bombing.\nIt was not an outstanding or new offence but one the authorities should have been aware of when the assurance was given because he was listed on the Police National Computer as wanted.\nLikewise, the fact that he was wanted by the Metropolitan Police was ""information currently available"" when the letter was issued.\nLegal sources say the problem was not a lack of caveat, but the fact that the PSNI did not highlight the fact that John Downey was wanted.\nThe result was that when he was arrested, prosecution lawyers could not argue that it was based on new evidence, or information the state was not aware of at the time the assurance was given.\nLegal sources say that if the letters received by other OTRs contain similar wording, it would still be possible for them to be prosecuted at a later date if new evidence linking them to an offence comes to light.\nThat is of course, unless other mistakes have been made and assurances have been issued based on inaccurate information.\nThe PSNI is currently conducting a review of all OTR cases to determine if the information it gave the prosecution authorities was accurate.","Peter Robinson has called them ""get out of jail free cards"".",26376541
2,"The remains were found in a remote part of north-western Brittany after Hubert Caouissin was let out of custody to lead the investigators to the location.\nHe earlier confessed to killing Pascal Troadec, his wife Brigitte, and their two children in an inheritance row.\nHe said he had battered them to death with a crowbar at their home in Nantes.\nMr Caouissin was arrested last week along with his former wife Lydie, Mr Troadec's sister.\nThe discovery of body parts at the farm in de Pont-de-Buis-les-Quimerch was announced by prosecutor Pierre Sennes on Wednesday. Family jewellery was also found. The remains are yet to be identified.\nPascal and Brigitte Troadec, both aged 49, their son Sebastien, 21, and daughter Charlotte, 18, were last seen in mid-February.\nAt a news conference last week, Mr Sennes said Mr Caouissin had admitted using a crowbar to bludgeon the family at their home in Nantes.\nOn 16 February, he spied on the Troadecs' home, using a stethoscope to listen through the windows, Mr Sennes said.\nThat night he broke into the house, apparently with the aim of stealing a key. The family awoke when they heard a noise, and a fight broke out between the intruder and Pascal Troadec.\nMr Caouissin killed Mr Troadec first, and then the rest of the family, Mr Sennes added.\nThe prosecutor said Mr Caouissin dismembered the bodies, burying some parts and burning others.\nMr Caouissin has no previous criminal record. He now faces possible life imprisonment.\nThe role of Lydie Troadec is not yet clear, but she is accused of helping to clean the vehicle used to dispose of the bodies.\nThe inheritance argument reportedly centres on gold bars found during works at a building in Brest owned by Mr Troadec's father, who died several years ago.\nHowever, Mr Caouissin's mother has told journalists that the existence of the gold bars was a ""myth"".",French investigators say they have discovered body parts at the farm of a man who earlier admitted killing four of his relatives.,39210622
3,"Public payphones are set to become an even less familiar sight in Scotland as an estimated one third of the country's phone boxes have been earmarked for closure.\nNo calls were made from more than 700, of Scotland's 4,800 call boxes, last year.\nBT has begun consulting on plans to close about 1,500 phone boxes around the UK.\nThe firm has said that payphone usage has declined, by more than 90% over the last decade, as the popularity of mobile phones has surged.\nBut as these readers' pictures show some telephone boxes across Scotland have been getting a new lease of life.\nSend us your photos to scotlandpictures@bbc.co.uk or Instagram at #bbcscotlandpics",.,37349200
4,"Chairman Gareth Davies has written to each of the Union's 320 member clubs.\n""The changes we are making are part of our continued aim to modernise the way in which the game is governed in Wales,"" he said.\n""We have created four new 'sub-boards' which will meet to discuss specific areas of governance, prior to full Board meetings.""\nDavies added: ""All sporting bodies are being scrutinised closely on their governance structures and composition of their management boards, and, partly in response, but mainly as it is the right thing to do, we have recently reviewed our internal structures.""\nFour new sub-boards have been created to discuss specific aspects of governance prior to full board meetings, with the aim of streamlining the decision-making process in Welsh rugby.\nEach sub-board will discuss their specific areas - Commercial, Financial, professional/performance and community game.\nThe Union says this means full board meetings will not need to be held as regularly and should be ""less cumbersome.""\nThe sub-boards, which can co-opt members, will meet every month, while the full WRU Board will now be scheduled to meet in its entirety four times a year instead of 12.",The Welsh Rugby Union has outlined a major structural overhaul in a bid to modernise the way the game is governed.,39284480


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [9]:
metric

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each predictions
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_agregator: Return aggregates if this is set to True
Retu

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

In [10]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = ["hello there", "general kenobi"]
metric.compute(predictions=fake_preds, references=fake_labels)

{'rouge1': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rouge2': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeL': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeLsum': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [11]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

By default, the call above will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [12]:
tokenizer("Hello, this one sentence!")

{'input_ids': [8774, 6, 48, 80, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Instead of one sentence, we can pass along a list of sentences:

In [13]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [14]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["Hello, this one sentence!", "This is another sentence."]))

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}


If you are using one of the five T5 checkpoints we have to prefix the inputs with "summarize:" (the model can also translate and it needs the prefix to know which task it has to perform).

In [15]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [16]:
max_input_length = 1024
max_target_length = 128


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"], max_length=max_target_length, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [17]:
preprocess_function(raw_datasets["train"][:2])

{'input_ids': [[21603, 10, 17716, 2279, 43, 5229, 128, 1410, 18, 390, 1508, 28, 5146, 12, 10256, 5, 96, 196, 31, 162, 373, 1800, 3, 18, 11, 48, 19, 28, 82, 22209, 3, 547, 30, 230, 117, 48, 19, 59, 1719, 42, 549, 8503, 3, 18, 27, 31, 26, 1066, 1492, 24, 540, 30, 2627, 1508, 16, 10256, 976, 243, 28571, 5, 37, 549, 8503, 795, 17586, 51, 12, 8, 3069, 11, 3996, 13606, 51, 639, 45, 8, 6266, 5, 18263, 10256, 11, 2390, 11, 7262, 10371, 7, 3971, 18, 17114, 28571, 1632, 549, 8503, 13404, 30, 2818, 1401, 1797, 6, 7229, 53, 20, 12151, 1955, 8356, 49, 53, 826, 3, 19585, 643, 9768, 5, 216, 19, 230, 3122, 3, 9, 2103, 1059, 12, 1175, 112, 1075, 38, 24260, 350, 16103, 10282, 7, 5752, 4297, 227, 271, 3, 11060, 30, 12, 8, 549, 8503, 1476, 16, 1600, 5, 28571, 47, 859, 8, 1374, 5638, 859, 10282, 7, 6, 411, 7, 2026, 63, 7, 6, 14586, 7677, 11, 26911, 2419, 7, 4298, 113, 130, 10960, 52, 26786, 16, 3, 9, 813, 11674, 11044, 28, 8, 549, 8503, 24, 3492, 16, 3, 9, 3996, 3328, 51, 1154, 16, 1660, 48, 215, 5, 86, 8,

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [18]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

  0%|          | 0/205 [00:00<?, ?ba/s]

  0%|          | 0/12 [00:00<?, ?ba/s]

  0%|          | 0/12 [00:00<?, ?ba/s]

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [19]:
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

2021-10-21 13:49:08.571263: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-21 13:49:08.577018: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-21 13:49:08.578028: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-21 13:49:08.579417: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags

Note that  we don't get a warning like in our classification example. This means we used all the weights of the pretrained model and there is no randomly initialized head in this case.

Next we set some parameters like the learning rate and the `batch_size`and customize the weight decay. 

The last two arguments are to setup everything so we can push the model to the [Hub](https://huggingface.co/models) at the end of training. Remove the two of them if you didn't follow the installation steps at the top of the notebook, otherwise you can change the value of push_to_hub_model_id to something you would prefer.

In [20]:
batch_size = 8
learning_rate = 2e-5
weight_decay = 0.01
num_train_epochs = 1

model_name = model_checkpoint.split("/")[-1]
push_to_hub_model_id = f"{model_name}-finetuned-xsum"

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels. Note that our data collators are multi-framework, so make sure you set `return_tensors='tf'` so you get `tf.Tensor` objects back and not something else!

In [21]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

In [22]:
tokenized_datasets["train"]

Dataset({
    features: ['attention_mask', 'document', 'id', 'input_ids', 'labels', 'summary'],
    num_rows: 204045
})

Now we convert our input datasets to TF datasets using this collator. There's a built-in method for this: `to_tf_dataset()`. Make sure to specify the collator we just created as our `collate_fn`!

In [23]:
train_dataset = tokenized_datasets["train"].to_tf_dataset(
    batch_size=batch_size,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    collate_fn=data_collator,
)
validation_dataset = tokenized_datasets["validation"].to_tf_dataset(
    batch_size=8,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator,
)

Now we initialize our loss and optimizer and compile the model. Note that most Transformers models compute loss internally - we can train on this as our loss value simply by not specifying a loss when we `compile()`.

In [24]:
from transformers import AdamWeightDecay
import tensorflow as tf

optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! Please ensure your labels are passed as the 'labels' key of the input dict so that they are accessible to the model during the forward pass. To disable this behaviour, please pass a loss argument, or explicitly pass loss=None if you do not want your model to compute a loss.


Now we can train our model. We can also add a callback to sync up our model with the Hub - this allows us to resume training from other machines and even test the model's inference quality midway through training! Make sure to change the `username` if you do. If you don't want to do this, simply remove the callbacks argument in the call to `fit()`.

In [26]:
from transformers.keras_callbacks import PushToHubCallback

callback = PushToHubCallback(
    output_dir="./summarization_model_save",
    tokenizer=tokenizer,
    hub_model_id=push_to_hub_model_id,
)

model.fit(train_dataset, validation_data=validation_dataset, epochs=1, callbacks=[callback])

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

/home/matt/PycharmProjects/notebooks/examples/summarization_model_save is already a clone of https://huggingface.co/Rocketknight1/t5-small-finetuned-xsum. Make sure you pull the latest changes with `repo.git_pull()`.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

2021-10-21 13:50:24.357782: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Upload file tf_model.h5:   0%|          | 32.0k/231M [00:00<?, ?B/s]

To https://huggingface.co/Rocketknight1/t5-small-finetuned-xsum
   d4fe052..2ba19cf  main -> main



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


<keras.callbacks.History at 0x7f2d94409100>

Hopefully you saw your loss value declining as training continued, but that doesn't really tell us much about the quality of the model. Let's use the ROUGE metric we loaded earlier to quantify our model's ability in more detail. First we need to get the model's predictions for the validation set.

In [27]:
import numpy as np

decoded_predictions = []
decoded_labels = []
for batch in validation_dataset:
    labels = batch["labels"]
    predictions = model.predict_on_batch(batch)["logits"]
    predicted_tokens = np.argmax(predictions, axis=-1)
    decoded_predictions.extend(
        tokenizer.batch_decode(predicted_tokens, skip_special_tokens=True)
    )
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels.extend(tokenizer.batch_decode(labels, skip_special_tokens=True))

Now we need to prepare the data as the metric expects, with one sentence per line.

In [28]:
import nltk
import numpy as np

# Rouge expects a newline after each sentence
decoded_predictions = [
    "\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_predictions
]
decoded_labels = [
    "\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels
]

result = metric.compute(
    predictions=decoded_predictions, references=decoded_labels, use_stemmer=True
)
# Extract a few results
result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

# Add mean generated length
prediction_lens = [
    np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions
]
result["gen_len"] = np.mean(prediction_lens)

print({k: round(v, 4) for k, v in result.items()})

{'rouge1': 37.5294, 'rouge2': 14.0285, 'rougeL': 34.4575, 'rougeLsum': 35.1888, 'gen_len': 1092352.0}


If you used the callback above, you can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained("your-username/my-awesome-model")
```