If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets as well as other dependencies. Uncomment the following cell and run it.

In [1]:
#! pip install datasets transformers rouge-score nltk

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then uncomment the following cell and input your username and password (this only works on Colab, in a regular notebook, you need to do this in a terminal):

In [2]:
# !huggingface-cli login

Then you need to install Git-LFS and setup Git if you haven't already. Uncomment the following instructions and adapt with your name and email:

In [3]:
# !pip install hf-lfs
# !git config --global user.email "you@example.com"
# !git config --global user.name "Your Name"

Make sure your version of Transformers is at least 4.8.1 since the functionality was introduced in that version:

In [4]:
import transformers

print(transformers.__version__)

4.11.0.dev0


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq).

# Fine-tuning a model on a summarization task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model for a summarization task. We will use the [XSum dataset](https://arxiv.org/pdf/1808.08745.pdf) (for extreme summarization) which contains BBC articles accompanied with single-sentence summaries.

![Widget inference on a summarization task](images/summarization.png)

We will see how to easily load the dataset for this task using 🤗 Datasets and how to fine-tune a model on it using the `Trainer` API.

In [5]:
model_checkpoint = "t5-small"

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we picked the [`t5-small`](https://huggingface.co/t5-small) checkpoint. 

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [6]:
from datasets import load_dataset, load_metric

raw_datasets = load_dataset("xsum")
metric = load_metric("rouge")

Using custom data configuration default
Reusing dataset xsum (/home/matt/.cache/huggingface/datasets/xsum/default/1.2.0/4957825a982999fbf80bca0b342793b01b2611e021ef589fb7c6250b3577b499)


  0%|          | 0/3 [00:00<?, ?it/s]

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

In [7]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

To access an actual element, you need to select a split first, then give an index:

In [8]:
raw_datasets["train"][0]

{'document': 'Recent reports have linked some France-based players with returns to Wales.\n"I\'ve always felt - and this is with my rugby hat on now; this is not region or WRU - I\'d rather spend that money on keeping players in Wales," said Davies.\nThe WRU provides £2m to the fund and £1.3m comes from the regions.\nFormer Wales and British and Irish Lions fly-half Davies became WRU chairman on Tuesday 21 October, succeeding deposed David Pickering following governing body elections.\nHe is now serving a notice period to leave his role as Newport Gwent Dragons chief executive after being voted on to the WRU board in September.\nDavies was among the leading figures among Dragons, Ospreys, Scarlets and Cardiff Blues officials who were embroiled in a protracted dispute with the WRU that ended in a £60m deal in August this year.\nIn the wake of that deal being done, Davies said the £3.3m should be spent on ensuring current Wales-based stars remain there.\nIn recent weeks, Racing Metro fla

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [9]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [10]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,document,summary,id
0,"The parliament too was in uproar with opposition MPs shouting slogans accusing PM Narendra Modi's government of failing to protect the Dalits.\nThe four Dalit men were assaulted while trying to skin a dead cow.\nMany Hindus consider cows sacred and the slaughter of the animal is banned in many Indian states.\nLast year, a Muslim man was lynched by a violent mob that attacked his house over allegations that his family had been storing and consuming beef at home.\nThere have several other attacks across India where Muslim men have been accused of eating or smuggling beef.\nA video of the four Dalit men, believed to be tannery workers, being stripped and beaten with sticks allegedly by the members of a Hindu hardline group last week in Gujarat's Una town has gone viral and sparked massive protests by Dalit groups.\nOn Wednesday, groups of protesters were seen walking around the streets of Ahmedabad city, armed with wooden sticks and shouting slogans.\nProtests, which began on Monday, have now spread to several parts of the state.\nProtesters have set government buses on fire, blocked a national highway and clashed with the police.\nOn Tuesday, police fired teargas shells and used sticks to control the stone-pelting mobs.\nOne senior officer was killed and several others were wounded in the clash.\nHundreds of people have been detained.\nGujarat Chief Minister Anandiben Patel has said that her government was committed to protecting the Dalits and ordered an inquiry into the incident. Four policemen have been suspended.",Protests are continuing in the Indian state of Gujarat over last week's beating up of four low-caste Dalit men by cow protection vigilantes.,36844782
1,"The sites went offline at 22:00 local time (15:00 GMT) on Wednesday. Access was restored by Thursday morning.\nIt appeared to be a protest against the government's plan to limit access to sites deemed inappropriate.\nTens of thousands of people have signed a petition against the proposal they call the ""Great Firewall of Thailand"".\nThe name is a reference to the so-called ""Great Firewall of China"" commonly used to refer to the Chinese government's censorship over internet content.\nA DDoS attack works by exceeding a website's capacity to handle internet traffic. They are usually orchestrated by a program or bot.\nBut on Wednesday, calls went out on social media in Thailand encouraging people to visit the websites and repeatedly refresh them.\nAmong the targets were the site of the ministry of information, communications and technology (ICT) and the main government website thaigov.go.th.\nICT Deputy Permanent Secretary Somsak Khaosuwan said the site did not crash because of an attack but because it was overloaded by visitors checking to see whether and attack was happening, the Bangkok Post reports.\nSince seizing power, the Thai military government has increased censorship, blocked websites and criminally charged critics for comments made online.\nNews it was planning to set up a single government-controlled gateway as a ""tool to control inappropriate websites and information flows from other countries"" emerged last month.\nInternet gateways are the point at which countries connect to the world wide web.\nThe cabinet had ordered a single gateway to be imposed in order to block ""inappropriate websites"" and control the flow of information from overseas. That the decision, made at a cabinet meeting on 30 June, was kept secret has caused more alarm.\nA statement by Minister for Information Uttama Savanayana that the decision was not yet final, and that the single gateway was only intended to reduce the cost of internet access. This was met with disbelief by many Thais, and then the shutdown of government websites.\nThai netizens insist this is not an attack, but a form of civil disobedience. The military may still push ahead with its firewall, whatever the opposition. The need for control, as it confronts the task of managing a sensitive royal succession, will probably trump any concerns it may have for the digital economy.\nThailand used to have just a single gateway but slow internet speeds led to the liberalisation of the industry and today there are 10, operated by private and state-owned companies.\nThe apparent attack renewed the vibrant debate over the single gateway plan on social media, with many users declaring the end of privacy.\n""Thailand is developing. Thailand is developing into North Korea,"" one Twitter user said.\n""I personally & professionally support free flow of information & fair competition on ICTs,"" said Supinya Klangnarong from the National Broadcasting and Telecommunication Commissions (NBTC) on Facebook.\n""Hope NBTC's website won't be attacked tonight. An open debate is definitely better than a cyber warfare. Voices of reason shall be heard.""","Several Thai government websites have been hit by a suspected distributed-denial-of-service (DDoS) attack, making them impossible to access.",34409343
2,"But, with France facing Portugal in the Euro 2016 final on Sunday - the same day as the British Grand Prix - we thought we would take F1's biggest football fans and re-imagine them as players from the beautiful game.\nDisagree with our choices? Have some better F1-football comparisons? Use #bbcf1 to get involved.\nDon't forget, we will have live text coverage of the British Grand Prix from 11:00 BST on the BBC Sport website on Sunday, and there will be live commentary on BBC Radio 5 live from 12:30 BST, with the race at 13:00 BST.\nCan Hamilton become a Silverstone legend? British GP preview\nA master of his time? The purists will tell you Alonso is one of the most gifted drivers ever. And might describe Alonso's style as ""total F1"".\nSupports: Real Oviedo\nBrilliant, creative, maverick and susceptible to the occasional strop. He's yet to throw a microphone into a lake, mind...\nSupports: Arsenal\nWith 28 wins from 46 poles, Seb's conversion rate is like no other. Wins on the big occasions - every time. Stick him up front - he's a born finisher.\nSupports: Eintracht Frankfurt\nWith three different teams in just six years, a journeyman who will make a nuisance of himself in the midfield. Lovely hair, too.\nSupports: Club America\nSo talented, such a bright future. But at 22 years old could it be over at such a young age? After being demoted from Red Bull to Toro Rosso, the rumours are he might not even be offered a seat next season. Like Bentley, who retired at 29, has his time come early?\nSupports: Roma\nDependable, solid and has a magic moment in the locker but, like countryman and former Germany utility man Lahm, Hulkenberg is capable of finishing up in any number of positions.\nSupports: Bayern Munich\nQuick, powerful and, like when Carlos was signed by Real Madrid in 1996, could be set for a big-money move any day now.\nSupports: Real Madrid\nPlenty of flair and might come up with something surprising now and again. Although equally capable of colliding with his own team-mate.\nSupports: Botafogo\nJolyon's father Jonathan was pretty good at driving racing cars, and Palmer the younger is yet to prove he has the pedigree of daddio. Only time will tell if he will end up as the Alex Bruce - who played more than 100 games for Jolyon's favourite team Ipswich - to dad and former Manchester United defender Steve Bruce.\nSupports: Ipswich Town\nLooked like he was set for the lower leagues after leaving Sauber in 2014 to sit on the bench for Ferrari. But, much like Drinkwater leaving Manchester United for Leicester, a move to plucky underdogs Haas could well reinvent his career. You can probably get odds of 2,000-1 for him to the win title as well.\nSupports: Rayados de Monterrey\nLike Kasper, Nico is an extremely talented man when he puts the gloves on, but is quite a way behind his dad when it comes to titles won. Also, as events in Austria showed, he'll make it very hard for anyone to get past him - and will stubbornly stay on his line.\nSupports: Bayern Munich\nAt 36, Massa is the daddy of the line-up, and while his legs are far from giving up on him, there's still a feeling that his career - with 11 wins from 238 races - is slowly winding down. So it's only right he uses his experience to orchestrate from the sidelines. Little Phil will need to grow a moustache though.\nSupports: Sao Paolo","When you squeeze Formula 1 drivers into a football team, it gives a whole new meaning to the term formation lap.",36572418
3,"Early diagnosis in children can prevent a possibly life-threatening condition, called diabetic ketoacidosis.\nDKA happens when a severe lack of insulin leads to the body starting to break down other tissue as an alternative energy source to glucose.\nAbout one in four children diagnosed with type 1 diabetes already have DKA.\nWarning signs of type 1 diabetes can include increased thirst, feeling more tired, losing weight and needing to go to the toilet more often.\nJane-Claire Judson, director of Diabetes Scotland, said: ""A diagnosis of type 1 diabetes is a lot for any child and their family to take in and respond to.\n""It fundamentally changes a child's life and has significant repercussions for the family and how they live their lives.\n""What can make this transition even harder is if your child's symptoms are not picked up early and they experience severe diabetic ketoacidosis.""\nShe added: ""This is an avoidable situation and one that is traumatic and can have long-lasting impact on the child and the family.\n""DKA can lead to coma and brain damage. GPs will see more children displaying the signs and symptoms of type 1 diabetes than they will meningitis, and yet awareness of type 1 is lower.""\nScotland has the fifth highest incidence of type 1 diabetes globally and this is increasing by about 3% a year in common with most western countries.\nThe condition is not associated with lifestyle factors and the reasons why rates are increasing are not fully understood.\nPublic Health Minister Maureen Watt said: ""Sadly, there are still children who are seriously ill by the time they are diagnosed with onset type 1 diabetes.\n""This causes unnecessary suffering to them and to their families. By spotting the early warning signs and getting tested, all this can be avoided.\n""If your child has lost weight, is going to the toilet more often, is feeling constantly tired or is more thirsty, take them to the GP as soon as you can.\n""Your doctor will carry out a simple test and, if necessary, they will be referred to a specialist.""",A campaign has been launched in Scotland to encourage warning signs of type 1 diabetes to be spotted earlier in children.,34953765
4,"The Saints took an early lead when Manolo Gabbiadini blasted in an opener on his debut after a £14m move from Napoli on transfer deadline day.\nBut Andy Carroll equalised when he finished after a fine pass from Pedro Obiang, who then put the Hammers ahead with a 30-yard low effort.\nWest Ham grabbed a third as Mark Noble's free-kick deflected in off Saints midfielder Steve Davis.\nWhat a way to introduce yourself to your new club as Italy international Gabbiadini only needed 12 minutes to open his goalscoring account after his move from Serie A.\nIt came when he collected Jay Rodriguez's lofted ball over a high West Ham defence and the 25-year-old timed his run perfectly to beat the Hammers offside trap.\nHe then ran into the penalty area and blasted the ball from a tight angle past Darren Randolph.\nHowever, Gabbiadini should have done better later on when he shot over the bar after Cheikhou Kouyate's misjudged header had gifted the Italian a chance.\nWith club record signing Sofiane Boufal limping off in the second half with a foot injury, Gabbiadini will need to show his best form if the Saints are to stay away from the relegation zone.\nBut Saints fans may be getting nervous. They are now only seven points above the bottom three, with six defeats in their last seven league games, and they were not good enough against West Ham.\nWhat is the reason for Southampton's slump? Well, they are defending badly and have now conceded 10 goals in their last three matches in all competitions.\nTheir former captain Jose Fonte looked assured in the West Ham defence after his £8m move in January and with influential defender Virgil van Dijk out for up to three months with an ankle injury, the Saints look short of options at the back.\nThey provided little resistance as West Ham, who moved up to ninth in the table, equalised within two minutes of conceding the opener.\nThe Saints backline were caught square and Obiang was allowed to slot a pass through to Carroll, who then registered his fourth goal in as many games.\nJust before the break, Obiang was given time and space to drill in a low shot from 30 yards out, which went through a crowded penalty area and goalkeeper Fraser Forster could not react in time.\nIf the first two West Ham goals came from poor Southampton defending, the third was down to bad luck. Noble's free-kick was on target and heading straight at Forster before Davis' swipe at the ball steered it into his own net.\nSouthampton did force Hammers goalkeeper Darren Randolph into a number of routine saves late in the second half, but it made no difference to the result.\nSouthampton manager Claude Puel said: ""We started well - a fantastic goal from Gabbiadini - but after that we made mistakes and it was difficult.\n""In the second half we had many chances but without the possibility to come back. We have the quality to play, to score, now we need to keep a clean sheet. We have to correct this and find a clean sheet and find confidence about the situation.""\nOn Italian striker Manolo Gabbiadini, who scored on his debut, Puel added: ""I'm happy with his first game. We saw a great player for the future. He is technical, gives solutions, sees the game and is a very interesting player.""\nOn their defensive problems, the Frenchman said: ""It is important to give confidence to the squad about the defensive chances. We can't replace the best defender (the injured Virgil Van Dijk) in the Premier League.""\nWest Ham manager Slaven Bilic: ""The guys were fantastic, we had a gameplan and they executed it in the best way. We were solid behind the ball, we kept the ball, attacked well and it was a great team performance.\n""We have a team that is working hard for each other. We have a brilliant atmosphere in the dressing room, not because Payet left, but because we have won six of nine. It is a crazy league and there are 42 points to play for. There are crazy results and that is why we have to keep playing as we are.""\nOn Andy Carroll, who scored his fourth goal in four games, Bilic added: ""He is a matured man. He is happy, stable, has got three kids. The key is that and the number of training sessions. The best prevention of injuries is training.""\nWest Ham's Jose Fonte, who handed in a transfer request at Southampton before joining the London club, said: ""The move is still fresh and only a couple of weeks ago I was still a Southampton player.\n""No doubt it was tough, but the main thing was to stay focused on the game and do my job the best I could.\n""It was almost eight seasons so it was tough but with the help of my team-mates and West Ham supporters the most important thing was achieving what we got - the three points.\n""I always gave my best for Southampton - my sweat and blood - so my conscience is clear. The past was good but now I look forward to the new challenge ahead of me.""\nSouthampton play away at the Premier League's bottom team, Sunderland, on Saturday, 11 February (15:00 GMT), while West Ham entertain West Brom at the same time.\nMatch ends, Southampton 1, West Ham United 3.\nSecond Half ends, Southampton 1, West Ham United 3.\nAttempt missed. Nathan Redmond (Southampton) right footed shot from outside the box is high and wide to the right.\nAttempt missed. Manolo Gabbiadini (Southampton) header from the centre of the box is high and wide to the right. Assisted by James Ward-Prowse with a cross.\nAttempt missed. Michail Antonio (West Ham United) right footed shot from outside the box misses to the left. Assisted by Robert Snodgrass.\nSubstitution, West Ham United. Jonathan Calleri replaces Sofiane Feghouli.\nAttempt saved. Steven Davis (Southampton) left footed shot from the centre of the box is saved in the bottom right corner.\nOriol Romeu (Southampton) wins a free kick in the defensive half.\nFoul by Robert Snodgrass (West Ham United).\nAttempt missed. James Ward-Prowse (Southampton) right footed shot from outside the box is too high from a direct free kick.\nWinston Reid (West Ham United) is shown the yellow card for a bad foul.\nShane Long (Southampton) wins a free kick in the attacking half.\nFoul by Winston Reid (West Ham United).\nOriol Romeu (Southampton) wins a free kick in the defensive half.\nFoul by Robert Snodgrass (West Ham United).\nAttempt missed. Cédric Soares (Southampton) left footed shot from outside the box is high and wide to the left.\nFoul by Shane Long (Southampton).\nJames Collins (West Ham United) wins a free kick on the right wing.\nFoul by James Ward-Prowse (Southampton).\nWinston Reid (West Ham United) wins a free kick in the defensive half.\nFoul by Nathan Redmond (Southampton).\nSofiane Feghouli (West Ham United) wins a free kick on the right wing.\nFoul by Oriol Romeu (Southampton).\nRobert Snodgrass (West Ham United) wins a free kick in the defensive half.\nOffside, Southampton. Oriol Romeu tries a through ball, but Ryan Bertrand is caught offside.\nAaron Cresswell (West Ham United) is shown the yellow card.\nFoul by Manolo Gabbiadini (Southampton).\nPedro Obiang (West Ham United) wins a free kick in the defensive half.\nShane Long (Southampton) wins a free kick on the left wing.\nFoul by Mark Noble (West Ham United).\nAttempt missed. Maya Yoshida (Southampton) right footed shot from the centre of the box is too high. Assisted by Nathan Redmond with a cross following a corner.\nCorner, Southampton. Conceded by James Collins.\nAttempt missed. Nathan Redmond (Southampton) header from the centre of the box misses to the left. Assisted by Cédric Soares with a cross.\nFoul by Nathan Redmond (Southampton).\nMark Noble (West Ham United) wins a free kick on the right wing.\nAttempt saved. Steven Davis (Southampton) right footed shot from outside the box is saved in the bottom left corner.\nFoul by James Ward-Prowse (Southampton).\nMark Noble (West Ham United) wins a free kick in the attacking half.\nAttempt missed. Shane Long (Southampton) left footed shot from outside the box misses to the left. Assisted by Manolo Gabbiadini.\nAttempt missed. Steven Davis (Southampton) right footed shot from the right side of the box misses to the right following a corner.",West Ham came from behind to beat Southampton and move into the top half of the Premier League.,38779644


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [11]:
metric

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each predictions
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_agregator: Return aggregates if this is set to True
Retu

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

In [12]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = ["hello there", "general kenobi"]
metric.compute(predictions=fake_preds, references=fake_labels)

{'rouge1': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rouge2': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeL': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeLsum': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [13]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

By default, the call above will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [14]:
tokenizer("Hello, this one sentence!")

{'input_ids': [8774, 6, 48, 80, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Instead of one sentence, we can pass along a list of sentences:

In [15]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [16]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["Hello, this one sentence!", "This is another sentence."]))

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}


If you are using one of the five T5 checkpoints we have to prefix the inputs with "summarize:" (the model can also translate and it needs the prefix to know which task it has to perform).

In [17]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [18]:
max_input_length = 1024
max_target_length = 128


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"], max_length=max_target_length, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [19]:
preprocess_function(raw_datasets["train"][:2])

{'input_ids': [[21603, 10, 17716, 2279, 43, 5229, 128, 1410, 18, 390, 1508, 28, 5146, 12, 10256, 5, 96, 196, 31, 162, 373, 1800, 3, 18, 11, 48, 19, 28, 82, 22209, 3, 547, 30, 230, 117, 48, 19, 59, 1719, 42, 549, 8503, 3, 18, 27, 31, 26, 1066, 1492, 24, 540, 30, 2627, 1508, 16, 10256, 976, 243, 28571, 5, 37, 549, 8503, 795, 17586, 51, 12, 8, 3069, 11, 3996, 13606, 51, 639, 45, 8, 6266, 5, 18263, 10256, 11, 2390, 11, 7262, 10371, 7, 3971, 18, 17114, 28571, 1632, 549, 8503, 13404, 30, 2818, 1401, 1797, 6, 7229, 53, 20, 12151, 1955, 8356, 49, 53, 826, 3, 19585, 643, 9768, 5, 216, 19, 230, 3122, 3, 9, 2103, 1059, 12, 1175, 112, 1075, 38, 24260, 350, 16103, 10282, 7, 5752, 4297, 227, 271, 3, 11060, 30, 12, 8, 549, 8503, 1476, 16, 1600, 5, 28571, 47, 859, 8, 1374, 5638, 859, 10282, 7, 6, 411, 7, 2026, 63, 7, 6, 14586, 7677, 11, 26911, 2419, 7, 4298, 113, 130, 10960, 52, 26786, 16, 3, 9, 813, 11674, 11044, 28, 8, 549, 8503, 24, 3492, 16, 3, 9, 3996, 3328, 51, 1154, 16, 1660, 48, 215, 5, 86, 8,

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [20]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Loading cached processed dataset at /home/matt/.cache/huggingface/datasets/xsum/default/1.2.0/4957825a982999fbf80bca0b342793b01b2611e021ef589fb7c6250b3577b499/cache-055d7cd8d2783e3a.arrow
Loading cached processed dataset at /home/matt/.cache/huggingface/datasets/xsum/default/1.2.0/4957825a982999fbf80bca0b342793b01b2611e021ef589fb7c6250b3577b499/cache-3787dfe361232fda.arrow
Loading cached processed dataset at /home/matt/.cache/huggingface/datasets/xsum/default/1.2.0/4957825a982999fbf80bca0b342793b01b2611e021ef589fb7c6250b3577b499/cache-f8a1e29ea937ee19.arrow


Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [21]:
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

2021-09-13 16:12:00.808738: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-13 16:12:00.815538: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-13 16:12:00.816536: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-13 16:12:00.818157: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags

Note that  we don't get a warning like in our classification example. This means we used all the weights of the pretrained model and there is no randomly initialized head in this case.

Next we set some parameters like the learning rate and the `batch_size`and customize the weight decay. 

The last two arguments are to setup everything so we can push the model to the [Hub](https://huggingface.co/models) at the end of training. Remove the two of them if you didn't follow the installation steps at the top of the notebook, otherwise you can change the value of push_to_hub_model_id to something you would prefer.

In [22]:
batch_size = 8
learning_rate = 2e-5
weight_decay = 0.01
num_train_epochs = 1

model_name = model_checkpoint.split("/")[-1]
push_to_hub_model_id = f"{model_name}-finetuned-xsum"

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels. Note that our data collators are multi-framework, so make sure you set `return_tensors='tf'` so you get `tf.Tensor` objects back and not something else!

In [23]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

In [24]:
tokenized_datasets["train"]

Dataset({
    features: ['attention_mask', 'document', 'id', 'input_ids', 'labels', 'summary'],
    num_rows: 204045
})

Now we convert our input datasets to TF datasets using this collator. There's a built-in method for this: `to_tf_dataset()`. Make sure to specify the collator we just created as our `collate_fn`!

In [25]:
train_dataset = tokenized_datasets["train"].to_tf_dataset(
    batch_size=batch_size,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    collate_fn=data_collator,
)
validation_dataset = tokenized_datasets["validation"].to_tf_dataset(
    batch_size=8,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator,
)



Now we initialize our loss and optimizer and compile the model. Note that most Transformers models compute loss internally, so all we need to do is create a 'dummy' loss function that passes this loss through. For the optimizer, we can use the `AdamWeightDecay` optimizer in the Transformer library.

In [26]:
from transformers import AdamWeightDecay
import tensorflow as tf


def dummy_loss(y_true, y_pred):
    return tf.reduce_mean(y_pred)


optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model.compile(optimizer=optimizer, loss={"loss": dummy_loss})

Now we can train our model.

In [27]:
model.fit(train_dataset, validation_data=validation_dataset, epochs=1)

2021-09-13 16:12:02.236002: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


<keras.callbacks.History at 0x7f5450134bb0>

Hopefully you saw your loss value declining as training continued, but that doesn't really tell us much about the quality of the model. Let's use the ROUGE metric we loaded earlier to quantify our model's ability in more detail. First we need to get the model's predictions for the validation set.

In [31]:
import numpy as np

decoded_predictions = []
decoded_labels = []
for batch, dummy_labels in validation_dataset:
    labels = batch["labels"]
    predictions = model.predict_on_batch(batch)["logits"]
    predicted_tokens = np.argmax(predictions, axis=-1)
    decoded_predictions.extend(
        tokenizer.batch_decode(predicted_tokens, skip_special_tokens=True)
    )
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels.extend(tokenizer.batch_decode(labels, skip_special_tokens=True))



Now we need to prepare the data as the metric expects, with one sentence per line.

In [32]:
import nltk
import numpy as np

# Rouge expects a newline after each sentence
decoded_predictions = [
    "\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_predictions
]
decoded_labels = [
    "\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels
]

result = metric.compute(
    predictions=decoded_predictions, references=decoded_labels, use_stemmer=True
)
# Extract a few results
result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

# Add mean generated length
prediction_lens = [
    np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions
]
result["gen_len"] = np.mean(prediction_lens)

print({k: round(v, 4) for k, v in result.items()})

{'rouge1': 37.4629, 'rouge2': 13.9471, 'rougeL': 34.3787, 'rougeLsum': 35.1249, 'gen_len': 1092352.0}


And we're done! You can now upload the result of the training to the Hub in one line:

In [33]:
model.push_to_hub(push_to_hub_model_id)

ValueError: You must login to the Hugging Face hub on this computer by typing `transformers-cli login` and entering your credentials to use `use_auth_token=True`. Alternatively, you can pass your own token as the `use_auth_token` argument.

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained("sgugger/my-awesome-model")
```