If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets as well as other dependencies. Uncomment the following cell and run it.

In [48]:
! pip install datasets transformers rouge-score nltk
! pip install --upgrade accelerate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [49]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Then you need to install Git-LFS. Uncomment the following instructions:

In [71]:
!apt install git-lfs

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
The operation couldn’t be completed. Unable to locate a Java Runtime that supports apt.
Please visit http://www.java.com for information on installing Java.



Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [51]:
import transformers

print(transformers.__version__)

4.29.1


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq).

We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely.

In [52]:
from transformers.utils import send_example_telemetry

send_example_telemetry("summarization_notebook", framework="pytorch")

# Fine-tuning a model on a summarization task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model for a summarization task. We will use the [XSum dataset](https://arxiv.org/pdf/1808.08745.pdf) (for extreme summarization) which contains BBC articles accompanied with single-sentence summaries.

![Widget inference on a summarization task](https://github.com/huggingface/notebooks/blob/main/examples/images/summarization.png?raw=1)

We will see how to easily load the dataset for this task using 🤗 Datasets and how to fine-tune a model on it using the `Trainer` API.

In [53]:
model_checkpoint = "t5-small"

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we picked the [`t5-small`](https://huggingface.co/t5-small) checkpoint. 

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [54]:
from datasets import load_dataset, load_metric

raw_datasets = load_dataset("xsum")
metric = load_metric("rouge")



  0%|          | 0/3 [00:00<?, ?it/s]

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

In [55]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

To access an actual element, you need to select a split first, then give an index:

In [56]:
raw_datasets["train"][0]

 'summary': 'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.',
 'id': '35232142'}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [57]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [58]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,document,summary,id
0,"When he was elected Conservative leader in 2005, he was seen as the party's answer to Tony Blair, a young, modern leader, who would shake off the party's ""nasty"" image and turn it back into the election winning machine that had dominated much of 20th century British politics.\nBut despite big advances at the 2010 election, he was forced to form a coalition with the Liberal Democrats, handing ammunition to those on the right of the party who hated his brand of ""Liberal Conservatism"" and yearned for a more traditional Tory programme.\nMr Cameron's presentational skills were never in doubt.\nHis easy charm and ability to appear ""prime ministerial"" at news conferences and summits helped ensure his personal poll ratings remained well ahead of the Conservative Party's ratings.\nBut critics complained that it was difficult to pin down what he actually believed in.\nHis laid-back, almost patrician style - and tendency to surround himself with advisers from similar backgrounds - led to accusations that he was too remote from the concerns of his party's rank-and-file, some of whom drifted off to the UK Independence Party, with its traditional right-wing messages on Europe and immigration.\nThose critics have been silenced - for now.\nMr Cameron is one of the longest serving Conservative leaders in history. Only Stanley Baldwin, Lady Thatcher and Sir Winston Churchill remained longer in the top job in the modern era.\nAnd although he has shifted his political position on to more traditional Conservative ground, ditching much of the utopian talk of creating a Big Society in favour of a focus on low taxation and sound financial management, observers have noted how little being in office has changed Mr Cameron personally.\nThe one fact everyone knows about him is that he comes from a privileged background. He has never made a secret of it.\nNot only was he the first former pupil of Eton to hold office since the early 1960s, he can also trace his ancestry back to William IV, making him a distant relative of the Queen.\n""I'm a practical person, and pragmatic. I know where I want to get to, but I am not ideologically attached to one particular method,"" December 2005.\n""I'm going to be as radical a social reformer as Mrs Thatcher was an economic reformer,"" August 2008.\n""I don't support gay marriage in spite of being a Conservative. I support gay marriage because I am a Conservative,"" July 2013.\n""I hope they'd say I'm optimistic, I enjoy life and that I'm fun. But also that I'm quite driven in doing what I believe in,"" on what his friends would say about him,"" February 2015.\n""We will govern as a party of one nation, one United Kingdom. That means ensuring this recovery reaches all parts of our country, from north to south, to east to west,"" May 2015.\nAt 43, he was the youngest prime minister since Robert Banks Jenkinson, the 2nd Earl of Liverpool in 1812. He was six months younger than Tony Blair when he entered Downing Street in 1997.\nThe third of four children, David William Donald Cameron, was born on 9 October 1966 in London.\nHe spent the first three years of his life in Kensington and Chelsea before the family moved to an old rectory near Newbury, in Berkshire.\nMr Cameron has said he had a happy childhood, but one where ""whingeing was not on the menu"".\nHis stockbroker father Ian, who died in 2010, was born with severely deformed legs, which he eventually had to have amputated. He also lost the sight in one eye, but David's father said he never considered himself ""disabled"" and rarely complained about anything.\nMr Cameron's mother, Mary, served as a Justice of the Peace for 30 years. During her time on the bench she passed judgement on the Greenham Common protesters, including on one occasion her own sister, Mr Cameron revealed recently, and eco-warrior Swampy, who was protesting against the construction of the Newbury bypass.\nAt the age of seven, the young Cameron was packed off to Heatherdown, an exclusive preparatory school, which counted Princes Edward and Andrew among its pupils. Then, following in the family tradition, came Eton.\nHe has described his 12 O-levels as ""not very good"", but he gained three As at A-level, in history, history of art and economics with politics.\n""He's not someone - and most Englishmen aren't - who talks freely and easily in the open-hearted Oprahesque fashion that some do but he's extremely good company,"" friend and ministerial colleague Michael Gove\n""In my experience, Cameron never gave a straight answer when dissemblance was a plausible alternative,"" business journalist Jeff Randall\n""He's always been incredibly strong, and kind, and supportive,"" Samantha Cameron, wife\n""In good times and in bad, he's just the kind of partner that you want at your side. I trust him. He says what he does, and he does what he says,"" US President Barack Obama.\n""If there was an Olympic gold medal for chillaxing he would win it,"" Cameron ally quoted in Francis Elliott and James Hanning biography.\n""Though possessed of a first-class mind, Cameron is not a reflective politician. He rarely agonises over a problem, preferring to resolve dilemmas as quickly and pragmatically as he can, generally with a group of close allies,"" political journalist Matthew D'Ancona.\nHis biggest mention in the Eton school magazine came when he sprained his ankle dancing to bagpipes on a school trip to Rome.\nBefore going up to Oxford to study Philosophy, Politics and Economics he took a gap year, working initially for Sussex Conservative MP Tim Rathbone, before spending three months in Hong Kong, working for a shipping agent, and then returning by rail via the Soviet Union and Eastern Europe.\nAt Oxford, he avoided student politics because, according to one friend from the time, Steve Rathbone, ""he wanted to have a good time"".\nHe was captain of Brasenose College's tennis team and a member of the Bullingdon dining club, famed for its hard drinking and bad behaviour, a period Mr Cameron has always refused to talk about.\nHe has also consistently dodged the question of whether he took drugs at university.\nBut he evidently did not let his extra-curricular activities get in the way of his studies.\nHis tutor at Oxford, Prof Vernon Bogdanor, describes him as ""one of the ablest"" students he has taught, whose political views were ""moderate and sensible Conservative"".\nAfter gaining a first-class degree, he briefly considered a career in journalism or banking, before answering an advertisement for a job in the Conservative Research Department.\nConservative Central Office is reported to have received a telephone call on the morning of his interview in June 1988, from an unnamed male at Buckingham Palace, who said: ""I understand you are to see David Cameron.\n""I've tried everything I can to dissuade him from wasting his time on politics but I have failed. I am ringing to tell you that you are about to meet a truly remarkable young man.""\nMusic: Bob Dylan and indie rock such as The Killers, The Smiths, Radiohead and Pulp.\nBooks: Goodbye to All That by Robert Graves, Cider With Rosie, by Laurie Lee\nFilms: Lawrence of Arabia, The Godfather\nTV: Game of Thrones, Breaking Bad, The Killing\nLeisure time: Karaoke, playing computer games - he once revealed he was addicted to Angry Birds and Fruit Ninja\nHolidays: Turkey, France, Cornwall\nFood: Italian, home-grown vegetables, slow-roast lamb, occasional fry-up\nDrink: Red wine, real ale. Luxury item on Desert Island Discs was ""a case of malt whisky from Jura""\nSport: Playing tennis, snooker, running, watching cricket. Supports Aston Villa football club, although recently got mixed up and said West Ham\nMr Cameron says he did not know the call was being made or who made it, but it is sometimes held up by his opponents as an example of his gilded passage to the top.\nAs a researcher, Mr Cameron was seen as hard-working and bright. He worked with future shadow home secretary David Davis on the team briefing John Major for Prime Minister's Questions, and also hooked up with George Osborne, who would go on to be his chancellor and closest political ally.\nOther colleagues, in what became known as the ""brat pack"" were Steve Hilton, who was one of Mr Cameron's closest strategy advisers during his early days in Downing Street, and future Health Secretary Andrew Lansley.\nThese young researchers were credited with devising the attack on Labour's tax plans that unexpectedly swung the 1992 general election for John Major.\nBut the remainder of Mr Cameron's time as a backroom boy in the Conservative government was more turbulent.\nHe was poached by then Chancellor Norman Lamont as a political adviser, and was at Mr Lamont's side throughout Black Wednesday, which saw the pound crash out of the European Exchange Rate Mechanism.\nBy the early 1990s, Mr Cameron had decided he wanted to be an MP himself, but he also knew it was vital to gain experience outside of politics.\nSo after a brief spell as an adviser to then home secretary Michael Howard, he took a job in public relations with ITV television company Carlton.\nMr Cameron spent seven years at Carlton, as head of corporate communications, travelling the world with the firm's boss Michael Green, who has described him as ""board material"".\nMr Cameron went part-time from his job at Carlton in 1997 to unsuccessfully contest Stafford at that year's general election. Four years later, in 2001, he won the safe Conservative seat of Witney, in Oxfordshire, recently vacated by Shaun Woodward, who had defected to Labour.\nSamantha Cameron, who works as the creative director of upmarket stationery firm Smythson of Bond Street, which counts Stella McCartney, Kate Moss and Naomi Campbell among its clients, has been credited with transforming her husband's ""Tory boy"" image.\nShe has a tattoo on her ankle and went to art school in Bristol, where she says she was taught to play pool by rap star Tricky.\nThe couple were introduced by Mr Cameron's sister Clare, Samantha's best friend, at a party at the Cameron family home. They were married in 1996.\nThey have three young children, Nancy, Arthur, and Florence, who was born shortly after the family moved into Downing Street.\nTheir first child, Ivan, who was born profoundly disabled and needed round-the-clock care, died in February 2009.\nThe experience of caring for Ivan and witnessing at first hand the dedication of NHS hospital staff, is said by friends to have broadened Mr Cameron's horizons. He had, friends say, led an almost charmed life, to that point.\nOn entering Parliament in 2001, Mr Cameron rose rapidly through the ranks, serving first on the Home Affairs Select Committee, which recommended the liberalisation of drug laws.\nHe was taken under the wing of Michael Howard, who put him in charge of policy co-ordination and then made him shadow education secretary. He also had the key role of drafting the 2005 Conservative election manifesto.\nBut when he entered the race to succeed Mr Howard as party leader in 2005 few initially gave him a chance. He was a distant fourth at the bookmakers behind Ken Clarke, Liam Fox and frontrunner David Davis.\nIt took an eye-catching conference speech, delivered without notes, in what would become his trademark style, to change the minds of the party faithful.\nA few may have had second thoughts, when in the early months of his leadership he spoke about how some young offenders just needed love (caricatured by his opponents as his ""hug a hoodie"" speech) and was pictured with huskies in the Arctic Circle on a trip to investigate climate change.\nAt the start of his leadership, Mr Cameron was all about sunny optimism and ""sharing the proceeds of growth"". He told activists in his first party conference speech to ""let sunshine win the day"" and managed to get a round of applause for a mention of civil partnerships.\nThe media, eager for a new story after years of Tory failure and with an increasingly unpopular Labour government, gave him the glowing coverage he craved, helping him to ""decontaminate"" the Tory brand and move the party back towards the centre ground, where, the conventional wisdom has it, British elections are won and lost.\nHe ordered the party to end its obsession with Europe and tried to reposition it as the party of the environment and the NHS, as well as recruiting more women and candidates from ethnic minorities to winnable seats.\nHe also cannily used the expenses scandal that rocked Westminster to portray himself as a radical reformer bent on cleaning up politics.\nHe was rewarded with big poll leads - but the financial crisis forced Mr Cameron to ditch much of his upbeat rhetoric, in favour of a more sober, even gloomy, approach, warning voters they face tough times and spending cuts ahead.\nBut during the course of the 2010 general election campaign, he watched much of his poll lead evaporate, with the rise of Lib Dem leader Nick Clegg, a man with a similar background and smooth, telegenic manner.\nWhat's more, his big idea, the Big Society, the fruit of detailed policy work stretching back to the early days of his leadership, which envisaged parents setting up their own schools and groups of public sector workers forming co-operatives, failed to capture voters' imagination in the way he had hoped.\nMr Clegg was the surprise victor in the first televised prime ministerial debate ever held in Britain - an innovation Mr Cameron had pushed hard for - and although the Lib Dem leader's advantage had largely evaporated by polling day, the election ended without a clear winner.\nDespite gaining 97 seats, the Conservatives' biggest increase in decades, Mr Cameron fell just short of the majority he needed to form a government and was forced into coalition talks with Mr Clegg's Liberal Democrats.\nBefore he was elected Conservative leader in 2005, David Cameron famously described himself as the ""heir to Blair"".\nThere are certainly similarities with the way he has used a small group of modernisers to force change on a reluctant party, even if it did not, in the end, produce the same seismic effect at the ballot box.\nThe coalition he formed with Mr Clegg functioned better than anyone expected, managing to complete a full five-year term and introduce sweeping changes to the education system, the NHS, the benefits system, pensions and much else besides.\nMr Clegg took a big hit in the opinion polls over unpopular coalition policies such as the massive cuts to public spending aimed at paying off the deficit, while Mr Cameron earned praise for his statesmanlike handling of set-piece events, such as his Commons statement on the Bloody Sunday inquiry.\nTo the surprise of many, possibly including himself, the greatest difficulties he encountered were not in managing the coalition with the Lib Dems but with keeping control of the increasingly vocal and rebellious right-wing of his own party.\nHis decision to promise a referendum on Britain's membership was seen as an attempt to placate right-wingers and stem the rise of UKIP.\nBut there was still a sizeable minority on the Tory backbenches who did not trust Mr Cameron and hated their party's alliance with the Lib Dems.\nIn August 2013, he suffered a major blow to his authority when he became the first prime minister in more than 100 years to lose a foreign policy vote, after dozens of Conservative MPs joined forces with Labour to block his plans for military intervention in Syria.\nBut perhaps the biggest crisis of his premiership came in the run-up to the Scottish independence referendum in September 2014, when he cancelled Prime Minister's Questions to rush north of the border in an effort to save the Union, after a poll suggested the Yes campaign would win.\nHe was later forced to issue an apology to the Queen, after he was overheard telling New York mayor Michael Bloomberg Her Majesty had ""purred down the line"" when he informed her that Scotland had rejected independence.\nFor some, his handling of the referendum issue, by offering last-minute concessions to the nationalists, added to the idea of Mr Cameron as an ""essay crisis"" prime minister, who only gets fully engaged with an issue when all seems lost.\nIt was a criticism that came back to haunt him during the 2015 election campaign, which began with him musing about his desire not to serve a third term.\nOnly when the polls refused to budge, said the critics, did he roll up his sleeves and begin to display the passion some said had been lacking from his early performances.\nBut the late swing to the Conservatives that confounded the opinion pollsters and allowed him to form the first Tory government since John Major's in 1992 will be seen as a vindication of his risk-averse campaign strategy - his refusal to debate Labour leader Ed Miliband head-to-head and relentless focus on a handful of simple messages, in particular his ""long-term economic plan"".\nMr Cameron has always insisted that he works as hard as any of the previous residents of Number 10 and retains his zeal for social reform and the NHS, recently describing improving the health service as his ""life's work"".\nHe has always defended the coalition too - paying tribute to Nick Clegg in his victory speech on the steps of Downing Street - even though he increasingly spoke of his frustration at not being able to govern as a true Conservative prime minister.\nAll those who have wondered what he might have done had he been given a free hand, what sort of prime minister he might have been, will find out in the weeks and months ahead.",David Cameron has proved the doubters in his own party and beyond wrong by winning a majority of his own at the 2015 general election.,32592449
1,"HP is suing for alleged fraud. Separately, Mr Lynch and the former management of Autonomy plans to sue HP for more than Â£100m, alleging ""false and negligent statements"".\nUS-based HP bought software firm Autonomy in 2011 for $11bn.\nHP later wrote down the value of its purchase by three quarters.\nIndustry observers suggested there may have been problems with due diligence before Autonomy was bought. HP purchased Autonomy with the aim of moving more into software.\nBut shortly after buying it, HP claimed it had been misled by Autonomy as to the firm's true value.\nEarlier this year, the Serious Fraud Office closed its investigation into Autonomy's sale, saying that ""on the information available to it, there is insufficient evidence for a realistic prospect of conviction.""\nIt ceded legal jurisdiction to US authorities. Mr Lynch and Mr Hussain have consistently denied any allegation of impropriety.\nHP said in an emailed statement that: ""HP can confirm that, on 30 March, a Claim Form was filed against Michael Lynch and Sushovan Hussain.\n""The lawsuit seeks damages from them of approximately $5.1 billion. HP will not comment further until the proceedings have been served on the defendants,""\nHP said it had filed its claim in London's Chancery Division High Court.\nMeanwhile, representatives for Mr Lynch and his colleagues said in a separate statement: ""The former management of Autonomy announces today they will file claims against HP.\n""Former Autonomy CEO Mike Lynch's claim, which is likely to be in excess of Â£100 million, will be filed in the UK.""\nLate last summer, Autonomy filed papers in a San Francisco court accusing HP of ""mismanagement"" of the takeover.\nAutonomy's former chief financial officer, Sushovan Hussain, said then that HP wanted to ""cover up its mismanagement of the Autonomy integration"".\nAt the time HP dismissed Mr Hussain's complaint as ""preposterous"".",Hewlett-Packard (HP) is suing Autonomy co-founder Mike Lynch and former chief financial officer Sushovan Hussain for about $5.1bn (Â£3.4bn).,32131529
2,"Stuart Hindes, 53, from Leeds, had started his charity swim about 03:00 BST on Sunday from Dover harbour.\nMr Hindes said there was ""quite a swell"" but he had started to catch the tide and had taken his seasickness medication.\n""There's only so much you can put up with when you are retching,"" he added. His swim ended after about four hours.\nMore on this and other stories from Yorkshire\nMr Hindes described conditions as like being in a washing machine and said he had been battling seasickness after about 20 minutes of the attempt.\n""It was tough I gave it everything I could before I got in the boat,"" he said.\nThe shortest route between Dover and Calais is about 21 miles (33km) but a swimmer covers a longer distance due to water currents. Mr Hindes had planned to be in the water for about 16 hours.\nThe fastest swim has been completed in about seven hours and the slowest in nearly 27 hours, according to the Channel Swimming Association.\nWhen asked whether he would attempt the swim again, Mr Hindes said: ""It's too early to say, I'm still hurting.""\nMr Hindes had been part of a relay team of six swimmers that completed the cross-channel swim in 2011.\nOn his solo effort Mr Hindes was raising funds for mental health charity Mind and Clic Sargent, a cancer support charity for children, young people and their families.\nHe has raised Â£3,290 for CLIC Sargent so far, the charity said.",A swimmer had to abandon an attempt to cross the English Channel after a severe swell made him seasick.,40631429
3,"The aircraft will be based at RAF Northolt, taking part in eight days of training over London and the home counties until 10 May, as part of operation Exercise Olympic Guardian.\nIt is the first time fighter jets have been stationed at the west London site since WWII.\nBut anti-military campaigners warn the jets will create a ""climate of fear"".\nThe Typhoon jets, which can travel at up to 1,370 miles per hour, will put pilots through their paces, testing security in the skies ahead of their vital role during the 2012 Olympic Games, which start in July.\nMilitary chiefs have alerted residents in south-east England about the operation, warning that they will notice an increase in often loud air activity, especially on 4 and 5 May.\nDefence Secretary Phillip Hammond said: ""Whilst there is no specific threat to the Games, we have to be ready to assist in delivering a safe and secure Olympics for all to enjoy.""\nHe said the Typhoon operation at RAF Northolt underlined the ""commitment of the Ministry of Defence and our armed forces to keeping the public safe at a time when the world will be watching us"".\nBut the Stop the War Coalition has criticised the move as ""unacceptable"", arguing that heavy military activity in the capital will cause unnecessary fear.\nThe BBC's home of 2012: Latest Olympic news, sport, culture, torch relay, video and audio\nLindsey German of the campaign group said ordinary people should not be forced to put up with the measures.\n""Far from safeguarding Londoners as they go about their daily lives, they will bring a real fear of explosions and the prospect of these places becoming a target for terrorist attack.""\n""We are told by the Government that the war in Afghanistan is being fought so that we don't have to fight on the streets of London"", she said.\n""These manoeuvres give the lie to that, and show that the war has made Britain a more dangerous place.""\nAir Vice-Marshal Stuart Atha, air component commander for Olympics air security, said the training was ""essential"" and in line with ""preparations for most Olympics"" in recent years.\n""What we have is just prudent precautionary measures in place in the unlikely event that a threat from the air does manifest itself.""\nHe denied fighter aircraft would be patrolling the skies as a matter of course throughout the summer, adding the operation ""would not set a precedent for any sort of enduring military commitment"" in the area.\n""This is a once in a generation event and I think the UK public would expect us to be prepared for this"", he said.\nHowever, he added the MoD would try to keep the amount of flying to a minimum, balancing the need to reduce disturbance with the key aim of ensuring forces are ""ready for their important role delivering air security for the Olympics"".\n""We hope that people will understand the need for this very important training, and we thank them for their continued strong support"", he said.\nExercise Olympic Guardian is taking place on land, sea and air in the London and Weymouth areas between 2 May and 10 May.\nThe security operations will also include:\nIt was also recently revealed that surface-to-air missiles could be deployed at six sites in London during the games.\nLast month, a sonic boom from two Typhoon aircraft that were responding to an emergency signal was reported to have been heard in Bath, Coventry and Oxford.",Royal Air Force Typhoon jets have arrived at an airbase in London for a large-scale Olympic security exercise.,17922490
4,"1 June 2016 Last updated at 13:11 BST\nIt is the first fully funded state pension in East Africa.\nEach pensioner will be entitled to 20,000 Tanzanian shillings ($9, £6) a month.\nCampaigners have welcomed the move and say it will lead to a huge improvement, not just in the lives of the elderly - considered to be among the poorest in society - but for the rest of the country as well.\nBBC Africa's Sammy Awami reports.",Tanzania’s semi-autonomous archipelago of Zanzibar has introduced a pension scheme for all citizens aged 70 and over.,36427814


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [59]:
metric

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_aggregator: Return aggregates if this is set to True
Retu

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

In [60]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = ["hello there", "general kenobi"]
metric.compute(predictions=fake_preds, references=fake_labels)

{'rouge1': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rouge2': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeL': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeLsum': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [61]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

By default, the call above will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [62]:
tokenizer("Hello, this one sentence!")

{'input_ids': [8774, 6, 48, 80, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Instead of one sentence, we can pass along a list of sentences:

In [63]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [64]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["Hello, this one sentence!", "This is another sentence."]))

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}




If you are using one of the five T5 checkpoints we have to prefix the inputs with "summarize:" (the model can also translate and it needs the prefix to know which task it has to perform).

In [65]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [66]:
max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [67]:
preprocess_function(raw_datasets['train'][:2])

{'input_ids': [[21603, 10, 37, 423, 583, 13, 1783, 16, 20126, 16496, 6, 80, 13, 8, 844, 6025, 4161, 6, 19, 341, 271, 14841, 5, 7057, 161, 19, 4912, 16, 1626, 5981, 11, 186, 7540, 16, 1276, 15, 2296, 7, 5718, 2367, 14621, 4161, 57, 4125, 387, 5, 15059, 7, 30, 8, 4653, 4939, 711, 747, 522, 17879, 788, 12, 1783, 44, 8, 15763, 6029, 1813, 9, 7472, 5, 1404, 1623, 11, 5699, 277, 130, 4161, 57, 18368, 16, 20126, 16496, 227, 8, 2473, 5895, 15, 147, 89, 22411, 139, 8, 1511, 5, 1485, 3271, 3, 21926, 9, 472, 19623, 5251, 8, 616, 12, 15614, 8, 1783, 5, 37, 13818, 10564, 15, 26, 3, 9, 3, 19513, 1481, 6, 18368, 186, 1328, 2605, 30, 7488, 1887, 3, 18, 8, 711, 2309, 9517, 89, 355, 5, 3966, 1954, 9233, 15, 6, 113, 293, 7, 8, 16548, 13363, 106, 14022, 84, 47, 14621, 4161, 6, 243, 255, 228, 59, 7828, 8, 1249, 18, 545, 11298, 1773, 728, 8, 8347, 1560, 5, 611, 6, 255, 243, 72, 1709, 1528, 161, 228, 43, 118, 4006, 91, 12, 766, 8, 3, 19513, 1481, 410, 59, 5124, 5, 96, 196, 17, 19, 1256, 68, 27, 103, 317, 132

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [68]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)




Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [69]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Note that  we don't get a warning like in our classification example. This means we used all the weights of the pretrained model and there is no randomly initialized head in this case.

To instantiate a `Seq2SeqTrainer`, we will need to define three more things. The most important is the [`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [70]:
batch_size = 16
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-xsum",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

ValueError: FP16 Mixed precision training with AMP or APEX (`--fp16`) and FP16 half precision evaluation (`--fp16_full_eval`) can only be used on CUDA devices.

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the cell and customize the weight decay. Since the `Seq2SeqTrainer` will save the model regularly and our dataset is quite large, we tell it to make three saves maximum. Lastly, we use the `predict_with_generate` option (to properly generate summaries) and activate mixed precision training (to go a bit faster).

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/t5-finetuned-xsum"` or `"huggingface/t5-finetuned-xsum"`).

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels:

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

The last thing to define for our `Seq2SeqTrainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, and we have to do a bit of pre-processing to decode the predictions into texts:

In [None]:
import nltk
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

Then we just need to pass all of this along with our datasets to the `Seq2SeqTrainer`:

In [None]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

We can now finetune our model by just calling the `train` method:

In [None]:
trainer.train()

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
trainer.push_to_hub()

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("sgugger/my-awesome-model")
```