Things need to be done on `Colab`

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd /content/drive/MyDrive/Information\ Retrieval/transformers

/content/drive/MyDrive/Information Retrieval/transformers


In [None]:
! pip install datasets transformers rouge-score nltk

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/46/1a/b9f9b3bfef624686ae81c070f0a6bb635047b17cdb3698c7ad01281e6f9a/datasets-1.6.2-py3-none-any.whl (221kB)
[K     |████████████████████████████████| 225kB 17.7MB/s 
[?25hCollecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 39.9MB/s 
[?25hCollecting rouge-score
  Downloading https://files.pythonhosted.org/packages/1f/56/a81022436c08b9405a5247b71635394d44fe7e1dbedc4b28c740e09c2840/rouge_score-0.0.4-py2.py3-none-any.whl
Collecting fsspec
[?25l  Downloading https://files.pythonhosted.org/packages/e9/91/2ef649137816850fa4f4c97c6f2eabb1a79bf0aa2c8ed198e387e373455e/fsspec-2021.4.0-py3-none-any.whl (108kB)
[K     |████████████████████████████████| 112kB 57.5MB/s 
Collecting xxhash
[?25l  Downloading https://files.pythonho

In [None]:
import nltk
nltk.download('punkt')

If you're opening this notebook locally, make sure your environment has the last version of those libraries installed.

You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq).

# Fine-tuning a model on a summarization task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model for a summarization task. We will use the [XSum dataset](https://arxiv.org/pdf/1808.08745.pdf) (for extreme summarization) which contains BBC articles accompanied with single-sentence summaries.

![Widget inference on a summarization task](images/summarization.png)

We will see how to easily load the dataset for this task using 🤗 Datasets and how to fine-tune a model on it using the `Trainer` API.

In [None]:
model_checkpoint = "t5-small"

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we picked the [`t5-small`](https://huggingface.co/t5-small) checkpoint.

> Indented block



## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [None]:
from datasets import load_dataset, load_metric

raw_datasets = load_dataset("xsum")
metric = load_metric("rouge")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1930.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=954.0, style=ProgressStyle(description_…

Using custom data configuration default



Downloading and preparing dataset xsum/default (download: 245.38 MiB, generated: 507.60 MiB, post-processed: Unknown size, total: 752.98 MiB) to /root/.cache/huggingface/datasets/xsum/default/1.2.0/4957825a982999fbf80bca0b342793b01b2611e021ef589fb7c6250b3577b499...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=254582292.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1001503.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset xsum downloaded and prepared to /root/.cache/huggingface/datasets/xsum/default/1.2.0/4957825a982999fbf80bca0b342793b01b2611e021ef589fb7c6250b3577b499. Subsequent calls will reuse this data.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2170.0, style=ProgressStyle(description…




In [None]:
import pandas as pd
from datasets import Dataset,DatasetDict
arxiv_df = pd.read_json('./data/arxiv_2010-2021_abstract_title.json', orient='split')
arxiv_dataset = Dataset.from_pandas(arxiv_df)
arxiv_dataset_1 = arxiv_dataset.train_test_split(train_size=0.8)
arxiv_dataset_2 = arxiv_dataset_1['test'].train_test_split(train_size=0.5)
arxiv_dataset = DatasetDict({'train':arxiv_dataset_1['train'],'validation':arxiv_dataset_2['train'],'test':arxiv_dataset_2['test']})
del arxiv_dataset_1, arxiv_dataset_2, arxiv_df

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

In [None]:
arxiv_dataset

DatasetDict({
    train: Dataset({
        features: ['summary', 'document'],
        num_rows: 47514
    })
    validation: Dataset({
        features: ['summary', 'document'],
        num_rows: 5939
    })
    test: Dataset({
        features: ['summary', 'document'],
        num_rows: 5940
    })
})

To access an actual element, you need to select a split first, then give an index:

In [None]:
raw_datasets["train"][0]

{'document': 'Recent reports have linked some France-based players with returns to Wales.\n"I\'ve always felt - and this is with my rugby hat on now; this is not region or WRU - I\'d rather spend that money on keeping players in Wales," said Davies.\nThe WRU provides £2m to the fund and £1.3m comes from the regions.\nFormer Wales and British and Irish Lions fly-half Davies became WRU chairman on Tuesday 21 October, succeeding deposed David Pickering following governing body elections.\nHe is now serving a notice period to leave his role as Newport Gwent Dragons chief executive after being voted on to the WRU board in September.\nDavies was among the leading figures among Dragons, Ospreys, Scarlets and Cardiff Blues officials who were embroiled in a protracted dispute with the WRU that ended in a £60m deal in August this year.\nIn the wake of that deal being done, Davies said the £3.3m should be spent on ensuring current Wales-based stars remain there.\nIn recent weeks, Racing Metro fla

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [None]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,document,id,summary
0,"The gathering began with the national anthem and ended with a tribute from Castro's brother Raul.\nIt was attended by a number of world leaders - but some countries sent lower-level officials.\nFidel Castro, who came to power in 1959, died on Friday, aged 90. His ashes will be taken to the eastern city of Santiago later on Wednesday.\nOpinion on Fidel Castro, who ruled Cuba as a one-party state for almost half a century, remains divided.\nSupporters say he returned Cuba to the people and praise him for some of his social programmes, such as public health and education.\nBut critics call him a dictator, who led a government that did not tolerate opposition and dissent.\nThis division led to some countries. such as the US, sending lower-ranking emissaries. However, allies including left-wing Latin American leaders were among those attending the ceremony in Revolution Square, where Cubans once gathered to listen to Fidel Castro's fiery speeches.\nAttendance at the commemorative event reflects this division.\nOn Tuesday, the crowd chanted ""long live the revolution!"" and ""Fidel! Fidel!"" as the rally got under way.\nPresident Raul Castro closed the rally, referring to his brother Fidel as the leader of a revolution ""for the humble, and by the humble"".\nGreece's left-wing Prime Minister Alexis Tsipras was among those who addressed the crowd. The presidents of Mexico, Ecuador, Bolivia, Venezuela, Panama, South Africa and Zimbabwe also attended.\nIn his speech, South African President Jacob Zuma praised Cuba's record on health care and education and its support for African countries.\nEarlier on Tuesday, the left-wing presidents of Bolivia and Venezuela, Evo Morales and Nicolas Maduro, were among those who signed a book of condolences at the Jose Marti memorial where a photograph flanked by an honour guard has been on display since Monday.\nAnother admirer of Fidel Castro, Ecuadorean President Rafael Correa, is joining the two presidents at the commemoration.\nBut many Western leaders are not attending the event in person.\nThe White House announced that its nominee for the post of ambassador to Havana, Jeffrey DeLaurentis, and Deputy National Security Adviser Ben Rhodes would attend the commemorative event but that it was not sending an ""official delegation"" to Cuba.\nBen Rhodes was one of the US officials who negotiated the thaw between the US and the Cuban government announced in December 2014.\nPresident-elect Donald Trump on Monday threatened to end the detente if Cuba did not offer a ""better deal"".\nOn Wednesday Castro's ashes will be taken on a journey to Santiago, which is regarded as the Cuba's 1959 revolution.\nThe ashes will be placed on Sunday in the Ifigenia Cemetery in Santiago, where Cuban independence hero Jose Marti is buried.",38146156,A mass rally honouring the late Cuban revolutionary leader Fidel Castro has filled Revolution Square in Havana.
1,"A total of 670,000 Britons aged 15-24 have experimented with the substances at least once, it says in its 2013 World Drug Report.\nIt says there has been an alarming increase worldwide in new psychoactive substances, known as NPS.\nThe UK's crime prevention minister said the UK was addressing the threat.\nDrawing on European Commission data from 2011 and United Nations population statistics, the World Drug Report says the UK is Europe's largest market ""for legal substances that imitate the effects of illicit drugs"".\nBut the use of mephedrone - also known as meow meow or M-CAT - has declined in England and Wales since it was banned in 2010, the report said.\nCrime Prevention Minister Jeremy Browne said the UK is ""leading the global effort to address the serious threat"" from legal highs, ""adapting and innovating"" as new trends emerge.\n""We have introduced temporary class drug orders, a swift legislative response to protect the public while our independent experts prepare advice. We are working with law enforcement agencies overseas to break down supply chains and reduce demand.""\nHe added: ""Our Forensic Early Warning System and the Advisory Council on the Misuse of Drugs continue to closely monitor the prevalence and availability of these substances.""\nIt said the 670,000 Britons aged between 15 and 24 who had experimented with such substances at least once was 23% of the EU total in 2011.\nClose to 5% of people aged 15-24 in the EU have used NPS.\nThe world's biggest market for NPS is the United States, where use of these substances among youth ""appears to be more than twice as widespread as in the European Union"".\nThe UNODC said this is an alarming problem, as the substances have not been tested for safety and pose ""unforeseen public health challenges"".\n""Sold openly, including via the internet, NPSâ€¦. can be far more dangerous than traditional drugs. Street names, such as spice, meow-meow and bath salts mislead young people into believing that they are indulging in low-risk fun,"" the report said.\nIt added that while the use of traditional drugs such as heroin or cocaine is globally stable, new psychoactive substances ""are proliferating at an unprecedented rate"".\nAnd new substances are being identified all the time.\nAt the end of 2009, 166 NPS had been identified worldwide. By mid-2012 that had risen to 251.\n""For the first time, the number of NPS exceeded the total number of substances under international control (234), "" the report said.\nThe UNODC said authorities are struggling to keep up.\n""Given the almost infinite scope to alter the chemical structure of NPS, new formulations are outpacing efforts to impose international control. While law enforcement lags behind, criminals have been quick to tap into this lucrative market.""\nMany of these new psychoactive substances appear to originate in Asia and are spread via the internet.\nThe report said the number of online shops offering to supply customers in countries in the EU with NPS increased from 170 in January 2010 to 693 in January 2012.\nHowever the UNODC suggested in Europe, at least, the internet may be used more for the import and wholesale business.\nIt pointed to an EU survey which says most young consumers in Europe do not tend to buy NPS online, but get their supplies from friends or at parties and nightclubs.\nJustice Tettey, from the UNODC, said that while the UK had ""a large market in NPS"", it had also successfully introduced legislation to bring some of the substances under control.\nIn 2010-2011, mephedrone was the second most widely misused substance in England and Wales, on a par with cocaine powder, according to the report.\nBut following an import ban and classification as a Class B substance, mephedrone use has declined, after years of increase.\n""We have seen a decrease in use (in the UK) since the legislation got put in place,"" Mr Tettey said.",23048267,"The UK has the largest market for so-called ""legal highs"" in the European Union, according to the United Nations Office on Drugs and Crime (UNODC)."
2,"Border Precision Engineering was ""heavily reliant"" on the deal, according to Robin Knight of liquidators Alix Partners.\nHe said the firm faced ""severe trading issues"" after the contract ended.\nStaff were turned away from the factory in Kelso on Monday and later learned that the company was in liquidation, with the loss of 80 jobs.\nIt is understood that the contract was terminated about 10 days ago.\nMr Knight said: ""The business experienced severe trading issues with the loss of a major contract and it was heavily reliant on one contract.""\nThe company previously went into administration in 2013 - but was saved by a management buyout, backed by investors syndicate Tri Cap.\nThe liquidator said they were working to find a ""viable solution"" for the firm.\nHe said: ""This is a highly skilled workforce working in a precision market and we will be doing our very best to find a viable solution.""\nSNP MP Calum Kerr, who represents Berwickshire, Roxburgh and Selkirk, said it was a ""big blow"" for Kelso and the Borders.\nHe added : ""I am currently seeking urgent meetings with those involved to see if I can be of any assistance and if there may be a future for the business.\n""I will also be contacting Fergus Ewing, the enterprise minister in the Scottish government, to arrange an urgent discussion and to find out what kind of help he may be able to make available.""",33343267,"A specialised engineering firm in the Borders collapsed after losing a major contract, its liquidator has confirmed."
3,"The devastating quake caused the collapse of numerous buildings with the historic town of Amatrice the worst hit.\nThis is what we know about those who died.\nThe married couple, who were crushed under the rubble of a house in Amatrice, were found by rescuers in an embrace.\nMs Rascelli, who worked as a secretary, had recently celebrated her birthday with her husband Mr Trabalza, an internet technology specialist who ran the website for Acti-Roma, which offers help to those in need of heart transplants.\nThe couple lived in Ostia, a coastal town near Rome, and were in Amatrice for the holidays, according to reports.\nThey had posted photographs on social media of themselves enjoying their holiday just four hours before the earthquake struck.\nFriends of the couple posted Facebook messages describing the pair as ""always smiling and helpful"".\nOne of three Britons who died in the disaster has been identified as a teenager from London who was on holiday in Amatrice.\nHis parents, Anne-Louise and Simon Burnett, and his sister suffered minor injuries but survived when the building they were in collapsed.\nThe family has since paid tribute to the ""tireless work"" of the Italian rescue workers.\nThe two other Britons killed were understood to be staying at the same property as Marcos Burnett.\nThe couple were from Stockwell in London.\nA chef visiting the town for its food festival, he had travelled to Amatrice with friends.\nOne friend managed to escape, while another was pulled alive from the debris a few hours later, according to reports.\nMarco was later pulled from the rubble by his father, Filippo, who had rushed to the region after his son failed to answer his mobile phone.\nDied when the house she was staying in collapsed. Her boyfriend Claudio Leonetti was seriously hurt.\nMs Grossi, a flautist and recent graduate, was pronounced dead at the scene.\nThe teenager, from Rome, was spending time with her father and grandparents in Pescara del Tronto before returning to school.\nShe died when the house belonging to her grandparents collapsed. Her relatives, who were with her at the time, are believed to have survived.\nShe was said to have been a keen One Direction fan, and many fellow fans took to Twitter to express their condolences with the hashtag #riparianna.\nThe hairdresser, who lived in the eastern coastal region of Marche, was visiting his parents while on holiday with a friend in Amatrice when the quake struck.\nLocal media reports that while his friend was pulled alive from a large pile of debris and survived, Mr Neroni was overwhelmed by rubble and could not be saved.\nGiulia was found on top of her little sister, Giorgia, who was pulled out alive after 16 hours under the rubble in Pescara del Tronto.\nFirefighter's moving letter to child victim Giulia\nDied when the house he was in with his family in Amatrice collapsed. He had emigrated to Italy from Albania a number of years ago.\nHis wife and three children survived and were treated in hospital for their injuries, according to the Albanian foreign ministry.\nElisa Cafini, from Rome, was in the mountainous region of Pescara del Tronto with her cousin Gabriele Pratesi, 8, and grandmothers Irma Cafini, 81, and Rita Colaceci, 72.\nAccording to local media, all four were killed in the area, known for its tight grouping of old stone and wood houses, which was razed by the quake.\nTiziana Lo Presti was reportedly an earthquake expert who spent most of her life working for Italy's Civil Protection disaster management agency, planning how to deal with emergencies.\nShe lived in Rome but when the earthquake struck, she was in the hamlet of Saletta, near Amatrice, visiting her mother who was recovering after a stay in hospital. She died on the spot, reports La Repubblica, but her mother survived.\nThe couple were on holiday in Amatrice at a relatives' house. Their nine-year-old son, Alessandro, survived as he was staying with his father's parents.\nA Romanian national, Maricica Losub worked as a waitress in Amatrice where she had been living for the past 15 years. Sixteen Romanians remain unaccounted for, the Romanian foreign ministry says.",37195764,Almost 300 people are now known to have died after a 6.2-magnitude earthquake struck central Italy on 24 August.
4,"A member of the Montabaur flight school where Andreas Lubitz took lessons confirmed to BBC News the co-pilot had flown a glider over the region.\nMr Lubitz was on holiday at the time, several years ago, Dieter Wagner said.\nA French newspaper reports that the co-pilot holidayed at a local flying club with his parents from the age of nine.\nInvestigators are trying to establish what may have motivated Mr Lubitz to seize sole control of the Airbus A320 and crash it.\nGerman prosecutors believe he was concealing an illness from his employer, Germanwings, at the time of the crash.\nData from the voice recorder suggests the 27-year-old purposely started an eight-minute descent into the mountains after locking the pilot out of the flight deck.\nThere were no survivors when Flight 4U 9525 crashed in a remote Alpine valley on Tuesday while en route from Barcelona in Spain to Duesseldorf in Germany.\nProsecutors say there was no evidence of a political or religious motive for his actions and no suicide note has been found.\nMr Lubitz flew a glider over the southern French Alps during a holiday with the flight school in Montabaur, his home town, Dieter Wagner told the BBC.\nHe had been holidaying there before he became a professional airline pilot.\nMr Wagner, who says he last saw the young man five or six years ago, was quoted by French newspaper Le Parisien (in French) as saying the co-pilot had been ""passionate about the Alps and even obsessed [with them]"".\nAnother French news outlet, Metro News, reports that Mr Lubitz holidayed with his parents from the age of nine at the flying club in Sisteron, 69km (43 miles) from Le Vernet, a village near the crash site.\nQuoting a ""friend of his parents"", the paper said in its report (in French) the family had stayed at a nearby campsite and Andreas had come across as a ""normal boy"".\nMetro News quoted Francis Keser, a designer at the club in Sisteron, as saying Mr Lubitz had ""known the area well"".\nUnanswered questions\nWhat drives people to murder-suicide?\nWho was Andreas Lubitz?\nAccording to prosecutors, torn-up sick notes were found at the co-pilot's tow addresses in Germany, including one for the day of the crash.\nA hospital in the German city of Duesseldorf has confirmed Mr Lubitz was a patient there recently but it denied media reports that he had been treated for depression.\nThe theory that a mental illness such as depression had affected the co-pilot was suggested by German media, quoting internal aviation authority documents.\nThey said he had suffered a serious depressive episode while training in 2009.\nHe reportedly went on to receive treatment for a year and a half and was recommended regular psychological assessment.\nMr Lubitz's employers insisted that he had only been allowed to resume training after his suitability was ""re-established"".\nFrench police say the search for passenger remains and debris on the mountain slopes could take another two weeks.\nIn the aftermath of the crash, the EU's aviation regulator, the European Aviation Safety Agency, has urged airlines to adopt new safety rules.\nIn future, it says, two crew members should be present in the cockpit at all times.\nSource: Aviation Safety Network",32097106,"The co-pilot suspected of crashing a German airliner into the French Alps, killing himself and 149 others, knew the region from gliding holidays."


In [None]:
show_random_elements(arxiv_dataset["train"])

Unnamed: 0,summary,document
0,Reconstruction of Local Perturbations in Periodic Surfaces,"This paper concerns the inverse scattering problem to reconstruct a local perturbation in a periodic structure. Unlike the periodic problems, the periodicity for the scattered field no longer holds, thus classical methods, which reduce quasi-periodic fields in one periodic cell, are no longer available. Based on the Floquet-Bloch transform, a numerical method has been developed to solve the direct problem, that leads to a possibility to design an algorithm for the inverse problem. The numerical method introduced in this paper contains two steps. The first step is initialization, that is to locate the support of the perturbation by a simple method. This step reduces the inverse problem in an infinite domain into one periodic cell. The second step is to apply Newton-CG method to solve the associated optimization problem. The perturbation is then approximated by a finite spline basis. Numerical examples are given at the end of this paper, shows the efficiency of the numerical method."
1,Energy dependence of pbar/p ratio in p+p collisions,"We have compiled the experimentally measured pbar/p ratio at midrapidity in p+p collisions from \sqrt = 23 to 7000 GeV and compared it to various mechanisms of baryon production as implemented in PYTHIA, PHOJET and HIJING/B-Bbar models. For the models studied with default settings, PHOJET has the best agreement with the measurements, PYTHIA gives a higher value for \sqrt < 200 GeV and the ratios from HIJING/B-Bbar are consistently lower for all the \sqrt studied. Comparison of the data to different mechanisms of baryon production as implemented in PYTHIA shows that through a suitable tuning of the suppression of diquark-antidiquark pair production in the color field relative to quark-antiquark production and allowing the diquarks to split according to the popcorn scheme gives a fairly reasonable description of the measured pbar/p ratio for \sqrt < 200 GeV. Comparison of the beam energy dependence of the pbar/p ratio in p+p and nucleus-nucleus collisions at midrapidity shows that the baryon production is significantly more for A+A collisions relative to p+p collisions for \sqrt < 200 GeV. We also carry out a phenomenological fit to the y_beam dependence of the pbar/p ratio."
2,Glauber Gluons and Multiple Parton Interactions,We show that for hadronic transverse energy
3,Long range rapidity correlations as seen in the STAR experiment,"We analyze long range rapidity correlations observed in the STAR experiment at RHIC. Our goal is to extract properties of the two particle correlation matrix, accounting for the analysis method of the STAR experiment. We find a surprisingly large correlation strength for central collisions of gold nuclei at highest RHIC energies. We argue that such correlations cannot be the result of impact parameter fluctuations."
4,Charged Radial Infall for Spherical Central Bodies,"A massive, charged, spherical central body can be neutralized by attracting particles of opposite charge. We calculate the time it takes to neutralize these central bodies using classical mechanics, special relativistic mechanics, and finally, the forced trajectories of general relativity. While we can compare the classical and relativistic times, and find, predictably, that the special relativistic neutralization time is longer, a comparison of these times with the general relativistic result is not as directly possible. We offer the final calculation as a demonstration of dynamics in a general setting and in particular, the structural similarity of the relativistic problem to the other cases."


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [None]:
metric

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each predictions
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_agregator: Return aggregates if this is set to True
Retu

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

In [None]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = ["hello there", "general kenobi"]
metric.compute(predictions=fake_preds, references=fake_labels)

{'rouge1': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rouge2': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeL': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeLsum': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1197.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1389353.0, style=ProgressStyle(descript…




By default, the call above will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [None]:
tokenizer("Hello, this one sentence!")

{'input_ids': [8774, 6, 48, 80, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [None]:
tokenizer("$_1, {this} a_1 one sentence!")

{'input_ids': [1514, 834, 4347, 3, 2, 8048, 2, 3, 9, 834, 536, 80, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Instead of one sentence, we can pass along a list of sentences:

In [None]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [None]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["Hello, this one sentence!", "This is another sentence."]))

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}


If you are using one of the five T5 checkpoints we have to prefix the inputs with "summarize:" (the model can also translate and it needs the prefix to know which task it has to perform).

In [None]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [None]:
max_input_length = 512
max_target_length = 64

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
preprocess_function(raw_datasets['train'][:2])

{'input_ids': [[21603, 10, 17716, 2279, 43, 5229, 128, 1410, 18, 390, 1508, 28, 5146, 12, 10256, 5, 96, 196, 31, 162, 373, 1800, 3, 18, 11, 48, 19, 28, 82, 22209, 3, 547, 30, 230, 117, 48, 19, 59, 1719, 42, 549, 8503, 3, 18, 27, 31, 26, 1066, 1492, 24, 540, 30, 2627, 1508, 16, 10256, 976, 243, 28571, 5, 37, 549, 8503, 795, 17586, 51, 12, 8, 3069, 11, 3996, 13606, 51, 639, 45, 8, 6266, 5, 18263, 10256, 11, 2390, 11, 7262, 10371, 7, 3971, 18, 17114, 28571, 1632, 549, 8503, 13404, 30, 2818, 1401, 1797, 6, 7229, 53, 20, 12151, 1955, 8356, 49, 53, 826, 3, 19585, 643, 9768, 5, 216, 19, 230, 3122, 3, 9, 2103, 1059, 12, 1175, 112, 1075, 38, 24260, 350, 16103, 10282, 7, 5752, 4297, 227, 271, 3, 11060, 30, 12, 8, 549, 8503, 1476, 16, 1600, 5, 28571, 47, 859, 8, 1374, 5638, 859, 10282, 7, 6, 411, 7, 2026, 63, 7, 6, 14586, 7677, 11, 26911, 2419, 7, 4298, 113, 130, 10960, 52, 26786, 16, 3, 9, 813, 11674, 11044, 28, 8, 549, 8503, 24, 3492, 16, 3, 9, 3996, 3328, 51, 1154, 16, 1660, 48, 215, 5, 86, 8,

In [None]:
preprocess_function(arxiv_dataset['train'][:2])

{'input_ids': [[21603, 10, 101, 4277, 8331, 115, 23, 107, 6, 3, 9, 1506, 3, 31761, 127, 28, 7951, 1693, 1339, 12, 199, 3962, 1705, 125, 31, 7, 1187, 3, 9, 1506, 733, 5, 421, 358, 8397, 1506, 3, 31801, 139, 984, 11, 3806, 7, 783, 10958, 24, 504, 8, 879, 685, 76, 10355, 13, 5099, 6, 8, 1952, 13, 17554, 232, 3040, 738, 6, 6676, 18, 18237, 2009, 6, 1374, 1827, 26403, 6, 879, 2835, 13, 5099, 6, 11, 3, 8389, 28, 1445, 12, 796, 3213, 11, 4064, 13, 3, 9, 1506, 12577, 5, 86, 811, 6, 62, 3269, 13590, 284, 1108, 12, 8432, 823, 34, 19, 17554, 232, 3040, 11, 12, 2082, 165, 3, 8389, 28, 1445, 12, 3, 9, 381, 13, 15202, 4064, 5, 1], [21603, 10, 101, 3, 11619, 120, 1428, 8, 3, 5359, 683, 329, 825, 6, 3, 13134, 3, 9, 192, 18, 11619, 4516, 24, 54, 36, 816, 13, 38, 3, 9, 3, 31, 7429, 1041, 31, 5, 100, 23734, 7, 251, 81, 321, 332, 18, 11, 180, 18, 1259, 10355, 6, 3, 23, 5, 15, 5, 8788, 4431, 11, 309, 18, 16099, 7, 16, 668, 11, 335, 8393, 5, 37, 2018, 4102, 3843, 4516, 44, 508, 584, 8878, 11, 508, 3, 157, 6

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [None]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

HBox(children=(FloatProgress(value=0.0, max=205.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=12.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=12.0), HTML(value='')))




In [None]:
tokenized_arxiv_datasets = arxiv_dataset.map(preprocess_function, batched=True)

HBox(children=(FloatProgress(value=0.0, max=48.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=6.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=6.0), HTML(value='')))




Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [None]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=242065649.0, style=ProgressStyle(descri…




Note that  we don't get a warning like in our classification example. This means we used all the weights of the pretrained model and there is no randomly initialized head in this case.

To instantiate a `Seq2SeqTrainer`, we will need to define three more things. The most important is the [`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
batch_size = 16
args = Seq2SeqTrainingArguments(
    "test-summarization",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the cell and customize the weight decay. Since the `Seq2SeqTrainer` will save the model regularly and our dataset is quite large, we tell it to make three saves maximum. Lastly, we use the `predict_with_generate` option (to properly generate summaries) and activate mixed precision training (to go a bit faster).

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels:

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

The last thing to define for our `Seq2SeqTrainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, and we have to do a bit of pre-processing to decode the predictions into texts:

In [None]:
import nltk
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

Then we just need to pass all of this along with our datasets to the `Seq2SeqTrainer`:

In [None]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_arxiv_datasets["train"],
    eval_dataset=tokenized_arxiv_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

We can now finetune our model by just calling the `train` method:

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss


LookupError: ignored

Don't forget to [upload your model](https://huggingface.co/transformers/model_sharing.html) on the [🤗 Model Hub](https://huggingface.co/models). You can then use it only to generate results like the one shown in the first picture of this notebook!

In [None]:
trainer.evaluate()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len,Runtime,Samples Per Second
1,2.755,2.504184,28.2043,7.6047,22.1211,22.1265,18.8256,556.8558,20.35


{'eval_gen_len': 18.8256,
 'eval_loss': 2.5041840076446533,
 'eval_rouge1': 28.2043,
 'eval_rouge2': 7.6047,
 'eval_rougeL': 22.1211,
 'eval_rougeLsum': 22.1265,
 'eval_runtime': 556.8558,
 'eval_samples_per_second': 20.35}

In [None]:
trainer.predict(tokenized_datasets["test"])

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len,Runtime,Samples Per Second
1,2.755,2.504184,28.2043,7.6047,22.1211,22.1265,18.8256,556.8558,20.35


PredictionOutput(predictions=array([[    0, 27495,     7, ...,    65,   118,     3],
       [    0,  5961,  5316, ...,     3,     9,  2340],
       [    0,    37, 12580, ...,     3,  8715,  6087],
       ...,
       [    0,  8007,  2091, ...,     3,     9,  2752],
       [    0,    37,  1270, ...,     6,     3,     9],
       [    0,    71,  3116, ...,     3,     9, 24609]]), label_ids=array([[  461,     8,   166, ...,  -100,  -100,  -100],
       [   37,  4047,    31, ...,  -100,  -100,  -100],
       [  290,     7,     9, ...,  -100,  -100,  -100],
       ...,
       [ 2733,  3142,   100, ...,  -100,  -100,  -100],
       [   41,   254, 10227, ...,  -100,  -100,  -100],
       [22317,    43,     3, ...,  -100,  -100,  -100]]), metrics={'eval_loss': 2.5258328914642334, 'eval_rouge1': 28.1446, 'eval_rouge2': 7.5729, 'eval_rougeL': 22.0106, 'eval_rougeLsum': 22.0068, 'eval_gen_len': 18.829, 'eval_runtime': 554.5972, 'eval_samples_per_second': 20.436})