##**T5 Model**

Text-to-Text Transfer Transformer (T5) is a Transformer-based model using an encoder-decoder architecture. It is pretrained on a mix of unsupervised and supervised tasks, all converted into a text-to-text format. T5 excels in various sequence-to-sequence tasks like summarization and translation.

In this project, we will fine-tune the pretrained T5 model for the Abstractive Summarization task using Hugging Face Transformers and the XSum dataset from Hugging Face Datasets. We will use the T5 Base variant for its balanced performance and computational efficiency.

The T5 architecture consists of transformer encoder-decoder layers that process input text iteratively, capturing contextual information and providing meaningful representations. This structure allows for efficient information flow and hierarchical learning, enabling T5 to achieve state-of-the-art results across multiple NLP benchmarks while maintaining a simple and scalable design.

## T5 base Architecture


<center>
    <img src = "https://cdn.analyticsvidhya.com/wp-content/uploads/2024/05/Screenshot-318.png" width = 300, height = 450>
<p style = "font-size: 14px;
            font-family: 'Georgia', serif;
            text-align: center;
            margin-top: 10px;">T5-Base architecture. <br>Source: <a href = "https://www.analyticsvidhya.com/blog/2024/05/text-summarization-using-googles-t5-base/">Analytics Vidhya</a></p>
</center>

## Dependencies


Installations and Setup

First, install the necessary libraries and import the required modules:

In [None]:
!pip install transformers                                     # Installing the transformers library (https://huggingface.co/docs/transformers/index)

In [13]:
from transformers import T5ForConditionalGeneration, T5Tokenizer         # T5 Tokenizer and architecture
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer        # These will help us to fine-tune our model
from transformers import DataCollatorForSeq2Seq                          # DataCollator to batch the data

In [2]:
!pip install accelerate -U

Collecting accelerate
  Downloading accelerate-0.32.1-py3-none-any.whl (314 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.1/314.1 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.w

Install the datasets library from Hugging Face in the enviornment.

In [3]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (40.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1

In [4]:
# !pip install --upgrade datasets

In [5]:
# !pip install --upgrade pyarrow

In [6]:
# !pip install --upgrade pyarrow datasets

## Load dataset

Use the Hugging Face Datasets library to download the data we need to use for training, evaluation and testing. This can be easily done with the load_dataset function.

In [1]:
from datasets import load_dataset

In [2]:
# Load a subset of XSum dataset
# Load a small portion of the training and validation datasets
train_dataset = load_dataset("xsum", split="train[:1%]")
valid_dataset = load_dataset("xsum", split="validation[:1%]")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

The repository for xsum contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/xsum.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.00M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

In [3]:
test_dataset = load_dataset("xsum", split="test[:1%]")

In [4]:
test_dataset

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 113
})

## Extreme Summarization (XSum) dataset
This dataset contains BBC articles along with single-sentence summaries. Each article includes an introductory sentence, professionally written by the article's author. The dataset comprises 226,711 articles, split into training (90%, 204,045), validation (5%, 11,332), and test (5%, 11,334) sets.

In case, we have sufficient compuational facility, we can load whole XSUM dataset using the following code and train our model on the same.

In [5]:
# dataset = load_dataset("EdinburghNLP/xsum")

# # Split the dataset into train, validation, and test sets
# train_dataset = dataset['train']
# valid_dataset = dataset['validation']
# test_dataset = dataset['test']

In [6]:
train_dataset

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 2040
})

In [7]:
# Print the first 5 examples in the training dataset
for i in range(5):
    print(train_dataset[i])

{'document': 'A fire alarm went off at the Holiday Inn in Hope Street at about 04:20 BST on Saturday and guests were asked to leave the hotel.\nAs they gathered outside they saw the two buses, parked side-by-side in the car park, engulfed by flames.\nOne of the tour groups is from Germany, the other from China and Taiwan. It was their first night in Northern Ireland.\nThe driver of one of the buses said many of the passengers had left personal belongings on board and these had been destroyed.\nBoth groups have organised replacement coaches and will begin their tour of the north coast later than they had planned.\nPolice have appealed for information about the attack.\nInsp David Gibson said: "It appears as though the fire started under one of the buses before spreading to the second.\n"While the exact cause is still under investigation, it is thought that the fire was started deliberately."', 'summary': 'Two tourist buses have been destroyed by fire in a suspected arson attack in Belfa

We can observe that the whole data is not properly readable. For better readability, let us convert this to DataFrme and see the data properly.

# Converting dataset to a Pandas DataFrame
We can convert the dataset using Pandas into DataFrame so that we can see the first filw enties of dataset in better way.

In [8]:
import pandas as pd

# Set Pandas options to display the full text in each cell
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

# Convert to a Pandas DataFrame
df = pd.DataFrame(train_dataset)
df.head(5)

Unnamed: 0,document,summary,id
0,"The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.\nRepair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.\nTrains on the west coast mainline face disruption due to damage at the Lamington Viaduct.\nMany businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.\nFirst Minister Nicola Sturgeon visited the area to inspect the damage.\nThe waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.\nJeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.\nHowever, she said more preventative work could have been carried out to ensure the retaining wall did not fail.\n""It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate that - but it is almost like we're neglected or forgotten,"" she said.\n""That may not be true but it is perhaps my perspective over the last few days.\n""Why were you not ready to help us a bit more when the warning and the alarm alerts had gone out?""\nMeanwhile, a flood alert remains in place across the Borders because of the constant rain.\nPeebles was badly hit by problems, sparking calls to introduce more defences in the area.\nScottish Borders Council has put a list on its website of the roads worst affected and drivers have been urged not to ignore closure signs.\nThe Labour Party's deputy Scottish leader Alex Rowley was in Hawick on Monday to see the situation first hand.\nHe said it was important to get the flood protection plan right but backed calls to speed up the process.\n""I was quite taken aback by the amount of damage that has been done,"" he said.\n""Obviously it is heart-breaking for people who have been forced out of their homes and the impact on businesses.""\nHe said it was important that ""immediate steps"" were taken to protect the areas most vulnerable and a clear timetable put in place for flood prevention plans.\nHave you been affected by flooding in Dumfries and Galloway or the Borders? Tell us about your experience of the situation and how it was handled. Email us on selkirk.news@bbc.co.uk or dumfries@bbc.co.uk.",Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.,35232142
1,"A fire alarm went off at the Holiday Inn in Hope Street at about 04:20 BST on Saturday and guests were asked to leave the hotel.\nAs they gathered outside they saw the two buses, parked side-by-side in the car park, engulfed by flames.\nOne of the tour groups is from Germany, the other from China and Taiwan. It was their first night in Northern Ireland.\nThe driver of one of the buses said many of the passengers had left personal belongings on board and these had been destroyed.\nBoth groups have organised replacement coaches and will begin their tour of the north coast later than they had planned.\nPolice have appealed for information about the attack.\nInsp David Gibson said: ""It appears as though the fire started under one of the buses before spreading to the second.\n""While the exact cause is still under investigation, it is thought that the fire was started deliberately.""",Two tourist buses have been destroyed by fire in a suspected arson attack in Belfast city centre.,40143035
2,"Ferrari appeared in a position to challenge until the final laps, when the Mercedes stretched their legs to go half a second clear of the red cars.\nSebastian Vettel will start third ahead of team-mate Kimi Raikkonen.\nThe world champion subsequently escaped punishment for reversing in the pit lane, which could have seen him stripped of pole.\nBut stewards only handed Hamilton a reprimand, after governing body the FIA said ""no clear instruction was given on where he should park"".\nBelgian Stoffel Vandoorne out-qualified McLaren team-mate Jenson Button on his Formula 1 debut.\nVandoorne was 12th and Button 14th, complaining of a handling imbalance on his final lap but admitting the newcomer ""did a good job and I didn't"".\nMercedes were wary of Ferrari's pace before qualifying after Vettel and Raikkonen finished one-two in final practice, and their concerns appeared to be well founded as the red cars mixed it with the silver through most of qualifying.\nAfter the first runs, Rosberg was ahead, with Vettel and Raikkonen splitting him from Hamilton, who made a mistake at the final corner on his first lap.\nBut Hamilton saved his best for last, fastest in every sector of his final attempt, to beat Rosberg by just 0.077secs after the German had out-paced him throughout practice and in the first qualifying session.\nVettel rued a mistake at the final corner on his last lap, but the truth is that with the gap at 0.517secs to Hamilton there was nothing he could have done.\nThe gap suggests Mercedes are favourites for the race, even if Ferrari can be expected to push them.\nVettel said: ""Last year we were very strong in the race and I think we are in good shape for tomorrow. We will try to give them a hard time.""\nVandoorne's preparations for his grand prix debut were far from ideal - he only found out he was racing on Thursday when FIA doctors declared Fernando Alonso unfit because of a broken rib sustained in his huge crash at the first race of the season in Australia two weeks ago.\nThe Belgian rookie had to fly overnight from Japan, where he had been testing in the Super Formula car he races there, and arrived in Bahrain only hours before first practice on Friday.\nHe also had a difficult final practice, missing all but the final quarter of the session because of a water leak.\nButton was quicker in the first qualifying session, but Vandoorne pipped him by 0.064secs when it mattered.\nThe 24-year-old said: ""I knew after yesterday I had quite similar pace to Jenson and I knew if I improved a little bit I could maybe challenge him and even out-qualify him and that is what has happened.\n""Jenson is a very good benchmark for me because he is a world champion and he is well known to the team so I am very satisfied with the qualifying.""\nButton, who was 0.5secs quicker than Vandoorne in the first session, complained of oversteer on his final run in the second: ""Q1 was what I was expecting. Q2 he did a good job and I didn't. Very, very good job. We knew how quick he was.""\nThe controversial new elimination qualifying system was retained for this race despite teams voting at the first race in Australia to go back to the 2015 system.\nFIA president Jean Todt said earlier on Saturday that he ""felt it necessary to give new qualifying one more chance"", adding: ""We live in a world where there is too much over reaction.""\nThe system worked on the basis of mixing up the grid a little - Force India's Sergio Perez ended up out of position in 18th place after the team miscalculated the timing of his final run, leaving him not enough time to complete it before the elimination clock timed him out.\nBut it will come in for more criticism as a result of lack of track action at the end of each session. There were three minutes at the end of the first session with no cars on the circuit, and the end of the second session was a similar damp squib.\nOnly one car - Nico Hulkenberg's Force India - was out on the track with six minutes to go. The two Williams cars did go out in the final three minutes but were already through to Q3 and so nothing was at stake.\nThe teams are meeting with Todt and F1 commercial boss Bernie Ecclestone on Sunday at noon local time to decide on what to do with qualifying for the rest of the season.\nTodt said he was ""optimistic"" they would be able to reach unanimous agreement on a change.\n""We should listen to the people watching on TV,"" Rosberg said. ""If they are still unhappy, which I am sure they will be, we should change it.""\nRed Bull's Daniel Ricciardo was fifth on the grid, ahead of the Williams cars of Valtteri Bottas and Felipe Massa and Force India's Nico Hulkenberg.\nRicciardo's team-mate Daniil Kvyat was eliminated during the second session - way below the team's expectation - and the Renault of Brit Jolyon Palmer only managed 19th fastest.\nGerman Mercedes protege Pascal Wehrlein managed an excellent 16th in the Manor car.\nBahrain GP qualifying results\nBahrain GP coverage details",Lewis Hamilton stormed to pole position at the Bahrain Grand Prix ahead of Mercedes team-mate Nico Rosberg.,35951548
3,"John Edward Bates, formerly of Spalding, Lincolnshire, but now living in London, faces a total of 22 charges, including two counts of indecency with a child.\nThe 67-year-old is accused of committing the offences between March 1972 and October 1989.\nMr Bates denies all the charges.\nGrace Hale, prosecuting, told the jury that the allegations of sexual abuse were made by made by four male complainants and related to when Mr Bates was a scout leader in South Lincolnshire and Cambridgeshire.\n""The defendant says nothing of that sort happened between himself and all these individuals. He says they are all fabricating their accounts and telling lies,"" said Mrs Hale.\nThe prosecutor claimed Mr Bates invited one 15 year old to his home offering him the chance to look at cine films made at scout camps but then showed him pornographic films.\nShe told the jury that the boy was then sexually abused leaving him confused and frightened.\nMrs Hale said: ""The complainant's recollection is that on a number of occasions sexual acts would happen with the defendant either in the defendant's car or in his cottage.""\nShe told the jury a second boy was taken by Mr Bates for a weekend in London at the age of 13 or 14 and after visiting pubs he was later sexually abused.\nMrs Hale said two boys from the Spalding group had also made complaints of being sexually abused.\nThe jury has been told that Mr Bates was in the RAF before serving as a Lincolnshire Police officer between 1976 and 1983.\nThe trial, which is expected to last two weeks, continues.","A former Lincolnshire Police officer carried out a series of sex attacks on boys, a jury at Lincoln Crown Court was told.",36266422
4,"Patients and staff were evacuated from Cerahpasa hospital on Wednesday after a man receiving treatment at the clinic threatened to shoot himself and others.\nOfficers were deployed to negotiate with the man, a young police officer.\nEarlier reports that the armed man had taken several people hostage proved incorrect.\nThe chief consultant of Cerahpasa hospital, Zekayi Kutlubay, who was evacuated from the facility, said that there had been ""no hostage crises"", adding that the man was ""alone in the room"".\nDr Kutlubay said that the man had been receiving psychiatric treatment for the past two years.\nHe said that the hospital had previously submitted a report stating that the man should not be permitted to carry a gun.\n""His firearm was taken away,"" Dr Kutlubay said, adding that the gun in the officer's possession on Wednesday was not his issued firearm.\nThe incident comes amid tension in Istanbul following several attacks in crowded areas, including the deadly assault on the Reina nightclub on New Year's Eve which left 39 people dead.","An armed man who locked himself into a room at a psychiatric hospital in Istanbul has ended his threat to kill himself, Turkish media report.",38826984


The dataset has the following fields:

document: the original BBC article to be summarized

summary: the single sentence summary of the BBC article

id: ID of the document-summary pair

In [9]:
df.describe()

Unnamed: 0,document,summary,id
count,2040,2040,2040
unique,2039,2038,2040
top,"This breaking news story is being updated and more details will be published shortly. Please refresh the page for the fullest version.\nIf you want to receive Breaking News alerts via email, or on a smartphone or tablet via the BBC News App then details on how to do so are available on this help page. You can also follow @BBCBreaking on Twitter to get the latest alerts.",All pictures are copyrighted.,39847414
freq,2,2,1


In [10]:
valid_dataset

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 113
})

In [11]:
# Convert to a Pandas DataFrame
valid_df = pd.DataFrame(valid_dataset)
valid_df.head(5)

Unnamed: 0,document,summary,id
0,"The ex-Reading defender denied fraudulent trading charges relating to the Sodje Sports Foundation - a charity to raise money for Nigerian sport.\nMr Sodje, 37, is jointly charged with elder brothers Efe, 44, Bright, 50 and Stephen, 42.\nAppearing at the Old Bailey earlier, all four denied the offence.\nThe charge relates to offences which allegedly took place between 2008 and 2014.\nSam, from Kent, Efe and Bright, of Greater Manchester, and Stephen, from Bexley, are due to stand trial in July.\nThey were all released on bail.",Former Premier League footballer Sam Sodje has appeared in court alongside three brothers accused of charity fraud.,38295789
1,"Voges was forced to retire hurt on 86 after suffering the injury while batting during the County Championship draw with Somerset on 4 June.\nMiddlesex hope to have the Australian back for their T20 Blast game against Hampshire at Lord's on 3 August.\nThe 37-year-old has scored 230 runs in four first-class games this season at an average of 57.50.\n""Losing Adam is naturally a blow as he contributes significantly to everything we do,"" director of cricket Angus Fraser said.\n""His absence, however, does give opportunities to other players who are desperate to play in the first XI.\n""In the past we have coped well without an overseas player and I expect us to do so now.""\nDefending county champions Middlesex are sixth in the Division One table, having drawn all four of their matches this season.\nVoges retired from international cricket in February with a Test batting average of 61.87 from 31 innings, second only to Australian great Sir Donald Bradman's career average of 99.94 from 52 Tests.",Middlesex batsman Adam Voges will be out until August after suffering a torn calf muscle in his right leg.,40202028
2,"Seven photographs taken in the Norfolk countryside by photographer Josh Olins will appear in the June edition.\nIn her first sitting for a magazine, the duchess is seen looking relaxed and wearing casual clothes.\nThe shoot was in collaboration with the National Portrait Gallery, where two images are being displayed in the Vogue 100: A Century of Style exhibition.\nThe duchess, who has a keen interest in photography, has been patron of the National Portrait Gallery since 2012.\nNicholas Cullinan, director of the National Portrait Gallery, said: ""Josh has captured the duchess exactly as she is - full of life, with a great sense of humour, thoughtful and intelligent, and in fact, very beautiful.""\nHe said the images also encapsulated what Vogue had done over the past 100 years - ""to pair the best photographers with the great personalities of the day, in order to reflect broader shifts in culture and society"".\nAlexandra Shulman, editor-in-chief of British Vogue, said: ""To be able to publish a photographic shoot with the Duchess of Cambridge has been one of my greatest ambitions for the magazine.""\nThe collaboration for the June edition had resulted in ""a true celebration of our centenary as well as a fitting tribute to a young woman whose interest in both photography and the countryside is well known"", she said.\nOther royal portraits to have featured in the fashion magazine include Diana, Princess of Wales - who graced the cover four times - and Princess Anne.\nThe duchess is to visit the exhibition at the National Portrait Gallery on Wednesday, Kensington Palace said.",The Duchess of Cambridge will feature on the cover of British Vogue to mark the magazine's centenary.,36177725
3,"Chris Poole - known as ""moot"" online - created the site in 2003.\nIt has gone on to be closely associated with offensive and often illegal activity, including instances where the images of child abuse were shared.\nIt was widely credited as being the first place where leaked images of nude celebrities were posted following 2014's well-publicised security breach affecting Apple's iCloud service. That incident prompted a policy change on the site.\nHowever, 4chan has also been the rallying point for many instances of online activism from the likes of Anonymous, the loosely organized hacktivism group.\nMr Poole shared news of his new position on blogging site Tumblr.\n""When meeting with current and former Googlers, I continually find myself drawn to their intelligence, passion, and enthusiasm - as well as a universal desire to share it with others.""\n""I'm also impressed by Google's commitment to enabling these same talented people to tackle some of the world's most interesting and important problems.\nHe added: ""I can't wait to contribute my own experience from a dozen years of building online communities, and to begin the next chapter of my career at such an incredible company.""\nMr Poole stepped down as the administrator of 4chan in January 2015. Now he is expected to turn his attentions to Google's social networking efforts.\nHis arrival was welcomed by Bradley Horowitz, the head of ""streams, photos and sharing"" at the search giant's floundering social network, Google+.\n""I'm thrilled he's joining our team here at Google,"" Mr Horowitz said.\n""Welcome Chris!ï»¿""\nSeveral commentators described the appointment as ""unexpected"" but noted that Mr Poole's expertise with social media could prove useful to the search firm.\nFollow Dave Lee on Twitter @DaveLeeBBC and on Facebook",Google has hired the creator of one of the web's most notorious forums - 4chan.,35751255
4,"Four police officers were injured in the incident on Friday night.\nA man, aged 19, and a boy, aged 16, have been charged with six counts of aggravated vehicle taking.\nThey are due to appear before Belfast Magistrates' Court on Monday.\nThe 19-year-old man has also been charged with driving while disqualified and using a motor vehicle without insurance.",Two teenagers have been charged in connection with an incident in west Belfast in which a car collided with two police vehicles.,35275743


## Defining the model and tokenizer
Now, we download the pretrained model and fine-tune it.

In [14]:
# Initialize the model and tokenizer
model_name = "t5-base"  # You can use t5-small, t5-base, t5-large, etc.
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [15]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: evaluate
Successfully installed evaluate-0.4.2


In [16]:
import evaluate                                            # Hugging Face's library for model evaluation

## Preprocessing

Before we can feed those texts to our model, we need to pre-process them and get them ready for the task. This is done by a Hugging Face Transformers Tokenizer which will tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

In [17]:
# Preprocess function
def preprocess_function(examples):
    inputs = examples["document"]
    targets = examples["summary"]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=128, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [18]:
# Apply preprocessing to all splits
train_dataset = train_dataset.map(preprocess_function, batched=True)
valid_dataset = valid_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/2040 [00:00<?, ? examples/s]



Map:   0%|          | 0/113 [00:00<?, ? examples/s]

In [19]:
test_dataset = test_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/113 [00:00<?, ? examples/s]

For training Sequence to Sequence models, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels. Thus, we use the DataCollatorForSeq2Seq provided by the Hugging Face Transformers library on our dataset.

DataCollatorForSeq2Seq is used to batch the data and these data collators may also automatically apply some processing techniques, such as padding.

In [21]:
# Instantiating Data Collator
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

In [22]:
# # Verify installation
# import transformers
# import accelerate

# print(f"transformers version: {transformers.__version__}")
# print(f"accelerate version: {accelerate.__version__}")

In [23]:
# !pip install transformers==4.42.3
# !pip install accelerate==0.21.0

## Defining parameters for training

We now use the Seq2SeqTrainingArguments class to set some relevant settings for fine-tuning. I will first define a directory to serve as output, and then define the evaluation strategy, learning rate, etc.

In [32]:
# Define the training arguments
seed = 42

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    seed = seed,
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
    save_total_limit=4,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True,                      # Enable mixed precision training
)

In [33]:
print(training_args)

Seq2SeqTrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_steps=None,
eval_strategy=epoch,
evaluation_strategy=None,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[]

## Creating the Trainer Instance
Now, create a Seq2SeqTrainer instance to manage the training and evaluation process for our T5 model. The Seq2SeqTrainer class is specifically designed for sequence-to-sequence tasks, like text summarization. This instance is initialized with the model, training arguments, training and validation datasets, tokenizer, and data collator, which together facilitate efficient model training and evaluation.

In [34]:
# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

## Starting the Training Process
Initiate the training process using the train method of the Seq2SeqTrainer instance. This method will train the T5 base model on the provided training and validation datasets, applying the specified training arguments and configurations.

In [35]:
# Start training
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.9152,0.476516
2,0.5044,0.469751
3,0.4682,0.470956
4,0.4458,0.472782


TrainOutput(global_step=2040, training_loss=0.5809187861049876, metrics={'train_runtime': 1030.684, 'train_samples_per_second': 7.917, 'train_steps_per_second': 1.979, 'total_flos': 4969096386969600.0, 'train_loss': 0.5809187861049876, 'epoch': 4.0})


##Evaluating and Saving Model

After training and validating the model, we now evaluate its performance on the test dataset. We will use the evaluate method for this purpose.

In [36]:
# Evaluating model performance on the test dataset
test = trainer.evaluate(eval_dataset = test_dataset)
print(test) # Printing results

{'eval_loss': 0.49926337599754333, 'eval_runtime': 3.6732, 'eval_samples_per_second': 30.763, 'eval_steps_per_second': 7.895, 'epoch': 4.0}


In [37]:
# Saving model to a custom directory
directory = "T5_finetuned_xsum_subset"
trainer.save_model(directory)

# Saving model tokenizer
tokenizer.save_pretrained(directory)

('T5_finetuned_xsum_subset/tokenizer_config.json',
 'T5_finetuned_xsum_subset/special_tokens_map.json',
 'T5_finetuned_xsum_subset/spiece.model',
 'T5_finetuned_xsum_subset/added_tokens.json')

##Inference

We will now perform inference using the model we trained on an arbitrary article. For this, we'll utilize the pipeline method from Hugging Face Transformers, which offers various pipelines for different tasks. For our purpose, we will use the summarization pipeline.

The pipeline method requires the trained model and tokenizer as inputs.

In [39]:
from transformers import pipeline

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)

summarizer(
    test_dataset[0]["document"],
    min_length=64,
    max_length=128,
)

[{'summary_text': 'Prison leavers in Wales are living rough for up to a year before finding suitable accommodation, a charity says . "there\'s a desperate need for it," says a worker who has been jailed for 20 years for burglary offences . 20,000 new affordable homes will be built in the next five years, the government says.'}]

In [41]:
examples = [
    test_dataset[i]["document"]
    for i in range(5)
]

In [75]:
import pickle

with open('Summarization_examples.pkl', 'wb') as f: pickle.dump(examples, f)
with open('Summarization_examples.pkl', 'rb') as f: loaded_examples = pickle.load(f)

In [76]:
from google.colab import files
files.download("Summarization_examples.pkl")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [44]:
loaded_examples[0]

'Prison Link Cymru had 1,099 referrals in 2015-16 and said some ex-offenders were living rough for up to a year before finding suitable accommodation.\nWorkers at the charity claim investment in housing would be cheaper than jailing homeless repeat offenders.\nThe Welsh Government said more people than ever were getting help to address housing problems.\nChanges to the Housing Act in Wales, introduced in 2015, removed the right for prison leavers to be given priority for accommodation.\nPrison Link Cymru, which helps people find accommodation after their release, said things were generally good for women because issues such as children or domestic violence were now considered.\nHowever, the same could not be said for men, the charity said, because issues which often affect them, such as post traumatic stress disorder or drug dependency, were often viewed as less of a priority.\nAndrew Stevens, who works in Welsh prisons trying to secure housing for prison leavers, said the need for acc

##Saving the model

In [63]:
best_model_save_path = f"Fine_tuned_T5_XSum"
model.save_pretrained(best_model_save_path)

Non-default generation parameters: {'max_length': 200, 'min_length': 30, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3}


In [64]:
tokenizer_save_path = "Tokenizer"
tokenizer.save_pretrained(tokenizer_save_path)

('Tokenizer/tokenizer_config.json',
 'Tokenizer/special_tokens_map.json',
 'Tokenizer/spiece.model',
 'Tokenizer/added_tokens.json')

## Compressing the model

In [65]:
zip_file_path, source_path = f"{best_model_save_path}.zip", best_model_save_path

In [66]:
import os
# Zipping the model to download
os.system(f"zip -r {zip_file_path} {source_path}")

0

In [80]:
os.system(f"zip -r Tokenizer.zip tokenizer")

0

## Download the model

In [77]:
# Downloading the zipped model into the local
from google.colab import files

In [78]:
files.download(zip_file_path)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [69]:
from google.colab import drive
drive.mount("drive")

Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).


In [70]:
print(os.getcwd())

/content


In [72]:
!cp -r /content/Fine_tuned_T5_XSum /content/drive/MyDrive/DS_Projects/Text_Summarization_using_T5_XSUM

In [81]:
files.download("Tokenizer.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [82]:
!cp -r /content/Tokenizer /content/drive/MyDrive/DS_Projects/Text_Summarization_using_T5_XSUM