<a href="https://colab.research.google.com/github/Shrishti18/Text-Summarization/blob/main/Pegasus%2CT5%2CBART.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This project demonstrates how T5, Pegasus and Bart transformers with HuggingFace for text summarization applied on a news dataset.**


In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
news = pd.read_excel("Inshorts Cleaned Data.xlsx", sheet_name='Sheet1')
news.head()

Unnamed: 0,Headline,Short,Source,Time,Publish Date
0,4 ex-bank officials booked for cheating bank o...,The CBI on Saturday booked four former officia...,The New Indian Express,09:25:00,2017-03-26
1,Supreme Court to go paperless in 6 months: CJI,Chief Justice JS Khehar has said the Supreme C...,Outlook,22:18:00,2017-03-25
2,"At least 3 killed, 30 injured in blast in Sylh...","At least three people were killed, including a...",Hindustan Times,23:39:00,2017-03-25
3,Why has Reliance been barred from trading in f...,Mukesh Ambani-led Reliance Industries (RIL) wa...,Livemint,23:08:00,2017-03-25
4,Was stopped from entering my own studio at Tim...,TV news anchor Arnab Goswami has said he was t...,YouTube,23:24:00,2017-03-25


In [4]:
news.drop(['Source ', 'Time ', 'Publish Date'], axis=1, inplace=True)


In [6]:
news.isnull().sum()


Headline    0
Short       0
dtype: int64

The **transformers** package, developed by Hugging Face, provides a wide range of pre-trained models for various NLP tasks, such as BERT, GPT-2, and T5.


The sentencepiece package is a library for subword text encoding, which is used to represent words in a vocabulary as a sequence of subword units. This technique is commonly used in NLP tasks to handle out-of-vocabulary (OOV) words and improve the performance of models.
```
# This is formatted as code
```



In [7]:
!pip install transformers sentencepiece


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m103.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.98-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m79.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m109.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB

These lines load the Pegasus model and tokenizer from the pre-trained weights stored on the Hugging Face model hub. The tokenizer is used to preprocess input text so that it can be fed into the model. The model is the pre-trained neural network that is used to generate summaries.



**pipeline("summarization", model=model, tokenizer=tokenizer, framework="tf",device=0)** method creates a pipeline object for text *summarization* task. The *model* and *tokenizer *are the pre-trained summarization model and tokenizer that we want to use, and framework and device specify the backend framework and the device to run the model on, respectively. Once the pipeline is created, you can use it to summarize input texts using the summarizer1(text) method.


In [17]:
# Pegasus in TensorFlow
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("google/pegasus-xsum")
model = AutoModelForSeq2SeqLM.from_pretrained("google/pegasus-xsum")
summarizer1 = pipeline("summarization", model=model, tokenizer=tokenizer, framework="tf",device=0)


In [18]:
# reducing the dataset for faster process
news=news.sample(n=20)
news.reset_index(inplace = True,drop = True)
news.head()

Unnamed: 0,Headline,Short,Pegasus
0,Mob attacks Radiohead fans in Turkey,About 20 people reportedly beat up customers a...,[{'summary_text': 'Turkish police are investig...
1,"Man City, Arsenal play out a 2-2 draw",Arsenal and Manchester City played out a 2-2 d...,[{'summary_text': 'Arsenal manager Arsene Weng...
2,"Tanmay Bhat speaks on his video on Lata, Sachin","Stand-up comedian Tanmay Bhat, who is associat...",[{'summary_text': 'Comedian Tanmay Bhat has be...
3,Pro-Russian rebels cut trade ties with Ukraine...,The pro-Russian rebel group in Ukraine has dec...,[{'summary_text': 'Pro-Russian separatists in ...
4,World may lose two-thirds of its wildlife by 2...,The world may lose two-thirds (68%) of its wil...,[{'summary_text': 'The world's wildlife popula...


In [19]:
l = []
for i in range(len(news['Short'])):
    l.append(summarizer1(news['Short'][i], min_length=10, max_length=30))
news.loc[:,'Pegasus'] = l
news.head()



Unnamed: 0,Headline,Short,Pegasus
0,Mob attacks Radiohead fans in Turkey,About 20 people reportedly beat up customers a...,[{'summary_text': 'Turkish police are investig...
1,"Man City, Arsenal play out a 2-2 draw",Arsenal and Manchester City played out a 2-2 d...,[{'summary_text': 'Arsenal manager Arsene Weng...
2,"Tanmay Bhat speaks on his video on Lata, Sachin","Stand-up comedian Tanmay Bhat, who is associat...",[{'summary_text': 'Comedian Tanmay Bhat has be...
3,Pro-Russian rebels cut trade ties with Ukraine...,The pro-Russian rebel group in Ukraine has dec...,[{'summary_text': 'Pro-Russian separatists in ...
4,World may lose two-thirds of its wildlife by 2...,The world may lose two-thirds (68%) of its wil...,[{'summary_text': 'The world's wildlife popula...


In [20]:
news['Short'][1]

'Arsenal and Manchester City played out a 2-2 draw at the Etihad Stadium in the Premier League on Sunday. Sergio Aguero and Kevin de Bruyne scored for the home side while Olivier Giroud and Alexis Sanchez scored for the Gunners. Notably, Man City will lose out on a UEFA Champions League spot if Manchester United win their next two games.'

In [21]:
news['Headline'][1]

'Man City, Arsenal play out a 2-2 draw'

In [22]:
news['Pegasus'][1]

[{'summary_text': 'Arsenal manager Arsene Wenger and Manchester City counterpart Pep Guardiola had a heated exchange at the end of their game on Sunday.'}]

In [23]:
# T-5 in TensorFlow
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
summarizer2 = pipeline("summarization", model=model, tokenizer=tokenizer, framework="tf",device=0)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [28]:
l = []
for i in range(len(news['Short'])):
    l.append(summarizer2(news['Short'][i], min_length=10, max_length=30))
news.loc[:,'T5'] = l
news.head()



Unnamed: 0,Headline,Short,Pegasus,T5
0,Mob attacks Radiohead fans in Turkey,About 20 people reportedly beat up customers a...,[{'summary_text': 'Turkish police are investig...,{'summary_text': 'the mob was reportedly angry...
1,"Man City, Arsenal play out a 2-2 draw",Arsenal and Manchester City played out a 2-2 d...,[{'summary_text': 'Arsenal manager Arsene Weng...,{'summary_text': 'arsenal and manchester city ...
2,"Tanmay Bhat speaks on his video on Lata, Sachin","Stand-up comedian Tanmay Bhat, who is associat...",[{'summary_text': 'Comedian Tanmay Bhat has be...,{'summary_text': ''drought hit farmers wonderi...
3,Pro-Russian rebels cut trade ties with Ukraine...,The pro-Russian rebel group in Ukraine has dec...,[{'summary_text': 'Pro-Russian separatists in ...,{'summary_text': 'pro-Russian rebels cut econo...
4,World may lose two-thirds of its wildlife by 2...,The world may lose two-thirds (68%) of its wil...,[{'summary_text': 'The world's wildlife popula...,{'summary_text': 'the world may lose two-third...


In [29]:
news['Short'][1]

'Arsenal and Manchester City played out a 2-2 draw at the Etihad Stadium in the Premier League on Sunday. Sergio Aguero and Kevin de Bruyne scored for the home side while Olivier Giroud and Alexis Sanchez scored for the Gunners. Notably, Man City will lose out on a UEFA Champions League spot if Manchester United win their next two games.'

In [30]:
news['Headline'][1]

'Man City, Arsenal play out a 2-2 draw'

In [31]:
news['T5'][1]

{'summary_text': 'arsenal and manchester city play out a 2-2 draw at the Etihad on sunday . man city will lose out on '}

In [32]:
# BART in PyTorch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
summarizer3 = pipeline("summarization", model=model, tokenizer=tokenizer, min_length=5, max_length=30, device=0)


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

In [33]:
l = []
for i in range(len(news['Short'])):
    l.append(summarizer3(news['Short'][i], min_length=10, max_length=30))
news.loc[:,'BART'] = l
news.head()



Unnamed: 0,Headline,Short,Pegasus,T5,BART
0,Mob attacks Radiohead fans in Turkey,About 20 people reportedly beat up customers a...,[{'summary_text': 'Turkish police are investig...,{'summary_text': 'the mob was reportedly angry...,[{'summary_text': 'The mob was reportedly angr...
1,"Man City, Arsenal play out a 2-2 draw",Arsenal and Manchester City played out a 2-2 d...,[{'summary_text': 'Arsenal manager Arsene Weng...,{'summary_text': 'arsenal and manchester city ...,[{'summary_text': 'Manchester City and Arsenal...
2,"Tanmay Bhat speaks on his video on Lata, Sachin","Stand-up comedian Tanmay Bhat, who is associat...",[{'summary_text': 'Comedian Tanmay Bhat has be...,{'summary_text': ''drought hit farmers wonderi...,[{'summary_text': 'Comedian Tanmay Bhat shared...
3,Pro-Russian rebels cut trade ties with Ukraine...,The pro-Russian rebel group in Ukraine has dec...,[{'summary_text': 'Pro-Russian separatists in ...,{'summary_text': 'pro-Russian rebels cut econo...,[{'summary_text': 'Pro-Russian rebel group in ...
4,World may lose two-thirds of its wildlife by 2...,The world may lose two-thirds (68%) of its wil...,[{'summary_text': 'The world's wildlife popula...,{'summary_text': 'the world may lose two-third...,[{'summary_text': 'The world may lose two-thir...


In [34]:
news['Short'][1]

'Arsenal and Manchester City played out a 2-2 draw at the Etihad Stadium in the Premier League on Sunday. Sergio Aguero and Kevin de Bruyne scored for the home side while Olivier Giroud and Alexis Sanchez scored for the Gunners. Notably, Man City will lose out on a UEFA Champions League spot if Manchester United win their next two games.'

In [35]:
news['Headline'][1]

'Man City, Arsenal play out a 2-2 draw'

In [36]:
news['BART'][1]

[{'summary_text': 'Manchester City and Arsenal played out a 2-2 draw at the Etihad Stadium on Sunday. Sergio Aguero and Kevin de Bruyne'}]

In [37]:
news

Unnamed: 0,Headline,Short,Pegasus,T5,BART
0,Mob attacks Radiohead fans in Turkey,About 20 people reportedly beat up customers a...,[{'summary_text': 'Turkish police are investig...,{'summary_text': 'the mob was reportedly angry...,[{'summary_text': 'The mob was reportedly angr...
1,"Man City, Arsenal play out a 2-2 draw",Arsenal and Manchester City played out a 2-2 d...,[{'summary_text': 'Arsenal manager Arsene Weng...,{'summary_text': 'arsenal and manchester city ...,[{'summary_text': 'Manchester City and Arsenal...
2,"Tanmay Bhat speaks on his video on Lata, Sachin","Stand-up comedian Tanmay Bhat, who is associat...",[{'summary_text': 'Comedian Tanmay Bhat has be...,{'summary_text': ''drought hit farmers wonderi...,[{'summary_text': 'Comedian Tanmay Bhat shared...
3,Pro-Russian rebels cut trade ties with Ukraine...,The pro-Russian rebel group in Ukraine has dec...,[{'summary_text': 'Pro-Russian separatists in ...,{'summary_text': 'pro-Russian rebels cut econo...,[{'summary_text': 'Pro-Russian rebel group in ...
4,World may lose two-thirds of its wildlife by 2...,The world may lose two-thirds (68%) of its wil...,[{'summary_text': 'The world's wildlife popula...,{'summary_text': 'the world may lose two-third...,[{'summary_text': 'The world may lose two-thir...
5,Bumrah records most T20I wickets in calendar yr,Indian pacer Jasprit Bumrah has set the record...,"[{'summary_text': 'Bumrah, who took four wicke...",{'summary_text': 'jasprit Bumrah took 28 wicke...,[{'summary_text': 'Jasprit Bumrah has set the ...
6,Not interested in Sunil Dutt&#39;s role in San...,"Actor Aamir Khan, who was offered the role of ...","[{'summary_text': 'Actor Aamir Khan, who was o...",{'summary_text': ''i thought you were coming t...,[{'summary_text': 'Aamir Khan was offered the ...
7,May not accept election result: Trump,US presidential nominee Donald Trump on Wednes...,[{'summary_text': 'Donald Trump says he will n...,{'summary_text': 'he says he will keep people ...,[{'summary_text': 'Donald Trump said he would ...
8,"Cop assaulted, forced to wave saffron flag",An Assistant Sub-Inspector was assaulted and m...,[{'summary_text': 'An Indian police officer ha...,{'summary_text': 'sub-inspector was assaulted ...,[{'summary_text': 'The 50-year-old officer and...
9,Navy retires Sea Harrier aircraft squadron,The Indian Navy on Wednesday retired the Sea H...,[{'summary_text': 'The Indian Navy on Wednesda...,{'summary_text': 'the retired Sea Harrier airc...,"[{'summary_text': 'Inducted in 1983, these wil..."


In [53]:
!pip install datasets rouge_score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m36.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting multiprocess
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [61]:

from datasets import load_metric
rouge_metric = load_metric('rouge')


In [70]:
records = []


ROUGE value for one summarization *news['Headline'][1]*

In [69]:
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

reference = news['Headline'][1]

rouge_metric.add(prediction = news['Pegasus'][1], reference = reference )
score = rouge_metric.compute()
rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_names )
print('rouge_dict ', rouge_dict )

rouge_dict  {'rouge1': 0.18181818181818182, 'rouge2': 0.0, 'rougeL': 0.12121212121212122, 'rougeLsum': 0.12121212121212122}


ROUGE value for all 20 summarization

In [77]:
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

references = news['Headline'].tolist()
predictions = news['Pegasus'].tolist()

rouge_scores = {rn: 0 for rn in rouge_names}

for i in range(len(references)):
    reference = references[i]
    prediction = predictions[i]

    rouge_metric.add(prediction=prediction, reference=reference)
    score = rouge_metric.compute()

    for rn in rouge_names:
        rouge_scores[rn] += score[rn].mid.fmeasure

rouge_avg = {rn: rouge_scores[rn] / len(references) for rn in rouge_names}
rouge_dict= dict(rouge_avg)
print('rouge_avg:', rouge_avg)
records.append(rouge_dict)


rouge_avg: {'rouge1': 0.23212276703556012, 'rouge2': 0.06564274301753079, 'rougeL': 0.18066450727429476, 'rougeLsum': 0.18066450727429476}


In [73]:
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

reference = news['Headline'][1]


rouge_metric.add(prediction = news['T5'][1], reference = reference )
score = rouge_metric.compute()
rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_names )
print('rouge_dict ', rouge_dict )

rouge_dict  {'rouge1': 0.5625, 'rouge2': 0.39999999999999997, 'rougeL': 0.43750000000000006, 'rougeLsum': 0.43750000000000006}


In [78]:
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

references = news['Headline'].tolist()
predictions = news['T5'].tolist()

rouge_scores = {rn: 0 for rn in rouge_names}

for i in range(len(references)):
    reference = references[i]
    prediction = predictions[i]

    rouge_metric.add(prediction=prediction, reference=reference)
    score = rouge_metric.compute()

    for rn in rouge_names:
        rouge_scores[rn] += score[rn].mid.fmeasure

rouge_avg = {rn: rouge_scores[rn] / len(references) for rn in rouge_names}
rouge_dict = dict(rouge_avg)
print('rouge_avg:', rouge_avg)
records.append(rouge_dict)



rouge_avg: {'rouge1': 0.25028621095936227, 'rouge2': 0.09895132585455166, 'rougeL': 0.216269258702004, 'rougeLsum': 0.216269258702004}


In [75]:
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

reference = news['Headline'][1]


rouge_metric.add(prediction = news['BART'][1], reference = reference )
score = rouge_metric.compute()
rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_names )
print('rouge_dict ', rouge_dict )

rouge_dict  {'rouge1': 0.42424242424242425, 'rouge2': 0.25806451612903225, 'rougeL': 0.42424242424242425, 'rougeLsum': 0.42424242424242425}


In [79]:
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

references = news['Headline'].tolist()
predictions = news['BART'].tolist()

rouge_scores = {rn: 0 for rn in rouge_names}

for i in range(len(references)):
    reference = references[i]
    prediction = predictions[i]

    rouge_metric.add(prediction=prediction, reference=reference)
    score = rouge_metric.compute()

    for rn in rouge_names:
        rouge_scores[rn] += score[rn].mid.fmeasure

rouge_avg = {rn: rouge_scores[rn] / len(references) for rn in rouge_names}
rouge_dict=dict(rouge_avg)
print('rouge_avg:', rouge_avg)
records.append(rouge_dict)



rouge_avg: {'rouge1': 0.24152549927488587, 'rouge2': 0.09653769029480605, 'rougeL': 0.20920683655181138, 'rougeLsum': 0.20920683655181138}


In [84]:
rouge_scores=pd.DataFrame.from_records(records )

new_index = ['Pegasus', 'T5', 'Bart']
rouge_scores = rouge_scores.set_index(pd.Index(new_index))

rouge_scores

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
Pegasus,0.232123,0.065643,0.180665,0.180665
T5,0.250286,0.098951,0.216269,0.216269
Bart,0.241525,0.096538,0.209207,0.209207


In [59]:
# For this Dataset it can be clearly seen that Pegasus is not a perfect match, BARTand T5 has better Rouge value.