# Comparison of NLP Summarization Models

#### Dataset Source: 

#### Import Necessary Libraries

In [1]:
from transformers import pipeline, set_seed
from datasets import Dataset, DatasetDict, load_metric
import nltk
from nltk.tokenize import sent_tokenize
import pandas as pd
import numpy as np

#### Download NLTK's 'punkt' Package

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/briandunn/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Ingest data & Split it into Train/Test/Validation Datasets

In [3]:
data = pd.read_csv('/Users/briandunn/Documents/nlpnn/Datasets/bbc/all_data_combined.csv')

# Convert pandas dataframe to a dataset
dataset = Dataset.from_pandas(data)

# Split the datasetdict into train/test/valid subsets
train_testvalid = dataset.train_test_split(test_size=0.20)
test_valid = train_testvalid['test'].train_test_split(test_size=0.50)

# Combine the train/test/valid into one datasetdict
dataset = DatasetDict({
    'train' : train_testvalid['train'],
    'test' : test_valid['test'],
    'valid' : test_valid['train']
})

## Part 1: Comparison Using Rouge Metric

# Only Use first Sample from Training Dataset

In [None]:
ds = dataset['train'][1]
ds

{'Unnamed: 0': 80,
 'Article': 'Consumer concern over RFID tags  Consumers are very concerned about the use of radio frequency ID (RFID) tags in shops, a survey says.  More than half of 2,000 people surveyed said they had privacy worries about the tags, which can be used to monitor stock on shelves or in warehouses. Some consumer groups have expressed concern that the tags could be used to monitor shoppers once they had left shops with their purchases. The survey showed that awareness of tags among consumers in Europe was low. The survey of consumers in the UK, France, Germany and the Netherlands was carried out by consultancy group Capgemini. The firm works on behalf of more than 30 firms who are seeking to promote the growth of RFID technology. The tags are a combination of computer chip and antenna which can be read by a scanner - each item contains a unique identification number.  More than half (55%) of the respondents said they were either concerned or very concerned that RFID ta

#### Set Baseline as First Three Sentences of Text

In [None]:
def first_three_lines_summary(text):
    return '\n'.join(sent_tokenize(text)[:3])

set_seed(42)

summaries_ds = {}
summaries_ds['baseline'] = first_three_lines_summary(ds['Article'])

summaries_ds

{'baseline': 'Consumer concern over RFID tags  Consumers are very concerned about the use of radio frequency ID (RFID) tags in shops, a survey says.\nMore than half of 2,000 people surveyed said they had privacy worries about the tags, which can be used to monitor stock on shelves or in warehouses.\nSome consumer groups have expressed concern that the tags could be used to monitor shoppers once they had left shops with their purchases.'}

#### GPT-2 Model

In [None]:
# GPT-2
pipe = pipeline("text-generation", model="gpt2-xl") 
gpt2_query = ds['Article'] + "\nTL;DR\n"
pipe_out = pipe(gpt2_query, clean_up_tokenization_spaces=True, max_new_tokens=1024)
summaries_ds["gpt2"] = "\n".join(sent_tokenize(pipe_out[0]["generated_text"][len(gpt2_query) :]))
summaries_ds["gpt2"]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"If retailers are putting RFID chips everywhere in their shops and stores that's a good cause for concern.\nIf RFID is to be used as a legitimate way to track prices, a scan of each item in a store should be taken before making the purchase.\nIt may be more efficient to track items once purchased rather than after a sale and this would also prevent scanning items again unless they have been used."

#### T5

In [None]:
pipe = pipeline("summarization", model="t5-large", max_length=260, clean_up_tokenization_spaces=True) #, max_model_length=1024)
pipe_out = pipe(ds['Article'])
summaries_ds['t5'] = '\n'.join(sent_tokenize(pipe_out[0]["summary_text"]))

Token indices sequence length is longer than the specified maximum sequence length for this model (518 > 512). Running this sequence through the model will result in indexing errors


'more than half of 2,000 people surveyed said they had privacy worries about the tags .\nRFID tags can be used to monitor stock on shelves or in warehouses .\nsurvey shows that awareness of tags among consumers in Europe is low .'

#### BART

In [None]:
pipe = pipeline("summarization", model="facebook/bart-large-cnn")
pipe_out = pipe(ds['Article'])
summaries_ds['bart'] = '\n'.join(sent_tokenize(pipe_out[0]["summary_text"]))

#### Pegasus

In [None]:
'''
While I really wanted to include the Pegasus transformer in this comparison, it requires a download 
that I am unable to download. Apparently, some hardware setups just cannot install this required
download (sometimes, it is unpredictable). I have made sure to keep the code (and just comment it out).
'''

#pipe = pipeline("summarization", model="google/pegasus-cnn_dailymail")
#pipe_out = pipe(ds['Article'])
#summaries_ds['Pegasus'] = pipe_out[0]["summary_text"].replace(" <n>", ".\n")

'\nWhile I really wanted to include the Pegasus transformer in this comparison, it requires a download \nthat I am unable to download. Apparently, some hardware setups just cannot install this required\ndownload. I have made sure to keep the code (and just comment it out).\n'

#### Print Out All Summaries Sequentially

In [None]:
print("GROUND TRUTH")
print(ds['Summary'])
print("")

for model_name in summaries_ds:
    print(model_name.upper())
    print(summaries_ds[model_name])
    print("")

GROUND TRUTH
Mr Vetham said the majority of people surveyed (52%) believed that RFID tags could be read from a distance.He said that the survey also showed people would accept RFID if they felt that the technology could mean a reduction in car theft or faster recovery of stolen items.Fifty nine percent of people said they were worried that RFID tags would allow data to be used more freely by third parties.Ard Jan Vetham, Capgemini's principal consultant on RFID, said the survey showed that retailers needed to inform and educate people about RFID before it would become accepted technology.More than half (55%) of the respondents said they were either concerned or very concerned that RFID tags would allow businesses to track consumers via product purchases.Consumers are very concerned about the use of radio frequency ID (RFID) tags in shops, a survey says.At least once consumer group - Consumers Against Supermarket Privacy Invasion and Numbering (Caspian) - has claimed that RFID chips cou

#### Model Evaluation Using Rouge Metric

In [None]:
rouge_metric = load_metric('rouge')

reference = dataset['train'][1]['Summary']
records = []
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

for model_name in summaries_ds:
    rouge_metric.add(prediction=summaries_ds[model_name], reference=reference)
    score = rouge_metric.compute()
    rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
    records.append(rouge_dict)
pd.DataFrame.from_records(records, index=summaries_ds.keys())

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
baseline,0.368421,0.212121,0.203008,0.285714
gpt2,0.040201,0.010152,0.040201,0.040201
t5,0.257511,0.12987,0.206009,0.214592
bart,0.274194,0.089431,0.145161,0.16129


## Part 2: Applying Bleu & Rouge to Example

#### Example 1 (Bleu)

In [None]:
article1 = """Broadband fuels online expression  Fast web access is encouraging more people to express themselves online, research suggests.  
A quarter of broadband users in Britain regularly upload content and have personal sites, according to a report by UK think-tank Demos. 
It said that having an always-on, fast connection is changing the way people use the internet. More than five million households in the UK 
have broadband and that number is growing fast.  The Demos report looked at the impact of broadband on people's net habits. It found that 
more than half of those with broadband logged on to the web before breakfast. One in five even admitted to getting up in the middle of the 
night to browse the web.  More significantly, argues the report, broadband is encouraging people to take a more active role online. It 
found that one in five post something on the net everyday, ranging from comments or opinions on sites to uploading photographs. "Broadband 
is putting the 'me' in media as it shifts power from institutions and into the hands of the individual," said John Craig, co-author of the 
Demos report. "From self-diagnosis to online education, broadband creates social innovation that moves the debate beyond simple questions 
of access and speed." The Demos report, entitled Broadband Britain: The End Of Asymmetry?, was commissioned by net provider AOL. "Broadband 
is moving the perception of the internet as a piece of technology to an integral part of home life in the UK," said Karen Thomson, Chief 
Executive of AOL UK, "with many people spending time on their computers as automatically as they might switch on the television or radio." 
According to analysts Nielsen//NetRatings, more than 50% of the 22.8 million UK net users regularly accessing the web from home each month 
are logging on at high speed They spend twice as long online than people on dial-up connections, viewing an average of 1,444 pages per month. 
The popularity of fast net access is growing, partly fuelled by fierce competition over prices and services."""

summary1 = '''More than five million households in the UK have broadband and that number is growing fast.The Demos report looked at the impact 
of broadband on people's net habits.More significantly, argues the report, broadband is encouraging people to take a more active role online.
The Demos report, entitled Broadband Britain: The End Of Asymmetry?, was commissioned by net provider AOL.A quarter of broadband users in Britain 
regularly upload content and have personal sites, according to a report by UK think-tank Demos.Fast web access is encouraging more people to 
express themselves online, research suggests.'''

bleu_metric = load_metric("sacrebleu")

# Run the first example here then enter the values below:

bleu_metric.add(prediction=article1, reference = [summary1]) # Find an example from the eval set
results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results['precisions'] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["Value"])

Unnamed: 0,Value
score,27.437036
counts,"[110, 108, 102, 97]"
totals,"[381, 380, 379, 378]"
precisions,"[28.87, 28.42, 26.91, 25.66]"
bp,1.0
sys_len,381
ref_len,110


#### Example 2 (Bleu)

In [None]:
article2 = '''Mobile music challenges 'iPod age'  Nokia and Microsoft have agreed a deal to work on delivery of music to handsets, while Sony 
Ericsson has unveiled its phone Walkman and Motorola is working on an iTunes phone.  Can mobile phones replace the MP3 player in your pocket? 
The music download market has been growing steadily since record firms embraced digital distribution. Ease of use, relative low price and 
increased access to broadband has helped drive the phenomenal growth of MP3 players.  Full-length music downloads on mobile phones have not 
taken off so quickly - held back by technical challenges as well as issues over music availability. But the mobile music industry is confident 
that the days of dedicated MP3 players are numbered.  Gilles Babinet, chief executive of mobile music firm Musiwave, said: "Music downloads on 
mobiles have the potential to be the biggest-ever medium for music."  Musiwave provides downloading infrastructure for the mobile phone market 
and Mr Babinet said the industry was enjoying "definite momentum." But there are hurdles to overcome. Mobile phones offer limited storage for 
music - certainly nothing to rival Apple's 60GB iPod. But the first mobile phones with hard disk players will be on the market soon and the 
current generation of mobiles using flash technology can store up to one gigabyte of music - enough for 250 songs. "We are working in the hard 
disk area and we will be bringing out exciting devices," Jonas Guest, vice president for entertainment at Nokia, told the BBC News website. But 
will mobiles become mere storage devices? "One of the problems we could have is that mobiles are used just for storage and playback while PCs 
are used for downloading," said Mr Babinet  "We don't want people to cast aside their PCs - we want mobile users to hook up into the existing 
ecosystems," explained Mr Guest. "You must enable people to transfer music from a PC to a handset and vice versa."  One of the key elements of 
the Nokia and Microsoft deal is the agreed ability to transfer songs between a handset and a PC. Microsoft will adopt open standards allowing 
music to cross boundaries for the first time. Songs can be downloaded on PC or mobile and transferred between the platforms. "The line between 
online and wireless is going to blur," predicted Ted Cohen, senior vice president of digital development and distribution at EMI. He said: 
"The market is more regional in its maturity. In Asia it is beyond belief. "The majority of our digital revenues in Asia comes from mobiles. 
In North America it is fixed line while there is equilibrium in Europe."  EMI currently offers its entire 200,000 download catalogue for use 
by both by PCs and mobile phones. Mr Cohen said: "It's going to be just as important to connect through 3G or wireless as it is through your PC. 
"We want music to be a continuum." The seamless experience of mobiles and PC downloads is approaching, he predicted. Mr Babinet said the mobile 
phone had a number of advantages over PCs which would see it become the focus for music downloading in the future. "Getting music from your PC 
onto a device is not an easy experience. You have to switch the PC on, load the operating system, load the program, buy the music, download the 
music, and then transfer the music. "All of these steps can be done in one step on a mobile phone." He said the mobile phone's billing system 
would make it easier for teenagers to embrace downloads, because pre-paid cards were already accepted by the age group.  "Certainly, we have a 
problem with battery, memory and bandwidth. But it's not about the current status. It's about the potential. "You will have all of your music 
on your mobile." All three men said that the social interaction of mobile music would drive the market. Mr Cohen said: "I can send you the song 
and it is either billed to me or I send it to you and if you listen to it and want to keep, it is billed to you. "It's a social phenomenon." 
Mr Babinet said: "Today you use radio and TV to discover music. Tomorrow you will discover and consume music via one device - the mobile."'''

summary2 = '''"You will have all of your music on your mobile."Gilles Babinet, chief executive of mobile music firm Musiwave, said: "Music 
downloads on mobiles have the potential to be the biggest-ever medium for music."All three men said that the social interaction of mobile 
music would drive the market.But the first mobile phones with hard disk players will be on the market soon and the current generation of 
mobiles using flash technology can store up to one gigabyte of music - enough for 250 songs.Mr Babinet said the mobile phone had a number 
of advantages over PCs which would see it become the focus for music downloading in the future.Full-length music downloads on mobile phones 
have not taken off so quickly - held back by technical challenges as well as issues over music availability.Tomorrow you will discover and 
consume music via one device - the mobile."Mobile phones offer limited storage for music - certainly nothing to rival Apple's 60GB iPod. You 
have to switch the PC on, load the operating system, load the program, buy the music, download the music, and then transfer the music.But 
the mobile music industry is confident that the days of dedicated MP3 players are numbered."One of the problems we could have is that mobiles 
are used just for storage and playback while PCs are used for downloading," said Mr Babinet  "We don't want people to cast aside their PCs - 
we want mobile users to hook up into the existing ecosystems," explained Mr Guest.Mr Babinet said: "Today you use radio and TV to discover music.
Musiwave provides downloading infrastructure for the mobile phone market and Mr Babinet said the industry was enjoying "definite momentum.""We 
want music to be a continuum."Can mobile phones replace the MP3 player in your pocket?The seamless experience of mobiles and PC downloads is 
approaching, he predicted."You must enable people to transfer music from a PC to a handset and vice versa."'''

bleu_metric.add(prediction=article2, reference = [summary2]) # Find another example from the eval set
results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results['precisions'] =  [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["Value"])

Unnamed: 0,Value
score,44.486598
counts,"[377, 371, 354, 338]"
totals,"[810, 809, 808, 807]"
precisions,"[46.54, 45.86, 43.81, 41.88]"
bp,1.0
sys_len,810
ref_len,377
