# **Install necessary packages**

In [2]:
!pip install rouge-score
!pip install bert-extractive-summarizer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24954 sha256=caeccc04cf91b0ca7e9b48f66ecbcadc2d4a10cd19027ddd520342d9929f5045
  Stored in directory: /root/.cache/pip/wheels/9b/3d/39/09558097d3119ca0a4d462df68f22c6f3c1b345ac63a09b86e
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bert-extractive-summarizer
  Downloading bert_extractive_summarizer-0.10.1-py3-none-any.whl (25 kB)
Collecting transformers
  Downloading transformers-4.28.1-py3-none-an

# **Import libraries**

In [4]:
import tensorflow 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import pandas as pd
from summarizer import Summarizer, TransformerSummarizer
from rouge_score import rouge_scorer
from tqdm.notebook import tqdm

# **Mount Google Drive**

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **Load and preprocess the CNN-DailyMail test data**

In [8]:
df = pd.read_csv("/content/test.csv", on_bad_lines='skip')

is_use_whole_set = True

if is_use_whole_set:
    data = df.drop_duplicates().copy()
    print("The total number of test samples is: {}".format(len(data)))
else:
    number_of_samples = 1000
    data = df.drop_duplicates().head(number_of_samples).copy()

assert(len(data["highlights"])==len(data["article"]))
data["highlights"] = data["highlights"].str.replace("\n", " ")
data["highlights"] = data["highlights"].str.replace("  ", " ")
data["highlights"] = data["highlights"].str.replace(r"\s([^\w\s])", r"\1", regex=True)

The total number of test samples is: 11490


In [None]:
original_text = data["article"][0]
true_summary = data["highlights"][0]
print("Original Text: {}".format(original_text))
print("True Summary: {}".format(true_summary))

BERT_model = Summarizer()
summary = BERT_model(data["article"][0])
print("BERT Summary: {}".format(summary))

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'])
scores = scorer.score(true_summary, summary)
print("Scores: {}".format(scores))

Original Text: Ever noticed how plane seats appear to be getting smaller and smaller? With increasing numbers of people taking to the skies, some experts are questioning if having such packed out planes is putting passengers at risk. They say that the shrinking space on aeroplanes is not only uncomfortable - it's putting our health and safety in danger. More than squabbling over the arm rest, shrinking space on planes putting our health and safety in danger? This week, a U.S consumer advisory group set up by the Department of Transportation said at a public hearing that while the government is happy to set standards for animals flying on planes, it doesn't stipulate a minimum amount of space for humans. 'In a world where animals have more rights to space and food than humans,' said Charlie Leocha, consumer representative on the committee. 'It is time that the DOT and FAA take a stand for humane treatment of passengers.' But could crowding on planes lead to more serious issues than figh

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

BERT Summary: Ever noticed how plane seats appear to be getting smaller and smaller? This week, a U.S consumer advisory group set up by the Department of Transportation said at a public hearing that while the government is happy to set standards for animals flying on planes, it doesn't stipulate a minimum amount of space for humans. ' But these tests are conducted using planes with 31 inches between each row of seats, a standard which on some airlines has decreased, reported the Detroit News. The distance between two seats from one point on a seat to the same point on the seat behind it is known as the pitch.
Scores: {'rouge1': Score(precision=0.1559633027522936, recall=0.5, fmeasure=0.2377622377622378), 'rouge2': Score(precision=0.05555555555555555, recall=0.18181818181818182, fmeasure=0.0851063829787234), 'rougeL': Score(precision=0.11009174311926606, recall=0.35294117647058826, fmeasure=0.16783216783216784)}


**Test Bert + k-means clustering on the CNN-Daily Mail data**

In [None]:
rouge1 = []
rouge2 = []
rougeL = []

for ot, ts in tqdm(zip(data["article"], data["highlights"]), total=len(data["article"])):
    summary = BERT_model(ot)
    scores = scorer.score(ts, summary)
    rouge1.append(scores["rouge1"][2])
    rouge2.append(scores["rouge2"][2])
    rougeL.append(scores["rougeL"][2])

print("ROUGE-1: {}".format(np.array(rouge1).mean()))
print("ROUGE-2: {}".format(np.array(rouge2).mean()))
print("ROUGE-L: {}".format(np.array(rougeL).mean()))

  0%|          | 0/11490 [00:00<?, ?it/s]

ROUGE-1: 0.3016095678779888
ROUGE-2: 0.11547925538764356
ROUGE-L: 0.188339220205746


**Test GPT-2 + k-means clustering on the CNN-Daily Mail data**

In [None]:
GPT2_model = TransformerSummarizer(transformer_type="GPT2",transformer_model_key="gpt2-medium")

rouge1 = []
rouge2 = []
rougeL = []

for ot, ts in tqdm(zip(data["article"], data["highlights"]), total=len(data["article"])):
    summary = GPT2_model(ot)
    scores = scorer.score(ts, summary)
    rouge1.append(scores["rouge1"][2])
    rouge2.append(scores["rouge2"][2])
    rougeL.append(scores["rougeL"][2])

print("ROUGE-1: {}".format(np.array(rouge1).mean()))
print("ROUGE-2: {}".format(np.array(rouge2).mean()))
print("ROUGE-L: {}".format(np.array(rougeL).mean()))

Downloading (…)lve/main/config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

  0%|          | 0/11490 [00:00<?, ?it/s]

ROUGE-1: 0.3059349406751119
ROUGE-2: 0.11779683993570841
ROUGE-L: 0.1907057737973608


# **Load and preprocess the Reddit TIFU test data**

In [25]:
ds = tfds.load("reddit_tifu/long_split", split="test")
reddit_df = tfds.as_dataframe(ds)

In [26]:
encoding = 'utf-8'
reddit_df["documents"] = reddit_df["documents"].apply(lambda text: text.decode(encoding))
reddit_df["tldr"] = reddit_df["tldr"].apply(lambda text: text.decode(encoding))

In [27]:
print(reddit_df.head(5))

                                           documents         id  num_comments  \
1  this happened 2 days ago and i am sitting on m...  b'3elaq7'          58.0   
2  i'll start this by telling you a bit about my ...  b'30d4ge'          31.0   
3  as like all other fuck-ups, this did not happe...  b'5ymdci'          51.0   
4  so today comes with my post of shame.  this ac...  b'3i4xnx'          29.0   

   score                                              title  \
0    3.0  b'agreeing to go to a girls house with my frie...   
1    0.0  b'forgetting to cut off a price tag on a reusa...   
2   10.0  b'trying a relationship online with someone i ...   
3    0.0    b'making my family eat my very own semen salt.'   
4   35.0                    b'assuming hot wings were not.'   

                                                tldr   ups  upvote_ratio  
0  went to a girls house with 2 friends, smoked w...   3.0          0.54  
1  forgot to cut off price tags off a reusable ba...   0.0       

In [30]:
print("Original Text: {}".format(reddit_df["documents"][0]))
print("True Summary: {}".format(reddit_df["tldr"][0]))
assert(len(reddit_df["documents"])==len(reddit_df["tldr"]))


me and two of my best friends, let's call them william and finley, were hanging out and we decided we'd like to buy some weed because we hadn't had any in a while, so finley calls a (girl)friend with access to a dealer and we arrange to go to her house so she'd buy it for us if we gave her the money. this was all well and good, we get there, and two girls are there, we'll call **them** kash and lucy. kash goes off to meet the dealer whilst william, finley, and me chat to lucy. we got to know her and we found out that she goes to a special school because she stabbed a girl in the head with a pen last year during maths class because she pissed her off for some petty reason i forget now, so, a little unnerved, we laughed it off and made chit-chat. kash soon returned with a baggie filled with weed and she closed all the windows and the door to her smallish bedroom, just so nobody would smell the smoke or whatever. it got pretty stuffy but it was alright at the time. finley says he feels a

In [31]:
BERT_model = Summarizer()
GPT2_model = TransformerSummarizer(transformer_type="GPT2",transformer_model_key="gpt2-medium")

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [33]:
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'])

**Test Bert + k-means clustering on the CNN-Daily Mail data**

In [34]:
rouge1 = []
rouge2 = []
rougeL = []

for ot, ts in tqdm(zip(reddit_df["documents"], reddit_df["tldr"]), total=len(reddit_df["documents"])):
    summary = BERT_model(ot)
    scores = scorer.score(ts, summary)
    rouge1.append(scores["rouge1"][2])
    rouge2.append(scores["rouge2"][2])
    rougeL.append(scores["rougeL"][2])

print("ROUGE-1: {}".format(np.array(rouge1).mean()))
print("ROUGE-2: {}".format(np.array(rouge2).mean()))
print("ROUGE-L: {}".format(np.array(rougeL).mean()))

  0%|          | 0/4214 [00:00<?, ?it/s]

ROUGE-1: 0.1544627550429196
ROUGE-2: 0.028070363090465015
ROUGE-L: 0.10594627875773312


**Test GPT-2 + k-means clustering on the CNN-Daily Mail data**

In [35]:
rouge1 = []
rouge2 = []
rougeL = []

for ot, ts in tqdm(zip(reddit_df["documents"], reddit_df["tldr"]), total=len(reddit_df["documents"])):
    summary = GPT2_model(ot)
    scores = scorer.score(ts, summary)
    rouge1.append(scores["rouge1"][2])
    rouge2.append(scores["rouge2"][2])
    rougeL.append(scores["rougeL"][2])

print("ROUGE-1: {}".format(np.array(rouge1).mean()))
print("ROUGE-2: {}".format(np.array(rouge2).mean()))
print("ROUGE-L: {}".format(np.array(rougeL).mean()))

  0%|          | 0/4214 [00:00<?, ?it/s]

ROUGE-1: 0.1574151802277536
ROUGE-2: 0.02742486616963162
ROUGE-L: 0.10829248703749408
