Mengfei Li, Sixtine Sphabmixay, Saïd Saterih, Juliana Carvalho de Souza

# Replicating the BARTScore Results for Summarization from "Evaluating Generated Text as Text Generation" by Weizhe Yuan, Graham Neubig, and Pengfei Liu

The objective of this project was to replicate the results from the paper "Evaluating Generated Text as Text Generation" by Weizhe Yuan, Graham Neubig, and Pengfei Liu, specifically for the summarization task. For this purpose, we utilized the datasets provided on GitHub at : https://github.com/neulab/BARTScore/tree/main/SUM, as recommended in the paper.

The project is divided into three parts:

Dataset Analysis: In the first part, we explored the datasets to understand their structure and components.

Custom BART Scorer: In the second part, we implemented a custom (vanilla version) BART scorer from scratch to evaluate the summarization quality.

Evaluation and Comparison: In the final part, we computed evaluation scores using several metrics mentioned in the paper, including ROUGE-1, ROUGE-2, ROUGE-L, BertScore and MoverScore. We then compared the results obtained from these metrics with the scores from our custom BARTScore implementation and the scores provided in the dataset.



## 1. Dataset Analysis

In [36]:
import requests

# URLs of the raw newsroom files
urls = {
    'news_eval.pkl': 'https://raw.githubusercontent.com/neulab/BARTScore/main/SUM/Newsroom/data.pkl',
}

# Download each file
for filename, url in urls.items():
    response = requests.get(url)
    if response.status_code == 200:
        with open(filename, 'wb') as f:
            f.write(response.content)
        print(f'Successfully downloaded {filename}')
    else:
        print(f'Failed to download {filename} from {url}')

Successfully downloaded news_eval.pkl


In [37]:
# Open the dataset
import pickle

with open('news_eval.pkl', 'rb') as f:
    data = pickle.load(f)

# get the type of the dataset
type(data)

dict

In [38]:
# Get the keys
data.keys()

dict_keys(['2140', '7569', '30385', '2591', '1943', '8167', '10113', '7651', '2350', '31823', '9821', '30429', '7208', '10376', '7666', '9823', '30821', '9092', '10506', '32563', '6476', '6130', '8727', '10580', '34005', '8801', '5678', '5626', '10488', '144', '31425', '9912', '10623', '30117', '31059', '6578', '31788', '10062', '8607', '1807', '32146', '3829', '32202', '6726', '9642', '4003', '5926', '6638', '32716', '32670', '350', '8212', '6906', '33233', '6314', '9950', '34184', '171', '7118', '1295'])

All keys seems to corresponds a unique identifier.

In [3]:
# What does a specific element looks like ?
data['2140'].keys()

dict_keys(['src', 'ref_summ', 'sys_summs'])

Each of these elements contains 3 sub dictionnaries, src, ref_summ, sys_summs

In [4]:
# What does 'src' means ?
data['2140']['src']

"A worker sets up a polling station the morning of the GOP primary in Florida . Fewer voters than expected turned out . Editor 's note : John Avlon is a CNN contributor and senior political columnist for Newsweek and The Daily Beast . He is co-editor of the book `` Deadline Artists : America 's Greatest Newspaper Columns . '' ( CNN ) -- Beneath Rick Santorum 's stunning three-state sweep on Tuesday stands another stubborn sign of dissatisfaction with the status quo : Republican turnout is down . I 'm talking embarrassingly , disturbingly , hey-don't-you-know-it's-an-election-year bad . It is a sign of a serious enthusiasm gap among the rank and file , and a particularly bad omen for Mitt Romney and the GOP in the general election . Here 's the tale of the tape , state by state , beginning with Tuesday night : Minnesota had just more than 47,000 people turn out for its caucuses this year -- four years ago it was nearly 63,000 -- and Romney came in first , not a distant third as he did T

This seems to corresponds to the text that we want to summarize.

In [5]:
# What does 'ref_summ' means ?
data['2140']['ref_summ']

'John Avlon says low voter turnout in the primaries is a sign of a serious enthusiasm gap among the rank and file , a bad omen for the GOP .'

This seems to corresponds to the summary given by a human.

In [6]:
# What does 'sys_summs' means ?
data['2140']['sys_summs'].keys()

dict_keys(['fragments', 'textrank', 'abstractive', 'pointer_c', 'pointer_n', 'pointer_s', 'lede3'])

These keys likely correspond to different summarization methods or evaluation techniques which gave different summarize and scores, as seen bellow.

In [7]:
# What is inside 'textrank' ?
data['2140']['sys_summs']['textrank'].keys()

dict_keys(['sys_summ', 'scores'])

In [8]:
data['2140']['sys_summs']['textrank']['sys_summ']

'In New Hampshire , the same dynamic applied -- 245,000 voters turned out in 2012 , compared with 241,000 four years before , despite Republicans being the only game in town and independents making up 47 % of the total turnout in 2012 , according to CNN exit polls .'

In [9]:
data['2140']['sys_summs']['textrank']['scores']

{'coherence': 4.0,
 'fluency': 4.0,
 'informativeness': 3.3333333333333335,
 'relevance': 4.0}

In [10]:
def print_structure(d, indent=0):
    """
    Recursively prints the structure of keys and subkeys in a hierarchical format.
    :param d: The dictionary to traverse
    :param indent: The current level of indentation for hierarchy
    """
    for key, value in d.items():
        print("  " * indent + f"- {key}")
        if isinstance(value, dict):
            print_structure(value, indent + 1)

In [11]:
print_structure(data['2140'])

- src
- ref_summ
- sys_summs
  - fragments
    - sys_summ
    - scores
      - coherence
      - fluency
      - informativeness
      - relevance
  - textrank
    - sys_summ
    - scores
      - coherence
      - fluency
      - informativeness
      - relevance
  - abstractive
    - sys_summ
    - scores
      - coherence
      - fluency
      - informativeness
      - relevance
  - pointer_c
    - sys_summ
    - scores
      - coherence
      - fluency
      - informativeness
      - relevance
  - pointer_n
    - sys_summ
    - scores
      - coherence
      - fluency
      - informativeness
      - relevance
  - pointer_s
    - sys_summ
    - scores
      - coherence
      - fluency
      - informativeness
      - relevance
  - lede3
    - sys_summ
    - scores
      - coherence
      - fluency
      - informativeness
      - relevance


## 2. Custom BART Scorer

In [12]:
# Import necessary packages
from transformers import BartTokenizer, BartForConditionalGeneration
import torch
!pip install pyemd
!pip install pytorch_pretrained_bert
!pip install moverscore
from moverscore import word_mover_score, get_idf_dict

Exception in thread Thread-5 (attachment_entry):
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/debugpy/server/api.py", line 237, in listen
    sock, _ = endpoints_listener.accept()
  File "/usr/lib/python3.10/socket.py", line 293, in accept
    fd, addr = self._accept()
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/google/colab/_debugpy.py", line 52, in attachment_entry
    debugpy.listen(_dap_port)
  File "/usr/local/lib/python3.10/dist-packages/debugpy/public_api.py", line 31, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/debugpy/server/api.py", line 143, in debug
    log.reraise



  state_dict = torch.load(weights_path, map_location='cpu')


In [13]:
# Use GPU if you can, otherwise CPU
device = "cuda" if torch.cuda.is_available() else "cpu"

In [14]:
# Build our custom Bart Scorer
class CustomBartScorer:
    def __init__(self, model_name="facebook/bart-large-cnn", device=device):
        """
        Initialize the tokenizer and model for computing BartScore.
        Args:
            model_name (str): Pretrained BART model checkpoint.
            device (str): Device to run computations
        """
        self.device = device
        self.tokenizer = BartTokenizer.from_pretrained(model_name)
        self.model = BartForConditionalGeneration.from_pretrained(model_name)
        self.model.to(device)
        self.model.eval()

    def compute_log_probs(self, src_text, tgt_text):
        """
        Compute the log probabilities of the target text given the source text.
        Args:
            src_text (str): Source text
            tgt_text (str): Target text
        Returns:
            log_prob (float): The log probability of the target text.
        """
        # Tokenize source and target texts
        src_inputs = self.tokenizer(src_text, return_tensors="pt", max_length=1024, truncation=True, padding=True).to(self.device)
        tgt_inputs = self.tokenizer(tgt_text, return_tensors="pt", max_length=1024, truncation=True, padding=True).to(self.device)

        # Forward pass with source as input and target as labels
        with torch.no_grad():
            outputs = self.model(**src_inputs, labels=tgt_inputs["input_ids"])
            logits = outputs.logits  # Logits: (batch_size, seq_len, vocab_size)

        # Compute log probabilities using log-softmax
        log_probs = torch.nn.functional.log_softmax(logits, dim=-1)

        # Gather log probabilities of the target tokens
        tgt_token_ids = tgt_inputs["input_ids"]
        tgt_mask = tgt_inputs["attention_mask"]
        seq_len = tgt_mask.sum(dim=1)

        # Collect log probabilities for the correct target tokens
        tgt_log_probs = log_probs.gather(2, tgt_token_ids.unsqueeze(-1)).squeeze(-1)

        # Mask out padding tokens and sum log probabilities
        tgt_log_probs = tgt_log_probs * tgt_mask
        total_log_probs = tgt_log_probs.sum(dim=1)

        # Normalize by sequence length
        normalized_log_probs = total_log_probs / seq_len

        return normalized_log_probs.item()

    def compute_bartscore(self, src, tgt):
        """
        Compute BartScore for a given source and target text.
        Args:
            src (str): Source text.
            tgt (str): Target text.
        Returns:
            score (float): BartScore value.
        """
        return self.compute_log_probs(src, tgt)


In [15]:
# Give an instance of our bartscorer
scorer = CustomBartScorer(model_name="facebook/bart-large-cnn", device="cuda")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

In [16]:
# Test it on one element of the dataset
source = data['2140']["src"]
ref_summary = data['2140']["ref_summ"]
sys_summary = data['2140']["sys_summs"]['textrank']['sys_summ']
bart_score = scorer.compute_bartscore(sys_summary, ref_summary)
print(f"BartScore: {bart_score}")

BartScore: -3.3545756340026855


## 3. Evaluation and Comparison


In [17]:
# May need to install those two packages to run the code
!pip install evaluate
!pip install rouge_score

import evaluate
from nltk.tokenize import sent_tokenize

rouge = evaluate.load("rouge")

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [18]:
!pip install bert-score
from bert_score import BERTScorer # Import BERTScorer

# initialize BERTScorer
Bertscorer = BERTScorer(lang="en", rescale_with_baseline=True)

Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert-score
Successfully installed bert-score-0.3.13


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [29]:
# Store ROUGE, BertScore and BartScore results
rouge1_scores = []
rouge2_scores = []
rougeL_scores = []
bert_scores = []
bart_scores = []

# Loop over each element in our dataset
for element in data.values():
    src = element["src"]
    ref_summ = element["ref_summ"]
    sys_summary = element["sys_summs"]

    # For each system summary variant (fragments, textrank, etc.), compute ROUGE and BartScore
    for key, sys_sum_dict in sys_summary.items():
        sys_sum = sys_sum_dict["sys_summ"]
        scores = sys_sum_dict["scores"]

        # Compute ROUGE score
        rouge_result = rouge.compute(predictions=[sys_sum], references=[ref_summ])

        # Store Rouge1 / Rouge2 / RougeL scores
        rouge1_score = rouge_result['rouge1']
        rouge2_score = rouge_result['rouge2']
        rougeL_score = rouge_result['rougeL']

        # Compute BertScore
        P, R, F = Bertscorer.score([sys_sum], [ref_summ])

        # Compute BartScore
        bart_score = scorer.compute_bartscore(src, sys_sum)

        # Append the results
        rouge1_scores.append(rouge1_score)
        rouge2_scores.append(rouge2_score)
        rougeL_scores.append(rougeL_score)
        bart_scores.append(bart_score)
        bert_scores.append(F)




In [30]:
# Initialize lists to store generated summaries and reference summaries
generated_texts = []
reference_texts = []

# Iterate through the dataset to extract generated and reference texts
for element in data.values():
    ref_summ = element['ref_summ']  # Reference summary
    sys_summaries = element['sys_summs']  # System-generated summaries collection

    for sys_name, sys_data in sys_summaries.items():
        sys_summ = sys_data['sys_summ']  # Extract system-generated summary

        # Add generated and reference texts to the respective lists
        generated_texts.append(sys_summ)
        reference_texts.append(ref_summ)


# Calculate the IDF for reference and generated summaries
idf_reference = get_idf_dict(reference_texts)
idf_generated = get_idf_dict(generated_texts)

# Calculate MoverScore
mover_scores = word_mover_score(
    reference_texts,          # List of reference summaries
    generated_texts,          # List of generated summaries
    idf_reference,            # IDF dictionary for reference texts
    idf_generated,            # IDF dictionary for generated texts
    stop_words=[],            # Stopwords, typically used to remove non-essential words
    n_gram=1,                 # Use n-gram, default is 1 (unigram)
    remove_subwords=True,     # Whether to remove subwords
    batch_size=8,             # Batch size, adjust to improve calculation speed
    device='cuda'             # Choose computing device, e.g., 'cuda' or 'cpu'
)
print(f"MoverScore: {mover_scores}")

MoverScore: [0.6443659943650033, -0.0975651378471547, -0.1674827449099776, -0.06380617310423342, -0.022069119871935605, -0.026302616397309464, -0.022481765900183026, 0.027062012660817092, 0.8582724117840813, 0.24099229829800684, -0.08201828437115455, -0.14490503358300888, -0.020782506203897544, 0.1173039639128558, -0.048402421317363054, 0.5252036284162317, -0.05651812396520017, -0.22360462486757404, -0.025139835822427514, -0.015850034516798894, -0.09210653280002279, -0.17566555315860444, -0.03660430725619679, -0.08502075188982294, -0.14731689691116512, 0.28627290240799996, -0.20522721183164427, -0.21260498793138027, -0.11135289014875394, 0.15498794496845314, -0.13670811160668084, -0.12676933656232303, -0.1280275397616546, -0.10867235075851789, -0.1261064357303463, -0.07536147865991438, 0.37193658346650926, -0.07228040097024402, -0.17619373371331304, -0.022718053956901407, -0.17363561773568903, -0.11525647323524457, 0.4563836034336717, 0.9254228234068088, 0.20596120292667086, -0.3156646

In [21]:
# Retrieve and store the coherence, fluency, informativeness, relevance for the
# different summarization methods : 'fragments', 'textrank', 'abstractive',
# 'pointer_c', 'pointer_n', 'pointer_s', 'lede3'

# Initialize empty arrays to store the values
coh = []
flu = []
info = []
rel = []

# Loop over each element in our dataset
for element in data.values():
    src = element["src"]
    ref_summ = element["ref_summ"]
    sys_summary = element["sys_summs"]

    # Loop over the different summarization methods (fragments, textrank, etc.)
    for key, sys_sum_dict in sys_summary.items():

        # append the scores to their corresponding lists
        coh.append(sys_sum_dict['scores']["coherence"])
        flu.append(sys_sum_dict['scores']["fluency"])
        info.append(sys_sum_dict['scores']["informativeness"])
        rel.append(sys_sum_dict['scores']["relevance"])

In [22]:
# Import necessary package to compute the spearman correlation
from scipy.stats import spearmanr

In [32]:
# Compute the correlation between the coherence score and the other metrics
def compute_spearman_correlation_coh(rouge1_scores, rouge2_scores, rougeL_scores, bart_scores, bert_scores, mover_scores, coh):
    for scores, name in zip([rouge1_scores, rouge2_scores, rougeL_scores, bart_scores,bert_scores,mover_scores],
                            ['ROUGE-1', 'ROUGE-2', 'ROUGE-L', 'BartScore','BertScore','MoverScore']):
        corr, p_value = spearmanr(scores, coh)
        print(f"Spearman correlation between {name} and COH: {corr}\nP-value: {p_value}")

compute_spearman_correlation_coh(rouge1_scores, rouge2_scores, rougeL_scores, bart_scores,bert_scores,mover_scores, coh)


Spearman correlation between ROUGE-1 and COH: 0.08937495069583032
P-value: 0.06727295916026399
Spearman correlation between ROUGE-2 and COH: 0.0720808027461955
P-value: 0.14028428617124158
Spearman correlation between ROUGE-L and COH: 0.05166854003912919
P-value: 0.2907672589175126
Spearman correlation between BartScore and COH: 0.6247454492952434
P-value: 7.621659394815542e-47
Spearman correlation between BertScore and COH: 0.1694246568287391
P-value: 0.0004883511314455359
Spearman correlation between MoverScore and COH: 0.16533263993512104
P-value: 0.0006698000474395901


In [33]:
# Compute the correlation between the fluency score and the other metrics
def compute_spearman_correlation_flu(rouge1_scores, rouge2_scores, rougeL_scores, bart_scores,bert_scores, mover_scores,flu):
    for scores, name in zip([rouge1_scores, rouge2_scores, rougeL_scores, bart_scores,bert_scores,mover_scores],
                            ['ROUGE-1', 'ROUGE-2', 'ROUGE-L', 'BartScore','BertScore','MoverScore']):
        corr, p_value = spearmanr(scores, flu)
        print(f"Spearman correlation between {name} and FLU: {corr}\nP-value: {p_value}")

compute_spearman_correlation_flu(rouge1_scores, rouge2_scores, rougeL_scores, bart_scores, bert_scores,mover_scores,flu)

Spearman correlation between ROUGE-1 and FLU: 0.04913050265983297
P-value: 0.3151465253574176
Spearman correlation between ROUGE-2 and FLU: 0.04093256264643066
P-value: 0.402750779646013
Spearman correlation between ROUGE-L and FLU: 0.017426223538463723
P-value: 0.7217701692453289
Spearman correlation between BartScore and FLU: 0.5938853417089718
P-value: 2.1707612445024725e-41
Spearman correlation between BertScore and FLU: 0.15406962938632385
P-value: 0.0015401641155744902
Spearman correlation between MoverScore and FLU: 0.10763894433819626
P-value: 0.02739942574786073


In [34]:
# Compute the correlation between the informativeness score and the other metrics
def compute_spearman_correlation_info(rouge1_scores, rouge2_scores, rougeL_scores, bart_scores,bert_scores, mover_scores,info):
    for scores, name in zip([rouge1_scores, rouge2_scores, rougeL_scores, bart_scores,bert_scores,mover_scores],
                            ['ROUGE-1', 'ROUGE-2', 'ROUGE-L', 'BartScore','BertScore','MoverScore']):
        corr, p_value = spearmanr(scores, info)
        print(f"Spearman correlation between {name} and INFO: {corr}\nP-value: {p_value}")

compute_spearman_correlation_info(rouge1_scores, rouge2_scores, rougeL_scores, bart_scores, bert_scores,mover_scores,info)

Spearman correlation between ROUGE-1 and INFO: 0.13728730571285852
P-value: 0.0048244924217598975
Spearman correlation between ROUGE-2 and INFO: 0.14999723331994472
P-value: 0.002053917325813913
Spearman correlation between ROUGE-L and INFO: 0.10683759872077521
P-value: 0.02857771202815741
Spearman correlation between BartScore and INFO: 0.5977529745204947
P-value: 4.8444825655407734e-42
Spearman correlation between BertScore and INFO: 0.19615354758472714
P-value: 5.1799283857121995e-05
Spearman correlation between MoverScore and INFO: 0.22599638693025603
P-value: 2.8911047936252053e-06


In [35]:
# Compute the correlation between the relevance score and the other metrics
def compute_spearman_correlation_rel(rouge1_scores, rouge2_scores, rougeL_scores, bart_scores,bert_scores,mover_scores, rel):
    for scores, name in zip([rouge1_scores, rouge2_scores, rougeL_scores, bart_scores,bert_scores,mover_scores],
                            ['ROUGE-1', 'ROUGE-2', 'ROUGE-L', 'BartScore','BertScore','MoverScore']):
        corr, p_value = spearmanr(scores, rel)
        print(f"Spearman correlation between {name} and REL: {corr}\nP-value: {p_value}")

compute_spearman_correlation_rel(rouge1_scores, rouge2_scores, rougeL_scores, bart_scores, bert_scores,mover_scores,rel)

Spearman correlation between ROUGE-1 and REL: 0.11233278453959593
P-value: 0.02130314881091199
Spearman correlation between ROUGE-2 and REL: 0.11976021355378536
P-value: 0.014053845719354285
Spearman correlation between ROUGE-L and REL: 0.07417255217696875
P-value: 0.129103703354877
Spearman correlation between BartScore and REL: 0.5668580028177166
P-value: 4.491453997544971e-37
Spearman correlation between BertScore and REL: 0.17634863442108228
P-value: 0.00028144795501462147
Spearman correlation between MoverScore and REL: 0.18498063903139395
P-value: 0.00013749585892896044
