# W266 Final Project - Evaluating LED and Baselines

**Description:** 

- This notebook attempts to evaluate the performance of (a) the 2 LED off-the-shelf checkpoints at 2 input length settings; (b) the 40 LED-Base-16384 checkpoints generated after 2 epochs of 20 checkpoints each finetuning on the X-Science dataset; and (c) the baseline of just copying the first X tokens/sentences of the first abstract (i.e. input document).
- In the process, an additional scoring mechansim is added by using rouge-Lto score the percentage of words in the summary that are copied from the first asbstract, which is used beside the usual rouge-scores
- The idea behind is that for a multi-document summarization task, just copying from the first abstract will likely mean a lack of capability to take into account the information in the other documents.

## Setup

In [1]:
import evaluate
from pprint import pprint

## General plotting
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

## Managing memory
import gc
import pickle

## Text processing
import re
import numpy as np
from scipy import stats as st

In [2]:
from datasets import load_dataset, load_metric

In [3]:
## Checking if GPU is available when running locally
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

Using device: cuda



In [12]:
## Loading rouge
rouge = load_metric("rouge")

  rouge = load_metric("rouge")
Using the latest cached version of the module from C:\Users\JustinTo\.cache\huggingface\modules\datasets_modules\metrics\rouge\0ffdb60f436bdb8884d5e4d608d53dbe108e82dac4f494a66f80ef3f647c104f (last modified on Sat Mar 11 13:07:34 2023) since it couldn't be found locally at rouge, or remotely on the Hugging Face Hub.


# 1. Loading X-Science Dataset (Test Set only)

## 2.1 Loading the dataset

In [4]:
## Loading the dataset
xsci_test = load_dataset('multi_x_science_sum', split='test')

## For text processing as X-Science have not concatenated the source articles
DOC_SEP = "|||||"

Using the latest cached version of the module from C:\Users\JustinTo\.cache\huggingface\modules\datasets_modules\datasets\multi_x_science_sum\2876ec0401f8f5c5acf7f4857dbc8d6229a390ab428321ab848f03f14b7f9729 (last modified on Mon Mar 13 14:43:54 2023) since it couldn't be found locally at multi_x_science_sum., or remotely on the Hugging Face Hub.
Found cached dataset multi_x_science_sum (C:/Users/JustinTo/.cache/huggingface/datasets/multi_x_science_sum/default/1.1.0/2876ec0401f8f5c5acf7f4857dbc8d6229a390ab428321ab848f03f14b7f9729)


## 2.2 Preprocessing

- Tokenization is not necessary as all the answers from models/baseline to be compared to the test labels are already in text form.
- So, we only need to pre-process the X-Science dataset labels to the form we want, e.g. changing the citation numbers to @cite, etc.

In [5]:
pat = re.compile("@cite_[0-9]+")

In [6]:
def preprocess_dataset(example):
    output = {}
    output["abstracts"] = (
        example["abstract"].split("| Abstract: ")[-1]
        + DOC_SEP
        + DOC_SEP.join([x for x in example["ref_abstract"]["abstract"] if x])
    )
    output["related_work"] = pat.sub("@cite", example["related_work"])
    
    return output

In [29]:
def preprocess_dataset_batched(example):
    output = {}
    output["abstracts"] = []
    output["related_work"] = []
    output["main_article"] = []
    
    for abstract, ref_abstract in zip(
        example["abstract"], example["ref_abstract"]
    ):
        output["abstracts"].append(
            abstract.split("| Abstract: ")[-1]
            + DOC_SEP
            + DOC_SEP.join([x for x in ref_abstract["abstract"] if x])
        )
        
        # Main article added for calculating the degree of copying
        output["main_article"].append(abstract)
        
    for related_work in example["related_work"]:
        output["related_work"].append(pat.sub("@cite", related_work))
    
    return output

In [30]:
xsci_test_processed = xsci_test.map(
    # preprocess_dataset,
    preprocess_dataset_batched,
    remove_columns=xsci_test.column_names,
    batched=True,
    batch_size=1,
    )

  0%|          | 0/5093 [00:00<?, ?ba/s]

# 2. Baseline LED Results

## 2.1 Off-the-shelf Large Checkpoint ("LED-base-16384-arxiv")

### 2.1.1 Long Input Sequence Length (16384) & Sample Answers

In [10]:
## Loading pickled results
with open("answers_revised/baselines/LED_large_16384tokens.pkl", "rb") as f:
    answers_exp1 = pickle.load(f)

In [14]:
## Calculating the rouge score
metric_exp1 = rouge.compute(predictions=answers_exp1,
                            references=[ref for ref in xsci_test_processed['related_work']],
                            use_stemmer = True)

copying_metric_exp1 = rouge.compute(predictions=answers_exp1,
                                    references=[ref for ref in xsci_test_processed['main_article']],
                                    use_stemmer = True)

In [35]:
metric_exp1

{'rouge1': AggregateScore(low=Score(precision=0.2909941324368112, recall=0.35105082313707636, fmeasure=0.30077302726714655), mid=Score(precision=0.2942472156556126, recall=0.35374213462422277, fmeasure=0.3030829300045328), high=Score(precision=0.2970721131079737, recall=0.356640422810119, fmeasure=0.3052169281154026)),
 'rouge2': AggregateScore(low=Score(precision=0.05114039705929173, recall=0.061469198630368205, fmeasure=0.05257839595270015), mid=Score(precision=0.05250657974529506, recall=0.06306549953449214, fmeasure=0.05394162344303733), high=Score(precision=0.053884108434255444, recall=0.06468015800210036, fmeasure=0.05522453940119105)),
 'rougeL': AggregateScore(low=Score(precision=0.14947553643070308, recall=0.18391832303844824, fmeasure=0.15527962272327), mid=Score(precision=0.15112890755384387, recall=0.18579472374569955, fmeasure=0.1565722710198794), high=Score(precision=0.15275375825054335, recall=0.18756221498065537, fmeasure=0.15778411072390058)),
 'rougeLsum': AggregateSc

In [34]:
copying_metric_exp1

{'rouge1': AggregateScore(low=Score(precision=0.5804809712983097, recall=0.43098446626073994, fmeasure=0.4795182335301225), mid=Score(precision=0.5866903800577534, recall=0.4363999019267688, fmeasure=0.4848281946091112), high=Score(precision=0.5929851958465849, recall=0.441824876210364, fmeasure=0.49040639901921956)),
 'rouge2': AggregateScore(low=Score(precision=0.3632678438094504, recall=0.2677063043309678, fmeasure=0.2989803918609445), mid=Score(precision=0.3732204090197948, recall=0.27503091164081683, fmeasure=0.307423927697038), high=Score(precision=0.38266863048879474, recall=0.28216887324677353, fmeasure=0.3151152944739423)),
 'rougeL': AggregateScore(low=Score(precision=0.445590154977894, recall=0.32988122718600327, fmeasure=0.36738107006177745), mid=Score(precision=0.45340047127281485, recall=0.3360586042185112, fmeasure=0.37402672650395546), high=Score(precision=0.4613234726241873, recall=0.3420236033778624, fmeasure=0.38022869054501984)),
 'rougeLsum': AggregateScore(low=Sco

In [44]:
answers_exp1[0]

" in multi-agent environments, an intelligent agent often needs to interact with other individuals or groups of agents to achieve its goals. agent tracking is one key capability required for intelligent interaction. \n it involves monitoring the observable actions of other agents and inferring their unobserved actions, plans, goals and behaviors. \n this article examines the implications of such an agent tracking capability for agent architectures. it specifically focuses on real-time and dynamic environments, where an intelligent agent is faced with the challenge of tracking the highly flexible mix of goal-driven and reactive behaviors of other agents, in real-time. \n the key implication is that an agent architecture needs to provide direct support for flexible and efficient reasoning about other agents' models. in this article, such support takes the form of an architectural capability to execute the other agent s models, enabling mental simulation of their behaviors. \n other archi

### 2.1.2 Short Input Sequence Length (1024) & Sample Answers

In [36]:
## Loading pickled results
with open("answers_revised/baselines/LED_large_1024tokens.pkl", "rb") as f:
    answers_exp2 = pickle.load(f)

In [37]:
## Calculating the rouge score
metric_exp2 = rouge.compute(predictions=answers_exp2,
                            references=[ref for ref in xsci_test_processed['related_work']],
                            use_stemmer = True)

copying_metric_exp2 = rouge.compute(predictions=answers_exp2,
                                    references=[ref for ref in xsci_test_processed['main_article']],
                                    use_stemmer = True)

In [38]:
metric_exp2

{'rouge1': AggregateScore(low=Score(precision=0.29202173734823894, recall=0.3476110916538046, fmeasure=0.29941398786584933), mid=Score(precision=0.29515888545413427, recall=0.3503062240451726, fmeasure=0.3017650577927991), high=Score(precision=0.29811998899916725, recall=0.3531599191621091, fmeasure=0.3039226750095273)),
 'rouge2': AggregateScore(low=Score(precision=0.05109327022756113, recall=0.06064176630880606, fmeasure=0.05217138520824604), mid=Score(precision=0.05252993312220889, recall=0.062282521667073425, fmeasure=0.053504252967343895), high=Score(precision=0.05392380879732633, recall=0.06392864814481376, fmeasure=0.05484064290952403)),
 'rougeL': AggregateScore(low=Score(precision=0.1503566582559145, recall=0.18257182819385767, fmeasure=0.1550218335890702), mid=Score(precision=0.1520194357783777, recall=0.18445279412278887, fmeasure=0.15626907514442895), high=Score(precision=0.1535550065693761, recall=0.1862921612922484, fmeasure=0.15741066865253625)),
 'rougeLsum': AggregateS

In [39]:
copying_metric_exp2

{'rouge1': AggregateScore(low=Score(precision=0.5946941944928145, recall=0.4360596677977077, fmeasure=0.4877595209953568), mid=Score(precision=0.6011510151203083, recall=0.44168158913879896, fmeasure=0.49339745494004583), high=Score(precision=0.6074880685655073, recall=0.4471351026036722, fmeasure=0.49838492081865404)),
 'rouge2': AggregateScore(low=Score(precision=0.38500089632351125, recall=0.2806171160843172, fmeasure=0.3152320162216887), mid=Score(precision=0.39451993949593006, recall=0.2873483711825244, fmeasure=0.32274421092300376), high=Score(precision=0.4040176116807798, recall=0.2948062312139038, fmeasure=0.33067826858405786)),
 'rougeL': AggregateScore(low=Score(precision=0.4635350151715852, recall=0.33911062747048365, fmeasure=0.379926475297668), mid=Score(precision=0.4716740579323859, recall=0.3454384618777596, fmeasure=0.38650097915044423), high=Score(precision=0.47946183157903216, recall=0.3516316762927688, fmeasure=0.39324304836801555)),
 'rougeLsum': AggregateScore(low=

In [46]:
answers_exp2[0]

' we present our approach to the problem of how an agent, within an economic Multi-Agent System, can determine when it should behave strategically (i.e. learn and use models of other agents ), and when it should act as a simple price-taker. we provide a framework for the incremental implementation of modeling capabilities in agents, and a description of the forms of knowledge required. \n we have implemented an agent architecture, an experimental variant of the soar integrated architecture, that conforms to all of these requirements. \n agents based on this architecture have been implemented to execute two different tasks in a real-time, dynamic, multi-agent domain. \n the agents were implemented and different populations simulated in order to learn more about their behavior and the merits of using and learning agent models. \n our results show, among other lessons, how savvy buyers can avoid being cheated by sellers, how price volatility can be used to quantitatively predict the benef

## 2.2 Off-the-shelf Base Checkpoint ("LED-base-16384")

### 2.2.1 Long Input Sequence Length (16384) & Sample Answers

In [40]:
## Loading pickled results
with open("answers_revised/baselines/LED_base_16384tokens.pkl", "rb") as f:
    answers_exp3 = pickle.load(f)

In [41]:
## Calculating the rouge score
metric_exp3 = rouge.compute(predictions=answers_exp3,
                            references=[ref for ref in xsci_test_processed['related_work']],
                            use_stemmer = True)

copying_metric_exp3 = rouge.compute(predictions=answers_exp3,
                                    references=[ref for ref in xsci_test_processed['main_article']],
                                    use_stemmer = True)

In [42]:
metric_exp3

{'rouge1': AggregateScore(low=Score(precision=0.25830633435027256, recall=0.3968438262683915, fmeasure=0.29700336313124903), mid=Score(precision=0.26126087743485504, recall=0.399375050558616, fmeasure=0.29940693690755715), high=Score(precision=0.26391486209433423, recall=0.4020188104898597, fmeasure=0.30166923811178403)),
 'rouge2': AggregateScore(low=Score(precision=0.04430336443409369, recall=0.06852074159341255, fmeasure=0.051092768490646624), mid=Score(precision=0.045255209888487094, recall=0.06988058539713318, fmeasure=0.0520976031529633), high=Score(precision=0.04614715675220126, recall=0.07132512576368692, fmeasure=0.05302513627299239)),
 'rougeL': AggregateScore(low=Score(precision=0.12799178381235296, recall=0.20292193492404648, fmeasure=0.14819555110490149), mid=Score(precision=0.1292496808406487, recall=0.2046865870656942, fmeasure=0.14914610839134823), high=Score(precision=0.13045063779179358, recall=0.20638772726265217, fmeasure=0.1502112805002533)),
 'rougeLsum': Aggregat

In [43]:
copying_metric_exp3

{'rouge1': AggregateScore(low=Score(precision=0.795455328386271, recall=0.775267100188124, fmeasure=0.7635032226541796), mid=Score(precision=0.8012543025205973, recall=0.7815542031014585, fmeasure=0.769430494044121), high=Score(precision=0.8068480324379127, recall=0.7887987480599074, fmeasure=0.7752297877709678)),
 'rouge2': AggregateScore(low=Score(precision=0.7364697647471133, recall=0.7284697152197072, fmeasure=0.7139075466914486), mid=Score(precision=0.7448441298601729, recall=0.7368793284311195, fmeasure=0.7217796335805167), high=Score(precision=0.752990380210822, recall=0.7459295677570408, fmeasure=0.7300286198202143)),
 'rougeL': AggregateScore(low=Score(precision=0.7608502160313695, recall=0.7474052789074276, fmeasure=0.7339103361154311), mid=Score(precision=0.7686121547499548, recall=0.7553827136959743, fmeasure=0.7414086776059847), high=Score(precision=0.776300493132067, recall=0.7623542609378742, fmeasure=0.7483180357351624)),
 'rougeLsum': AggregateScore(low=Score(precision

In [47]:
answers_exp3[0]

"We present our approach to the problem of how an agent, within an economic Multi-Agent System, can determine when it should behave strategically (i.e. learn and use models of other agents), and when it should act as a simple price-taker. We provide a framework for the incremental implementation of modeling capabilities in agents, and a description of the forms of knowledge required. The agents were implemented and different populations simulated in order to learn more about their behavior and the merits of using and learning agent models. Our results show, among other lessons, how savvy buyers can avoid being cheated'' by sellers, how price volatility can be used to quantitatively predict the benefits of deeper models, and how specific types of agent populations influence system behavior.|||||In multi-agent environments, an intelligent agent often needs to interact with other individuals or groups of agents to achieve its goals. Agent tracking is one key capability required for intell

### 2.2.2 Short Input Sequence Length (1024) & Sample Answers

In [48]:
## Loading pickled results
with open("answers_revised/baselines/LED_base_1024tokens.pkl", "rb") as f:
    answers_exp4 = pickle.load(f)

In [49]:
## Calculating the rouge score
metric_exp4 = rouge.compute(predictions=answers_exp4,
                            references=[ref for ref in xsci_test_processed['related_work']],
                            use_stemmer = True)

copying_metric_exp4 = rouge.compute(predictions=answers_exp4,
                                    references=[ref for ref in xsci_test_processed['main_article']],
                                    use_stemmer = True)

In [50]:
metric_exp4

{'rouge1': AggregateScore(low=Score(precision=0.25830633435027256, recall=0.3968438262683915, fmeasure=0.29700336313124903), mid=Score(precision=0.26126087743485504, recall=0.399375050558616, fmeasure=0.29940693690755715), high=Score(precision=0.26391486209433423, recall=0.4020188104898597, fmeasure=0.30166923811178403)),
 'rouge2': AggregateScore(low=Score(precision=0.04430336443409369, recall=0.06852074159341255, fmeasure=0.051092768490646624), mid=Score(precision=0.045255209888487094, recall=0.06988058539713318, fmeasure=0.0520976031529633), high=Score(precision=0.04614715675220126, recall=0.07132512576368692, fmeasure=0.05302513627299239)),
 'rougeL': AggregateScore(low=Score(precision=0.12799178381235296, recall=0.20292193492404648, fmeasure=0.14819555110490149), mid=Score(precision=0.1292496808406487, recall=0.2046865870656942, fmeasure=0.14914610839134823), high=Score(precision=0.13045063779179358, recall=0.20638772726265217, fmeasure=0.1502112805002533)),
 'rougeLsum': Aggregat

In [51]:
copying_metric_exp4

{'rouge1': AggregateScore(low=Score(precision=0.795455328386271, recall=0.775267100188124, fmeasure=0.7635032226541796), mid=Score(precision=0.8012543025205973, recall=0.7815542031014585, fmeasure=0.769430494044121), high=Score(precision=0.8068480324379127, recall=0.7887987480599074, fmeasure=0.7752297877709678)),
 'rouge2': AggregateScore(low=Score(precision=0.7364697647471133, recall=0.7284697152197072, fmeasure=0.7139075466914486), mid=Score(precision=0.7448441298601729, recall=0.7368793284311195, fmeasure=0.7217796335805167), high=Score(precision=0.752990380210822, recall=0.7459295677570408, fmeasure=0.7300286198202143)),
 'rougeL': AggregateScore(low=Score(precision=0.7608502160313695, recall=0.7474052789074276, fmeasure=0.7339103361154311), mid=Score(precision=0.7686121547499548, recall=0.7553827136959743, fmeasure=0.7414086776059847), high=Score(precision=0.776300493132067, recall=0.7623542609378742, fmeasure=0.7483180357351624)),
 'rougeLsum': AggregateScore(low=Score(precision

In [54]:
answers_exp4[0]

"We present our approach to the problem of how an agent, within an economic Multi-Agent System, can determine when it should behave strategically (i.e. learn and use models of other agents), and when it should act as a simple price-taker. We provide a framework for the incremental implementation of modeling capabilities in agents, and a description of the forms of knowledge required. The agents were implemented and different populations simulated in order to learn more about their behavior and the merits of using and learning agent models. Our results show, among other lessons, how savvy buyers can avoid being cheated'' by sellers, how price volatility can be used to quantitatively predict the benefits of deeper models, and how specific types of agent populations influence system behavior.|||||In multi-agent environments, an intelligent agent often needs to interact with other individuals or groups of agents to achieve its goals. Agent tracking is one key capability required for intell

# 3. Finetuned LED-Models

## 3.1 Compiling results

In [53]:
## Results are available for Epochs 1 and 2, Runs 5, 10, 15 and 20

model_list = [(epoch, num) for epoch in [1,2] for num in [5, 10, 15, 20]]

answers_tunedLED = {}
metric_tunedLED = {}
copying_metric_tunedLED = {}

for (epoch, num) in model_list:
    PATH = f"answers_revised/epoch{epoch}/LED_xsci_finetuned_run{num}.pkl"
    with open(PATH, "rb") as f:
        answers_tunedLED[(epoch, num)] = pickle.load(f)
    
for (key, value) in answers_tunedLED.items():
    metric_tunedLED[key] = rouge.compute(predictions=value,
                           references=[ref for ref in xsci_test_processed['related_work']],
                           use_stemmer = True)
        
    copying_metric_tunedLED[key] = rouge.compute(predictions=value,
                                                 references=[ref for ref in xsci_test_processed['main_article']],
                                                 use_stemmer = True)

In [58]:
metric_tunedLED.keys()

dict_keys([(1, 5), (1, 10), (1, 15), (1, 20), (2, 5), (2, 10), (2, 15), (2, 20)])

In [62]:
for (key, value) in metric_tunedLED.items():
    print(f"\n-----Results for Epoch {key[0]} Run {key[1]} Tuned LED Model-----")
    for (name, figures) in value.items():
        print(name.capitalize())
        print(figures)


-----Results for Epoch 1 Run 5 Tuned LED Model-----
Rouge1
AggregateScore(low=Score(precision=0.43096455287873037, recall=0.25111143174866973, fmeasure=0.298003255363893), mid=Score(precision=0.43512150987668774, recall=0.25362619435654454, fmeasure=0.30009905518756474), high=Score(precision=0.4388452341964497, recall=0.256417712889918, fmeasure=0.30224150244925657))
Rouge2
AggregateScore(low=Score(precision=0.08980536975033572, recall=0.05144090068252852, fmeasure=0.06117450713194414), mid=Score(precision=0.09171552038463286, recall=0.052573429410541925, fmeasure=0.062397025605467275), high=Score(precision=0.09374341221858144, recall=0.05379855034876872, fmeasure=0.06368920883418691))
Rougel
AggregateScore(low=Score(precision=0.25305664450449245, recall=0.14679973063146354, fmeasure=0.17391009204917518), mid=Score(precision=0.25543295003753413, recall=0.14843453471314955, fmeasure=0.1751987240474616), high=Score(precision=0.2579235219950028, recall=0.15012688752823256, fmeasure=0.176

In [63]:
for (key, value) in copying_metric_tunedLED.items():
    print(f"\n-----Results for Epoch {key[0]} Run {key[1]} Tuned LED Model-----")
    for (name, figures) in value.items():
        print(name.capitalize())
        print(figures)


-----Results for Epoch 1 Run 5 Tuned LED Model-----
Rouge1
AggregateScore(low=Score(precision=0.5226669879338464, recall=0.18182631614630482, fmeasure=0.2595916872195215), mid=Score(precision=0.5270972002290121, recall=0.1839541434785873, fmeasure=0.26188062737870965), high=Score(precision=0.5315583041008248, recall=0.18605108802815992, fmeasure=0.2642760697272397))
Rouge2
AggregateScore(low=Score(precision=0.1884356754463796, recall=0.06043040244380299, fmeasure=0.08797707849195342), mid=Score(precision=0.19336956242199865, recall=0.06212871641061322, fmeasure=0.0903653409964461), high=Score(precision=0.19838776624664275, recall=0.06369035936031123, fmeasure=0.0924940470829754))
Rougel
AggregateScore(low=Score(precision=0.34331872527699303, recall=0.11660297818811133, fmeasure=0.16721704170019797), mid=Score(precision=0.3476666548494963, recall=0.11807125193809927, fmeasure=0.16903793373128517), high=Score(precision=0.3516851683235269, recall=0.1195573144114709, fmeasure=0.1708252819

## Sandbox

In [188]:
gc.collect()

0