# W266 Final Project - Evaluating LED and Baselines

**Description:** 

- This notebook attempts to evaluate the performance of the finetuned Centrum model provided by Milan.
- In the process, an additional scoring mechansim is added by using rouge-Lto score the percentage of words in the summary that are copied from the first asbstract, which is used beside the usual rouge-scores
- The idea behind is that for a multi-document summarization task, just copying from the first abstract will likely mean a lack of capability to take into account the information in the other documents.

## Setup

In [1]:
import evaluate
from pprint import pprint

## General plotting
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

## Managing memory
import gc
import pickle

## Text processing
import re
import numpy as np
from scipy import stats as st

In [2]:
## For printing out model summary in PyTorch
from torchvision import models
from torchsummary import summary

In [3]:
from transformers import (
    AdamW, AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
)

In [4]:
from datasets import load_dataset, load_metric

In [5]:
## Checking if GPU is available when running locally
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

Using device: cuda



In [6]:
## Loading rouge
rouge = load_metric("rouge")

  rouge = load_metric("rouge")


# 1. Loading X-Science Dataset (Test Set only)

## 1.1 Loading the dataset

In [36]:
## Loading the dataset
dataset = load_dataset('multi_x_science_sum')

## For text processing as X-Science have not concatenated the source articles
DOC_SEP = " ||||| "
BATCH_SIZE = 16
MAX_LENGTH_ENC = 4096
MAX_LENGTH_DEC = 256
docsep_token_id = 50266

Found cached dataset multi_x_science_sum (C:/Users/JustinTo/.cache/huggingface/datasets/multi_x_science_sum/default/1.1.0/2876ec0401f8f5c5acf7f4857dbc8d6229a390ab428321ab848f03f14b7f9729)


  0%|          | 0/3 [00:00<?, ?it/s]

## 1.2 Preprocessing

- Tokenization is not necessary as all the answers from models/baseline to be compared to the test labels are already in text form.
- So, we only need to pre-process the X-Science dataset labels to the form we want, e.g. changing the citation numbers to @cite, etc.

In [8]:
pat = re.compile("@cite_[0-9]+")

In [41]:
def preprocess_dataset(example):

    abstracts = example["abstract"].split("| Abstract: ")[-1]
    related_work = pat.sub("@cite", example["related_work"])
    ref_abstracts = filter(bool, example["ref_abstract"]["abstract"])
    output = {
        "abstracts": f"{abstracts}{DOC_SEP}{DOC_SEP.join(ref_abstracts)}",
        "related_work": related_work,
        "main_article": abstract # Main article added for calculating the degree of copying
    }
    return output

def preprocess_dataset_batched(example):
    abstracts = [
        abstract.split("| Abstract: ")[-1] + DOC_SEP + DOC_SEP.join([x for x in ref_abstract["abstract"] if x])
        for abstract, ref_abstract in zip(example["abstract"], example["ref_abstract"])
    ]
    related_work = [pat.sub("@cite", rw) for rw in example["related_work"]]
    abstract = [abstract.split("| Abstract: ")[-1] for abstract in example["abstract"]]
    output = {
        "abstracts": abstracts,
        "related_work": related_work,
        "main_article": abstract, # Main article added for calculating the degree of copying
    }
    return output

dataset_processed = {}
for split in dataset.keys():
    dataset_processed[split] = dataset[split].map(
        preprocess_dataset_batched,
        remove_columns=dataset[split].column_names,
        batched=True,
        batch_size=BATCH_SIZE,
    )

  0%|          | 0/1899 [00:00<?, ?ba/s]

  0%|          | 0/319 [00:00<?, ?ba/s]

  0%|          | 0/317 [00:00<?, ?ba/s]

In [45]:
def tokenize_dataset_batched(example):
    # Tokenizer input
    input_encoding = centrum_tokenizer(
        example["abstracts"],
        padding="max_length",
        truncation=True,
        max_length=MAX_LENGTH_ENC,
        return_tensors="pt",
    )

    # Tokenizer output
    output_encoding = centrum_tokenizer(
        example["related_work"],
        padding="max_length",
        truncation=True,
        max_length=MAX_LENGTH_DEC,
        return_tensors="pt",
    )

    # Modify output encoding to ignore padding in loss function
    # torch ignore -100 in loss function computation
    labels = output_encoding["input_ids"].clone()
    labels[labels == centrum_tokenizer.pad_token_id] = -100

    # Global attention with vectorized operations (optimized for GPU)
    input_ids = input_encoding["input_ids"]
    docsep_token_id = centrum_tokenizer.convert_tokens_to_ids(DOC_SEP)
    global_attention_mask = (input_ids == centrum_tokenizer.cls_token_id) | (input_ids == docsep_token_id)

    return {
        "input_ids": input_encoding["input_ids"],
        "attention_mask": input_encoding["attention_mask"],
        "global_attention_mask": global_attention_mask.float(),
        "labels": labels,
    }

dataset_tokenized = {}
for split in dataset_processed.keys():
    dataset_tokenized[split] = (
        dataset_processed[split]
        .select(range(len(dataset_processed[split])))
        .map(
            tokenize_dataset_batched,
            remove_columns=dataset_processed[split].column_names,
            batched=True,
            batch_size=BATCH_SIZE,
        )
    )

  0%|          | 0/1899 [00:00<?, ?ba/s]

  0%|          | 0/319 [00:00<?, ?ba/s]

  0%|          | 0/317 [00:00<?, ?ba/s]

In [49]:
dataset_processed['test'][0]

{'related_work': 'Within the MAS community, some work @cite has focused on how artificial AI-based learning agents would fare in communities of similar agents. For example, @cite and @cite show how agents can learn the capabilities of others via repeated interactions, but these agents do not learn to predict what actions other might take. Most of the work in MAS also fails to recognize the possible gains from using explicit agent models to predict agent actions. @cite is an exception and gives another approach for using nested agent models. However, they do not go so far as to try to quantify the advantages of their nested models or show how these could be learned via observations. We believe that our research will bring to the foreground some of the common observations seen in these research areas and help to clarify the implications and utility of learning and using nested agent models.',
 'abstracts': "We present our approach to the problem of how an agent, within an economic Multi-

In [57]:
dataset_tokenized['test']

Dataset({
    features: ['input_ids', 'attention_mask', 'global_attention_mask', 'labels'],
    num_rows: 5093
})

# 2. Generating Finetuned Centrum Results

## 2.1 Loading Model Checkpoint and Saved Weights from Finetuning

In [18]:
def get_tokenizer(host_tokenizer: str):
  """return the tokenizer and model for LLM training"""

  return (AutoTokenizer.from_pretrained(host_tokenizer, 
                                        use_cache=False, 
                                        gradient_checkpointing=True), 
          AutoModelForSeq2SeqLM.from_pretrained(host_tokenizer, 
                                                use_cache=False, 
                                                gradient_checkpointing=True).to("cuda").half())


centrum_tokenizer, centrum_model = get_tokenizer("ratishsp/Centrum")

In [24]:
DOC_SEP = " ||||| "

centrum_tokenizer.add_tokens(DOC_SEP, special_tokens=True)
centrum_model.resize_token_embeddings(len(centrum_tokenizer))
docsep_token_id = centrum_tokenizer.convert_tokens_to_ids(DOC_SEP)

In [25]:
centrum_model.load_state_dict(torch.load("../milan_working_file/centrum_xsci_test_2.pt"))

<All keys matched successfully>

## 2.2 Generating Results (Using No_repeat_ngram_size = 3; Global ID not fixed)

In [88]:
test_inputs_base = centrum_tokenizer(dataset_processed['test']['abstracts'],
                                padding="max_length",
                                max_length=MAX_LENGTH_ENC,
                                return_tensors="pt",
                                truncation=True)

In [100]:
def generate_abstract_batched(batch_size=2, start=0, no_repeat_ngram_size=3):
    
    try:
        del test_input_ids, attention_mask, global_attention_mask, predicted_abstract_ids
    except:
        None
        
    gc.collect()

    test_input_ids = test_inputs_base['input_ids'][start:start+batch_size].to("cuda")
    attention_mask = test_inputs_base['attention_mask'][start:start+batch_size].to("cuda")

    global_attention_mask = torch.zeros_like(attention_mask)
    global_attention_mask[:, 0] = 1

    predicted_abstract_ids = centrum_model.generate(test_input_ids,
                                                    attention_mask=attention_mask, 
                                                    global_attention_mask=global_attention_mask, 
                                                    max_length=MAX_LENGTH_DEC,
                                                    no_repeat_ngram_size=no_repeat_ngram_size,
                                                    num_beams=4)

    predicted_abstract = centrum_tokenizer.batch_decode(predicted_abstract_ids, skip_special_tokens=True)
    
    return predicted_abstract

In [101]:
## Generating answers
test_batch_size = 2
no_repeat_ngram_size = 3
answers = []

for i in range(0, dataset_processed['test'].num_rows, test_batch_size):
    if i%500 == 0:
        print(f"Handling sample {i} now..")
        
    answers.append(generate_abstract_batched(start=i,
                                             batch_size=test_batch_size,
                                             no_repeat_ngram_size=no_repeat_ngram_size))
    
print(f"Completed, {i+1} data points handled.")

Handling sample 0 now..
Handling sample 500 now..
Handling sample 1000 now..
Handling sample 1500 now..
Handling sample 2000 now..
Handling sample 2500 now..
Handling sample 3000 now..
Handling sample 3500 now..
Handling sample 4000 now..
Handling sample 4500 now..
Handling sample 5000 now..
Completed, 5093 data points handled.


In [102]:
formatted_answers = []
for answer in answers:
    formatted_answers += answer

## Pickling results
with open("answers_revised/centrum/Centrum_finetuned_norepeat3.pkl", "wb") as f:
    pickle.dump(formatted_answers, f)

In [103]:
## Calculating the rouge score
metric_norepeat3 = rouge.compute(predictions=formatted_answers,
                                 references=[ref for ref in dataset_processed['test']['related_work']],
                                 use_stemmer = True)

copying_metric_norepeat3 = rouge.compute(predictions=formatted_answers,
                                         references=[ref for ref in dataset_processed['test']['main_article']],
                                         use_stemmer = True)

In [104]:
metric_norepeat3

{'rouge1': AggregateScore(low=Score(precision=0.3826156154413282, recall=0.26467030607223624, fmeasure=0.28980812020532937), mid=Score(precision=0.3866420951409433, recall=0.26744880399072335, fmeasure=0.2919580650349817), high=Score(precision=0.3901736385572176, recall=0.27036412353959066, fmeasure=0.29412808216862935)),
 'rouge2': AggregateScore(low=Score(precision=0.07192393346813435, recall=0.04792333904761617, fmeasure=0.053020161476219185), mid=Score(precision=0.07383598551371859, recall=0.04923474006873839, fmeasure=0.05438841734910711), high=Score(precision=0.07583711950717102, recall=0.050578569241275344, fmeasure=0.05574052725393261)),
 'rougeL': AggregateScore(low=Score(precision=0.21730830694718284, recall=0.14922890504528968, fmeasure=0.16311323743283307), mid=Score(precision=0.21975685175962914, recall=0.15095210399028297, fmeasure=0.16448325744895015), high=Score(precision=0.22243514112870227, recall=0.15276261717836814, fmeasure=0.16597635256121948)),
 'rougeLsum': Aggr

In [105]:
copying_metric_norepeat3

{'rouge1': AggregateScore(low=Score(precision=0.487419089761421, recall=0.20843581441831296, fmeasure=0.27821733049209924), mid=Score(precision=0.4920875973697113, recall=0.2112583712493406, fmeasure=0.2812356020172335), high=Score(precision=0.4968481177299069, recall=0.21409026644542298, fmeasure=0.2842420762559606)),
 'rouge2': AggregateScore(low=Score(precision=0.14896204722506382, recall=0.06190707065900231, fmeasure=0.08312978305796795), mid=Score(precision=0.15433009939597667, recall=0.06449500056554985, fmeasure=0.08640884730340798), high=Score(precision=0.15992727251486, recall=0.0672960420613433, fmeasure=0.08987080579019033)),
 'rougeL': AggregateScore(low=Score(precision=0.30866548313810493, recall=0.12926681060357437, fmeasure=0.17326579194689481), mid=Score(precision=0.3130410635125256, recall=0.1316757684912946, fmeasure=0.1760270355844274), high=Score(precision=0.3179158716007658, recall=0.13412013056713074, fmeasure=0.1789344608489763)),
 'rougeLsum': AggregateScore(low

In [111]:
formatted_answers[0]

"In @cite, the authors present a framework for the incremental implementation of agent models, and a description of the forms of knowledge required. The agents were implemented to execute two different tasks in a real-time, dynamic, multi-agent domain. The authors present experimental results illustrating the agents' dynamic behavior, and show, among other lessons, how savvy buyers can avoid being cheated'' by sellers, how price volatility can be used to quantitatively predict the benefits of deeper models and how specific types of agent populations influence system behavior."

In [112]:
formatted_answers[2942]

"In @cite, the authors study the problem of minimizing convex and concave functions with access to an erroneous zeroth-order oracle. In particular, they consider optimization when one is given access to absolute error oracles that return values in [f(x) - @math, @math ] or relative error oracle that return value in @math. In this paper, we consider the class of all @math -player non-cooperative games with at least one NE such that the players' utility functions satisfy a certain (differential) constraint."

## 2.3 Generating Results (Using No_repeat_ngram_size = 3; Global ID Fixed)

In [116]:
def generate_abstract_batched2(batch_size=2, start=0, no_repeat_ngram_size=3):
    
    try:
        del test_input_ids, attention_mask, global_attention_mask, predicted_abstract_ids
    except:
        None
        
    gc.collect()

    test_input_ids = test_inputs_base['input_ids'][start:start+batch_size].to("cuda")
    attention_mask = test_inputs_base['attention_mask'][start:start+batch_size].to("cuda")

    global_attention_mask = (test_input_ids == centrum_tokenizer.cls_token_id) | (test_input_ids == docsep_token_id)

    predicted_abstract_ids = centrum_model.generate(test_input_ids,
                                                    attention_mask=attention_mask, 
                                                    global_attention_mask=global_attention_mask, 
                                                    max_length=MAX_LENGTH_DEC,
                                                    no_repeat_ngram_size=no_repeat_ngram_size,
                                                    num_beams=4)

    predicted_abstract = centrum_tokenizer.batch_decode(predicted_abstract_ids, skip_special_tokens=True)
    
    return predicted_abstract

In [117]:
## Generating answers
test_batch_size = 2
no_repeat_ngram_size = 3
answers_fixed = []

for i in range(0, dataset_processed['test'].num_rows, test_batch_size):
    if i%500 == 0:
        print(f"Handling sample {i} now..")
        
    answers_fixed.append(generate_abstract_batched2(start=i,
                                                    batch_size=test_batch_size,
                                                    no_repeat_ngram_size=no_repeat_ngram_size))
    
print(f"Completed, {i+1} data points handled.")

Handling sample 0 now..
Handling sample 500 now..
Handling sample 1000 now..
Handling sample 1500 now..
Handling sample 2000 now..
Handling sample 2500 now..
Handling sample 3000 now..
Handling sample 3500 now..
Handling sample 4000 now..
Handling sample 4500 now..
Handling sample 5000 now..
Completed, 5093 data points handled.


In [118]:
formatted_answers_fixed = []
for answer in answers_fixed:
    formatted_answers_fixed += answer

## Pickling results
with open("answers_revised/centrum/Centrum_finetuned_norepeat3_run2.pkl", "wb") as f:
    pickle.dump(formatted_answers_fixed, f)

In [119]:
## Calculating the rouge score
metric_norepeat3_fixed = rouge.compute(predictions=formatted_answers_fixed,
                                       references=[ref for ref in dataset_processed['test']['related_work']],
                                       use_stemmer = True)

copying_metric_norepeat3_fixed = rouge.compute(predictions=formatted_answers_fixed,
                                               references=[ref for ref in dataset_processed['test']['main_article']],
                                               use_stemmer = True)

In [120]:
metric_norepeat3_fixed

{'rouge1': AggregateScore(low=Score(precision=0.320784094716328, recall=0.36518329307704006, fmeasure=0.3193008623424025), mid=Score(precision=0.32417024919388254, recall=0.36820641847673274, fmeasure=0.321638950436123), high=Score(precision=0.3272306003875265, recall=0.3713844963120369, fmeasure=0.3239246523238588)),
 'rouge2': AggregateScore(low=Score(precision=0.06405146501804442, recall=0.07365160389211617, fmeasure=0.06393093961746889), mid=Score(precision=0.06546831952443201, recall=0.07521408116224615, fmeasure=0.06528984301029793), high=Score(precision=0.0669052603366083, recall=0.07688665569078304, fmeasure=0.06662614204458325)),
 'rougeL': AggregateScore(low=Score(precision=0.1720198279171965, recall=0.19593044026402306, fmeasure=0.17042503868407696), mid=Score(precision=0.1741259549871223, recall=0.19787217588740136, fmeasure=0.17170238861300774), high=Score(precision=0.17621028476498926, recall=0.19978810126793864, fmeasure=0.17296350020707307)),
 'rougeLsum': AggregateScor

In [121]:
copying_metric_norepeat3_fixed

{'rouge1': AggregateScore(low=Score(precision=0.4672562294870131, recall=0.33557713190865324, fmeasure=0.36630590415028774), mid=Score(precision=0.47178908758737786, recall=0.3399449933560247, fmeasure=0.37015139193277635), high=Score(precision=0.47663374034493977, recall=0.3444413694653993, fmeasure=0.3739836903034003)),
 'rouge2': AggregateScore(low=Score(precision=0.18672037197812436, recall=0.13632903372419836, fmeasure=0.14819614859686972), mid=Score(precision=0.1921146209083192, recall=0.1402435867374232, fmeasure=0.15221777075488108), high=Score(precision=0.19732511320700097, recall=0.14381982468723026, fmeasure=0.156170964466887)),
 'rougeL': AggregateScore(low=Score(precision=0.28938901322709043, recall=0.2044642754447174, fmeasure=0.22400905836985419), mid=Score(precision=0.29357470460097856, recall=0.20774352178940067, fmeasure=0.22713156295583278), high=Score(precision=0.29848490244113324, recall=0.21098499955877997, fmeasure=0.2306672304074923)),
 'rougeLsum': AggregateSco

In [124]:
dataset_processed['test']['related_work'][0]

'Within the MAS community, some work @cite has focused on how artificial AI-based learning agents would fare in communities of similar agents. For example, @cite and @cite show how agents can learn the capabilities of others via repeated interactions, but these agents do not learn to predict what actions other might take. Most of the work in MAS also fails to recognize the possible gains from using explicit agent models to predict agent actions. @cite is an exception and gives another approach for using nested agent models. However, they do not go so far as to try to quantify the advantages of their nested models or show how these could be learned via observations. We believe that our research will bring to the foreground some of the common observations seen in these research areas and help to clarify the implications and utility of learning and using nested agent models.'

In [122]:
formatted_answers_fixed[0]

"Our work is also related to agent tracking @cite, where agents are able to track other agents' behavior in real-time. However, the agent tracking problem is different from agent tracking in that it does not aim at tracking other agents, but rather at tracking their behavior in the real world. Agent tracking has also been studied in the context of multi-agent systems, where agents have to interact with other individuals or groups of agents to achieve their goals. In this work, we focus on the problem of how an agent can determine when it should act strategically (i.e. learn and use models of other agents), and how specific types of agent populations influence system behavior. In addition, we provide a framework for the incremental implementation of modeling capabilities in agents, and a description of the forms of knowledge required for the agents to learn their behavior and the merits of using and learning agent models. Our results show, among other lessons, how savvy buyers can avoid

In [125]:
dataset_processed['test']['related_work'][2942]

"The paper @cite considered a model in which the algorithm observes noisy versions of the oracle's response and established lower bounds on the complexity of convex optimization problems under first order as well as gradient-only oracles. In @cite , complexity lower bounds were obtained for convex optimization problems with a stochastic zero-order oracle. The paper @cite studied the complexity of convex optimization problems under a zero-order stochastic oracle in which the optimization algorithm submits two queries at each iteration and the oracle responds to both queries. These results were extended to the case in which the algorithm makes queries about multiple points at each iteration in @cite . In @cite , the complexity of convex optimization problems was studied under an erroneous oracle model wherein the oracle's responses to queries are subject to absolute relative errors."

In [123]:
formatted_answers_fixed[2942]

"In @cite, the authors consider the problem of minimizing convex and concave functions with access to an erroneous zeroth-order oracle. The authors consider optimization when one is given access to absolute error oracles that return values in [f(x) - @math, @math ) or relative error oracle that return value in @math. The authors also consider the case in which the players communicate with a set of system nodes over noisy communication channels. In this paper, we consider the class @math of all @math -player non-cooperative games with at least one NE such that the players' utility functions satisfy a certain (differential) constraint. The lower bound on the complexity of solving this class under Gaussian noise models is derived by establishing a connection between the Kullback-Leibler distance and Fisher information. The work of the authors in this paper is related to the work of, who study the convergence rate of derivative free optimization (DFO) with noisy function evaluations. In pa

## 2.4 Generating Results (Using No_repeat_ngram_size = 4; Global ID Fixed)

In [126]:
## Generating answers
test_batch_size = 4
no_repeat_ngram_size = 4
answers2_fixed = []

for i in range(0, dataset_processed['test'].num_rows, test_batch_size):
    if i%500 == 0:
        print(f"Handling sample {i} now..")
        
    answers2_fixed.append(generate_abstract_batched2(start=i,
                                                     batch_size=test_batch_size,
                                                     no_repeat_ngram_size=no_repeat_ngram_size))
    
print(f"Completed, {i+1} data points handled.")

Handling sample 0 now..
Handling sample 500 now..
Handling sample 1000 now..
Handling sample 1500 now..
Handling sample 2000 now..
Handling sample 2500 now..
Handling sample 3000 now..
Handling sample 3500 now..
Handling sample 4000 now..
Handling sample 4500 now..
Handling sample 5000 now..
Completed, 5093 data points handled.


In [127]:
formatted_answers2_fixed = []
for answer in answers2_fixed:
    formatted_answers2_fixed += answer

## Pickling results
with open("answers_revised/centrum/Centrum_finetuned_norepeat4_run2.pkl", "wb") as f:
    pickle.dump(formatted_answers2_fixed, f)

In [128]:
## Calculating the rouge score
metric_norepeat4_fixed = rouge.compute(predictions=formatted_answers2_fixed,
                                       references=[ref for ref in dataset_processed['test']['related_work']],
                                       use_stemmer = True)

copying_metric_norepeat4_fixed = rouge.compute(predictions=formatted_answers2_fixed,
                                               references=[ref for ref in dataset_processed['test']['main_article']],
                                               use_stemmer = True)

In [129]:
metric_norepeat4_fixed

{'rouge1': AggregateScore(low=Score(precision=0.35975180173421456, recall=0.34479356485971835, fmeasure=0.3267262189577247), mid=Score(precision=0.36325724741295884, recall=0.34821202061763984, fmeasure=0.32921339218420587), high=Score(precision=0.36689916912063053, recall=0.35142895490369114, fmeasure=0.3315751849656482)),
 'rouge2': AggregateScore(low=Score(precision=0.07473983027722655, recall=0.07111844735950187, fmeasure=0.06746659138880694), mid=Score(precision=0.0763077963682686, recall=0.07273474904438323, fmeasure=0.06887394785023747), high=Score(precision=0.07802365137195881, recall=0.07450545973912151, fmeasure=0.07033540514982335)),
 'rougeL': AggregateScore(low=Score(precision=0.19567283460523266, recall=0.18677177928391997, fmeasure=0.1764319945690786), mid=Score(precision=0.19785944377485978, recall=0.18881022074538373, fmeasure=0.17798090145329193), high=Score(precision=0.20027583525278567, recall=0.19084499337223415, fmeasure=0.17932125915421393)),
 'rougeLsum': Aggreg

In [130]:
copying_metric_norepeat4_fixed

{'rouge1': AggregateScore(low=Score(precision=0.4630942160405471, recall=0.2766736136786596, fmeasure=0.32641800898041473), mid=Score(precision=0.46783457219624813, recall=0.2802923733814354, fmeasure=0.32978022520020744), high=Score(precision=0.472841747242718, recall=0.2842072880254012, fmeasure=0.3334775978256094)),
 'rouge2': AggregateScore(low=Score(precision=0.1623897294385033, recall=0.0962015396428156, fmeasure=0.11389491303742359), mid=Score(precision=0.1678383499008111, recall=0.09951655700555598, fmeasure=0.11755747270757966), high=Score(precision=0.17305862916509765, recall=0.10288575604778666, fmeasure=0.12143901194558603)),
 'rougeL': AggregateScore(low=Score(precision=0.2895870988269311, recall=0.16981696333155638, fmeasure=0.20109347498978591), mid=Score(precision=0.2944825196649199, recall=0.17303022565544604, fmeasure=0.2045848083616214), high=Score(precision=0.299370752233194, recall=0.17619108669280426, fmeasure=0.20801586488350002)),
 'rougeLsum': AggregateScore(lo

In [131]:
dataset_processed['test']['related_work'][0]

'Within the MAS community, some work @cite has focused on how artificial AI-based learning agents would fare in communities of similar agents. For example, @cite and @cite show how agents can learn the capabilities of others via repeated interactions, but these agents do not learn to predict what actions other might take. Most of the work in MAS also fails to recognize the possible gains from using explicit agent models to predict agent actions. @cite is an exception and gives another approach for using nested agent models. However, they do not go so far as to try to quantify the advantages of their nested models or show how these could be learned via observations. We believe that our research will bring to the foreground some of the common observations seen in these research areas and help to clarify the implications and utility of learning and using nested agent models.'

In [132]:
formatted_answers2_fixed[0]

'Our work is also related to agent tracking @cite @cite, where agents are required to interact with other individuals or groups of agents to achieve their goals. Agent tracking is one of the most important aspects of agent behavior @cite. It involves monitoring the observable actions of other agents and inferring their unobserved actions, plans, goals and behaviors. In this paper, we focus on agent tracking in a multi-agent environment, where an intelligent agent is faced with the challenge of tracking the highly flexible mix of goal-driven and reactive behaviors of other agents, in real-time.'

In [133]:
dataset_processed['test']['related_work'][2942]

"The paper @cite considered a model in which the algorithm observes noisy versions of the oracle's response and established lower bounds on the complexity of convex optimization problems under first order as well as gradient-only oracles. In @cite , complexity lower bounds were obtained for convex optimization problems with a stochastic zero-order oracle. The paper @cite studied the complexity of convex optimization problems under a zero-order stochastic oracle in which the optimization algorithm submits two queries at each iteration and the oracle responds to both queries. These results were extended to the case in which the algorithm makes queries about multiple points at each iteration in @cite . In @cite , the complexity of convex optimization problems was studied under an erroneous oracle model wherein the oracle's responses to queries are subject to absolute relative errors."

In [134]:
formatted_answers2_fixed[2942]

'In @cite, the authors study the problem of minimizing convex and concave functions with access to an erroneous zeroth-order oracle. In @cite the authors consider the problem of optimizing convex functions over polytopes in a distributed manner in which the players communicate with a set of system nodes over noisy communication channels. The authors in @cite study the problem in the context of convex optimization in which the goal is to minimize every function in a given class using as few queries as possible. The authors of @cite and @cite consider the problem in which the objective function is a convex function and the goal of the algorithm is to minimize all functions in the class of convex functions. In the context of non-cooperative games, the authors present a lower bound on the complexity of solving the class @math that depends on the Kolmogorov @math -capacity of the constraint set and the total capacity of the communication channel. The lower bound of solving @math is derived 

# 3. Using Centrum as Second Model of Two-Step Model

## 3.1 Loading the generated results from the first step (finetuned LED)

In [135]:
## Results generated in the "Naive_Two-step_Model.ipynb" notebook
with open("misc_data/XSci_test_2step.pkl", "rb") as f:
    first_step_results = pickle.load(f)

In [136]:
first_step_results

Dataset({
    features: ['related_work', 'abstracts', 'main_article', 'short_abstracts'],
    num_rows: 5093
})

In [142]:
first_step_results['short_abstracts'][0:2943:2942]

["Our work is closely related to the work of @cite, in which agents are trained to behave strategically (i.e. learn and use models of other agents), and when they should act as a simple price-taker. However, our work is different in that we do not use the agent as a price-taker, but rather as an agent that learns and uses models of the other agents. In contrast, in our work, agents are trained in order to learn and use agent models, and we do not require the agent to be a pricetaker, but we do require that agents learn to use agent models in order to act as a pricetaker. In addition, we do not need the agent to learn to behave strategically.|||||The Soar integrated architecture @cite is a variant of the soar integrated architecture, which allows for simultaneous execution of multiple agent models. However, unlike the Soar integrated, it does not provide direct support for flexible and efficient reasoning about other agents' actions, plans, goals and behaviors. In contrast, our architec

In [148]:
## Additional processing because the Centrum doc_sep has spaces (' ') before and after "|||||"
def process_first_step_output_batched(example):
    output = {}
    # These don't need changes
    output["abstracts"] = example["abstracts"]
    output["related_work"] = example["related_work"]
    output["main_article"] = example["main_article"]
    
    # This need some revision
    output['short_abstracts'] = []
        
    for short_abstract in example['short_abstracts']:
        output['short_abstracts'].append(short_abstract.replace("|||||", DOC_SEP))
    
    return output

In [149]:
first_step_results_processed = first_step_results.map(
    # preprocess_dataset,
    process_first_step_output_batched,
    batched=True,
    batch_size=1,
    )

  0%|          | 0/5093 [00:00<?, ?ba/s]

In [150]:
first_step_results_processed['short_abstracts'][0:2943:2942]

["Our work is closely related to the work of @cite, in which agents are trained to behave strategically (i.e. learn and use models of other agents), and when they should act as a simple price-taker. However, our work is different in that we do not use the agent as a price-taker, but rather as an agent that learns and uses models of the other agents. In contrast, in our work, agents are trained in order to learn and use agent models, and we do not require the agent to be a pricetaker, but we do require that agents learn to use agent models in order to act as a pricetaker. In addition, we do not need the agent to learn to behave strategically. ||||| The Soar integrated architecture @cite is a variant of the soar integrated architecture, which allows for simultaneous execution of multiple agent models. However, unlike the Soar integrated, it does not provide direct support for flexible and efficient reasoning about other agents' actions, plans, goals and behaviors. In contrast, our archit

## 3.2 Feeding the first step results to the second step Centrum model

In [151]:
first_step_inputs = centrum_tokenizer(first_step_results_processed['short_abstracts'],
                                      padding="max_length",
                                      max_length=MAX_LENGTH_ENC,
                                      return_tensors="pt",
                                      truncation=True)

In [152]:
def generate_abstract_batched3(batch_size=2, start=0, no_repeat_ngram_size=3):
    
    try:
        del test_input_ids, attention_mask, global_attention_mask, predicted_abstract_ids
    except:
        None
        
    gc.collect()

    test_input_ids = first_step_inputs['input_ids'][start:start+batch_size].to("cuda")
    attention_mask = first_step_inputs['attention_mask'][start:start+batch_size].to("cuda")

    global_attention_mask = (test_input_ids == centrum_tokenizer.cls_token_id) | (test_input_ids == docsep_token_id)

    predicted_abstract_ids = centrum_model.generate(test_input_ids,
                                                    attention_mask=attention_mask, 
                                                    global_attention_mask=global_attention_mask, 
                                                    max_length=MAX_LENGTH_DEC,
                                                    no_repeat_ngram_size=no_repeat_ngram_size,
                                                    num_beams=4)

    predicted_abstract = centrum_tokenizer.batch_decode(predicted_abstract_ids, skip_special_tokens=True)
    
    return predicted_abstract

In [153]:
## Generating answers
test_batch_size = 4
no_repeat_ngram_size = 4
answers_second_step = []

for i in range(0, dataset_processed['test'].num_rows, test_batch_size):
    if i%500 == 0:
        print(f"Handling sample {i} now..")
        
    answers_second_step.append(generate_abstract_batched3(start=i,
                                                          batch_size=test_batch_size,
                                                          no_repeat_ngram_size=no_repeat_ngram_size))
    
print(f"Completed, {i+1} data points handled.")

Handling sample 0 now..
Handling sample 500 now..
Handling sample 1000 now..
Handling sample 1500 now..
Handling sample 2000 now..
Handling sample 2500 now..
Handling sample 3000 now..
Handling sample 3500 now..
Handling sample 4000 now..
Handling sample 4500 now..
Handling sample 5000 now..
Completed, 5093 data points handled.


In [154]:
formatted_answers_second_step = []
for answer in answers_second_step:
    formatted_answers_second_step += answer

## Pickling results
with open("misc_data/XSci_test_2step_CENTRUM.pkl", "wb") as f:
    pickle.dump(formatted_answers_second_step, f)

In [156]:
## Calculating the rouge score
metric_second_step = rouge.compute(predictions=formatted_answers_second_step,
                                   references=[ref for ref in dataset_processed['test']['related_work']],
                                   use_stemmer = True)

copying_second_step = rouge.compute(predictions=formatted_answers_second_step,
                                    references=[ref for ref in dataset_processed['test']['main_article']],
                                    use_stemmer = True)

In [157]:
metric_second_step

{'rouge1': AggregateScore(low=Score(precision=0.3108908383484029, recall=0.3539702264895049, fmeasure=0.31196283472387626), mid=Score(precision=0.31407146830053895, recall=0.3566337106608922, fmeasure=0.3141585091294947), high=Score(precision=0.317159131391955, recall=0.3592039419974559, fmeasure=0.3162616561043695)),
 'rouge2': AggregateScore(low=Score(precision=0.05742338580510461, recall=0.06521734368981709, fmeasure=0.05742004134612701), mid=Score(precision=0.05880825531056981, recall=0.06668000059827006, fmeasure=0.058664311529910466), high=Score(precision=0.060077683531292295, recall=0.06816161972427576, fmeasure=0.05991960858637221)),
 'rougeL': AggregateScore(low=Score(precision=0.16799056201364704, recall=0.1938778199220727, fmeasure=0.16904550795517964), mid=Score(precision=0.16974126038069148, recall=0.19560857677949478, fmeasure=0.17025284654403422), high=Score(precision=0.1714614829064525, recall=0.1974343294220886, fmeasure=0.17130246780620778)),
 'rougeLsum': AggregateSc

In [158]:
copying_second_step

{'rouge1': AggregateScore(low=Score(precision=0.4300065072497933, recall=0.3030435305501919, fmeasure=0.33782384013412775), mid=Score(precision=0.43398183486202024, recall=0.3062554707654349, fmeasure=0.3405969523174616), high=Score(precision=0.4380928565419001, recall=0.3095118361127674, fmeasure=0.34348501295171086)),
 'rouge2': AggregateScore(low=Score(precision=0.13285644629300325, recall=0.09223971606990994, fmeasure=0.10280562027804581), mid=Score(precision=0.13726132188746637, recall=0.09526409311601927, fmeasure=0.10612481331187881), high=Score(precision=0.14152314193942114, recall=0.09783980112765547, fmeasure=0.10891609004450528)),
 'rougeL': AggregateScore(low=Score(precision=0.24376174693362435, recall=0.16942594280862433, fmeasure=0.18929911577898295), mid=Score(precision=0.246975249385711, recall=0.17163252930586032, fmeasure=0.19135157740747943), high=Score(precision=0.2500888790716374, recall=0.17379896114862295, fmeasure=0.19328067299090773)),
 'rougeLsum': AggregateSc

In [159]:
dataset_processed['test']['related_work'][0]

'Within the MAS community, some work @cite has focused on how artificial AI-based learning agents would fare in communities of similar agents. For example, @cite and @cite show how agents can learn the capabilities of others via repeated interactions, but these agents do not learn to predict what actions other might take. Most of the work in MAS also fails to recognize the possible gains from using explicit agent models to predict agent actions. @cite is an exception and gives another approach for using nested agent models. However, they do not go so far as to try to quantify the advantages of their nested models or show how these could be learned via observations. We believe that our research will bring to the foreground some of the common observations seen in these research areas and help to clarify the implications and utility of learning and using nested agent models.'

In [160]:
formatted_answers_second_step[0]

'The formal definition of intelligence is closely related to the formal conception of intelligence @cite @cite. In the formal conception, the goal is to create a system that is robust enough to allow the cumulative development of robust systems and general results. In the informal conception, the system is designed to be robust enough to be applied to a wide range of applications. In contrast, our work is different in that we do not use the agent as a price-taker, but rather as an agent that learns and uses models of the other agents.'

In [161]:
dataset_processed['test']['related_work'][2942]

"The paper @cite considered a model in which the algorithm observes noisy versions of the oracle's response and established lower bounds on the complexity of convex optimization problems under first order as well as gradient-only oracles. In @cite , complexity lower bounds were obtained for convex optimization problems with a stochastic zero-order oracle. The paper @cite studied the complexity of convex optimization problems under a zero-order stochastic oracle in which the optimization algorithm submits two queries at each iteration and the oracle responds to both queries. These results were extended to the case in which the algorithm makes queries about multiple points at each iteration in @cite . In @cite , the complexity of convex optimization problems was studied under an erroneous oracle model wherein the oracle's responses to queries are subject to absolute relative errors."

In [162]:
formatted_answers_second_step[2942]

"In @cite, the authors derive lower bounds on the complexity of solving games in a distributed manner in which the players communicate with a set of system nodes over noisy communication channels. The authors in @cite also derive lower bounds for solving games in the non-Gaussian case. However, they do not consider the non-cooperative case. In @cite @cite the authors derive the lower bounds on solving games with at least one NE such that the players' utility functions satisfy a certain (differential) constraint. The authors of @cite derive lower bounds of solving games with @math and @math in the Gaussian case."

## Sandbox

In [188]:
gc.collect()

0