In [22]:
!pip install rouge_score

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [21]:
!pip install evaluate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [1]:
import re

from datasets import load_dataset
import evaluate
import nltk
import nltk.data
import numpy as np
import torch
import torch.nn.functional as F
from transformers import AdamW, AutoTokenizer, AutoModelForSeq2SeqLM

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/luka/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Setup

In [3]:
rouge_score = evaluate.load("rouge")

## Load multi_news data

In [4]:
dataset = load_dataset('multi_news')

Found cached dataset multi_news (/Users/luka/.cache/huggingface/datasets/multi_news/default/1.0.0/2f1f69a2bedc8ad1c5d8ae5148e4755ee7095f465c1c01ae8f85454342065a72)


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
dataset.column_names

{'train': ['document', 'summary'],
 'validation': ['document', 'summary'],
 'test': ['document', 'summary']}

In [6]:
n = 10
input_documents = dataset["train"]["document"][:n]
input_summaries = dataset["train"]["summary"][:n]

In [14]:
input_documents[:2]

['National Archives \n \n Yes, it’s that time again, folks. It’s the first Friday of the month, when for one ever-so-brief moment the interests of Wall Street, Washington and Main Street are all aligned on one thing: Jobs. \n \n A fresh update on the U.S. employment situation for January hits the wires at 8:30 a.m. New York time offering one of the most important snapshots on how the economy fared during the previous month. Expectations are for 203,000 new jobs to be created, according to economists polled by Dow Jones Newswires, compared to 227,000 jobs added in February. The unemployment rate is expected to hold steady at 8.3%. \n \n Here at MarketBeat HQ, we’ll be offering color commentary before and after the data crosses the wires. Feel free to weigh-in yourself, via the comments section. And while you’re here, why don’t you sign up to follow us on Twitter. \n \n Enjoy the show. ||||| Employers pulled back sharply on hiring last month, a reminder that the U.S. economy may not be g

In [15]:
input_summaries[:2]

['– The unemployment rate dropped to 8.2% last month, but the economy only added 120,000 jobs, when 203,000 new jobs had been predicted, according to today\'s jobs report. Reaction on the Wall Street Journal\'s MarketBeat Blog was swift: "Woah!!! Bad number." The unemployment rate, however, is better news; it had been expected to hold steady at 8.3%. But the AP notes that the dip is mostly due to more Americans giving up on seeking employment.',
 '– Shelly Sterling plans "eventually" to divorce her estranged husband Donald, she tells Barbara Walters at ABC News. As for her stake in the Los Angeles Clippers, she plans to keep it, the AP notes. Sterling says she would "absolutely" fight any NBA decision to force her to sell the team. The team is her "legacy" to her family, she says. "To be honest with you, I\'m wondering if a wife of one of the owners … said those racial slurs, would they oust the husband? Or would they leave the husband in?"']

# Baseline

First 3 sentences from each news

Room of improvements
* How to properly get the first 3 sentences?

In [7]:
preds_baseline = [
    ".".join([". ".join(each.split(". ")[:3]) for each in sequence.split("|||||")]) + "."
    for sequence in input_documents
]

In [8]:
preds_baseline[:2]

['National Archives \n \n Yes, it’s that time again, folks. It’s the first Friday of the month, when for one ever-so-brief moment the interests of Wall Street, Washington and Main Street are all aligned on one thing: Jobs. \n \n A fresh update on the U.S. Employers pulled back sharply on hiring last month, a reminder that the U.S. economy may not be growing fast enough to sustain robust job growth. The unemployment rate dipped, but mostly because more Americans stopped looking for work.',
 'LOS ANGELES (AP) — In her first interview since the NBA banned her estranged husband, Shelly Sterling says she will fight to keep her share of the Los Angeles Clippers and plans one day to divorce Donald Sterling. \n \n (Click Prev or Next to continue viewing images.) \n \n ADVERTISEMENT (Click Prev or Next to continue viewing images.) \n \n Los Angeles Clippers co-owner Shelly Sterling, below, watches the Clippers play the Oklahoma City Thunder along with her attorney, Pierce O\'Donnell, in the fir

In [9]:
scores = rouge_score.compute(
    predictions=preds_baseline, references=input_summaries, use_stemmer=True
)
scores

{'rouge1': 0.43601002287852186,
 'rouge2': 0.14265791909311992,
 'rougeL': 0.19956066282373178,
 'rougeLsum': 0.2729012505428157}

# Model 1: Default Centrum

Boooo!!!

Notes
* rouge includes padding in calculation

Room of improvements
* Trim after end
* Weird special characters (Fixed)
* max_length?
* beam search? (Fixed)
* skip repeated n gram (Fixed)

tokenizer.pad_token_idtokenizer.pad_token_id?

In [10]:
CHECKPOINT = "ratishsp/Centrum"
DOC_SEP_ = "|||||"

In [11]:
tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
model = AutoModelForSeq2SeqLM.from_pretrained(CHECKPOINT)

In [62]:
tokenizer.add_tokens(DOC_SEP_, special_tokens=True)
model.resize_token_embeddings(len(tokenizer))
docsep_token_id = tokenizer.convert_tokens_to_ids(DOC_SEP_)

In [14]:
batch = tokenizer(
    input_documents,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)

In [43]:
batch["global_attention_mask"] = [
    [
        1 if token in (tokenizer.cls_token_id, docsep_token_id) else 0 
        for token in each
    ]
    for each in batch["input_ids"]
]

In [48]:
led_output = model.generate(
    **batch,
    no_repeat_ngram_size=3,
    max_length=512,
    num_beams=4,
)

In [70]:
preds_model1 = tokenizer.batch_decode(
    led_output,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

In [71]:
preds_model1 = [each.strip() for each in preds_model1]

In [72]:
preds_model1[:2]

['The unemployment rate fell to 8.2 percent, the lowest since January 2009. The rate dropped because fewer people searched for jobs. The official unemployment tally only includes those seeking work. The economy has added 858,000 jobs since December — the best four months of hiring in two years. But Federal Reserve Chairman Ben Bernanke has cautioned that the current hiring pace is unlikely to continue without more consumer spending. The unemployment rate is expected to hold steady at 8.3 percent. Here at MarketBeat HQ, we’ll be offering color commentary before and after the data crosses the wires. Feel free to weigh-in yourself, via the comments section. And while you’re here, why don’t you sign up to follow us on Twitter.',
 'LOS ANGELES — In her first interview since the NBA banned her estranged husband, Shelly Sterling says she will fight to keep her share of the Los Angeles Clippers and plans one day to divorce Donald Sterling. “I will fight that decision,” she told ABC News’ Barba

In [73]:
scores = rouge_score.compute(
    predictions=preds_model1, references=input_summaries, use_stemmer=True
)
scores

{'rouge1': 0.4221617749790788,
 'rouge2': 0.1418855987672803,
 'rougeL': 0.20409749797205406,
 'rougeLsum': 0.20413145617293332}

# Multi XScience

In [1]:
import re

from datasets import load_dataset
import evaluate
import nltk
import nltk.data
import numpy as np
import torch
import torch.nn.functional as F
from transformers import AdamW, AutoTokenizer, AutoModelForSeq2SeqLM

In [2]:
torch.backends.mps.is_available()

True

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/luka/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
DATASET_NAME = "multi_x_science_sum"
DOC_SEP = " ||||| "
BATCH_SIZE = 64

## Set up evaluation

In [5]:
rouge_score = evaluate.load("rouge")

## Load dataset

In [6]:
dataset = load_dataset(DATASET_NAME)

Found cached dataset multi_x_science_sum (/Users/luka/.cache/huggingface/datasets/multi_x_science_sum/default/1.1.0/2876ec0401f8f5c5acf7f4857dbc8d6229a390ab428321ab848f03f14b7f9729)


  0%|          | 0/3 [00:00<?, ?it/s]

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['aid', 'mid', 'abstract', 'related_work', 'ref_abstract'],
        num_rows: 30369
    })
    test: Dataset({
        features: ['aid', 'mid', 'abstract', 'related_work', 'ref_abstract'],
        num_rows: 5093
    })
    validation: Dataset({
        features: ['aid', 'mid', 'abstract', 'related_work', 'ref_abstract'],
        num_rows: 5066
    })
})

In [8]:
dataset["train"][0]

{'aid': 'math9912167',
 'mid': '1631980677',
 'abstract': 'Author(s): Kuperberg, Greg; Thurston, Dylan P. | Abstract: We give a purely topological definition of the perturbative quantum invariants of links and 3-manifolds associated with Chern-Simons field theory. Our definition is as close as possible to one given by Kontsevich. We will also establish some basic properties of these invariants, in particular that they are universally finite type with respect to algebraically split surgery and with respect to Torelli surgery. Torelli surgery is a mutual generalization of blink surgery of Garoufalidis and Levine and clasper surgery of Habiro.',
 'related_work': 'Two other generalizations that can be considered are invariants of graphs in 3-manifolds, and invariants associated to other flat connections @cite_16 . We will analyze these in future work. Among other things, there should be a general relation between flat bundles and links in 3-manifolds on the one hand and finite covers and b

## Format dataset to our needs

In [9]:
pat = re.compile("@cite_[0-9]+")

In [10]:
def preprocess_dataset(example):
    output = {}
    output["abstracts"] = (
        example["abstract"].split("| Abstract: ")[-1]
        + DOC_SEP
        + DOC_SEP.join([x for x in example["ref_abstract"]["abstract"] if x])
    )
    output["related_work"] = pat.sub("@cite", example["related_work"])
    
    return output

In [11]:
def preprocess_dataset_batched(example):
    output = {}
    output["abstracts"] = []
    output["related_work"] = []
    
    for abstract, ref_abstract in zip(
        example["abstract"], example["ref_abstract"]
    ):
        output["abstracts"].append(
            abstract.split("| Abstract: ")[-1]
            + DOC_SEP
            + DOC_SEP.join([x for x in ref_abstract["abstract"] if x])
        )
    for related_work in example["related_work"]:
        output["related_work"].append(pat.sub("@cite", related_work))
    
    return output

In [12]:
dataset_processed = {}
for split in dataset.keys():
    dataset_processed[split] = dataset[split].map(
        # preprocess_dataset,
        preprocess_dataset_batched,
        remove_columns=dataset[split].column_names,
        batched=True,
        batch_size=BATCH_SIZE,
    )

Map:   0%|          | 0/30369 [00:00<?, ? examples/s]

Map:   0%|          | 0/5093 [00:00<?, ? examples/s]

Map:   0%|          | 0/5066 [00:00<?, ? examples/s]

# Baseline

First 3 sentences from each news

Room of improvements
* How to properly get the first 3 sentences?

In [13]:
dataset_processed["test"]

Dataset({
    features: ['related_work', 'abstracts'],
    num_rows: 5093
})

In [14]:
punkt_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [15]:
test_pred_baselines = [
    " ".join(
        [
            " ".join(punkt_tokenizer.tokenize(each)[:2]).strip()
            for each in element["abstracts"].split("|||||")
        ]
    ).strip()
    for element in dataset_processed["test"]
]

In [16]:
scores = rouge_score.compute(
    predictions=test_pred_baselines,
    references=dataset_processed["test"]["related_work"],
    use_stemmer=True,
)
scores

{'rouge1': 0.29248682507658086,
 'rouge2': 0.054016188777851686,
 'rougeL': 0.14639129491285108,
 'rougeLsum': 0.1463775641142901}

# Model 1: Default Centrum

* Probably need to figure out the distribution of `dataset_processed["test"]["abstracts"]` so that we can estimate the best `max_length`.

In [17]:
CHECKPOINT = "ratishsp/Centrum"

In [18]:
tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
model = AutoModelForSeq2SeqLM.from_pretrained(CHECKPOINT)

In [19]:
tokenizer.add_tokens(DOC_SEP, special_tokens=True)
model.resize_token_embeddings(len(tokenizer))
docsep_token_id = tokenizer.convert_tokens_to_ids(DOC_SEP)

In [20]:
dataset_tokenized = {}

dataset_tokenized["test"] = tokenizer(
    dataset_processed["test"]["abstracts"],
    padding=True,
    truncation=True,
    max_length=1024,
    return_tensors="pt",
)

In [21]:
dataset_tokenized["test"]["global_attention_mask"] = np.array([
    [
        1 if token in (tokenizer.cls_token_id, docsep_token_id) else 0 
        for token in each
    ]
    for each in dataset_tokenized["test"]["input_ids"]
])

In [51]:
import time

In [56]:
BATCH_SIZE = 32

In [57]:
start_time = time.time()

led_output_model1 = []
# for i in range(0, len(dataset_tokenized["test"]["input_ids"]), BATCH_SIZE):
for i in range(0, 100, BATCH_SIZE):
    
    input_ids = dataset_tokenized["test"]["input_ids"][i:i+BATCH_SIZE]
    attention_mask = dataset_tokenized["test"]["attention_mask"][i:i+BATCH_SIZE]
    global_attention_mask = dataset_tokenized["test"]["global_attention_mask"][i:i+BATCH_SIZE]

    led_output_model1.append(
        model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            global_attention_mask=global_attention_mask,
            no_repeat_ngram_size=3,
            max_length=128,
            num_beams=4,
        )
    )
    
    if i % BATCH_SIZE == 0:
        print(f"Generating sample {i}.")
        print(f"Time elapsed: {time.time() - start_time}s")
        
led_output_model1 = torch.cat(led_output_model1)

Generating sample 0.
Time elapsed: 74.9034059047699s


KeyboardInterrupt: 

In [44]:
test_pred_model1 = tokenizer.batch_decode(
    torch.cat(led_output_model1),
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

In [45]:
test_pred_model1 = [each.strip() for each in test_pred_model1]

In [46]:
test_pred_model1[:2]

["The long-term goal of our field is the creation and understanding of intelligence. Productive research in AI, both practical and theoretical, benefits from a notion of intelligence that is precise enough to allow the cumulative development of robust systems and general results. This paper outlines a gradual evolution in our formal conception of Intelligence that brings it closer to our informal conception and simultaneously reduces the gap between theory and practice. The article presents experimental results illustrating the agents' dynamic behavior. I. Introduction, 488. — II. The model with automobiles as an example, 489. — III. Examples and applications, 492. — IV. Counter",
 '“Interaction in virtual reality (VR) environments (e.g. grasping and manipulating virtual objects) is essential to ensure a pleasant and immersive experience. In this work, we propose a visually realistic, flexible and robust grasping system that enables real-time interactions in virtual environments. Resul

In [48]:
scores = rouge_score.compute(
    predictions=test_pred_model1,
    references=dataset_processed["test"]["related_work"][:128],
    use_stemmer=True,
)
scores

{'rouge1': 0.29331846458195443,
 'rouge2': 0.050375425158183273,
 'rougeL': 0.15445020608642712,
 'rougeLsum': 0.15443973495826438}

In [None]:
%%time

led_output_model1 = model.generate(
    **dataset_tokenized["test"],
    # input_ids=dataset_tokenized["test"]["input_ids"][:n],
    # attention_mask=dataset_tokenized["test"]["attention_mask"][:n],
    # global_attention_mask=dataset_tokenized["test"]["global_attention_mask"][:n],
    no_repeat_ngram_size=3,
    max_length=128,
    num_beams=4,
)

In [None]:
test_pred_model1 = tokenizer.batch_decode(
    led_output_model1,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
)

In [None]:
test_pred_model1 = [each.strip() for each in test_pred_model1]

In [None]:
test_pred_model1[:2]

In [None]:
scores = rouge_score.compute(
    predictions=test_pred_model1,
    references=dataset_processed["test"]["related_work"],
    use_stemmer=True,
)
scores

# Model 1: Fine-tune Centrum