In [1]:
from transformers import (
    AutoTokenizer,
    LEDForConditionalGeneration,
    LEDConfig
)
from datasets import load_dataset, load_metric
import torch
from collections import OrderedDict

  from .autonotebook import tqdm as notebook_tqdm


First, we load the **Multi-news** dataset from huggingface dataset hub

In [2]:
dataset=load_dataset('multi_news', trust_remote_code=True)

In [6]:
data = dataset['train'][0]
print(data['document'])
print(data['summary'])

National Archives 
 
 Yes, it’s that time again, folks. It’s the first Friday of the month, when for one ever-so-brief moment the interests of Wall Street, Washington and Main Street are all aligned on one thing: Jobs. 
 
 A fresh update on the U.S. employment situation for January hits the wires at 8:30 a.m. New York time offering one of the most important snapshots on how the economy fared during the previous month. Expectations are for 203,000 new jobs to be created, according to economists polled by Dow Jones Newswires, compared to 227,000 jobs added in February. The unemployment rate is expected to hold steady at 8.3%. 
 
 Here at MarketBeat HQ, we’ll be offering color commentary before and after the data crosses the wires. Feel free to weigh-in yourself, via the comments section. And while you’re here, why don’t you sign up to follow us on Twitter. 
 
 Enjoy the show. ||||| Employers pulled back sharply on hiring last month, a reminder that the U.S. economy may not be growing fas

Then we load the fine-tuned PRIMERA model, please download [it](https://storage.googleapis.com/primer_summ/PRIMER_multinews.tar.gz) to your local computer.

In [16]:
PRIMER_path='F:\\resources\\PRIMER_multinews'
TOKENIZER = AutoTokenizer.from_pretrained(PRIMER_path)

states = torch.load('F:\\resources\\PRIMER_multinews\\pytorch_model.bin')
new_states = OrderedDict()
for k in states:
    new_k=k.replace('model','led')
    new_states[new_k]=states[k]
new_states['led.encoder.embed_positions.weight'] = states['model.encoder.embed_positions.weight'][2:]
new_states['led.decoder.embed_positions.weight'] = states['model.decoder.embed_positions.weight'][2:]
new_states['lm_head.weight'] = states['model.shared.weight']
config=LEDConfig.from_pretrained('F:\\resources\\PRIMER_multinews')

MODEL = LEDForConditionalGeneration(config).cuda()
MODEL.load_state_dict(new_states)
MODEL.gradient_checkpointing_enable()
PAD_TOKEN_ID = TOKENIZER.pad_token_id
DOCSEP_TOKEN_ID = TOKENIZER.convert_tokens_to_ids("<doc-sep>")

We then define the functions to pre-process the data, as well as the function to generate summaries.

In [18]:
def process_document(documents):
    input_ids_all=[]
    for data in documents:
        all_docs = data.split("|||||")[:-1]
        for i, doc in enumerate(all_docs):
            doc = doc.replace("\n", " ")
            doc = " ".join(doc.split())
            all_docs[i] = doc

        #### concat with global attention on doc-sep
        input_ids = []
        for doc in all_docs:
            input_ids.extend(
                TOKENIZER.encode(
                    doc,
                    truncation=True,
                    max_length=4096 // len(all_docs),
                )[1:-1]
            )
            input_ids.append(DOCSEP_TOKEN_ID)
        input_ids = (
            [TOKENIZER.bos_token_id]
            + input_ids
            + [TOKENIZER.eos_token_id]
        )
        input_ids_all.append(torch.tensor(input_ids))
    input_ids = torch.nn.utils.rnn.pad_sequence(
        input_ids_all, batch_first=True, padding_value=PAD_TOKEN_ID
    )
    return input_ids.cuda()


def batch_process(batch):
    input_ids=process_document(batch['document'])
    # get the input ids and attention masks together
    global_attention_mask = torch.zeros_like(input_ids).to(input_ids.device)
    # put global attention on <s> token

    global_attention_mask[:, 0] = 1
    global_attention_mask[input_ids == DOCSEP_TOKEN_ID] = 1
    generated_ids = MODEL.generate(
        input_ids=input_ids,
        global_attention_mask=global_attention_mask,
        use_cache=True,
        max_length=1024,
        num_beams=5,
    )
    generated_str = TOKENIZER.batch_decode(
            generated_ids.tolist(), skip_special_tokens=True
        )
    result={}
    result['generated_summaries'] = generated_str
    result['gt_summaries']=batch['summary']
    return result

Next, we simply run the model on 10 data examples (or any number of examples you want)

In [19]:
import random
data_idx = random.choices(range(len(dataset['test'])),k=10)
dataset_small = dataset['test'].select(data_idx)
result_small = dataset_small.map(batch_process, batched=True, batch_size=2)

Map:   0%|          | 0/10 [00:00<?, ? examples/s]Input ids are automatically padded from 825 to 1024 to be a multiple of `config.attention_window`: 512
Map:  20%|██        | 2/10 [01:16<05:07, 38.46s/ examples]Input ids are automatically padded from 3163 to 3584 to be a multiple of `config.attention_window`: 512
Map:  40%|████      | 4/10 [17:58<31:02, 310.40s/ examples]Input ids are automatically padded from 4093 to 4096 to be a multiple of `config.attention_window`: 512
Map:  60%|██████    | 6/10 [48:00<38:40, 580.13s/ examples]Input ids are automatically padded from 2812 to 3072 to be a multiple of `config.attention_window`: 512
Map:  80%|████████  | 8/10 [50:31<12:41, 380.96s/ examples]Input ids are automatically padded from 998 to 1024 to be a multiple of `config.attention_window`: 512
Map: 100%|██████████| 10/10 [51:37<00:00, 309.77s/ examples]


After getting all the results, we load the evaluation metric. 


(Note in the original code, we didn't use the default aggregators, instead, we simply take average over all the scores.
We simply use 'mid' in this notebook)

In [19]:
rouge = load_metric("rouge")

In [20]:
result_small['generated_summaries']

['– Boston’s New England Holocaust Memorial was vandalized for the second time in two months last night, when a teen shattered a glass panel etched with the numbers that Nazis tattooed on concentration camp victims, the AP reports. Police were called about 6:40pm to the downtown landmark and said witnesses helped them identify a 17-year-old suspect. The suspect’s name was not released because he is a juvenile. He is due to be arraigned today in Boston Municipal Court. Police were investigating the motive.',
 '– The electric chair has been sitting in storage since 1966, when the last inmate was executed in it for the murder of his cellmate. But that may be about to change. The city of McAlester, home to the state\'s death chamber, says it owns the chair, which was transferred to the state corrections department a few years ago, the Guardian reports. The city\'s mayor says the chair should be put on display to the public. "I would like to get it displayed somewhere since it is a historic

In [33]:
score=rouge.compute(predictions=result_small["generated_summaries"], references=result_small["gt_summaries"])
print(score['rouge1'].mid)
print(score['rouge2'].mid)
print(score['rougeL'].mid)

Score(precision=0.509437078378281, recall=0.43832461548851936, fmeasure=0.4644188580686355)
Score(precision=0.17689604682544763, recall=0.14564519595131636, fmeasure=0.1581222605371442)
Score(precision=0.2362355904256852, recall=0.19669444890277293, fmeasure=0.21194685290367665)


In [27]:
import random

In [30]:
random.choices(range(5000),k=5)

[4496, 1390, 2088, 2130, 1604]

– Facebook removed a photo of two men kissing in protest of a London pub’s decision to eject a same-sex couple for kissing, reports the America Blog. “Shares that contain nudity, or any kind of graphic or sexually suggestive content, are not permitted on Facebook,” the administrators of the Dangerous Minds Facebook page said in an email. The decision to remove the photo has prompted scores of people to post their own pictures of same-sex couples kissing in protest— dozens in the last few hours alone.

– Facebook has removed a photo from a protest page for a gay pub that booted a same-sex couple for kissing, USA Today reports. The Dangerous Minds Facebook page was trying to promote a “gay kiss-in” demonstration in London to protest the pub. The page used a photo of two men kissing to promote the event. But Facebook quickly removed the photo, saying in an email, “Shares that contain nudity, or any kind of graphic or sexually suggestive content, are not permitted on Facebook.” The decision to remove the photo has prompted scores of people to post their own pictures of same-sex couples kissing in protest— dozens in the last few hours alone.