<a href="https://colab.research.google.com/github/NikitasTsingenopoulos/WikiSum/blob/main/WikiSum.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -U datasets
!pip install -U evaluate
!pip install rouge_score



In [2]:
pip install bert-score



In [3]:
from transformers import (AutoTokenizer, LEDConfig, LEDForConditionalGeneration)
from datasets import load_dataset, Dataset
import torch
import pandas as pd

In [4]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [5]:
from bert_score import score
from nltk.tokenize import sent_tokenize

In [6]:
from huggingface_hub import hf_hub_download

In [7]:
repo_id = "d0rj/wikisum"
splits = {'train': 'data/train-00000-of-00001-b28959cff7dcaf55.parquet', 'validation': 'data/validation-00000-of-00001-21f2c9acfa77bab4.parquet', 'test': 'data/test-00000-of-00001-52a8a7cd640a9fff.parquet'}

In [8]:
dataset_train = pd.read_parquet(hf_hub_download(repo_id=repo_id, repo_type="dataset", filename=splits["train"]))
dataset_val = pd.read_parquet(hf_hub_download(repo_id=repo_id, repo_type="dataset", filename=splits["validation"]))
dataset_test = pd.read_parquet(hf_hub_download(repo_id=repo_id, repo_type="dataset", filename=splits["test"]))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [9]:
dataset_train = Dataset.from_pandas(dataset_train)
dataset_val = Dataset.from_pandas(dataset_val)
dataset_test = Dataset.from_pandas(dataset_test)

In [10]:
tokenizer = AutoTokenizer.from_pretrained('allenai/PRIMERA')

config=LEDConfig.from_pretrained('allenai/PRIMERA')

model = LEDForConditionalGeneration.from_pretrained('allenai/PRIMERA')
model.gradient_checkpointing_enable()

PAD_TOKEN_ID = tokenizer.pad_token_id
DOCSEP_TOKEN_ID = tokenizer.convert_tokens_to_ids("<doc-sep>")

pytorch_model.bin:   0%|          | 0.00/1.79G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.79G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/197 [00:00<?, ?B/s]

In [11]:
def process_document(documents):
    input_ids_all = []

    for data in documents:
        cleaned_text = " ".join(data.replace("\n", " ").split())
        input_chunks = [cleaned_text]

        input_ids = []
        for doc in input_chunks:
            input_ids.extend(
                tokenizer.encode(
                    doc,
                    truncation=True,
                    max_length=4096,
                )[1:-1]
            )
            input_ids.append(DOCSEP_TOKEN_ID)

        input_ids = (
            [tokenizer.bos_token_id]
            + input_ids
            + [tokenizer.eos_token_id]
        )
        input_ids_all.append(torch.tensor(input_ids))

    input_ids = torch.nn.utils.rnn.pad_sequence(
        input_ids_all, batch_first=True, padding_value=PAD_TOKEN_ID
    )
    return input_ids

In [12]:
def bert_apply(generated, reference_summary):
    summary_sents = sent_tokenize(generated)
    # Create a list of the same reference summary, repeated for each candidate sentence
    reference = [reference_summary] * len(summary_sents)

    # Score sentences
    P, R, F1 = score(summary_sents, reference, lang="en", verbose=True)

    # Rank and select top N
    ranked = sorted(zip(summary_sents, F1), key=lambda x: x[1], reverse=True)
    concise_summary = " ".join([sent for sent, _ in ranked[:4]])
    return concise_summary

In [13]:
def batch_process(batch):
    input_ids=process_document(batch['article'])
    # get the input ids and attention masks together
    global_attention_mask = torch.zeros_like(input_ids).to(input_ids.device)
    # put global attention on <s> token

    global_attention_mask[:, 0] = 1
    global_attention_mask[input_ids == DOCSEP_TOKEN_ID] = 1
    generated_ids = model.generate(
        input_ids=input_ids,
        global_attention_mask=global_attention_mask,
        use_cache=True,
        max_length=512,
        min_length=50,
        num_beams=5,
        length_penalty=2,
        no_repeat_ngram_size=3,
        early_stopping=True
    )
    generated_str = tokenizer.batch_decode(
            generated_ids.tolist(), skip_special_tokens=True
        )

    generated_str_bert = [bert_apply(generated, ref) for generated, ref in zip(generated_str, batch['summary'])]

    result={}
    result['generated_summaries'] = generated_str_bert
    result['gt_summaries']=batch['summary']
    return result

In [14]:
import random

docs = random.sample(range(len(dataset_test)),k=5)
docs

[436, 136, 1687, 832, 117]

In [15]:
dataset_small = dataset_test.select(docs)
result_small = dataset_small.map(batch_process, batched=True, batch_size=2)

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Input ids are automatically padded from 1781 to 2048 to be a multiple of `config.attention_window`: 512


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 48.29 seconds, 0.66 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 21.21 seconds, 0.90 sentences/sec


Input ids are automatically padded from 2043 to 2048 to be a multiple of `config.attention_window`: 512
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 27.72 seconds, 1.23 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

Input ids are automatically padded from 899 to 1024 to be a multiple of `config.attention_window`: 512


done in 16.59 seconds, 0.96 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 17.97 seconds, 0.83 sentences/sec


In [16]:
import evaluate

rouge = evaluate.load("rouge")

Downloading builder script: 0.00B [00:00, ?B/s]

In [17]:
rouge_scores = rouge.compute(predictions=result_small["generated_summaries"], references=result_small["gt_summaries"])

print(f"ROUGE-1: {rouge_scores['rouge1']:.4f}")
print(f"ROUGE-2: {rouge_scores['rouge2']:.4f}")
print(f"ROUGE-L: {rouge_scores['rougeL']:.4f}")

ROUGE-1: 0.3821
ROUGE-2: 0.1399
ROUGE-L: 0.2216


In [18]:
for i in range(0,5):
    example_summ = dataset_small[i]['summary']
    produced_summ = result_small[i]['generated_summaries']
    display(example_summ)
    display(produced_summ)
    print("\n")

"Keeping your septic tank clean will save you from costly repairs down the road. To clean your septic tank, you'll need to clean the filter annually and pump the tank every few years. Your filter should be located in the tank's outlet baffle and is often brightly colored. To clean it, all you need to do is spray it with a hose over the tank or dip it in a bucket of clean water. If you use a bucket, pour the dirty water into the tank when you're finished. You'll need to pump your tank every 1 to 3 years or whenever the sludge and scum levels reach a third of the tank. This needs to be done with a cast-iron pump and the waste will need to be disposed of in a government-chosen location, so it's best to get this done professionally."

'Pump the tank every few years. Locating the tank now saves time and money later regardless if you or an inspector clean the tank. Find your tank. When this fat and sludge layer is only three inches (7.62 cm) above the bottoms of the outlet pipes, you will end up saving thousands of dollars in costly repairs.'





"To throw a kitten shower to benefit a local shelter or humane society, make sure to talk to the organization first, since they will most likely be the ones to host it, then start recruiting volunteers and asking local businesses for donations of food and raffle prizes. Once you've set a date for your event, publicize it on social media and send press releases to your local news media to spread the word. During the kitten shower, make sure your volunteers know their responsibilities and maintain the different stations, which should include a place to donate as well as somewhere to foster and adopt. The kitten shower should also include raffles and games, such as pin the tail on the kitten, or a drawing where the winner gets to name kittens, for example."

'After throwing your successful kitten shower, rally your clean up crew and break down the event. In the weeks leading up to the shower, have your volunteers promote the event regularly via their Twitter, Facebook, and Instagram accounts. If your troop, group, or organization has a web presence, be sure to promote the shower there as well. You can create an Amazon.com registry and share it via email and social media.'





"To get skinny fast, start by figuring out how much weight you want to lose and in what time frame. Next, figure out how many calories you need to cut per day to reach your goal. For example, if you want to lose 1-2 pounds per week, you'll need to burn 500-1,000 calories every day by cutting calories from your diet and exercising. Try keeping a food journal or using an app to help you keep track of calories and stay motivated!"

'To lose 1 – 2 pounds per week, you\'ll need to burn 500 – 1,000 calories every day. Your goal could be, for example, to lose 5 pounds. A good example of a goal is: "My goal is to lose five pounds in two weeks by sticking to a balanced 1,200 calorie-a-day diet and exercise for at least 30 minutes a day." Cardio is one of the best ways to burn extra calories and support your goal of getting thin quickly.'





"To prevent ascarids in dogs, get an over-the-counter deworming medication that specifies it's for ascarids on the packaging. Make sure to read the label to check that the medication is suitable for the age and weight of your dog. If you need help choosing a medication, ask your vet for their advice. When you have the medication, follow the dosing directions and give it to your dog. Once you've dewormed your dog, remember to continue doing so regularly, or 2 to 4 times a year for adult dogs."

"Adults dogs should be dewormed monthly with fecal checks 2 - 4 times a year to check the effectiveness of the product used. This is best done under veterinary supervision who can advise you about the correct dose for your dog's pre-pregnant weight. If you know your dog has roundworms then you should keep it away from other dogs. How often worming takes place depends on the age of the dog and the results of fecal tests to show whether infection is still present."





"If you want to create a post on Facebook, you can use your mobile app or your desktop. Start by clicking on the Facebook icon or going to your Facebook page and logging in. You can then go to your page, and click on the post box. Tap the text field to write something. If your post is 130 characters or fewer, you can also add color to make your post stand out. To add media, tap “Photo Video” and select the images to upload. You can also click “Add to your post” to check in at a location, put a feeling or sticker on your post, or tag people. When you're done, tap “Post.”"

"Go to the page where you want to create your post, this will vary: Your page - You can create a post for your page from the top of the News Feed. Tap Photo/Video near the middle of the post screen, then select a photo and video to upload and tap Done. Skip this step if you don't want to upload a text-only post. You can tap multiple photos or videos to upload them all at once."



