<a href="https://colab.research.google.com/github/NikitasTsingenopoulos/WikiSum/blob/main/WikiSum.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -U datasets
!pip install -U evaluate
!pip install rouge_score



In [None]:
pip install bert-score



In [None]:
from transformers import (AutoTokenizer, LEDConfig, LEDForConditionalGeneration)
from datasets import load_dataset, Dataset
import torch
import pandas as pd

In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
from bert_score import score
from nltk.tokenize import sent_tokenize

In [None]:
from huggingface_hub import hf_hub_download

In [None]:
repo_id = "d0rj/wikisum"
splits = {'train': 'data/train-00000-of-00001-b28959cff7dcaf55.parquet', 'validation': 'data/validation-00000-of-00001-21f2c9acfa77bab4.parquet', 'test': 'data/test-00000-of-00001-52a8a7cd640a9fff.parquet'}

In [None]:
dataset_train = pd.read_parquet(hf_hub_download(repo_id=repo_id, repo_type="dataset", filename=splits["train"]))
dataset_val = pd.read_parquet(hf_hub_download(repo_id=repo_id, repo_type="dataset", filename=splits["validation"]))
dataset_test = pd.read_parquet(hf_hub_download(repo_id=repo_id, repo_type="dataset", filename=splits["test"]))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
dataset_train = Dataset.from_pandas(dataset_train)
dataset_val = Dataset.from_pandas(dataset_val)
dataset_test = Dataset.from_pandas(dataset_test)

In [None]:
tokenizer = AutoTokenizer.from_pretrained('allenai/PRIMERA')

config=LEDConfig.from_pretrained('allenai/PRIMERA')

model = LEDForConditionalGeneration.from_pretrained('allenai/PRIMERA')
model.gradient_checkpointing_enable()

PAD_TOKEN_ID = tokenizer.pad_token_id
DOCSEP_TOKEN_ID = tokenizer.convert_tokens_to_ids("<doc-sep>")

In [None]:
def process_document(documents):
    input_ids_all = []

    for data in documents:
        cleaned_text = " ".join(data.replace("\n", " ").split())
        input_chunks = [cleaned_text]

        input_ids = []
        for doc in input_chunks:
            input_ids.extend(
                tokenizer.encode(
                    doc,
                    truncation=True,
                    max_length=4096,
                )[1:-1]
            )
            input_ids.append(DOCSEP_TOKEN_ID)

        input_ids = (
            [tokenizer.bos_token_id]
            + input_ids
            + [tokenizer.eos_token_id]
        )
        input_ids_all.append(torch.tensor(input_ids))

    input_ids = torch.nn.utils.rnn.pad_sequence(
        input_ids_all, batch_first=True, padding_value=PAD_TOKEN_ID
    )
    return input_ids

In [None]:
def bert_apply(generated, reference_summary):
    summary_sents = sent_tokenize(generated)
    # Create a list of the same reference summary, repeated for each candidate sentence
    reference = [reference_summary] * len(summary_sents)

    # Score sentences
    P, R, F1 = score(summary_sents, reference, lang="en", verbose=True)

    # Rank and select top N
    ranked = sorted(zip(summary_sents, F1), key=lambda x: x[1], reverse=True)
    concise_summary = " ".join([sent for sent, _ in ranked[:4]])
    return concise_summary

In [None]:
def batch_process(batch):
    input_ids=process_document(batch['article'])
    # get the input ids and attention masks together
    global_attention_mask = torch.zeros_like(input_ids).to(input_ids.device)
    # put global attention on <s> token

    global_attention_mask[:, 0] = 1
    global_attention_mask[input_ids == DOCSEP_TOKEN_ID] = 1
    generated_ids = model.generate(
        input_ids=input_ids,
        global_attention_mask=global_attention_mask,
        use_cache=True,
        max_length=512,
        min_length=50,
        num_beams=5,
        length_penalty=2,
        no_repeat_ngram_size=3,
        early_stopping=True
    )
    generated_str = tokenizer.batch_decode(
            generated_ids.tolist(), skip_special_tokens=True
        )

    generated_str_bert = [bert_apply(generated, ref) for generated, ref in zip(generated_str, batch['summary'])]

    result={}
    result['generated_summaries'] = generated_str_bert
    result['gt_summaries']=batch['summary']
    return result

In [None]:
import random

docs = random.sample(range(len(dataset_test)),k=5)
docs

[880, 1434, 1095, 1049, 1573]

In [None]:
dataset_small = dataset_test.select(docs)
result_small = dataset_small.map(batch_process, batched=True, batch_size=2)

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Input ids are automatically padded from 1169 to 1536 to be a multiple of `config.attention_window`: 512
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 37.22 seconds, 0.51 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

Input ids are automatically padded from 1897 to 2048 to be a multiple of `config.attention_window`: 512


done in 20.30 seconds, 0.94 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 10.99 seconds, 1.09 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

Input ids are automatically padded from 474 to 512 to be a multiple of `config.attention_window`: 512


done in 26.32 seconds, 0.95 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 8.15 seconds, 1.72 sentences/sec


In [None]:
import evaluate

rouge = evaluate.load("rouge")

In [None]:
rouge_scores = rouge.compute(predictions=result_small["generated_summaries"], references=result_small["gt_summaries"])

print(f"ROUGE-1: {rouge_scores['rouge1']:.4f}")
print(f"ROUGE-2: {rouge_scores['rouge2']:.4f}")
print(f"ROUGE-L: {rouge_scores['rougeL']:.4f}")

ROUGE-1: 0.3956
ROUGE-2: 0.1552
ROUGE-L: 0.1972


In [None]:
for i in range(0,5):
    example_summ = dataset_small[i]['summary']
    produced_summ = result_small[i]['generated_summaries']
    display(example_summ)
    display(produced_summ)
    print("\n")

"To get a dried butter stain out of clothing, first dampen the stain with some warm water. Then, squirt a dollop of dish soap directly on the stain. Rub the soap into the stain with your finger using a smooth circular motion. Rinse the stain under warm water, then spray it with some prewash stain remover. Machine wash your garment like normal. To remove a dried butter stain, try using baking soda. Sprinkle the baking soda over the stain so it's completely covered, then let it sit overnight. The next day, shake the baking soda off and machine wash your garment. Repeat these steps as needed until the butter stain is gone. If you're dealing with a butter stain that's still wet, first scrape up any globs of butter. Then, dip a napkin in warm water and coat it in salt. Put a dry napkin behind the stain to brace the fabric, then press the salt-soaked napkin against the butter stain without rubbing. Hold it there for 30 seconds to give the salt time to absorb the oils. Finally, rinse the area

'If the stain has not yet been removed, repeat the process of applying dish soap, rinsing, pretreating the stain, and washing one more time before putting the garment through the dryer. When you cover the fresh butter stain with a generous layer of either product, the powder will draw the butter out of the garment. Test the product on a small, easily hidden patch of fabric on the garment before applying it to the stain. The hotter the water, the higher the likelihood of the butter stain coming out, so use the hottest temperature allowable for the fabric of the stained piece of clothing.'





"If your cat's joints are swollen or if it drags one of its feet when it walks, it may be suffering from septic arthritis, which is a condition caused by bacteria in your cat's joints. To treat your cat's septic arthritis, take it to the veterinarian, who will most likely perform surgery to determine what type of bacteria is causing the infection. From here, they will prescribe antibiotics that you will need to administer to your cat for as long as recommended, even after symptoms disappear. To make sure your cat heals fully and properly, you should also try to keep it from moving around more than it needs to by placing its litter box, food, water, and bed all in the same area."

"However, lack of appetite is a serious symptom regardless of what causes it, so if you notice it in your cat, you should take it to the vet right away. If your cat is in enough pain, it may lose its appetite. If you are testing your cat's joints, do so gently. The length of time your cat has been suffering from septic arthritis can affect how much additional damage there is."





'If you want to make Nigerian fried rice, parboil the rice for 5 minutes, drain some of the water from the pot, and simmer for a further 3 minutes. Saute the onion for 7-8 minutes, then add the ground crayfish and cook for 1 minute. Next, stir in the diced vegetables, Nigerian curry powder, ground pepper, and stock cubes before frying the mixture for 2-5 minutes. To finish, mix the rice into the vegetables, fry for 2 minutes, and add the smoked prawns.'

"If you want to adjust the flavor of the rice, add more ground crayfish, curry, or ground pepper. Sauté the onions for 7 to 8 minutes. Stir in the smoked prawns and stir the mixture until it's just combined. Store the leftover Nigerian fried rice within 2 hours of making it."





"If you need to clear mild acne fast, start by trying a benzoyl peroxide cream. Benzoyl peroxide comes in a variety of strengths, from 2.5 to 10 percent, and it's best to start with the lowest strength for mild acne. If benzoyl peroxide alone isn't getting the job done, try adding salicylic acid to the mix. You can also try alpha hydroxy acids, like glycolic acid and lactic acid. These acids can fight acne, minimize acne scars, and make pores look smaller. All of these products are available over-the-counter, but be sure to give each treatment at least a week so you can properly gauge the results."

"It can lessen the appearance of acne and acne scars and make skin look younger and lighter overall. It's a good idea to go without makeup until acne is eliminated, but another option is to talk to your doctor or skin care expert about better choices of makeup, such as loose mineral-based powders. The ingredients are generally stronger and may help clear up your acne faster than drugstore brands. Eliminate potential irritants If you're looking to cure a breakout fast, eliminate any potential skin irritants until your acne is gone."





'To make cinnamon sugar, measure out 1 cup of granulated white sugar and 1/4 cup of ground cinnamon and put the ingredients in a small bowl. Next, mix the cinnamon and sugar together thoroughly with a fork until the mixture has a uniform color and consistency. Then, you can store the mixture in an airtight container at room temperature indefinitely until you use it all up.'

'Store any leftover mixture in an airtight container at room temperature. Stop mixing when the butter is an even color and consistency. You can make the cinnamon sugar ahead of time to add to butter whenever you want. Mash and stir the sugar and cinnamon into the butter until it is evenly distributed.'





In [None]:
text = """In a case, you can start by raising a question. You could quote someone you interviewed. Make sure to include background information on your study site, why your interviewees are a good sample, and what makes your problem pressing to give your audience a panoramic view of the issue. After you've clearly stated the problem at hand, of course. Include photos or a video if it would benefit your work to be persuasive and personalized. As you go through each one, take adequate notes so you can find the info later on! Search for case studies that have been published on the same or similar subject matter. Talk to your professors, go to the library, surf the web until your bum falls asleep. You don't want to replicate the research that has already been done. Find out what has been written before, and read the important articles about your case's situation. . You also need to ask questions that will give you facts that might not be available from an article--make your work different and purposeful. Set up interviews with subject matter experts (account managers in a corporation, clients and customers using applicable tools and services, etc. ). Make sure all your informants are aware of what you're doing. If you have written a good case, they will have enough information to understand the situation and have a lively class discussion. Add references and appendices (if any). Just like you would in any other paper, reference your sources. That's why you got credible ones in the first place. And if you have any information that relates to the study but would have interrupted the flow of the body, include it now. You may have terms that would be hard for other cultures to understand. As your work is forming, you'll notice that it may morph into an object you didn't otherwise expect. If it does so, make additions and deletions as needed. Go over your study section by section, but also as a whole. Each data point needs to fit into both it's place and the entirety of the work.If you can't find an appropriate place for something, stick it in the appendix. Edit and proofread your work. Now that your paper is formulated, look for minute revisions."""  # (your full text here)

tokens = tokenizer.tokenize(text)
token_count = len(tokens)

print("Token count:", token_count)

Token count: 443
