### Summarization

In questo notebook esploreremo diversi modi di fare summarization in modo da poter capire le diverse alternative a nostra disposizione e come utilizzarle

In [28]:
text = """"
Lack of fluids can lead to dry mouth, which is a leading cause of bad breath. Water
can also dilute any chemicals in your mouth or gut that are causing bad breath., Studies show that
eating 6 ounces of yogurt a day reduces the level of odor-causing compounds in the mouth. In
particular, look for yogurt containing the active bacteria Streptococcus thermophilus or
Lactobacillus bulgaricus., The abrasive nature of fibrous fruits and vegetables helps to clean
teeth, while the vitamins, antioxidants, and acids they contain improve dental health.Foods that can
be particularly helpful include:Apples — Apples contain vitamin C, which is necessary for health
gums, as well as malic acid, which helps to whiten teeth.Carrots — Carrots are rich in vitamin A,
which strengthens tooth enamel.Celery — Chewing celery produces a lot of saliva, which helps to
neutralize bacteria that cause bad breath.Pineapples — Pineapples contain bromelain, an enzyme that
cleans the mouth., These teas have been shown to kill the bacteria that cause bad breath and
plaque., An upset stomach can lead to burping, which contributes to bad breath. Don’t eat foods that
upset your stomach, or if you do, use antacids. If you are lactose intolerant, try lactase tablets.,
They can all cause bad breath. If you do eat them, bring sugar-free gum or a toothbrush and
toothpaste to freshen your mouth afterwards., Diets low in carbohydrates lead to ketosis — a state
in which the body burns primarily fat instead of carbohydrates for energy. This may be good for your
waistline, but it also produces chemicals called ketones, which contribute to bad breath.To stop the
problem, you must change your diet. Or, you can combat the smell in one of these ways:Drink lots of
water to dilute the ketones.Chew sugarless gum or suck on sugarless mints.Chew mint leaves.
"""

### T5 trained on Wikihow

In [29]:
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch

def summarize_t5(text):
    tokenizer = AutoTokenizer.from_pretrained("deep-learning-analytics/wikihow-t5-small")
    model = AutoModelWithLMHead.from_pretrained("deep-learning-analytics/wikihow-t5-small")

    device = torch.device("cpu")
    model = model.to(device)



    preprocess_text = text.strip().replace("\n","")
    tokenized_text = tokenizer.encode(preprocess_text, return_tensors="pt").to(device)

    summary_ids = model.generate(
                tokenized_text,
                max_length=150, 
                num_beams=2,
                repetition_penalty=2.5, 
                length_penalty=1.0, 
                early_stopping=True
            )

    output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return output

print ("\n\nSummarized text: \n",summarize_t5(text))





Summarized text: 
 Drink water.Eat yogurt.Eat fibrous fruits and vegetables.Try teas.Eat lactose-intolerant foods.Eat sugar-free gum.Drink plenty of water.


### BigBird

In [38]:
from transformers import BigBirdPegasusForConditionalGeneration, AutoTokenizer

def summarize_bigbird(text):
    tokenizer = AutoTokenizer.from_pretrained("google/bigbird-pegasus-large-arxiv")

    # by default encoder-attention is `block_sparse` with num_random_blocks=3, block_size=64
    model = BigBirdPegasusForConditionalGeneration.from_pretrained("google/bigbird-pegasus-large-arxiv")

    inputs = tokenizer(text, return_tensors='pt')
    prediction = model.generate(**inputs)
    prediction = tokenizer.batch_decode(prediction)
    return prediction[0]

print(summarize_bigbird(text))

Attention type 'block_sparse' is not possible if sequence_length: 400 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3.Changing attention type to 'original_full'...


<s> sugarless mint leaves are shown to dilute chemicals which cause bad breath which dilutes any chemicals which contribute to bad breath which dilutes any chemicals which contribute to bad breath which dilutes any foods which contribute to bad breath which dilutes any chemicals which contribute to bad breath which dilutes any foods which contribute to bad breath which dilutes any gluconic acid which helps to improve health which dilutes any gluconic acid which helps to improve health which dilutes any gluconic acid which helps to improve health which dilutes any gluconic acid which helps to improve health which dilutes any gluconic acid which helps to improve health which dilutes any gluconic acid which helps to improve health which dilutes any gluconic acid which helps to improve health which dilutes any gluconic acid which helps to improve health which dilutes any gluconic acid which helps to improve health which dilutes any gluconic acid which helps to improve health which dilutes 

In [39]:
from transformers import pipeline

def summarize_bigbird_pipeline(text):
    summarizer = pipeline("summarization", model="google/bigbird-pegasus-large-arxiv")
    output = summarizer(text, min_length=5, max_length=64)
    return output[0]['summary_text']

print(summarize_bigbird_pipeline(text))

Attention type 'block_sparse' is not possible if sequence_length: 400 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3.Changing attention type to 'original_full'...


sugarless mint leaves are shown to dilute chemicals which cause bad breath which dilutes any chemicals which contribute to bad breath which dilutes any chemicals which cause bad breath which dilutes any foods which contribute to bad breath which dilutes any chemicals which contribute to bad breath which dilutes any foods which contribute to bad breath which


### Pegasus

In [40]:
from transformers import pipeline

def summarize_pegasus(text):
    summarizer = pipeline("summarization", model="google/pegasus-xsum")
    output = summarizer(text, min_length=5, max_length=64)
    return output[0]['summary_text']

print(summarize_pegasus(text))


Drinking lots of water is one of the best ways to combat bad breath.


### BART

In [41]:
from transformers import pipeline

def summarize_bart(text):
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
    output = summarizer(text, min_length=5, max_length=64)
    return output[0]['summary_text']

print(summarize_bart(text))


Lack of fluids can lead to dry mouth, which is a leading cause of bad breath. Studies show that eating 6 ounces of yogurt a day reduces the level of odor-causing compounds. The abrasive nature of fibrous fruits and vegetables helps to clean teeth.


### Proviamo su un vero dataset

In [15]:
from datasets import load_dataset

dataset = load_dataset("scitldr")

Downloading:   0%|          | 0.00/2.80k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.59k [00:00<?, ?B/s]

No config specified, defaulting to: scitldr/Abstract


Downloading and preparing dataset scitldr/Abstract (download: 5.23 MiB, generated: 4.58 MiB, post-processed: Unknown size, total: 9.81 MiB) to /home/calogero/.cache/huggingface/datasets/scitldr/Abstract/0.0.0/72d6e2195786c57e1d343066fb2cc4f93ea39c5e381e53e6ae7c44bbfd1f05ef...


Downloading:   0%|          | 0.00/1.01M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/356k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/378k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset scitldr downloaded and prepared to /home/calogero/.cache/huggingface/datasets/scitldr/Abstract/0.0.0/72d6e2195786c57e1d343066fb2cc4f93ea39c5e381e53e6ae7c44bbfd1f05ef. Subsequent calls will reuse this data.


In [44]:
from tqdm import tqdm
from rouge import Rouge
import numpy as np

def rouge_score(prd, tgt):
    rouge = Rouge()
    return rouge.get_scores(prd, tgt)[0]['rouge-l']['f']
    

rouge_t5 = []
rouge_bb = []
rouge_bb_p = []
rouge_pgs = []
rouge_bart = []

summarizers = [summarize_t5, summarize_bigbird, summarize_bigbird_pipeline, summarize_pegasus, summarize_bart]

for i,src in enumerate(tqdm(dataset['test']['source'])):
    src = ' '.join(src)
    tgt = ' '.join(dataset['test']['target'][i])
    prd_t5 =  summarize_t5(src)
    rouge_t5.append( rouge_score(prd_t5, tgt) )
    
    prd_bb =  summarize_bigbird(src)
    rouge_bb.append( rouge_score(prd_bb, tgt) )
    
    prd_bb_p =  summarize_bigbird_pipeline(src)
    rouge_bb_p.append( rouge_score(prd_bb_p, tgt) )
    
    prd_pgs =  summarize_pegasus(src)
    rouge_pgs.append( rouge_score(prd_pgs, tgt) )
    
    prd_bart =  summarize_bart(src)
    rouge_bart.append( rouge_score(prd_bart, tgt) )
 

  0%|                                                                                                                       | 0/618 [00:00<?, ?it/s]Attention type 'block_sparse' is not possible if sequence_length: 217 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3.Changing attention type to 'original_full'...
Attention type 'block_sparse' is not possible if sequence_length: 217 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3.Changing attention type to 'original_full'...
  0%|▏                                                                          

KeyboardInterrupt: 

In [45]:
print("Performance T5")
print(f"Rouge - AVG: {np.mean(rouge_t5)}, STD: {np.std(rouge_t5)}")

Performance T5
Rouge - AVG: 0.16014457432930365, STD: 0.0750278380011422


In [46]:
print("Performance BigBird")
print(f"Rouge - AVG: {np.mean(rouge_bb)}, STD: {np.std(rouge_bb)}")

Performance BigBird
Rouge - AVG: 0.22359598779377526, STD: 0.08992183801335628


In [47]:
print("Performance BigBird with Pipeline")
print(f"Rouge - AVG: {np.mean(rouge_bb_p)}, STD: {np.std(rouge_bb_p)}")

Performance BigBird with Pipeline
Rouge - AVG: 0.227299649374868, STD: 0.10127713232946191


In [48]:
print("Performance Pegasus")
print(f"Rouge - AVG: {np.mean(rouge_pgs)}, STD: {np.std(rouge_pgs)}")

Performance Pegasus
Rouge - AVG: 0.1777226738761773, STD: 0.10845422655714473


In [49]:
print("Performance BART")
print(f"Rouge - AVG: {np.mean(rouge_bart)}, STD: {np.std(rouge_bart)}")

Performance BART
Rouge - AVG: 0.24533799362249537, STD: 0.08998582976732775
