# Inference

In this notebook we are loading a custom dataset, taken from real whatsapp chats, in order to try the sunmmarization performance of our previously fine-tuned model.

In [1]:
!pip install datasets pytesseract transformers datasets rouge --upgrade

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/547.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.6/547.8 kB[0m [31m5.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m542.7/547.8 kB[0m [31m8.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pytesseract
  Downloading pytesseract-0.3.10-py3-none-any.whl (14 kB)
Collecting transformers
  Downloading transformers-4.42.3-py3-none-any.whl (9.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.3/9.3 MB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloa

In [2]:
from transformers import GenerationConfig, AutoTokenizer, AutoModelForSeq2SeqLM
import json
from datasets import Dataset
from pprint import pprint
import torch
from collections import defaultdict
from rouge import Rouge


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

Here we initialize the model.

In [3]:
gen_model_id="Seba213/flan-t5-base-samsum"
gen_tokenizer = AutoTokenizer.from_pretrained(gen_model_id)
gen_model = AutoModelForSeq2SeqLM.from_pretrained(gen_model_id).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/251 [00:00<?, ?B/s]

Here we explain the arguments used in the following generation config, that are setted in order to improve the generation of summary:


*   `max_length`: the maximum length the generated tokens can have
*   `min_length`: the minimum length the generated tokens can have
*   `length_penalty`: if < 0.0 encourage shorter sequence, if > 0.0 encourage longer sequence
*   `num_beams`: number of beams for beam search
* `repetition_penalty`: the parameter for repetition penalty. 1.0 means no penalty
*   `early_stopping`: it accepts the following values: *True*, where the generation stops as soon as there are num_beams complete candidates; *False*, where an heuristic is applied and the generation stops when is it very unlikely to find better candidates;
*   `no_repeat_ngram_size`: avoid the repetition of n-grams of size n





In [4]:
generation_config = GenerationConfig(
    max_length=256,
    min_length=50,
    length_penalty=2.0,
    num_beams=12,
    repetition_penalty=2.5,
    early_stopping=True,
    no_repeat_ngram_size = 3,
    bos_token_id=gen_model.config.bos_token_id,
    decoder_start_token_id=gen_model.config.decoder_start_token_id,
    eos_token_id = gen_model.generation_config.eos_token_id,
    pad_token_id = gen_model.generation_config.pad_token_id,
    forced_bos_token_id = 0,
    forced_eos_token_id = 2,
)

## Load dataset


In [5]:
with open( 'test.json' , 'r') as file:
  whatsapp_chats= json.load(file)

test_dataset = Dataset.from_list(whatsapp_chats)

In the next code cell, the inference operation is performed over all dataset, computing also for each output summary the Rouge scores.

In [6]:
rouge_sum = 0
for idx, dialogue in enumerate(test_dataset['dialogue']):
  rouge = Rouge()
  input = gen_tokenizer(dialogue, return_tensors="pt").to(device)
  encoded_output = gen_model.generate(**input, generation_config=generation_config)
  decoded_output = gen_tokenizer.batch_decode(encoded_output, skip_special_tokens=True)
  scores1 = rouge.get_scores(decoded_output[0],test_dataset['summary'][idx])[0]
  pprint(f"dialogue: \n{dialogue}\n---------------")
  print('\n')
  pprint(f"reference summary:\n{test_dataset['summary'][idx]}")
  print('\n')
  pprint(f"flan-t5-base summary:\n{decoded_output}")
  print('\n')
  rouge_sum += scores1['rouge-1']['f']
  print(f"ROUGE-1: {scores1['rouge-1']['f']}")
  print(f"ROUGE-2: {scores1['rouge-2']['f']}")
  print(f"ROUGE-L: {scores1['rouge-l']['f']}")
  print('------------------------------------------------------------')
print(f"Rouge mean: {rouge_sum/12}")

Token indices sequence length is longer than the specified maximum sequence length for this model (566 > 512). Running this sequence through the model will result in indexing errors


('dialogue: \n'
 'Alex: Did you guys watch the derby? What a game! Sam: Yes! Inter Milan was '
 "on fire  Jordan: Totally! They really showed Milan who's boss. Taylor: Sent "
 'an image with this description: Paulo Dybala, wearing a blue t-shirt, with '
 "tattoos on his hands, celebrating after scoring Inter Milan's second goal. "
 'The background is blurred, suggesting the focus is on Dyba and his '
 'celebration. That first goal by Lautaro Martinez was amazing! What a '
 'celebration tho Casey: Absolutely, Lautaro was unstoppable tonight. Alex: '
 "And that assist from Barella, so smooth! Sam: I think Milan's defense just "
 "couldn't keep up. Jordan: Agreed. They looked so disorganized at the back. "
 'Chris: As a Milan fan, this was painful to watch  Jamie: Same here, Chris. '
 "Milan just didn't show up tonight. Taylor: What did you think of Inter's "
 "second goal? Perfect counter-attack. Chris: It was a great goal, but Milan's "
 'defense was nowhere to be seen. Casey: That was 