# Summarization (PyTorch)

Install the Transformers and Datasets libraries to run this notebook.

In [1]:
# Install depencencies
!pip install datasets transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the followin line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!apt install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
The following packages were automatically installed and are no longer required:
  libnvidia-common-460 nsight-compute-2020.2.0
Use 'apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 42 not upgraded.


In [2]:
# Load English and Spanish datasets
from datasets import load_dataset

spanish_dataset = load_dataset("amazon_reviews_multi", "es")
english_dataset = load_dataset("amazon_reviews_multi", "en")
english_dataset

Reusing dataset amazon_reviews_multi (/root/.cache/huggingface/datasets/amazon_reviews_multi/es/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609)


  0%|          | 0/3 [00:00<?, ?it/s]

Reusing dataset amazon_reviews_multi (/root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
})

In [3]:
# Print a few examples of the data
def show_samples(dataset, num_samples=3, seed=42):
    sample = dataset["train"].shuffle(seed=seed).select(range(num_samples))
    for example in sample:
        print(f"\n'>> Title: {example['review_title']}'")
        print(f"'>> Review: {example['review_body']}'")


show_samples(english_dataset)

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609/cache-55db70e56c9e5d52.arrow



'>> Title: Worked in front position, not rear'
'>> Review: 3 stars because these are not rear brakes as stated in the item description. At least the mount adapter only worked on the front fork of the bike that I got it for.'

'>> Title: meh'
'>> Review: Does it’s job and it’s gorgeous but mine is falling apart, I had to basically put it together again with hot glue'

'>> Title: Can't beat these for the money'
'>> Review: Bought this for handling miscellaneous aircraft parts and hanger "stuff" that I needed to organize; it really fit the bill. The unit arrived quickly, was well packaged and arrived intact (always a good sign). There are five wall mounts-- three on the top and two on the bottom. I wanted to mount it on the wall, so all I had to do was to remove the top two layers of plastic drawers, as well as the bottom corner drawers, place it when I wanted and mark it; I then used some of the new plastic screw in wall anchors (the 50 pound variety) and it easily mounted to the wall. 

In [4]:
# Check out product category value counts
english_dataset.set_format("pandas")
english_df = english_dataset["train"][:]
# Show counts for top 20 products
english_df["product_category"].value_counts()[:20]

home                      17679
apparel                   15951
wireless                  15717
other                     13418
beauty                    12091
drugstore                 11730
kitchen                   10382
toy                        8745
sports                     8277
automotive                 7506
lawn_and_garden            7327
home_improvement           7136
pet_products               7082
digital_ebook_purchase     6749
pc                         6401
electronics                6186
office_product             5521
shoes                      5197
grocery                    4730
book                       3756
Name: product_category, dtype: int64

In [5]:
# Function to filter to books and ebooks
def filter_books(example):
    return (
        example["product_category"] == "book"
        or example["product_category"] == "digital_ebook_purchase"
    )

In [6]:
# Reset data format back to arrow
english_dataset.reset_format()

In [7]:
# Apply filtering function to datasets
spanish_books = spanish_dataset.filter(filter_books)
english_books = english_dataset.filter(filter_books)
show_samples(english_books)

Loading cached processed dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/es/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609/cache-4d2dec7a6a42df16.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/es/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609/cache-61058da7e51c423e.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/es/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609/cache-4843747a01f000ae.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609/cache-8c2a38a14cdadcfd.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609/cache-07efdcfa3ab5d139.arrow
Loading cached processed datas


'>> Title: I'm dissapointed.'
'>> Review: I guess I had higher expectations for this book from the reviews. I really thought I'd at least like it. The plot idea was great. I loved Ash but, it just didnt go anywhere. Most of the book was about their radio show and talking to callers. I wanted the author to dig deeper so we could really get to know the characters. All we know about Grace is that she is attractive looking, Latino and is kind of a brat. I'm dissapointed.'

'>> Title: Good art, good price, poor design'
'>> Review: I had gotten the DC Vintage calendar the past two years, but it was on backorder forever this year and I saw they had shrunk the dimensions for no good reason. This one has good art choices but the design has the fold going through the picture, so it's less aesthetically pleasing, especially if you want to keep a picture to hang. For the price, a good calendar'

'>> Title: Helpful'
'>> Review: Nearly all the tips useful and. I consider myself an intermediate to a

In [8]:
# Create a combined dataset
from datasets import concatenate_datasets, DatasetDict

books_dataset = DatasetDict()

for split in english_books.keys():
    books_dataset[split] = concatenate_datasets(
        [english_books[split], spanish_books[split]]
    )
    books_dataset[split] = books_dataset[split].shuffle(seed=42)

# Peek at a few examples
show_samples(books_dataset)

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609/cache-e9354c4da463ab89.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609/cache-a5c6ad96dea06fbe.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609/cache-7967af58b64cabb0.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609/cache-9bde5ae1560b92e2.arrow



'>> Title: Easy to follow!!!!'
'>> Review: I loved The dash diet weight loss Solution. Never hungry. I would recommend this diet. Also the menus are well rounded. Try it. Has lots of the information need thanks.'

'>> Title: PARCIALMENTE DAÑADO'
'>> Review: Me llegó el día que tocaba, junto a otros libros que pedí, pero la caja llegó en mal estado lo cual dañó las esquinas de los libros porque venían sin protección (forro).'

'>> Title: no lo he podido descargar'
'>> Review: igual que el anterior'


In [9]:
# Filter out reviews that are 2 words or less
books_dataset = books_dataset.filter(lambda x: len(x["review_title"].split()) > 2)

Loading cached processed dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609/cache-8827f8fe2dc3d3c4.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609/cache-282c0e76a39649d7.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609/cache-56c206bd8f93433f.arrow


In [10]:
# Import pre-trianed model
from transformers import AutoTokenizer

model_checkpoint = "google/mt5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

  "The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option"


In [11]:
# Tokenization example
inputs = tokenizer("I loved reading the Hunger Games!")
inputs

{'input_ids': [336, 259, 28387, 11807, 287, 62893, 295, 12507, 309, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [12]:
# Convert input IDs to tokens
tokenizer.convert_ids_to_tokens(inputs.input_ids)

['▁I', '▁', 'loved', '▁reading', '▁the', '▁Hung', 'er', '▁Games', '!', '</s>']

In [13]:
# Preprocessing function for tokenization
max_input_length = 512
max_target_length = 30


def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["review_body"], max_length=max_input_length, truncation=True
    )
    # Set up the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["review_title"], max_length=max_target_length, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [14]:
# Tokenize the dataset!
tokenized_datasets = books_dataset.map(preprocess_function, batched=True)

Loading cached processed dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609/cache-bd7dfbf9ac31f506.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609/cache-2f6c347862caac8c.arrow


  0%|          | 0/1 [00:00<?, ?ba/s]

In [15]:
# Example generated and reference summaries
generated_summary = "I absolutely loved reading the Hunger Games"
reference_summary = "I loved reading the Hunger Games"

In [16]:
# Install ROUGE score library dependency
!pip install rouge_score



In [17]:
# Load in ROUGE metric
from datasets import load_metric

rouge_score = load_metric("rouge")

In [18]:
# Compute ROUGE socre using example reference docs above
scores = rouge_score.compute(
    predictions=[generated_summary], references=[reference_summary]
)
scores

{'rouge1': AggregateScore(low=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), mid=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), high=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923)),
 'rouge2': AggregateScore(low=Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272), mid=Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272), high=Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272)),
 'rougeL': AggregateScore(low=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), mid=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), high=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923)),
 'rougeLsum': AggregateScore(low=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), mid=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.92307692307

In [20]:
# Install nltk
!pip install nltk



In [21]:
# Import nltk and download punctuation rules
import nltk

nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [22]:
# Create baseline lead-3 reference summary
from nltk.tokenize import sent_tokenize


def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])


print(three_sentence_summary(books_dataset["train"][1]["review_body"]))

I grew up reading Koontz, and years ago, I stopped,convinced i had "outgrown" him.
Still,when a friend was looking for something suspenseful too read, I suggested Koontz.
She found Strangers.


In [23]:
# Function for evaluating baseline summary
def evaluate_baseline(dataset, metric):
    summaries = [three_sentence_summary(text) for text in dataset["review_body"]]
    return metric.compute(predictions=summaries, references=dataset["review_title"])

In [24]:
# Eval
import pandas as pd

score = evaluate_baseline(books_dataset["validation"], rouge_score)
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_dict = dict((rn, round(score[rn].mid.fmeasure * 100, 2)) for rn in rouge_names)
rouge_dict

{'rouge1': 16.79, 'rouge2': 8.83, 'rougeL': 15.56, 'rougeLsum': 15.99}

In [25]:
# Load sequence-to-sequence model
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [26]:
# Instantiate model traning arguements
from transformers import Seq2SeqTrainingArguments

batch_size = 4
num_train_epochs = 4
# Show the training loss with every epoch
logging_steps = len(tokenized_datasets["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

args = Seq2SeqTrainingArguments(
    output_dir=f"{model_name}-finetuned-amazon-en-es",
    evaluation_strategy="epoch",
    learning_rate=5.6e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps,
    push_to_hub=False,
)

In [27]:
# Create function for evaluation on training
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # Decode generated summaries into text
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    # Decode reference summaries into text
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # ROUGE expects a newline after each sentence
    decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels]
    # Compute ROUGE scores
    result = rouge_score.compute(
        predictions=decoded_preds, references=decoded_labels, use_stemmer=True
    )
    # Extract the median scores
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    return {k: round(v, 4) for k, v in result.items()}

In [28]:
# Import and instantiate data collator
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [29]:
# remove some columns
tokenized_datasets = tokenized_datasets.remove_columns(
    books_dataset["train"].column_names
)

In [30]:
# Example of padding and truncation performed by data collator
features = [tokenized_datasets["train"][i] for i in range(2)]
data_collator(features)

{'input_ids': tensor([[   653,   1957,   1314,    261,   2757,   1280,    435,    259,  29166,
            263,    269,    774,   5547,      1,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,      0,
              

In [31]:
# Instantiate Hhuggingface Trainer object
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"].select(range(1000)),
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [32]:
# Train!
trainer.train()

***** Running training *****
  Num examples = 1000
  Num Epochs = 4
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 1000


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,No log,6.281906,2.7186,0.3849,2.5045,2.5038
2,No log,4.781202,4.4,0.6151,4.2896,4.3328
3,No log,4.18102,6.6675,1.8033,6.5625,6.5321
4,No log,4.056785,6.6704,1.6617,6.6272,6.561


***** Running Evaluation *****
  Num examples = 238
  Batch size = 4
Saving model checkpoint to mt5-small-finetuned-amazon-en-es/checkpoint-500
Configuration saved in mt5-small-finetuned-amazon-en-es/checkpoint-500/config.json
Model weights saved in mt5-small-finetuned-amazon-en-es/checkpoint-500/pytorch_model.bin
tokenizer config file saved in mt5-small-finetuned-amazon-en-es/checkpoint-500/tokenizer_config.json
Special tokens file saved in mt5-small-finetuned-amazon-en-es/checkpoint-500/special_tokens_map.json
Copy vocab file to mt5-small-finetuned-amazon-en-es/checkpoint-500/spiece.model
***** Running Evaluation *****
  Num examples = 238
  Batch size = 4
***** Running Evaluation *****
  Num examples = 238
  Batch size = 4
Saving model checkpoint to mt5-small-finetuned-amazon-en-es/checkpoint-1000
Configuration saved in mt5-small-finetuned-amazon-en-es/checkpoint-1000/config.json
Model weights saved in mt5-small-finetuned-amazon-en-es/checkpoint-1000/pytorch_model.bin
tokenizer conf

TrainOutput(global_step=1000, training_loss=9.3855029296875, metrics={'train_runtime': 597.7905, 'train_samples_per_second': 6.691, 'train_steps_per_second': 1.673, 'total_flos': 483876191723520.0, 'train_loss': 9.3855029296875, 'epoch': 4.0})

In [33]:
# Post-training evaluation
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 238
  Batch size = 4


{'epoch': 4.0,
 'eval_loss': 4.056784629821777,
 'eval_rouge1': 6.6704,
 'eval_rouge2': 1.6617,
 'eval_rougeL': 6.6272,
 'eval_rougeLsum': 6.561,
 'eval_runtime': 30.117,
 'eval_samples_per_second': 7.903,
 'eval_steps_per_second': 1.992}

In [34]:
# Set data format to torch
tokenized_datasets.set_format("torch")

In [35]:
# Load pretrained model
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

loading configuration file https://huggingface.co/google/mt5-small/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/97693496c1a0cae463bd18428187f9e9924d2dfbadaa46e4d468634a0fc95a41.dadce13f8f85f4825168354a04675d4b177749f8f11b167e87676777695d4fe4
Model config MT5Config {
  "_name_or_path": "google/mt5-small",
  "architectures": [
    "MT5ForConditionalGeneration"
  ],
  "d_ff": 1024,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "mt5",
  "num_decoder_layers": 8,
  "num_heads": 6,
  "num_layers": 8,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.19.1",
  "use_cache": true,
  "vocab_size": 250112
}

loading wei

In [53]:
# Create dataloaders
from torch.utils.data import DataLoader

batch_size = 4
train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=batch_size,
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], collate_fn=data_collator, batch_size=batch_size
)

In [54]:
# Get optimizer
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

In [55]:
# Pass to accelerator
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [56]:
# Get learning rate scheduler
from transformers import get_scheduler

num_train_epochs = 10
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [57]:
# Post processing function
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # ROUGE expects a newline after each sentence
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

    return preds, labels

In [None]:
# Custom training loop!
from tqdm.auto import tqdm
import torch
import numpy as np

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for step, batch in enumerate(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(
                batch["input_ids"],
                attention_mask=batch["attention_mask"],
            )

            generated_tokens = accelerator.pad_across_processes(
                generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
            )
            labels = batch["labels"]

            # If we did not pad to max length, we need to pad the labels too
            labels = accelerator.pad_across_processes(
                batch["labels"], dim=1, pad_index=tokenizer.pad_token_id
            )

            generated_tokens = accelerator.gather(generated_tokens).cpu().numpy()
            labels = accelerator.gather(labels).cpu().numpy()

            # Replace -100 in the labels as we can't decode them
            labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            if isinstance(generated_tokens, tuple):
                generated_tokens = generated_tokens[0]
            decoded_preds = tokenizer.batch_decode(
                generated_tokens, skip_special_tokens=True
            )
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

            decoded_preds, decoded_labels = postprocess_text(
                decoded_preds, decoded_labels
            )

            rouge_score.add_batch(predictions=decoded_preds, references=decoded_labels)

    # Compute metrics
    result = rouge_score.compute()
    # Extract the median ROUGE scores
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    result = {k: round(v, 4) for k, v in result.items()}
    print(f"Epoch {epoch}:", result)

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

In [60]:
# Load model
from transformers import pipeline

hub_model_id = "huggingface-course/mt5-small-finetuned-amazon-en-es"
summarizer = pipeline("summarization", model=hub_model_id)

https://huggingface.co/huggingface-course/mt5-small-finetuned-amazon-en-es/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpxhw4gr_t


Downloading:   0%|          | 0.00/682 [00:00<?, ?B/s]

storing https://huggingface.co/huggingface-course/mt5-small-finetuned-amazon-en-es/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/dde8e444762a31ab085cd22ae61918d3bb0865b49bde61c155322f56501f57ae.207147ee081fa77cd7c866a3b5bce2c2638a6e60fe32671b2faa08770ac23bab
creating metadata file for /root/.cache/huggingface/transformers/dde8e444762a31ab085cd22ae61918d3bb0865b49bde61c155322f56501f57ae.207147ee081fa77cd7c866a3b5bce2c2638a6e60fe32671b2faa08770ac23bab
loading configuration file https://huggingface.co/huggingface-course/mt5-small-finetuned-amazon-en-es/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/dde8e444762a31ab085cd22ae61918d3bb0865b49bde61c155322f56501f57ae.207147ee081fa77cd7c866a3b5bce2c2638a6e60fe32671b2faa08770ac23bab
Model config MT5Config {
  "_name_or_path": "huggingface-course/mt5-small-finetuned-amazon-en-es",
  "architectures": [
    "MT5ForConditionalGeneration"
  ],
  "d_ff": 1024,
  "d_kv": 64,
  "d_model": 512,
 

Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

storing https://huggingface.co/huggingface-course/mt5-small-finetuned-amazon-en-es/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/d49f4ba2d457df2ee6065cc6958264d563cab415f85feed84a0a85db8b7bcf53.21ddfef6c4e7eb44e07005bf8db52bc63d0726d008827d059b151a628db2f529
creating metadata file for /root/.cache/huggingface/transformers/d49f4ba2d457df2ee6065cc6958264d563cab415f85feed84a0a85db8b7bcf53.21ddfef6c4e7eb44e07005bf8db52bc63d0726d008827d059b151a628db2f529
loading weights file https://huggingface.co/huggingface-course/mt5-small-finetuned-amazon-en-es/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/d49f4ba2d457df2ee6065cc6958264d563cab415f85feed84a0a85db8b7bcf53.21ddfef6c4e7eb44e07005bf8db52bc63d0726d008827d059b151a628db2f529
All model checkpoint weights were used when initializing MT5ForConditionalGeneration.

All the weights of MT5ForConditionalGeneration were initialized from the model checkpoint at huggingface-course/mt5

Downloading:   0%|          | 0.00/398 [00:00<?, ?B/s]

storing https://huggingface.co/huggingface-course/mt5-small-finetuned-amazon-en-es/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/006d9cf08164902de42c0fd8ed1b9a14ae0ec9f3d2cc2a6d8da8ed17f8ca5855.6e87115e0a1c8055c27a54422aea0068c0372e93ada199c6b4dff828bb022da1
creating metadata file for /root/.cache/huggingface/transformers/006d9cf08164902de42c0fd8ed1b9a14ae0ec9f3d2cc2a6d8da8ed17f8ca5855.6e87115e0a1c8055c27a54422aea0068c0372e93ada199c6b4dff828bb022da1
https://huggingface.co/huggingface-course/mt5-small-finetuned-amazon-en-es/resolve/main/spiece.model not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp4ciozfmx


Downloading:   0%|          | 0.00/4.11M [00:00<?, ?B/s]

storing https://huggingface.co/huggingface-course/mt5-small-finetuned-amazon-en-es/resolve/main/spiece.model in cache at /root/.cache/huggingface/transformers/717900414ce0a86de07013c8fcfece57d4442d39dbc129de7524dae70fc54f8b.da687df25d297aebfd515b6699506f3229d24423c0da1a02f45396bfa8197a95
creating metadata file for /root/.cache/huggingface/transformers/717900414ce0a86de07013c8fcfece57d4442d39dbc129de7524dae70fc54f8b.da687df25d297aebfd515b6699506f3229d24423c0da1a02f45396bfa8197a95
https://huggingface.co/huggingface-course/mt5-small-finetuned-amazon-en-es/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpgrripkvc


Downloading:   0%|          | 0.00/7.94M [00:00<?, ?B/s]

storing https://huggingface.co/huggingface-course/mt5-small-finetuned-amazon-en-es/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/3a657e75065e7284a3f84ed2b10486f013c767c204aecca616e6d9c06f010fbf.9d5034526d3968eb13cbe73b3ecab3a4bec507cac3e53b768c0012e4cdbfb144
creating metadata file for /root/.cache/huggingface/transformers/3a657e75065e7284a3f84ed2b10486f013c767c204aecca616e6d9c06f010fbf.9d5034526d3968eb13cbe73b3ecab3a4bec507cac3e53b768c0012e4cdbfb144
https://huggingface.co/huggingface-course/mt5-small-finetuned-amazon-en-es/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp0xe15uwo


Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

storing https://huggingface.co/huggingface-course/mt5-small-finetuned-amazon-en-es/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/f9adca713538596fff176ee7726e5fe87f692048e36d448a72bfc47fad8f49db.294ebaa4cd17bb284635004c92d2c4d522ec488c828dcce0c2471b6f28e3fe82
creating metadata file for /root/.cache/huggingface/transformers/f9adca713538596fff176ee7726e5fe87f692048e36d448a72bfc47fad8f49db.294ebaa4cd17bb284635004c92d2c4d522ec488c828dcce0c2471b6f28e3fe82
loading file https://huggingface.co/huggingface-course/mt5-small-finetuned-amazon-en-es/resolve/main/spiece.model from cache at /root/.cache/huggingface/transformers/717900414ce0a86de07013c8fcfece57d4442d39dbc129de7524dae70fc54f8b.da687df25d297aebfd515b6699506f3229d24423c0da1a02f45396bfa8197a95
loading file https://huggingface.co/huggingface-course/mt5-small-finetuned-amazon-en-es/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/3a657e75065e7284a3f84ed2b10486f013c767c20

In [61]:
# Function to print examples
def print_summary(idx):
    review = books_dataset["test"][idx]["review_body"]
    title = books_dataset["test"][idx]["review_title"]
    summary = summarizer(books_dataset["test"][idx]["review_body"])[0]["summary_text"]
    print(f"'>>> Review: {review}'")
    print(f"\n'>>> Title: {title}'")
    print(f"\n'>>> Summary: {summary}'")

In [62]:
# Print example 1
print_summary(100)

Your max_length is set to 20, but you input_length is only 9. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=4)


'>>> Review: Muy apropiado para mis hijas'

'>>> Title: Libros para el verano'

'>>> Summary: Muy buen libro'


In [63]:
# Print example 2
print_summary(0)

'>>> Review: Es una trilogia que se hace muy facil de leer. Me ha gustado, no me esperaba el final para nada'

'>>> Title: Buena literatura para adolescentes'

'>>> Summary: Muy facil de leer'
