# UNAM Digital Humanities Summer School
# Pratical course: generating text (summarization)

# September 2023

Benjamin Piwowarski (benjamin.piwowarski@cnrs.fr)  -- MLIA/ISIR, Sorbonne Université

In this notebook, we'll look at a (conditional) text generation task, that is, we want to generate a text in response to an input (a document to summarize, a question of a user, etc.)

We'll also look at how to use the huggingface trainer API to simplify writing the learning loop.

## Setting up the environnement

The following cells create the appropriate python environnement, import important modules, and define a few useful functions

In [None]:
# Installing the packages

try:
    from easypip import easyimport, easyinstall, is_notebook
except ModuleNotFoundError as e:
    get_ipython().run_line_magic("pip", "install easypip")
    from easypip import easyimport, easyinstall, is_notebook

easyinstall("torch")
easyinstall("numpy")
easyinstall("pandas")
easyinstall("nltk")
easyinstall("rouge_score")
easyinstall("datasets")
easyinstall("accelerate")
easyinstall("transformers")

In [None]:
import numpy as np
import torch
import random
import pandas as pd
import nltk
from datasets import load_dataset, load_metric
import datasets
from IPython.display import display, HTML

nltk.download('punkt')

%load_ext tensorboard

In [None]:
# Define the device on which to run the model

device = torch.device("cpu")
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")

print("Selected device", device)

In [None]:
def show_random_elements(dataset, num_examples=10):
    """Shows a subset of a pandas dataset"""
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))


In the following, we are going to use the [T5](https://arxiv.org/abs/1910.10683) model, which is a Transformer pre-trained on a set of tasks
in *seq2seq* mode. In particular, we use here the ["small" version](https://huggingface.co/t5-small) from Huggingface.

We use `AutoModelForSeq2SeqLM` to use encoder-decoder models.

In [None]:
from transformers import AutoModelForSeq2SeqLM

model_checkpoint = "t5-small"
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint).to(device)

## Loading data

We use the [Datasets library](https://github.com/huggingface/datasets) provided by huggingface. This library provides access to a large number of text resources.

In [None]:
# To speed up processing, we only use 10% of the train set, and 10% of the validation one (5% for validation, 5% for test)
raw_datasets = load_dataset("xsum", split={"train": "train[:10%]", "validation": "validation[:5%]", "test": "validation[5%:10%]"})
metric = load_metric("rouge")

The `dataset` object is a [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains a set of datasets used for training, validation and testing.

In [None]:
raw_datasets

# Dataset extract

Here's a random extract from the dataset, with a document (to be summarized) and the expected summary

In [None]:
show_random_elements(raw_datasets["train"], 4)

The metric is [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [None]:
metric

We can use the `compute` method to calculate a metric on a set of text pairs.

ROUGE metrics are statistics on the number of n-grams (bigrams for ROUGE-2, etc.). See this [blog](https://towardsdatascience.com/the-ultimate-performance-metric-in-nlp-111df6c64460) for a fuller description.

In [None]:
predictions = ["I was going to the park"]
expected = ["I was hiking"]
metric.compute(predictions=predictions, references=expected)

## Data pre-processing

We're going to use the tokenizer corresponding to the pre-trained Transformer model `T5-small`.

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

To pre-process the data, we add "summarize: " to each text we need to summarize (this is how `T5` was trained). We therefore define a pre-processing function


In [None]:
max_input_length = 512
max_target_length = 128

def preprocess_function(examples):
    inputs = ["summarize: " + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

An example of use is shown below

In [None]:
for key, value in preprocess_function(raw_datasets['train'][:2]).items():
    print(key, value[0][:10], "...")

The dataset can be pre-processed to speed up further processing - using the `map` function.

In [None]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

# Using a seq2seq model

The Transformer library makes it very easy to use a Seq2Seq model, using the `generate` method ([documentation](https://huggingface.co/transformers/main_classes/model.html?highlight=generate#transformers.generation_utils.GenerationMixin.generate)). We explore the various possibilities next.

We first define a `summarize` function that generate text by encoding the sentence to summarize with a prefix (by default, `summarize: `.

In [None]:
import torch

def summarize(model, sentence, prefix="summarize: ", **args):
    encoder_inputs = tokenizer(f"{prefix}{sentence}", return_tensors="pt")
    output = model.generate(encoder_inputs["input_ids"].to(device),  **args)
    return tokenizer.batch_decode(output, skip_special_tokens=True)

In [None]:
sentence = "Mary went to the zoo yesterday with her father. She saw plenty of animals. The one she prefered was a tiger, but she also enjoyed the monkeys."

display(HTML(f"Sentence to summarize: <b>{sentence}</b>"))

## Sampling

The simplest method is

In [None]:
summarize(model, sentence, num_return_sequences=10, max_length=50, output_scores=True, do_sample=True, early_stopping=True)

## Nucleus sampling

In [None]:
summarize(model, sentence, num_return_sequences=10, max_length=50, top_p=10, output_scores=True, do_sample=True, early_stopping=True)

## Top-K sampling

In [None]:
summarize(model, sentence, num_return_sequences=10, max_length=50, do_sample=True, top_k=10, early_stopping=True)

## Beam search

In [None]:
# Beam search

display(summarize(model, sentence, max_length=50, num_beams=5, num_return_sequences=5, early_stopping=True))

## Finetuning T5

As you can see, the model doesn't work all that well directly. We therefore need to fine-tune its parameters in order to obtain better-quality summaries. To do this, we'll use the `Seq2SeqTrainer` class.

The first step is to define the arguments via the [`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments) class. The first argument is the name of the folder that will contain the *checkpoints*. There are plenty of other arguments that can control learning, but the main ones are given below (at least, in the context of this notebook!).

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq, AutoTokenizer

gradient_accumulation_steps = 4 if device.type == "cuda" else 1
eval_steps = 25
max_steps = 100

args = Seq2SeqTrainingArguments(
    "xp/summarization",

    # We evaluate every n steps
    evaluation_strategy = "steps",
    eval_steps=eval_steps * gradient_accumulation_steps,

    # 200 learning steps
    max_steps=max_steps * gradient_accumulation_steps,

    # Optimizer settings
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_steps=25,

    # Batch size
    per_device_train_batch_size=64 // gradient_accumulation_steps,
    gradient_accumulation_steps=gradient_accumulation_steps,
    per_device_eval_batch_size=32,

    # Logging
    logging_dir="xp/summarization/runs",
    logging_steps=.1, # log 10 times
    logging_strategy="steps",

    # 3 checkpoints maximum
    save_total_limit=3,

    # Save the model every 25 steps
    save_steps=eval_steps * gradient_accumulation_steps,

    # Generate to evaluate
    predict_with_generate=True,

    # Speed up training using FP16 (floats with 16 bits)
    # only CUDA
    fp16=device.type == "cuda",

)

To prepare the data, specific pre-processing is required (padding of inputs as well as outputs) - the [DataCollatorForSeq2Seq](https://huggingface.co/transformers/main_classes/data_collator.html#datacollatorforseq2seq) class is used.

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

Next, you need to specify how the `Seq2SeqTrainer` metrics are to be calculated from the predictions. We will use the RED metrics defined above.

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # We replace -100 by [PAD] so we can decode
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Adds "\n" after each sentence (1 sentence per line for nltk)
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    # Computes the metric
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    # Results (x 100)
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

    # Adds mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

Lastly, a `Seq2SeqTrainer` is used to start the learning process:

In [None]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

We setup tensorboard to follow the learning process

In [None]:
%tensorboard --logdir xp/summarization/runs

Now just use the `train` method - and wait (about 5 minutes if you haven't changed the parameters)!

In [None]:
trainer.train()
model.to(device); # just in case

# Using the model

We can now look at the results after learning.

In [None]:
display(summarize(model.to(device), sentence, max_length=50, num_beams=5, num_return_sequences=5, early_stopping=True))

Note that the model has been saved and can be loaded afterwards to keep the training information with

```py
model = AutoModelForSeq2SeqLM.from_pretrained("xp/summarization/checkpoint-10").to(device)
```

# Gooing deeper

The objective of this section is to go a bit deeper in looking at how a transformer model like T5 generates text. Instead of using the `generate` model, we are going to use the model at a lower level - and step by step.

## 1. Tokenization

In [None]:
tokenized = tokenizer(f"summarize: {sentence}", return_tensors="pt")

# Show how the text as been converted
" / ".join(tokenizer.convert_ids_to_tokens(tokenized.input_ids[0])[:30])

## 2. Now, we encode the input using the model encoder

The encoder takes a sequence of $n$ tokens, and computes their contextual representation in $\mathbb{R}^d$, given into the `last_hidden_state` of the output of the encoder.

In [None]:
out_encoder=model.encoder(**tokenized.to(device))

# The output is 1 x n x d
out_encoder.last_hidden_state.shape


## 3. Generating the output

We need to generate the first token: use the special token `<pad>`. The then use the decoder on this sequence, condtionned on the input - here, the input. We obtain a representation of the decoder output for `<pad>` which we project onto the vocabulary.

In [None]:
output = tokenizer("<pad>", return_tensors="pt", add_special_tokens=False).to(device)
decoder_output = model.decoder(**output, encoder_hidden_states=out_encoder.last_hidden_state)

def get_top(decoder_output, k=10, do_print=True):
    all_logits = model.lm_head(decoder_output.last_hidden_state)[-1]
    top_logits, top_token_ids = all_logits[-1].topk(10)
    if do_print:
        print(", ".join([f"p({k}) = {v:.3f}" for k, v in zip(tokenizer.convert_ids_to_tokens(top_token_ids), top_logits.softmax(0))]))
    return top_logits, top_token_ids

get_top(decoder_output);

In the next cell, we do use the greedy decoding strategy to generate the summarized sentence. Feel free to implement others, like sampling, nucleus sampling or beam search!

In [None]:
# For reference
summarize(model.to(device), sentence, max_length=50, num_beams=1,  early_stopping=True)

# Loop and construct the sentence step by step
output_ids = tokenizer("<pad>", return_tensors="pt", add_special_tokens=False).to(device).input_ids

for ix in range(20):
    print(tokenizer.convert_ids_to_tokens(output_ids[0]), " -> ", tokenizer.decode(output_ids[0]))

    decoder_output = model.decoder(input_ids=output_ids, encoder_hidden_states=out_encoder.last_hidden_state)
    top_logits, top_token_ids = get_top(decoder_output)

    output_ids = torch.cat((output_ids, torch.LongTensor([[top_token_ids[0]]]).to(device)), dim=1)
    print()
