<a href="https://colab.research.google.com/github/Taaniya/explore-T5-model/blob/main/Explore_T5_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers SentencePiece

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting SentencePiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting

In [2]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Model, T5EncoderModel
import torch

#### Training
T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher forcing. This means that for training, we always need an input sequence and a corresponding target sequence. The input sequence is fed to the model using `input_ids`. The target sequence is shifted to the right, i.e., **prepended by a start-sequence token** and fed to the decoder using the `decoder_input_ids`. In teacher-forcing style, the target sequence is then appended by the EOS token and corresponds to the labels.

* The PAD token is hereby used as the start-sequence token.
* T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.

Let's explore [T5ForConditionalGeneration](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5ForConditionalGeneration) class first.
This class returns a model having a language modelling head on top of decoder, i.e., it also returns the **language modelling loss** and **Prediction scores of the language modeling head** (scores for each vocabulary token before SoftMax) if the labels are provided. See snippets below.

In [3]:
T5ForConditionalGeneration

transformers.models.t5.modeling_t5.T5ForConditionalGeneration

In [4]:
T5ForConditionalGeneration.base_model

<property at 0x7fd70a716890>

#### Unsupervised denoising training

In this setup, spans of the input sequence are masked by so-called **sentinel tokens** (a.k.a unique mask tokens) and the output sequence is formed as a concatenation of the same sentinel tokens and the real masked tokens. Each sentinel token represents a unique mask token for this sentence and should start with \<extra_id_0>, \<extra_id_1>, … up to \<extra_id_99>. As a default, 100 sentinel tokens are available in T5Tokenizer.

For instance, the sentence “The cute dog walks in the park” with the masks put on “cute dog” and “the” should be processed as follows:

In [5]:
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

The vocabulary is shared across both input and output of the model. Hence the same tokenizer is used to tokenize inputs for encoder and labels (inputs for decoder).

In [6]:
tokenizer.vocab_size

32100

In [7]:
input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids

# the forward function automatically creates the correct decoder_input_ids
loss = model(input_ids=input_ids, labels=labels).loss
loss.item()

3.7837319374084473

In [8]:
input_ids

tensor([[   37, 32099, 10681,    16, 32098,  2447,     1]])

In [9]:
labels

tensor([[32099,  5295,  1782, 32098,     8, 32097,     1]])

Tokenizer appends </s> token in the end of tokenized text by default while tokenizing inputs and labels above.

In [10]:
tokenizer.decode(1)

'</s>'

In [11]:
tokenizer.decode(32099)

'<extra_id_0>'

#### Supervised training
In this setup, the input sequence and output sequence are a standard sequence-to-sequence input-output mapping (No masking required).

Suppose that we want to fine-tune the model for translation for example, and we have a training example: the input sequence “The house is wonderful.” and output sequence “Das Haus ist wunderbar.”, then they should be prepared for the model as follows:

In [12]:
input_ids = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt").input_ids
labels = tokenizer("Das Haus ist wunderbar.", return_tensors="pt").input_ids

# the forward function automatically creates the correct decoder_input_ids
loss = model(input_ids=input_ids, labels=labels).loss
loss.item()

0.25424423813819885

As we can see, only 2 inputs are required for the model in order to compute a loss: `input_ids` (which are the input_ids of the encoded input sequence) and `labels` (which are the `input_ids` of the encoded target sequence).

The model automatically creates the `decoder_input_ids` based on the `labels` provided, by shifting them one position to the right and prepending the `config.decoder_start_token_id`, which for T5 is equal to 0 (i.e. the id of the pad token). Also note the task prefix: we prepend the input sequence with ‘translate English to German: ’ before encoding it. This will help in improving the performance, as this task prefix was used during T5’s pre-training.

In [13]:
tokenizer.pad_token, tokenizer.pad_token_id

('<pad>', 0)

However, the example above only shows a single training example. In practice, one trains deep learning models in batches. This entails that we must pad/truncate examples to the same length. For encoder-decoder models, one typically defines a max_source_length and max_target_length, which determines the maximum length of the input and output sequences respectively (otherwise they are truncated). These should be carefully set depending on the task.

In addition, we must make sure that padding token id’s of the labels are not taken into account by the loss function. In PyTorch and Tensorflow, this can be done by replacing them with -100, which is the **ignore_index** of the CrossEntropyLoss.

According to [pytorch documentation for CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#crossentropyloss), the optional parameter `ignore_index` specifies a target value that is ignored and does not contribute to the input gradient. When `size_average` is True, the loss is averaged over non-ignored targets

 We also pass `attention_mask` as additional input to the model, which makes sure that padding tokens of the inputs are ignored. The code example below illustrates all of this.

In [25]:
# the following 2 hyperparameters are task-specific
max_source_length = 512
max_target_length = 128

# Suppose we have the following 2 training examples:
input_sequence_1 = "Welcome to NYC"
output_sequence_1 = "Bienvenue à NYC"

input_sequence_2 = "HuggingFace is a company"
output_sequence_2 = "HuggingFace est une entreprise"

# encode the inputs
task_prefix = "translate English to French: "
input_sequences = [input_sequence_1, input_sequence_2]

encoding = tokenizer(
    [task_prefix + sequence for sequence in input_sequences],
    padding="longest",
    max_length=max_source_length,
    truncation=True,
    return_tensors="pt",
)

input_ids, attention_mask = encoding.input_ids, encoding.attention_mask

# encode the targets
target_encoding = tokenizer(
    [output_sequence_1, output_sequence_2],
    padding="longest",
    max_length=max_target_length,
    truncation=True,
    return_tensors="pt",
)
labels = target_encoding.input_ids

# replace padding token id's of the labels by -100 so it's ignored by the loss
labels[labels == tokenizer.pad_token_id] = -100

# forward pass
loss = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels).loss
loss.item()

0.1880139261484146

In [26]:
encoding

{'input_ids': tensor([[13959,  1566,    12,  2379,    10,  5242,    12, 13465,     1,     0,
             0,     0,     0,     0],
        [13959,  1566,    12,  2379,    10, 11560,  3896,   371,  3302,    19,
             3,     9,   349,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [33]:
target_encoding

{'input_ids': tensor([[10520, 15098,     3,    85, 13465,     1,  -100,  -100],
        [11560,  3896,   371,  3302,   259,   245, 11089,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1]])}

#### Inference

At inference time, it is recommended to use generate(). This method takes care of encoding the input and feeding the encoded hidden states via cross-attention layers to the decoder and auto-regressively generates the decoder output.There’s also [this blog post](https://huggingface.co/blog/encoder-decoder#encoder-decoder) which explains how generation works in general in encoder-decoder models.

In [34]:
input_ids = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt").input_ids
outputs = model.generate(input_ids)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Das Haus ist wunderbar.




In [35]:
input_ids = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt").input_ids
outputs = model.generate(input_ids)

print(tokenizer.decode(outputs[0]))

<pad> Das Haus ist wunderbar.</s>


Note that T5 uses the pad_token_id as the decoder_start_token_id, so when doing generation without using generate(), make sure you start it with the pad_token_id.

The example above only shows a single example. You can also do batched inference, like so:

In [None]:
task_prefix = "translate English to German: "
# use different length sentences to test batching
sentences = ["The house is wonderful.", "I like to work in NYC."]

inputs = tokenizer([task_prefix + sentence for sentence in sentences], return_tensors="pt", padding=True)

output_sequences = model.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    do_sample=False,  # disable sampling to test if batching affects output
)

print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))

['Das Haus ist wunderbar.', 'Ich arbeite gerne in NYC.']


Because T5 has been trained with the span-mask denoising objective, it can be used to predict the sentinel (masked-out) tokens during inference. The predicted tokens will then be placed between the sentinel tokens.

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids

sequence_ids = model.generate(input_ids)
sequences = tokenizer.batch_decode(sequence_ids)
sequences


['<pad><extra_id_0> park offers<extra_id_1> the<extra_id_2> park.</s>']

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base")

input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids

sequence_ids = model.generate(input_ids)
sequences = tokenizer.batch_decode(sequence_ids)
sequences

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

['<pad><extra_id_0> park is a short walk from the park. There are<extra_id_1> the<extra_id_2>park is']

Exploring [T5Model class](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Model). This class returns an instance with bare T5 Model transformer outputting raw hidden-state without any specific head on top so, it cannot return any loss.

`last_hidden_state` here is the Sequence of hidden-states at the **output of the last layer of the decoder** of the model.

Also, without language modelling head, we need to explicitly prepend `decoder_input_ids` with `decoder_start_input_id` and shift the labels by right before feeding them to the decoder.

The `forward` method of this model instance includes the argument `decoder_input_ids` which expects the indices of decoder input sequence tokens in the vocabulary as its input.

In [None]:
model = T5Model.from_pretrained("t5-small")

input_ids = tokenizer(
 "Studies have been shown that owning a dog is good for you", return_tensors="pt"
).input_ids  # Batch size 1

decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids  # Batch size 1
print(f"decoder_input_ids before shifting right: {decoder_input_ids}")
# Since it doesn't have language modelling head, prepend decoder_input_ids with start token which is pad token for T5Model.
# This is not needed for torch's T5ForConditionalGeneration as it does this internally using labels arg.

decoder_input_ids = model._shift_right(decoder_input_ids)
print(f"decoder_input_ids after shifting right: {decoder_input_ids}")

# forward pass
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
last_hidden_state = outputs.last_hidden_state

decoder_input_ids before shifting right: tensor([[6536,  504,   24,    1]])
decoder_input_ids after shifting right: tensor([[   0, 6536,  504,   24]])


In [None]:
last_hidden_state.shape

torch.Size([1, 4, 512])

Exploring [T5EncoderModel class](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel). This class returns an instance of the bare T5 Model transformer outputting **encoder’s raw hidden-states** without any specific head on top.

Unlike T5Mode class, the `forward` method of this model's instance doesn't have any argument to take decoder_input_ids as inputs.

In [None]:
model = T5EncoderModel.from_pretrained("t5-small")
input_ids = tokenizer(
    "Studies have been shown that owning a dog is good for you", return_tensors="pt"
).input_ids  # Batch size 1
outputs = model(input_ids=input_ids)
last_hidden_states = outputs.last_hidden_state

In [None]:
last_hidden_states.shape

torch.Size([1, 15, 512])

#### References
* https://huggingface.co/docs/transformers/model_doc/t5#training
* [T5 paper](https://arxiv.org/pdf/1910.10683.pdf)
* https://huggingface.co/blog/encoder-decoder