In [1]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch 

# Transformers
Here we use the transformer model [T5](https://huggingface.co/docs/transformers/model_doc/t5) to translate an english sentence into german.
## Input Embedding
### Tokenization
For more information on hugging face tokenizers see [Summary of the tokenizers](https://huggingface.co/docs/transformers/tokenizer_summary).

Generally T5 is trained using [SentencePiece](https://huggingface.co/docs/transformers/tokenizer_summary#sentencepiece) in combination with [Unigram](https://huggingface.co/docs/transformers/tokenizer_summary#unigram) (click [here](https://huggingface.co/docs/tokenizers/quicktour) for building an own tokenizer). SentencePiece considers the input as a raw input stream, encompassing spaces in the character set for utilization. It then employs the unigram algorithm to construct the appropriate vocabulary. The decoding process with SentencePiece is quite straightforward, as all tokens can be simply concatenated, and `_` is replaced by a space.  
#### Transition words into IDs (and back)

In [2]:
# Using t5 model
checkpoint = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(checkpoint, legacy=False)
model = T5ForConditionalGeneration.from_pretrained(checkpoint, output_hidden_states=True)



Let's start by examining the tokens within the vocabulary, observing the tokens positioned at each index.

In [3]:
voc_first_indices = [0, 1, 2, 3, 4, 5, 1674, 653]
for voc_token in voc_first_indices:
    print(tokenizer.decode(voc_token))

<pad>
</s>
<unk>

X
.
Ich
try


As we can see, the first three positions in the vocabulary are occupied by the special tokens `pad`, `/s` (equivalent to `EOS`), and `unk`. Following these, various individual letters or characters appear, and further down in the vocabulary, complete words are encountered (such as `Ich` at position `1674`).

##### Tokenize the input sentence and show the input IDs
Next, we examine how the input sequence *"I try to understand the Transformer architecture"* is represented as **IDs** in the computer, so that the Transformer model is able to work with it.

*Note:* T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task. For our translation example we will use the prefix *"translate English to German: "*.

In [4]:
english_sentence = "I try to understand the Transformer architecture"
task_prefix = "translate English to German: "
input_text = task_prefix + english_sentence

input_ids = tokenizer.encode(input_text, return_tensors="pt")
input_ids

tensor([[13959,  1566,    12,  2968,    10,    27,   653,    12,   734,     8,
         31220,  4648,     1]])

##### Reconstructing the input sentence from the IDs
Similarly, the IDs will contain the index of each of those **tokens** in the tokenizer's vocabulary. So let's convert the IDs to tokens:

In [5]:
tokenizer.convert_ids_to_tokens(input_ids[0])

['▁translate',
 '▁English',
 '▁to',
 '▁German',
 ':',
 '▁I',
 '▁try',
 '▁to',
 '▁understand',
 '▁the',
 '▁Transformer',
 '▁architecture',
 '</s>']

##### Run the model and show the token IDs
During the inference phase, it is advisable to employ the `generate()` function. This approach manages the process of encoding the input and conveying the encoded hidden states to the decoder through cross-attention layers. Subsequently, it proceeds to generate the decoder output in an auto-regressive manner. Please see [here](https://huggingface.co/blog/how-to-generate) for more information on how to generate text with Transformers.

The decoder output is also represented by the token IDs:

In [6]:
output = model.generate(input_ids, do_sample=False)
output

tensor([[    0,  1674,     3, 27085,    15,     6,    67, 31220,    18, 23533,
            23, 15150,  2905,   170, 19163,     1]])

##### Convert output IDs to tokens
After converting the token IDs back into tokens, the resulting translation, including the special tokens, is as follows:

In [7]:
decoded_output = tokenizer.decode(output[0], skip_special_tokens=False)
decoded_output

'<pad> Ich versuche, die Transformer-Architektur zu verstehen</s>'

*Note*: T5 uses the `pad` token ID as the start token ID of the decoder.

So far we have seen how to turn each word of the input sentence into an ID that identifies that token (or that word in this example). That's the first step of the Transformer architecture! In the same manner, we have now witnessed the final step of the Transformer - that is, how the token IDs produced by the decoder are converted back into tokens/words. This conversion enables us to ultimately obtain the translation of the input sentence in German, making it readable for us humans.

### Embedding
The obtained numbers after tokenization - like `653` representing the word `try` - do not really have meaning encoded into them. And therefore the next step that the model goes through to cature some of the meaning behind the words (and what they could represent) is called **embedding**.
#### Breathe meaning into numbers
So how can we breathe meaning into numbers? The way that's done is through a so called **embedding matrix**.
##### Showing the embedding matrix of the model
This matrix is constructed by converting all token IDs of the entire vocabulary (including the IDs associated with special tokens) into high-dimensional vectors (specifically, `512`-dimensional). This results in the formation of an embedding matrix with dimensions $(n\_vocab, d\_model)$. These numeric vector representations of 512 numbers capture some of the ideas or the meanings behind each token and that is what the model uses to really make sense of the input text. Therefore creating this embedding matrix is part of the training.

In [8]:
model.get_input_embeddings()  # Dimensions are: (Number of tokens in vocabulary, dimension of model)

Embedding(32128, 512)

In the remainder of this section, an example will illustrate the embedding for the input token `try` with ID `653`.

##### Get token with input ID 653 ('try')

In [9]:
tokenizer.decode(input_ids[0][6])

'try'

##### Get the embedding vector of the token with the input ID 653 ('try')

In [10]:
model.encoder.embed_tokens(torch.tensor(653))

tensor([ 16.6250,   0.8086, -24.6250, -19.7500, -17.6250,  -7.6875,   1.0312,
         -2.5781,  -0.6602,  22.0000, -49.7500,   4.1562,   0.8008,   5.2188,
        -23.6250,  12.0625,  14.6875,  -3.1094,   0.4609,  -8.5625,   1.0000,
         -3.1250,  -2.2188, -11.5000,  -6.4375,  24.5000,  13.1250,  -0.4023,
          3.1562,  21.3750,  10.0625, -13.5000, -17.0000,  23.0000, -28.6250,
         -9.0625,  17.7500, -21.5000, -10.6250,  11.4375,   6.9688,  -6.9688,
          8.7500,   2.6406,   0.2236,  -2.5312,   5.0312, -21.2500,  23.5000,
         36.2500,   2.2812, -40.7500, -16.7500, -20.8750,  -6.0000,  10.8125,
         16.0000,   6.5312,  -6.8750, -18.2500,  -6.1875,   5.9688,  23.1250,
        -22.3750,   7.8438,  30.3750,  10.6875,  26.3750,  -3.2812,  -3.7812,
         -3.7969, -14.1875, -24.1250, -29.7500, -11.3125,   2.8281,   8.9375,
         -9.5625,  -3.5781,  -5.2188,   2.3594,  15.3750,   3.7500,  25.1250,
          2.9531,   0.3047,  25.3750, -15.1875, -24.5000,  -6.43

In [11]:
print(f"The embedding vector is of dimension {len(model.encoder.embed_tokens(torch.tensor(653)))}!")

The embedding vector is of dimension 512!


The example above only shows a single example. You can also do batched inference, like so:

### Extra

Padding tokens play a vital role in Transformer implementations when dealing with batched inputs of varying lengths. Batching is employed for computational efficiency and parallel processing during Transformer training. However, the inherent variability in the length of natural language sequences leads to batches containing inputs of unequal sizes, making them unsuitable for direct transformation into fixed-size tensors. To address this challenge, practitioners utilize padding and truncation techniques. Padding involves appending a designated `pad` token to shorter sequences within the batch or adjusting them to conform to the model's maximum accepted sequence length, ensuring uniform sequence lengths across the batch. This uniformity is critical for processing the batch as a single tensor, enabling efficient parallelization and training. Conversely, truncation manages lengthy sequences by shortening them to meet specified length criteria. For advanced strategies and a detailed explanation of the provided API, please refer to the section on [Padding and truncation](https://huggingface.co/docs/transformers/pad_truncation).

#### Padding the outputs to the longest sentence when encoding multiple sentences
When encoding multiple sentences, you can automatically pad the outputs to the longest sentence present by doing:

In [12]:
input_1 = "What do you do?"
input_2 = "I try to understand the Transformer architecture."
sentences = [input_1, input_2]
inputs = tokenizer([task_prefix + sentence for sentence in sentences], return_tensors="pt", padding=True)
inputs["input_ids"]

tensor([[13959,  1566,    12,  2968,    10,   363,   103,    25,   103,    58,
             1,     0,     0,     0],
        [13959,  1566,    12,  2968,    10,    27,   653,    12,   734,     8,
         31220,  4648,     5,     1]])

As expected, the first sentence, which is three words shorter than the second, is extended by three `0`s so that both sentences have the same length.

*A brief reminder:* </br>
`0` corresponds to the token ID associated with the `PAD` token. This becomes particularly evident after converting the IDs into tokens:

In [13]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

['▁translate',
 '▁English',
 '▁to',
 '▁German',
 ':',
 '▁What',
 '▁do',
 '▁you',
 '▁do',
 '?',
 '</s>',
 '<pad>',
 '<pad>',
 '<pad>']

In [14]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"][1])

['▁translate',
 '▁English',
 '▁to',
 '▁German',
 ':',
 '▁I',
 '▁try',
 '▁to',
 '▁understand',
 '▁the',
 '▁Transformer',
 '▁architecture',
 '.',
 '</s>']

To obtain the decoder output in a human-readable token form, you can use the `batch_decode` method provided by HuggingFace:

In [15]:
output_sequences = model.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    do_sample=False,  # disable sampling to test if batching affects output
)
tokenizer.batch_decode(output_sequences, skip_special_tokens=True)

['Was tun Sie?', 'Ich versuche, die Transformer-Architektur zu verstehen.']

And now we repeat this step without hiding the special tokens:

In [16]:
tokenizer.batch_decode(output_sequences, skip_special_tokens=False)

['<pad> Was tun Sie?</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>',
 '<pad> Ich versuche, die Transformer-Architektur zu verstehen.</s>']