Sheet 3.1: Tokenization & Transformers
==============

**Author:** Polina Tsvilodub

In this sheet, we will focus on two topics: *tokenization*, the process of converting raw text into single tokens which are mapped onto neural-network friendly numerical representations; and *transformers*, the architecture behind state-of-the-art language models. The learning goals for this sheet are:
* understand core steps of tokenization
* learn to use state-of-the-art tokenization correctly
* learn to build transformers in PyTorch
* understand pretrained transformers like GPT-2 which can be loaded from HuggingFace, so as to be able to customize them for different tasks.

The contents of the sheet are inspired by [this](https://huggingface.co/learn/nlp-course/chapter2/2) tutorial. 

**To do**: say a word about sentence representation.

tokenization & dealing with context window size, understanding transformers and their architecture and configs (with native Torch code), positional encodings, different model heads, some plots and results of training dynamics and how to interpret them (if time suffices)

additional materials could be outlook to different architrectures: seq2seq, bidirectional transformers

## Tokenization

Under the hood, neural networks are complicated functions; therefore, they cannot deal with raw text represented as strings, but uses numerical representations of the text. This conversion is done by a *tokenizer*, and the process is called *tokenization*. Tokenization commonly consists of the following steps: 
1. splitting the input text into words, subwords, or symbols (like punctuation) that are called *tokens*
2. mapping each token to an integer (or, index)
3. adding additional inputs (or, *special tokens*) that may be useful to the model

The set of unique tokens of a given tokenizer is often called the *vocabulary*.

> <strong><span style=&ldquo;color:#D83D2B;&rdquo;>Exercise 3.1.1: Simple tokenization </span></strong>
> 
> We have seen the simplest version explicit version of tokenization in [sheet 2.4](https://cogsciprag.github.io/Understanding-LLMs-course/tutorials/02d-char-level-RNN.html). Here, the "tokenizer" is just a simple mapping.
> 
> 1. What are the minimal units, i.e., tokens in sheet 2.4?
> 2. What is the range of indices representing tokens in sheet 2.4? I.e., how many different tokens are there?

However, this simple approach has several limitations. Specifically, the vocabulary is very limited and is manually defined a priori. While such an approach may work for simple tasks or very specific domains like predicting names, it is not very useful for dealing with more general texts which may include numbers, emojis, or different languages with different alphabets. Under this simple approach, we wouldn't be able to represent any of these things.

To allow for more flexibility but avoid having to manually specify all possible tokens, special tokenization approaches have been developed. The most prominent tokenization algorithm used, e.g., for the GPT models, is the so-called *byte-pair-encoding* (BPE) tokenization. BPE tokenizers are trained, i.e., the can be adjusted on specific texts so as to optimally represent the data.

### BPE tokenization

Here are the core steps behind BPE tokenization:
* Text from the training corpus is normalized. That is, text undergoes general preprocessing like removing unnecessary whitespaces, lower-casing, possibly doing unicode normalization.
* Text in the corpus is pre-tokenized. Here, the text is commonly split, e.g., into single words (i.e., by whitespace) and punctuation.
* Next, all characters used in the preprocessed text are identified. Special tokens are added (more on these below).
* Next, frequencies of pairs of characters in the training corpus are identified. The most frequent pair is then *merged* into a single token, and the frequencies are identified again, now using the merged representation of the identified pair. This is repeated, until a desired vocabulary size is reached. 
* After training, to tokenize a new text, the learned merge rules are applied to it and the respective tokens are assigned to the text.

The important take-aways are: 
* Tokenizers are training data-dependent. This means, what exactly is represented by single tokens depends on the freuqencies of different words and their contexts in the corpus. The size of the vocabulary is determined by the developers.
* Same characters or word-pieces can be mapped onto different tokens, depending on their context! For instance, the same word at the beginning or in the middle of a sentence can be represented with different tokens.
  
If you want to dive deeper into the algorithm, take a look at [this](https://huggingface.co/learn/nlp-course/en/chapter6/5#byte-pair-encoding-tokenization) tutorial. Another common approach is WordPiece tokenization; you can learn more [here]().

### Special tokens

An important aspect of tokenizers are special tokens. They are called "special" because they carry special meaning rather than simply representing parts of a text. Common SOTA tokenizers have the following special tokens:
* **beginning-of-sequence (BOS)** (or, start-of-sequence, SOS) token: it is prepended to the start of every training sequence, and at inference time, it is used to signal to the LM that text should be predicted. In a sense, it represents a "signal to act".
   *  Different tokenizers use different strings as BOS. For instance, these could be "\<s\>", "|startofsequence|" or anything thelike.
* **end-of-sequence (EOS)** token: it is appended to the end of every training sequence so as to signal to the LM that a text is "finished" at a certain point. Thereby the LM learns to predict finite sequences. At inference time, once the LM samples the EOS, it stops generating further tokens. 
   *  Different tokenizers use different strings as EOS. For instance, these could be "\<\s\>", "|endofsequence|" or anything thelike.
* **pad** token: it is used to make a sequence longer and have a certain number of tokens. This is used for training LMs on batches of sequences. Since each batch of $n$ sequences is represented by a matrix with the dimensions $n \times m$, each sequence has to have the length of $m$ tokens. If, in fact, it only has less than $m$ tokens, we append pad tokens so that the sequence has $m$ tokens.  
   *  The pad token could be, e.g., "[PAD]". Sometimes, however, pad tokens are not provided by a pretrained tokenizer. In this case, usually the EOS token is set as the pad token.
   *  It is important to note that pad tokens are an "engineering" necessity rather than tokens carrying meaning. Therefore, these are often *masked* during training, while other special tokens are not. More on  masking below.
   *  We can set the padding side for a tokenizer. It represents on which side of the sequence the pad tokens should be added. For auto-regressive models (i.e., common LMs), the padding side should be on the left. Concretely, a padded sequence should, e.g., look like this: "[PAD] [PAD] [PAD] Hi there!"
* **unknown (UNK)** token: it is used to represent a character or a part of a sequence which cannot be mapped to known tokens. Under some tokenization approaches (e.g., byte level pair encoding), UNK tokens aren't possible in principle, and therefore, such tokenizers don't have UNK tokens.
* **system** tokens: these tokens have been introduced more recently with the introduction of chat-optimized and assistant LMs, and are used to delineate to an LM different types of contents. These are, for instance, special tokens which are introduced to delineate system prompts, user inputs, previous model responses, etc (more on prompting and assistants in the next lectures). 

These special tokens are added to the vocabulary of a tokenizer, i.e., represented by their own token indices.

The important take-away is that, in order to get optimal performance of a trained LM, one must use tokenization in the same way as what was used when the model was trained!

Outlook to tight LMs re: EOS tokens.

### Pretrained tokenizers

In practice, as with language models, we don't have to create tokenizers ourselves -- we can download pretrained tokenizers that were created for specific LMs and are shipped with them on HF. We have already used a pretrained tokenizer in the previous sheets under the hood. Below, we take a closer look at the pretrained GPT-2 tokenizer.

TODO: Each token is mapped onto an embedding; therefore, vocab size and model must match. TODO: inspect the embedding layer of gpt-2.

In [1]:
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [3]:
tokenizer("Привет", return_tensors="pt")

{'input_ids': tensor([[  140,   253, 21169, 18849, 38857, 16843, 20375]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

In [4]:
tokenizer("Hello", return_tensors="pt")

{'input_ids': tensor([[15496]]), 'attention_mask': tensor([[1]])}

> <strong><span style=&ldquo;color:#D83D2B;&rdquo;>Exercise 3.1.1: Pretrained tokenizers</span></strong>
> 
> 1. Inspect the three tokenization related files that are loaded together with the pretrained GPT-2 [here](https://huggingface.co/openai-community/gpt2/tree/main). What do the single files contain?
> 2. Find out about special Llama chat tokens.

**TODO**: masking