# 01. Building Blocks of LLMs: Tokenizers & Transformer Architecture

In this chapter, we're going to cover:

- The building blocks of language modeling, from concepts like tokenization and token embeddings to constructing the entire Transformer architecture from scratch.

- Discovering the Hugging Face APIs for tokenization, text generation, and more with some pre-trained LLMs out there, such as GPT-2 and LLaMA.

> **Mini-Project**:
>
> Train a custom tokenizer (Hugging Face API compatible) for Moroccan Darija. We'll use the DODa dataset [https://github.com/darija-open-dataset/dataset](https://github.com/darija-open-dataset/dataset). The tokenizer should support both English and Moroccan Darija.

## 1. Introduction To Tokenization

### 1.1 What's Tokenization?

**What's tokenization and what is it used for?**  

Well, how can we feed text into language models? LLMs are essentially large neural networks (commonly Transformer-based) that can only process numbers. So, we need a process that converts strings into numbers. The answer is **Tokenization**, which is the process of splitting raw text into small units called **tokens**. These tokens are then mapped to **IDs**, and all available tokens along with their **IDs** form a **Vocabulary**.  

<img src="https://raw.githubusercontent.com/JamorMoussa/LLMs-Zero-To-Hero/refs/heads/main/assets/01/tokenizer.png"/>




There are three main types of tokenizers used in 🤗 Hugging Face Transformers: **Byte-Pair Encoding (BPE)**, **WordPiece**, and **SentencePiece**, which we will cover in the following sections.

### 1.2 Challenges

**Tokenizers** are trained on huge datasets to determine efficient splitting. This is a crucial step—**Andrej Karpathy** mentioned that bad tokenization can lead to significant issues for LLMs.  

> Tokenization is at the heart of much weirdness of LLMs. Do not brush it off.  
>   
> - Why can't LLM spell words? Tokenization.  
> - Why can't LLM do super simple string processing tasks like reversing a string? Tokenization.  
> - Why is LLM worse at non-English languages (e.g., Japanese)? Tokenization.  
> - Why is LLM bad at simple arithmetic? Tokenization.  
> - Why did GPT-2 have more than necessary trouble coding in Python? Tokenization.  

🚧 **Tokenization is not as easy as it seems!**  

While **character tokenization** is very simple and results in a minimal vocabulary, it makes it harder for the model to learn meaningful input representations.  

📚 On the other hand, **word tokenization** (i.e., splitting text by words) leads to a very large vocabulary size, 📈 increasing computational complexity due to the embedding matrix becoming too large.  

✨ To get the best of both worlds, Transformer models use a hybrid approach between word-level and character-level tokenization called **subword tokenization**.  

💡 **Subword tokenization** allows models to have a resonable volcabulary size, while being able to learn text representation.

### 1.3 Subword Tokenization  

**TODO**: Refer to [tokenizer summary 🤗](https://huggingface.co/docs/transformers/en/tokenizer_summary) and write a section covering the **BPE**, **WordPiece**, and **SentencePiece** techniques.  :


## 2. Hugging Face Tokenizer API

### 2.1

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

from pprint import pprint

In [2]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [3]:
text = "Welcome to LLMs zero to Hero Course 🤗"

pprint(tokenizer(text= text)) # returns 'attenstion_mask' and 'input_ids' as list.

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [14618,
               284,
               27140,
               10128,
               6632,
               284,
               8757,
               20537,
               12520,
               97,
               245]}


In [12]:
pprint(tokenizer(text= text, return_tensors= "pt")) # as Pytorch Tensor

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 'input_ids': tensor([[14618,   284, 27140, 10128,  6632,   284,  8757, 20537, 12520,    97,
           245]])}


In [11]:
sentences = [
    "Hello Everyone ✋",
    "Welcome to LLMs Zero-to-Hero Course 🤗"
]

pprint(tokenizer(text= sentences)) # batches of sentence

{'attention_mask': [[1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
 'input_ids': [[15496, 11075, 14519, 233],
               [14618,
                284,
                27140,
                10128,
                12169,
                12,
                1462,
                12,
                30411,
                20537,
                12520,
                97,
                245]]}


In [None]:
pprint(tokenizer(text= sentences, return_tensors= "pt"))

<img src="https://raw.githubusercontent.com/JamorMoussa/LLMs-Zero-To-Hero/refs/heads/main/assets/01/return_tensor_error.png"/>




The issue is that when converting a batch of sentences into a PyTorch tensor, all sentences must have the same length, which is not the case here.  

In [7]:
inputs = tokenizer(text= sentences) # list format

input_ids = inputs["input_ids"]
for sentence, ids in zip(sentences, input_ids):
    print(f"Sentence: {sentence} - number of tokens: {len(ids)}")

Sentence: Hello Everyone ✋ - number of tokens: 4
Sentence: Welcome to LLMs Zero-to-Hero Course 🤗 - number of tokens: 13


To solve this issue, we need to pad the first sentence from 4 tokens to 13 tokens using a `pad_token`. Since the GPT-2 tokenizer doesn't have a `pad_token` by default, we need to add it ourselves.  

In [8]:
tokenizer.pad_token = "[PAD]"

pprint(tokenizer(text= sentences, padding= True))

{'attention_mask': [[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
 'input_ids': [[15496,
                11075,
                14519,
                233,
                50256,
                50256,
                50256,
                50256,
                50256,
                50256,
                50256,
                50256,
                50256],
               [14618,
                284,
                27140,
                10128,
                12169,
                12,
                1462,
                12,
                30411,
                20537,
                12520,
                97,
                245]]}


In [14]:
inputs = tokenizer(text= sentences, return_tensors="pt", padding= True)
pprint(inputs)
print(f"shape: {inputs['input_ids'].shape}")

{'attention_mask': tensor([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 'input_ids': tensor([[15496, 11075, 14519,   233, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256],
        [14618,   284, 27140, 10128, 12169,    12,  1462,    12, 30411, 20537,
         12520,    97,   245]])}
shape: torch.Size([2, 13])


### 2.2 Chat Template

In [20]:
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

In [41]:
tokenizer

LlamaTokenizerFast(name_or_path='HuggingFaceH4/zephyr-7b-beta', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='left', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>', 'additional_special_tokens': ['<unk>', '<s>', '</s>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

In [37]:
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate"
    },
    {
        "role": "user",
        "content": "How many helicopters can a human eat in one sitting?"
    }
]

In [39]:
pprint(
    tokenizer.apply_chat_template(
        conversation= messages, return_tensors= "pt", tokenize= False,
        add_generation_prompt= True, # to start generation: '<|assistant|>\n'
    )
)

('<|system|>\n'
 'You are a friendly chatbot who always responds in the style of a pirate</s>\n'
 '<|user|>\n'
 'How many helicopters can a human eat in one sitting?</s>\n'
 '<|assistant|>\n')


In [48]:
inputs = tokenizer.apply_chat_template(
    conversation= messages, return_tensors= "pt", add_generation_prompt=True
)

pprint(inputs)

tensor([[  523, 28766,  6574, 28766, 28767,    13,  1976,   460,   264, 10131,
         10706, 10093,   693,  1743,  2603,  3673,   297,   272,  3238,   302,
           264, 17368,   380,     2, 28705,    13, 28789, 28766,  1838, 28766,
         28767,    13,  5660,  1287, 19624,   410,  1532,   541,   264,  2930,
          5310,   297,   624,  6398, 28804,     2, 28705,    13, 28789, 28766,
           489, 11143, 28766, 28767,    13]])


In [49]:
tokenizer.decode(inputs[0])

'<|system|>\nYou are a friendly chatbot who always responds in the style of a pirate</s> \n<|user|>\nHow many helicopters can a human eat in one sitting?</s> \n<|assistant|>\n'

In [55]:
tokenizer.encode(
    text = tokenizer.apply_chat_template(
        conversation= messages, add_generation_prompt=True, tokenize= False),
    return_tensors= "pt"
)

tensor([[    1,   523, 28766,  6574, 28766, 28767,    13,  1976,   460,   264,
         10131, 10706, 10093,   693,  1743,  2603,  3673,   297,   272,  3238,
           302,   264, 17368,   380,     2, 28705,    13, 28789, 28766,  1838,
         28766, 28767,    13,  5660,  1287, 19624,   410,  1532,   541,   264,
          2930,  5310,   297,   624,  6398, 28804,     2, 28705,    13, 28789,
         28766,   489, 11143, 28766, 28767,    13]])