# Understanding LLMs and Transformers: Tokenization and Embeddings

## Introduction

Large Language Models (LLMs), like GPT or BERT, are advanced AI models designed to understand and generate natural language. 

They rely on the **Transformer architecture**, a revolutionary approach that has reshaped Natural Language Processing (NLP).

Before diving into how transformers work, it is essential to understand the foundational concepts:
1. **Tokenization**: How text is broken into smaller units for the model to process.
2. **Embeddings**: How tokens are represented as numerical vectors.

---


## Tokenization

### What is Tokenization?

Tokenization is the process of splitting text into smaller units called **tokens**. Tokens could be:
- **Words**: `"The quick brown fox"` → `["The", "quick", "brown", "fox"]`
- **Subwords**: `"unbelievable"` → `["un", "believable"]`
- **Characters**: `"hello"` → `["h", "e", "l", "l", "o"]`

### Why Tokenize?

Models work with numbers, not raw text. Tokenization converts raw text into numerical data that a model can process.

### Types of Tokenizers
1. **Word-based Tokenizer**: Splits at spaces and punctuation (naive but fast).
2. **Character-based Tokenizer**: Breaks text into individual characters (useful for non-Latin scripts).
3. **Subword-based Tokenizer**: Used in modern LLMs (e.g., Byte Pair Encoding, WordPiece). 
   - Balances vocabulary size and token granularity by splitting into common subwords.


---

### Code Example: Tokenization

In [1]:
# Simple Tokenization Example
text = "Transformers are amazing! 🤯 AIJId878vs¤13"

# Naive word-based tokenizer
tokens = text.split()
print("Word-based Tokens:", tokens)

# Character-based tokenizer
char_tokens = list(text)
print("Character-based Tokens:", char_tokens)

Word-based Tokens: ['Transformers', 'are', 'amazing!', '🤯', 'AIJId878vs¤13']
Character-based Tokens: ['T', 'r', 'a', 'n', 's', 'f', 'o', 'r', 'm', 'e', 'r', 's', ' ', 'a', 'r', 'e', ' ', 'a', 'm', 'a', 'z', 'i', 'n', 'g', '!', ' ', '🤯', ' ', 'A', 'I', 'J', 'I', 'd', '8', '7', '8', 'v', 's', '¤', '1', '3']


In [2]:
# Using Hugging Face Tokenizer (Subword-based)
from transformers import AutoTokenizer

# Load a pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize the text
subword_tokens = tokenizer.tokenize(text)
print("Subword Tokens:", subword_tokens)

  from .autonotebook import tqdm as notebook_tqdm


Subword Tokens: ['transformers', 'are', 'amazing', '!', '[UNK]', 'ai', '##ji', '##d', '##8', '##7', '##8', '##vs', '##¤', '##13']


### Output

After running the code above, you will see:
- **Word-based Tokens**: `['Transformers', 'are', 'amazing!']`
- **Character-based Tokens**: `['T', 'r', 'a', 'n', ...]`
- **Subword-based Tokens**: `['transformers', 'are', 'amazing', '!']`

---

### Mapping Tokens to Embeddings

**What Does "Mapping Tokens to Embeddings" Mean?**

Once the text is tokenized, the tokens are still symbolic representations (e.g., `['transformers', 'are']`). 

The model cannot process these directly because neural networks operate on numerical data. To enable computation:

1. Tokens are first **mapped to unique IDs** using a **vocabulary**.
2. These IDs are then mapped to their corresponding embeddings using an **embedding matrix**.

---

**Step 1: Token-to-ID Mapping**

- Each token has a unique integer ID, assigned by the tokenizer.
- The tokenizer's vocabulary is a lookup table mapping tokens to IDs.

**Code Example: Token-to-ID Mapping**

In [3]:
from transformers import AutoTokenizer

# Load a pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize the text
#text = "سارة هي في كمبوديا أوتش شيلار"
text = "Transformers are amazing!"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print("Tokens:", tokens)
print("Token IDs:", token_ids)

Tokens: ['transformers', 'are', 'amazing', '!']
Token IDs: [19081, 2024, 6429, 999]


**Step 2: Mapping IDs to Embeddings**

The model uses an embedding matrix (a large table) where:
- Rows represent tokens in the vocabulary.
- Columns represent dimensions of the embedding (e.g., 768 in BERT).

A token ID is used as an index to retrieve its corresponding row in the embedding matrix.

**How It Works:**

- Assume the vocabulary size is $V$ (e.g., 30,000) and the embedding size is $D$ (e.g., 768).
- The embedding matrix is then a $V \times D$ matrix.
- For each token ID, the model retrieves the corresponding row as the embedding.


In [4]:
import torch

# Example embedding matrix (vocab_size=5, embedding_dim=3 for simplicity)
vocab_size = 5
embedding_dim = 3
embedding_matrix = torch.randn(vocab_size, embedding_dim)

print("Embedding Matrix:\n", embedding_matrix)


Embedding Matrix:
 tensor([[ 0.5127, -0.1147,  0.8532],
        [ 0.4294, -1.5529,  1.0748],
        [ 0.6209,  1.3002,  0.1265],
        [-0.2731, -0.3250,  0.1171],
        [ 1.4866,  0.8414,  0.6397]])


In [5]:
# Example token IDs
token_ids = torch.tensor([0, 3, 4])  # Simulated token IDs
print(f'token_ids: {token_ids}')

token_ids: tensor([0, 3, 4])


In [6]:
# Retrieve embeddings
embeddings = embedding_matrix[token_ids]
print("Retrieved Embeddings:\n", embeddings)

Retrieved Embeddings:
 tensor([[ 0.5127, -0.1147,  0.8532],
        [-0.2731, -0.3250,  0.1171],
        [ 1.4866,  0.8414,  0.6397]])


---

### Pretrained Models: Token Mapping with Embeddings

In real-world models like BERT, the embedding matrix is trained to represent meaningful relationships between tokens.

**Code Example: Mapping with a Pretrained Model**

In [11]:
from transformers import AutoTokenizer, AutoModel

# Load tokenizer and model
#tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Input text
text = "Transformers are amazing!"

# Tokenize and get token IDs
inputs = tokenizer(text, return_tensors="pt")  # Token IDs are in inputs['input_ids']
token_ids = inputs["input_ids"]

# Retrieve the embedding matrix
embedding_layer = model.embeddings.word_embeddings

# Get embeddings for input tokens
embeddings = embedding_layer(token_ids)

print("Token IDs:", token_ids)
print("Embeddings Shape:", embeddings.shape)  # (batch_size, seq_len, hidden_dim)

Token IDs: tensor([[  101, 19081,  2024,  6429,   999,   102]])
Embeddings Shape: torch.Size([1, 6, 768])


In [10]:
embeddings

tensor([[[ 0.0136, -0.0265, -0.0235,  ...,  0.0087,  0.0071,  0.0151],
         [ 0.0189, -0.0289, -0.0768,  ...,  0.0116, -0.0212,  0.0171],
         [-0.0134, -0.0135,  0.0250,  ...,  0.0013, -0.0183,  0.0227],
         [ 0.0168, -0.0245, -0.0513,  ..., -0.0209, -0.0529, -0.0645],
         [ 0.0298, -0.0373, -0.0356,  ...,  0.0161,  0.0192,  0.0173],
         [-0.0145, -0.0100,  0.0060,  ..., -0.0250,  0.0046, -0.0015]]],
       grad_fn=<EmbeddingBackward0>)

In [9]:
embeddings.reshape(6,-1)[2]

tensor([-1.3416e-02, -1.3482e-02,  2.5002e-02, -5.1527e-02, -5.1794e-02,
         9.6558e-03,  6.7361e-03, -5.4607e-02,  2.5705e-02,  7.2570e-03,
         4.4699e-03, -1.5962e-02, -6.9308e-02, -1.6379e-02, -1.2439e-02,
         8.5646e-04,  1.3288e-02, -3.7826e-02,  8.5045e-03, -2.1850e-02,
         3.6346e-02, -4.0202e-02, -1.4137e-02, -5.1339e-02, -1.6540e-02,
        -3.9799e-02,  1.6954e-02,  1.0004e-02, -2.2243e-02,  4.5188e-02,
         3.2355e-02, -8.1226e-05,  3.9237e-02, -2.4096e-02,  8.4741e-03,
         7.3312e-03,  6.7126e-03, -1.9774e-03, -8.1773e-02,  9.8314e-03,
         1.2894e-02,  3.2907e-02,  3.0951e-02,  3.2138e-02,  1.4520e-02,
        -2.9762e-02, -5.1314e-02,  1.1868e-02, -3.0954e-02,  1.3684e-02,
        -4.3889e-02, -1.1218e-02,  1.1214e-02, -4.5011e-02,  2.4642e-02,
        -1.6498e-03,  2.1277e-02, -1.3931e-02, -1.3391e-02,  3.5602e-03,
         3.8181e-02, -4.9932e-02, -2.6025e-02,  5.4804e-03,  3.5982e-03,
        -2.2312e-02, -5.7361e-03, -4.6598e-02,  9.9

In [16]:
len(embeddings.reshape(6,-1)[2])

768