# Understanding LLMs and Transformers: Tokenization and Embeddings

## Introduction

Large Language Models (LLMs), like GPT or BERT, are advanced AI models designed to understand and generate natural language. 

They rely on the **Transformer architecture**, a revolutionary approach that has reshaped Natural Language Processing (NLP).

Before diving into how transformers work, it is essential to understand the foundational concepts:
1. **Tokenization**: How text is broken into smaller units for the model to process.
2. **Embeddings**: How tokens are represented as numerical vectors.

---


## Tokenization

### What is Tokenization?

Tokenization is the process of splitting text into smaller units called **tokens**. Tokens could be:
- **Words**: `"The quick brown fox"` → `["The", "quick", "brown", "fox"]`
- **Subwords**: `"unbelievable"` → `["un", "believable"]`
- **Characters**: `"hello"` → `["h", "e", "l", "l", "o"]`

### Why Tokenize?

Models work with numbers, not raw text. Tokenization converts raw text into numerical data that a model can process.

### Types of Tokenizers
1. **Word-based Tokenizer**: Splits at spaces and punctuation (naive but fast).
2. **Character-based Tokenizer**: Breaks text into individual characters (useful for non-Latin scripts).
3. **Subword-based Tokenizer**: Used in modern LLMs (e.g., Byte Pair Encoding, WordPiece). 
   - Balances vocabulary size and token granularity by splitting into common subwords.


---

### Code Example: Tokenization

In [None]:
# Simple Tokenization Example
text = "Transformers are amazing! 🤯 AIJId878vs¤13"

# Naive word-based tokenizer
tokens = text.split()
print("Word-based Tokens:", tokens)

# Character-based tokenizer
char_tokens = list(text)
print("Character-based Tokens:", char_tokens)

In [None]:
# Using Hugging Face Tokenizer (Subword-based)
from transformers import AutoTokenizer

# Load a pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize the text
subword_tokens = tokenizer.tokenize(text)
print("Subword Tokens:", subword_tokens)

### Output

After running the code above, you will see:
- **Word-based Tokens**: `['Transformers', 'are', 'amazing!']`
- **Character-based Tokens**: `['T', 'r', 'a', 'n', ...]`
- **Subword-based Tokens**: `['transformers', 'are', 'amazing', '!']`

---

### Mapping Tokens to Embeddings

**What Does "Mapping Tokens to Embeddings" Mean?**

Once the text is tokenized, the tokens are still symbolic representations (e.g., `['transformers', 'are']`). 

The model cannot process these directly because neural networks operate on numerical data. To enable computation:

1. Tokens are first **mapped to unique IDs** using a **vocabulary**.
2. These IDs are then mapped to their corresponding embeddings using an **embedding matrix**.

---

**Step 1: Token-to-ID Mapping**

- Each token has a unique integer ID, assigned by the tokenizer.
- The tokenizer's vocabulary is a lookup table mapping tokens to IDs.

**Code Example: Token-to-ID Mapping**

In [None]:
from transformers import AutoTokenizer

# Load a pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize the text
#text = "سارة هي في كمبوديا أوتش شيلار"
text = "Transformers are amazing!"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print("Tokens:", tokens)
print("Token IDs:", token_ids)

**Step 2: Mapping IDs to Embeddings**

The model uses an embedding matrix (a large table) where:
- Rows represent tokens in the vocabulary.
- Columns represent dimensions of the embedding (e.g., 768 in BERT).

A token ID is used as an index to retrieve its corresponding row in the embedding matrix.

**How It Works:**

- Assume the vocabulary size is $V$ (e.g., 30,000) and the embedding size is $D$ (e.g., 768).
- The embedding matrix is then a $V \times D$ matrix.
- For each token ID, the model retrieves the corresponding row as the embedding.


In [None]:
import torch

# Example embedding matrix (vocab_size=5, embedding_dim=3 for simplicity)
vocab_size = 5
embedding_dim = 3
embedding_matrix = torch.randn(vocab_size, embedding_dim)

print("Embedding Matrix:\n", embedding_matrix)


In [None]:
# Example token IDs
token_ids = torch.tensor([0, 3, 4])  # Simulated token IDs
print(f'token_ids: {token_ids}')

In [None]:
# Retrieve embeddings
embeddings = embedding_matrix[token_ids]
print("Retrieved Embeddings:\n", embeddings)

---

### Pretrained Models: Token Mapping with Embeddings

In real-world models like BERT, the embedding matrix is trained to represent meaningful relationships between tokens.

**Code Example: Mapping with a Pretrained Model**

In [None]:
from transformers import AutoTokenizer, AutoModel

# Load tokenizer and model
#tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
#model = AutoModel.from_pretrained("bert-base-uncased")

# Input text
text = "Transformers are amazing!"

# Tokenize and get token IDs
inputs = tokenizer(text, return_tensors="pt")  # Token IDs are in inputs['input_ids']
token_ids = inputs["input_ids"]

# Retrieve the embedding matrix
embedding_layer = model.embeddings.word_embeddings

# Get embeddings for input tokens
embeddings = embedding_layer(token_ids)

print("Token IDs:", token_ids)
print("Embeddings Shape:", embeddings.shape)  # (batch_size, seq_len, hidden_dim)

In [None]:
embeddings.reshape(6,-1)[2]