# Day 4 - Understanding Byte Pair Encoding (BPE) Tokenizer 

- So far, we explored what a tokenizer is and even built our own from scratch. 
- However, one of the key limitations of building a custom tokenizer is handling unknown or rare words. 
- This is where adcanced tokenizers like OpenAI's tiktoken, which uses Byte Pair Encoding (BPE), really shine. 
- We also understood, Language models don't read or understand in the same way humans do. 
- Before any text can be processed by a model, it needs to be tokenized, that is, broken into smaller chunks called tokens. 
- One of the most efficient and widely adopted techniques to perform this is called Byte Pair Encoding (BPE). 

## What is Byte Pair Encoding? 
- Byte Pair Encoding is a data compression algorithm adapted for tokenization.
- Instead of treating words as whole units, it breaks them down into smaller, more frequent subword units. This allows it to:
    - Handle unknown words gracefully 
    - Strike a balance between character-level and word-level tokenization. 
    - Reduce the overall vocabulary size 

## How BPE Works
### Step 1: Start with Characters 
- We begin by breaking all words in our corpus into character: 
- "low", "lower", "newest", "widest"
- → ["l", "o", "w"], ["l", "o", "w", "e", "r"], ...

### Step 2: Count Pair Frequencies
- We count the frequency of adjacent character pairs (biagrams). 
- "l o": 2, "o w": 2, "w e": 2, "e s": 2, ...

### Step 3: Merge the Most Frequent Pair 
- Merge the most frequent pair into a new token: 
- Merge "e s" → "es"
- Now "newest" becomes: ["n", "e", "w", "es", "t"]

### Step 4: Repeat Until Vocabulary Limit
- Continue this process until we reach the desired vocabulary size or until no more merges are possible. 

## Why is BPE Powerful? 
- Efficient: It reuses frequent subwords to reduce redundancy. 
- Flexible: Handlers rare and compound words better that word-level tokenizers. 
- Compact Vocabulary: Essential for performance in large models. 
- It solves a key problem: How to tokenize unknown or rare words without bloating the vocabulary. 

## Example: Using tiktoken for BPE Tokenization 

In [None]:
%pip install tiktoken 

In [5]:
import tiktoken 

encoding = tiktoken.get_encoding ("cl100k_base")

text = "IdeaWeaver is building a tokenizer using BPE"

token_ids = encoding.encode(text)
print("Token IDs:", token_ids)

decoded_text = encoding.decode(token_ids)
print("Decoded Text:", decoded_text)

tokens = [encoding.decode([id]) for id in token_ids]
print("Tokens", tokens)

Token IDs: [40, 56188, 1687, 7403, 374, 4857, 264, 47058, 1701, 426, 1777]
Decoded Text: IdeaWeaver is building a tokenizer using BPE
Tokens ['I', 'dea', 'We', 'aver', ' is', ' building', ' a', ' tokenizer', ' using', ' B', 'PE']


## Final Thoughts
- Byte Pair Encoding may sound simple, but it's one of the key innovations that made today's large language models possible. 
- It strikes a balance between efficieny, flexibility, and robustness in handling diverse language input. 