## Tokens & Embeddings

1. **Tokens and embeddings** are fundamental to how large language models (LLMs) operate and are key to understanding their past, present, and future.

2. **Tokens** are small chunks of text (such as words or subwords) that LLMs process, and these are the basic units of input for language models.

3. **Embeddings** are numeric vector representations of tokens that allow LLMs to perform computations on language.

4. The chapter explores **tokenization methods**—how raw text is broken into tokens for processing by LLMs. It also introduces **word2vec**, an early embedding technique that laid the groundwork for modern embedding approaches and is still used in systems like commercial recommendation engines.

5. The discussion progresses from token-level embeddings to **sentence or text embeddings**, where an entire sentence or document is represented as a single vector. These embeddings are critical for downstream tasks such as search, summarization, recommendation, and understanding context.


## Lets look at the tokens generated by Phi3

In [None]:
%%capture
!pip install transformers>=4.40.1 accelerate>=0.27.2

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
# Load the corresponding tokenizer to handle text-token conversion
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

In [None]:
prompt = 'I am studing in grade IIIrd.Write a 5 line paragraph on global warming.'

In [None]:
token_ids = tokenizer(prompt,return_tensors='pt').input_ids.to('cuda')

In [None]:
token_ids

tensor([[  306,   626,  1921,   292,   297, 19468,  4786,  5499, 29889,  6113,
           263, 29871, 29945,  1196, 14880,   373,  5534,  1370,  4056, 29889]],
       device='cuda:0')

In [None]:
for i in token_ids[0]:
  print(f'{i:<5} --> {tokenizer.decode(i)}')

306   --> I
626   --> am
1921  --> stud
292   --> ing
297   --> in
19468 --> grade
4786  --> III
5499  --> rd
29889 --> .
6113  --> Write
263   --> a
29871 --> 
29945 --> 5
1196  --> line
14880 --> paragraph
373   --> on
5534  --> global
1370  --> war
4056  --> ming
29889 --> .


In [None]:
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",              # Automatically place model on GPU
    torch_dtype="auto",             # Use optimal data type for performance (e.g., float16)
    trust_remote_code=False         # Disable execution of remote custom code for security
)

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

In [None]:
tokens_generated = model.generate(token_ids,max_new_tokens=50,do_sample=True)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [None]:
tokens_generated

tensor([[  306,   626,  1921,   292,   297, 19468,  4786,  5499, 29889,  6113,
           263, 29871, 29945,  1196, 14880,   373,  5534,  1370,  4056, 29889,
            13,    13,    13,    13, 18567, 29991,   306, 29915, 29885,   263,
          2586,  6365,  1255,  1048,  5534,  1370,  4056,   322,   967,  9545,
         29889,   739,   338, 14171,   263,  4655,  1108,  9826, 29889,  9267,
         10916,   526, 28967,  1623, 10697,   304,  2693,  1009, 14368, 29889,
           910,   338,  3907,   278,  4799, 21180,  3860,   322,   884, 27668]],
       device='cuda:0')

In [None]:
for i in tokens_generated[0][token_ids.shape[-1]:]:
  print(f'{i:<5} --> {tokenizer.decode(i)}')

13    --> 

13    --> 

13    --> 

13    --> 

18567 --> Hi
29991 --> !
306   --> I
29915 --> '
29885 --> m
263   --> a
2586  --> bit
6365  --> wor
1255  --> ried
1048  --> about
5534  --> global
1370  --> war
4056  --> ming
322   --> and
967   --> its
9545  --> effects
29889 --> .
739   --> It
338   --> is
14171 --> becoming
263   --> a
4655  --> major
1108  --> problem
9826  --> today
29889 --> .
9267  --> Many
10916 --> countries
526   --> are
28967 --> cutting
1623  --> down
10697 --> trees
304   --> to
2693  --> develop
1009  --> their
14368 --> cities
29889 --> .
910   --> This
338   --> is
3907  --> making
278   --> the
4799  --> air
21180 --> poll
3860  --> uted
322   --> and
884   --> also
27668 --> reducing


In [None]:
## Lets Look at Bert tokens

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
token_ids = tokenizer(prompt,return_tensors='pt').input_ids.to('cuda')

In [None]:
token_ids

tensor([[  101,   146,  1821, 24084,  1158,  1107,  3654,  2684,  2956,   119,
           160, 10587,   170,   126,  1413, 24950,  1113,  4265, 14110,   119,
           102]], device='cuda:0')

In [None]:
for i in token_ids[0]:
  print(f'{i:<5} --> {tokenizer.decode(i)}')

101   --> [CLS]
146   --> I
1821  --> am
24084 --> stud
1158  --> ##ing
1107  --> in
3654  --> grade
2684  --> III
2956  --> ##rd
119   --> .
160   --> W
10587 --> ##rite
170   --> a
126   --> 5
1413  --> line
24950 --> paragraph
1113  --> on
4265  --> global
14110 --> warming
119   --> .
102   --> [SEP]


In [None]:
# Lets look at GPT tokens
tokenizer = AutoTokenizer.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
token_ids = tokenizer('explain BPE in points 📌',return_tensors='pt').input_ids.to('cuda')

In [None]:
token_ids

tensor([[20676,   391,   347, 11401,   287,  2173, 12520,   241,   234]],
       device='cuda:0')

In [None]:
for i in token_ids[0]:
  print(f'{i:<5} --> {tokenizer.decode(i)}')

20676 --> expl
391   --> ain
347   -->  B
11401 --> PE
287   -->  in
2173  -->  points
12520 -->  �
241   --> �
234   --> �


In [None]:
tokenizer.decode([12520,241,234])

' 📌'

## How does Tokenizer breaks down the text?

1. **Tokenizer behavior** depends on three major factors: the tokenization method, tokenizer design choices, and the training dataset.

2. **Tokenization methods** like Byte Pair Encoding (BPE) and WordPiece are chosen at model design time and define how text is split into tokens.

3. **Tokenizer design choices** include decisions about vocabulary size and special tokens (like padding, start/end markers, etc.).

4. The **training dataset** used to train the tokenizer plays a crucial role—different datasets (e.g., English, code, multilingual) result in different vocabularies, even with the same tokenization method.

5. Tokenizers are **trained to create the most efficient vocabulary** to represent the input data they are exposed to.

6. Tokenizers are used not only to convert **text into token IDs (input)** but also to convert **model output token IDs back into text**. Therefore, the tokenizer impacts both **model input and output interpretation**, making it a critical component in how language models function.


## Word Versus Subword Versus Character Versus Byte Tokens

1. **Subword tokenization** is the most common scheme in modern NLP and combines whole and partial word tokens to improve vocabulary efficiency and handle unseen words.

2. **Word tokenization** (used in older models like word2vec) treats each word as a token but struggles with new or rare words and creates a bloated vocabulary.

3. **Subword tokens** solve this by breaking words into reusable parts (e.g., "apolog" + "-y", "-ize", etc.), allowing better generalization and compact vocabularies.

4. **Character tokenization** breaks text into individual characters (e.g., "p-l-a-y") which makes tokenization simple and robust to new words but increases modeling complexity and reduces context efficiency.

5. Subword tokenization is **more efficient** for Transformer models with limited context windows, fitting **more text** per sequence compared to character tokens.

6. **Byte tokenization** splits text into Unicode byte representations, supporting tokenization-free approaches, especially useful in **multilingual contexts**. Methods like **CANINE** and **ByT5** explore byte-level models that eliminate the need for traditional tokenization.

7. Some subword tokenizers (like those in **GPT-2** and **RoBERTa**) include **bytes as fallback tokens** for unknown characters, but they’re **not fully byte-level models**.

8. Each tokenization method has trade-offs in **efficiency**, **handling unknown inputs**, and **model complexity**.


## BPE and WordPiece

### 🔍 What is Byte Pair Encoding (BPE)?

**Byte Pair Encoding (BPE)** is a **subword tokenization algorithm** used in many modern language models (e.g., GPT-2, GPT-3, RoBERTa). It breaks down words into smaller units (subwords), allowing models to handle **rare or unseen words** efficiently.

---

### 🧠 Core Idea:

BPE starts with characters as tokens and **iteratively merges the most frequent pairs** of adjacent tokens (initially characters) into new tokens. This process continues until a predefined **vocabulary size** is reached.

---

### 🧩 Step-by-Step Example:

Let’s say your corpus is:

```
low lower newest wider
```

#### Step 1: Initial Tokenization (Character-level)

```
l o w
l o w e r
n e w e s t
w i d e r
```

Each word is split into characters, and a special end-of-word marker (like `</w>`) may be added to mark boundaries.

---

#### Step 2: Count Most Frequent Pairs

Count all adjacent token pairs:

* "l o": 2
* "o w": 2
* "e r": 2
* "n e": 1
* ...

---

#### Step 3: Merge the Most Frequent Pair

If "e r" is most frequent, merge it:

* "e r" → "er"

Update all sequences:

```
l o w
l o w er
n e w e s t
w i d er
```

---

#### Step 4: Repeat

Next frequent pair might be "l o" → "lo":

```
lo w
lo w er
n e w e s t
w i d er
```

Keep merging until the vocabulary reaches the desired size.

---

### ✅ Why is BPE Useful?

1. **Handles rare words** by splitting them into known subwords.
2. Reduces **vocabulary size**, compared to full-word tokenization.
3. Balances between **character-level flexibility** and **word-level efficiency**.
4. Allows representation of **unseen words** by combining known subword tokens.

---

### 💡 Analogy:

Think of BPE like **file compression**:

* You find the most repeated patterns (pairs of characters).
* Replace them with shorthand codes (merged tokens).
* This makes storage (vocabulary) efficient while retaining information.

---

### 🧪 Real-World Use:

* GPT-2, GPT-3 use **BPE**.
* RoBERTa also uses **BPE**, but with slightly different preprocessing.

### 🔍 What is WordPiece Tokenization?

**WordPiece** is a **subword tokenization algorithm**, similar in spirit to Byte Pair Encoding (BPE), but with a key difference in how it chooses **which tokens to add to the vocabulary**. It was originally introduced by Google for models like **BERT** and **ALBERT**.

---

### 🧠 Core Idea:

Instead of merging the most frequent **pairs** like BPE, WordPiece selects the **subword that increases the likelihood of the training data the most** based on a probabilistic language model.

---

### 📌 Step-by-Step Overview:

Let’s say you have a sentence:

```
"unhappiness"
```

#### Step 1: Start with Characters

Break the word into characters:

```
[u, n, h, a, p, p, i, n, e, s, s]
```

#### Step 2: Learn and Merge Subwords

WordPiece doesn't just look at frequency — it uses a **language model objective** to decide what merges would **maximize the likelihood** of reconstructing the training corpus.

Common learned subwords might be:

* "un"
* "##happy"
* "##ness"

So:

```
"unhappiness" → ["un", "##happy", "##ness"]
```

🔸 `##` indicates the token **does not begin a word** (i.e., it's a **continuation** token).
This helps the model understand where **word boundaries** are.

---

### ✅ Key Features of WordPiece:

| Feature                      | Explanation                                                            |
| ---------------------------- | ---------------------------------------------------------------------- |
| **Subword Vocabulary**       | Allows breaking unknown words into known parts.                        |
| **Prefix Convention (`##`)** | Clearly marks subwords that continue a word.                           |
| **Smarter Merging**          | Uses a likelihood-based selection rather than just frequency like BPE. |
| **Efficient Vocabulary**     | Maintains a compact set of tokens with broad coverage.                 |

---

### 🔁 BPE vs. WordPiece – What’s the Difference?

| Feature        | BPE                | WordPiece                          |
| -------------- | ------------------ | ---------------------------------- |
| Merge Strategy | Most frequent pair | Maximizes training data likelihood |
| Word Boundary  | No special symbol  | Uses `##` to show continuation     |
| Used In        | GPT, RoBERTa       | BERT, ALBERT, DistilBERT           |

---

### 💡 Analogy:

Imagine you're typing on your phone:

* **WordPiece** is like **autocomplete** that intelligently suggests parts of words based on context.
* **BPE** is like compressing text by replacing frequent letter pairs.

---

### 📌 Summary:

* **WordPiece** tokenizes words into subwords to handle rare or unknown terms.
* It uses a **likelihood-based approach** to build its vocabulary.
* It’s widely used in models like **BERT** to ensure good **coverage**, **efficiency**, and **generalization**.


## Token Embeddings

1. **Tokenization** breaks language into sequences of tokens, which allows language models to process and learn patterns from large text datasets.

2. After tokenization, the next step is to find **numerical representations** of tokens so the model can understand and process language effectively.

3. These numerical representations are called **embeddings**, which capture the meanings and relationships between tokens to power the model’s capabilities.

## A Language Model Holds Embeddings for the Vocabulary of Its Tokenizer

1. **Language models are tightly linked to their tokenizer** because each token in the tokenizer's vocabulary has a corresponding embedding vector in the model, making tokenizer substitution incompatible without retraining.

2. **Embedding vectors are part of the model’s parameters**, initially random but optimized during training to capture meaningful representations of each token for effective language understanding.


## Creating Contextualized Word Embeddings with Language Models

1. **Contextualized Embeddings**: Unlike static embeddings (same vector for each word), language models generate **contextualized word embeddings**, which adjust based on the word's surrounding context.

2. **Improved Text Representation**: These embeddings enhance the model's understanding of meaning, making them highly useful for **NLP tasks** like **named-entity recognition** and **extractive summarization**.

3. **Dynamic Token Representation**: A single word can have **different vector representations** depending on its usage in a sentence, enabling more accurate interpretation.

4. **Broader Applications**: Beyond text tasks, **contextualized embeddings** are also foundational in **multimodal AI**, such as **image generation models** like DALL·E, Midjourney, and Stable Diffusion.

5. **Reusability**: These embeddings can be fed into other downstream systems, making them a versatile component in modern AI pipelines.


In [None]:
from transformers import AutoModel , AutoTokenizer

In [None]:
model = AutoModel.from_pretrained('microsoft/deberta-base')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/474 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/559M [00:00<?, ?B/s]

In [None]:
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-base')

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

In [None]:
tokens = tokenizer('King loves apples while queen likes oranges',return_tensors='pt')

In [None]:
for i in tokens['input_ids']:
  print(tokenizer.decode(i))

[CLS]King loves apples while queen likes oranges[SEP]


In [None]:
context_embedding = model(**tokens)[0]

In [None]:
context_embedding

tensor([[[ 0.0599, -0.0210, -0.0796,  ...,  0.0100,  0.0525, -0.0580],
         [ 0.7146, -0.3095, -0.0693,  ...,  0.7228,  0.0781,  0.3667],
         [-1.6079, -0.3470,  0.0576,  ...,  0.5438, -0.5955,  0.9199],
         ...,
         [-1.9216,  0.2889,  0.5016,  ...,  0.6557, -1.0178,  1.2307],
         [-0.2176,  0.9261,  0.7574,  ...,  0.1845, -1.3310,  0.9372],
         [ 0.2082,  0.0447, -0.1084,  ...,  0.0755,  0.1981,  0.0765]]],
       grad_fn=<AddBackward0>)

In [None]:
context_embedding.shape

torch.Size([1, 9, 768])