### Understanding Tokenization and Embeddings for LLMs

This Jupyter Notebook explains **tokenization** and **embeddings**, two foundational concepts behind Large Language Models (LLMs). We explore how text is converted into tokens, how different models tokenize the same input differently, and how embeddings represent semantic meaning in vector space.

In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5")
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-1_5",
    dtype=torch.float16 if device.type == "cuda" else torch.float32,
).to(device)

print("Model and tokenizer loaded on", device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Model and tokenizer loaded on cpu


In [3]:
print("Model and tokenizer loaded on", device)

Model and tokenizer loaded on cpu


let's generate some text with our model and see how it responds to the prompt

In [5]:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cpu")

# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100
)

# Print the output
print(tokenizer.decode(generation_output[0]))

Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>

Email 3:
Subject: Apology for the Tragic Gardening Mishap

Dear Sarah,

I hope this email finds you well. I wanted to take a moment to apologize for the unfortunate incident that occurred in your garden. It was truly a heartbreaking sight to see your beautiful flowers and plants destroyed.

I understand that accidents happen, and I am truly sorry for the damage caused. I assure you that I will do everything in my power to make it right.


Let's see the input tokens and the output tokens are looking like behind the text

In [6]:
# For inspection, print the input IDs
print(input_ids)

tensor([[16594,   281,  3053, 47401,   284, 10490,   329,   262, 15444, 46072,
         29406,   499,    13, 48605,   703,   340,  3022, 29847,    91,   562,
         10167,    91,    29]])


In [7]:
# Decode and print each input ID sequence
for i in input_ids:
    print(tokenizer.decode(i))

Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>


In [8]:
# Decode and print each input ID in the first batch
for i in input_ids[0]:
    print(tokenizer.decode(i))

Write
 an
 email
 apologizing
 to
 Sarah
 for
 the
 tragic
 gardening
 mish
ap
.
 Explain
 how
 it
 happened
.<
|
ass
istant
|
>


In [9]:
generation_output

tensor([[16594,   281,  3053, 47401,   284, 10490,   329,   262, 15444, 46072,
         29406,   499,    13, 48605,   703,   340,  3022, 29847,    91,   562,
         10167,    91,    29,   198,   198, 15333,   513,    25,   198, 19776,
            25,  5949,  1435,   329,   262,   833,  9083, 12790,  3101, 39523,
           499,   198,   198, 20266, 10490,    11,   198,   198,    40,  2911,
           428,  3053,  7228,   345,   880,    13,   314,  2227,   284,  1011,
           257,  2589,   284, 16521,   329,   262, 14855,  4519,   326,  5091,
           287,   534, 11376,    13,   632,   373,  4988,   257, 37154,  6504,
           284,   766,   534,  4950, 12734,   290,  6134,  6572,    13,   198,
           198,    40,  1833,   326, 17390,  1645,    11,   290,   314,   716,
          4988,  7926,   329,   262,  2465,  4073,    13,   314, 19832,   345,
           326,   314,   481,   466,  2279,   287,   616,  1176,   284,   787,
           340,   826,    13]])

In [10]:
print(tokenizer.decode(16594))
print(tokenizer.decode(281))
print(tokenizer.decode([16594, 281]))

Write
 an
Write an


#### Compare LLM Tokenizers

Let's compare the tokenizers of different trained LLMs and see how they handle the tokenization of the same text.

In this part we will code a simple function to compare the tokenization of different models and see how they handle the tokenization of the same text.

In [11]:
def compare_tokenizers(sentence, tokenizer_names):
    """
    Compare how different tokenizers split the same text.
    Shows token count and the actual tokenization for each model.
    """
    print(f"\n{'='*80}")
    print(f"Comparing tokenizers on text: {sentence[:500]}...")
    print(f"{'='*80}\n")

    for tokenizer_name in tokenizer_names:
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
        token_ids = tokenizer(sentence).input_ids
        tokens = [tokenizer.decode(t, skip_special_tokens=False) for t in token_ids]

        print(f" {tokenizer_name}")
        print(f"   Token count: {len(token_ids)}")
        print(f"   Tokens: {tokens} \n")

In [13]:
# Example: Compare how different tokenizers handle the same text
test_sentence = "CAPITALIZATION and other stuff"

compare_tokenizers(test_sentence, [
    "microsoft/phi-1_5",
    "bert-base-uncased",
    "google/flan-t5-small",
    "microsoft/Phi-3-mini-4k-instruct"
])



Comparing tokenizers on text: CAPITALIZATION and other stuff...

 microsoft/phi-1_5
   Token count: 7
   Tokens: ['CAP', 'ITAL', 'IZ', 'ATION', ' and', ' other', ' stuff'] 

 bert-base-uncased
   Token count: 7
   Tokens: ['[CLS]', 'capital', '##ization', 'and', 'other', 'stuff', '[SEP]'] 

 google/flan-t5-small
   Token count: 9
   Tokens: ['CA', 'PI', 'TAL', 'IZ', 'ATION', 'and', 'other', 'stuff', '</s>'] 

 microsoft/Phi-3-mini-4k-instruct
   Token count: 9
   Tokens: ['C', 'AP', 'IT', 'AL', 'IZ', 'ATION', 'and', 'other', 'stuff'] 



### Meaning of the special tokens in tokenization

You can see above that we have a bunch of unusual tokens in the tokenization like `[CLS]`, `[SEP]`, `[UNK]`, `[PAD]`, `[MASK]`, etc.

These are special tokens that are used to help the model understand the text. Let's review some of them:


- `[CLS]`  is the classification token. It is used to indicate the start of the text.

- `[SEP]`  is the separator token. It is used to separate the text into two parts.

- `[UNK]`  is the unknown token. It is used to indicate that the token is not known.

- `[PAD]`  is the padding token. It is used to pad the text to the same length.

- `[MASK]`  is the mask token. It is used to mask the text during the training process to help the model learn.

- `<s>` is the start token. It is used to indicate the start of the text.

- `</s>` is the end token. It is used to indicate the end of the text.

- `�` or `#` is used to indicate that the token is not known or is a part of a word.

Ok but why different tokenization techniques ?
 1. Vocabulary Size vs Efficiency Trade-off: Different techniques balance between vocabulary size and token efficiency. BPE (like phi-1_5) creates subword units for flexibility, WordPiece (BERT) optimizes for common patterns with ## continuations, and SentencePiece (T5) handles multilingual text better. Fewer tokens = faster generation and lower costs, but too aggressive splitting loses semantic meaning.

 2. Language & Domain Specialization: Some tokenizers are optimized for specific use cases - bert-base-uncased lowercases everything (good for case-insensitive tasks), while others preserve case (critical for code, names, or case-sensitive languages).
 Some times you will need not splitting the text for domain knowledge, for example Medical/code-specific tokenizers keep domain terms as single tokens instead of splitting "acetaminophen" into meaningless pieces.

 3. Direct Impact on Generation Quality: During generation, the model predicts one token at a time. If "CAPITALIZATION" is 4 tokens (CAP|ITAL|IZ|ATION) vs 2 tokens (capital|##ization), the model needs 4 prediction steps vs 2. Bad tokenization means: (1) slower generation, (2) more chances for errors to compound, (3) the model may not "understand" rare words split into weird subwords, and (4) hitting context limits faster (e.g., 2048 tokens fills up quicker with inefficient tokenization).

 TL;DR: The tokenizer determines how the model "sees" and generates text; a mismatch between tokenizer and task hurts both speed and quality.

#### Embeddings

Now that we understand how tokens are used to represent text, let's understand how we can use them to represent text in a vector space. As you may guess a LLM is just putting together a bunch of tokens to represent a text.

What is the embedding mission ?
The embedding mission is to represent the text in a PROPER vector space, so we can use it to properly moidel the text and do some operations like similarity, clustering, etc.

The patterns of the embeddings are learned by the model during the training process reveals to us as the model coherence in a specific language or task

When we download a model from the hub, we can see the embedding matrix in the model's configuration file in a config file. Note that before the begging of the training process, the embedding matrix is random like the weights of the model but the training process will adjust the embedding matrix to the task and language.


let's start by loading a model and tokenizer and see the embedding matrix

In [14]:
from transformers import AutoModel

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")

# Tokenize the sentence
tokens = tokenizer('Hello world', return_tensors='pt')

In [15]:
tokens

{'input_ids': tensor([[    1, 31414,   232,     2]]), 'token_type_ids': tensor([[0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1]])}

### Important distinction: Token IDs vs Embeddings
**Token IDs (integers):**

- These are the numerical IDs from the tokenizer's vocabulary
- Example: [1, 31414, 232, 2] where :
  - 1 is the batch size
  - 4 is the number of tokens
  - 384 is the embedding dimension
  - 2 is the padding token
- You CAN decode these back to text using tokenizer.decode()

**Embeddings (float vectors):**

- These are dense vector representations learned by the model
- Example:`[-3.4816, 0.0861, -0.1819, ...]` (384 dimensions)
- You CANNOT decode these back to text
- Each token ID gets mapped to its embedding vector.  
The flow is: **Text → Token IDs → Embeddings → Model Processing**

In [16]:
# Process the tokens
output = model(**tokens)[0]
display(output)

tensor([[[-3.4816,  0.0861, -0.1819,  ..., -0.0612, -0.3911,  0.3017],
         [ 0.1898,  0.3208, -0.2315,  ...,  0.3714,  0.2478,  0.8048],
         [ 0.2071,  0.5036, -0.0485,  ...,  1.2175, -0.2292,  0.8582],
         [-3.4278,  0.0645, -0.1427,  ...,  0.0658, -0.4367,  0.3834]]],
       grad_fn=<NativeLayerNormBackward0>)

In [17]:
# Check the shape of the output
output.shape

torch.Size([1, 4, 384])

In [18]:
# Decode the TOKEN IDs
tokenizer.decode(tokens['input_ids'][0])

'[CLS]Hello world[SEP]'

Here we CANNOT decode output because it's embeddings (float vectors), not text, is a tensor of shape `[1, 4, 384]` containing 384-dimensional embedding vectors for each token.

These are continuous float values that represent semantic meaning, not discrete token IDs that can be converted back to text.  

In simple terms:

- Input: tokens`['input_ids']` = `[1, 31414, 232, 2]` → CAN decode to text
- Output: output = `[[-3.48, 0.086, ...], [0.19, 0.32, ...], ...]` → CANNOT decode to text

### Play with words embeddings
Now that we understand the embedding mission, let's play with words embeddings to understand how they work and how we can use them to represent text in a vector space

In [18]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [20]:
import gensim.downloader as api

# Download embeddings (66MB, glove, trained on wikipedia, vector size: 50)
model = api.load("glove-wiki-gigaword-50")

In [21]:
# print the 11 most similar words to "Engineer"
model.most_similar([model['engineer']], topn=11)

[('engineer', 1.0),
 ('mechanic', 0.7610689997673035),
 ('technician', 0.7588814496994019),
 ('engineers', 0.7152685523033142),
 ('worked', 0.7083118557929993),
 ('pioneer', 0.7055997848510742),
 ('retired', 0.6979386806488037),
 ('chemist', 0.6946015954017639),
 ('engineering', 0.6913756132125854),
 ('contractor', 0.686898410320282),
 ('builder', 0.6847971081733704)]

This technique called word2vect similarity is very useful to find the most similar words to a given word and also used in naive recommendation systems ,let's explore this use case to recommend songs.

In [22]:
import io
import pandas as pd
from urllib import request

# URLs of the data
PLAYLIST_URL = "https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt"
SONGS_URL = "https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt"

# Load the playlist file
with request.urlopen(PLAYLIST_URL) as resp:
    text = resp.read().decode("utf-8")

# Skip the first two metadata lines
lines = text.splitlines()[2:]

# Build playlists, removing those with only one song
playlists = []
for line in lines:
    line = line.strip()
    if not line:
        continue
    parts = line.split()
    if len(parts) > 1:
        playlists.append(parts)

# Load the song metadata file
with request.urlopen(SONGS_URL) as resp:
    songs_text = resp.read().decode("utf-8")

# Read metadata into a DataFrame
songs_buffer = io.StringIO(songs_text)
songs_df = pd.read_csv(
    songs_buffer,
    sep="\t",
    header=None,
    names=["id", "title", "artist"],
    dtype={"id": str, "title": str, "artist": str},
)

# Clean up and set index
songs_df = songs_df.dropna(how="all")
songs_df = songs_df.set_index("id")

In [23]:
print('Playlist #1:\n ', playlists[0], '\n')
print('Playlist #2:\n ', playlists[1])

Playlist #1:
  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43'] 

Playlist #2:
  ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117',

#### Train word2vec model for recommendation song system

Now that we have the playlists and the songs metadata, we can train a word2vec model to recommend songs to the user based on the playlist. We will use the Word2Vec model from the gensim library with the following parameters:

- `vector_size`: the size of the embedding vectors.

- `window`: the size of the window to consider for the context of the words.

- `negative`: the number of negative samples to use for the training.

- `min_count`: the minimum number of times a word must appear in the playlists to be included in the model.

- `workers`: the number of CPU cores to use for the training.

In [24]:
from gensim.models import Word2Vec

# train our Word2Vec model
model = Word2Vec(
    playlists,
    vector_size=32,
    window=20,
    negative=50,
    min_count=1,
    workers=4
)

In [25]:
# Ask the model for songs similar to song #2172
song_id = 2172
model.wv.most_similar(positive=str(song_id))

[('2849', 0.9975500106811523),
 ('1849', 0.997077465057373),
 ('5549', 0.9966534376144409),
 ('2976', 0.9963942766189575),
 ('2104', 0.9961690306663513),
 ('2014', 0.9961365461349487),
 ('3094', 0.9956143498420715),
 ('5586', 0.9955135583877563),
 ('6626', 0.9954299926757812),
 ('3117', 0.9953515529632568)]

In [26]:
print(songs_df.iloc[2172])

title     Fade To Black
artist        Metallica
Name: 2172 , dtype: object


In [27]:
import numpy as np

def print_recommendations(song_id):
    similar_songs = np.array(
        model.wv.most_similar(positive=str(song_id),topn=5)
    )[:,0]
    return  songs_df.iloc[similar_songs]

# print recommendations in clean text
print_recommendations(2172)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2849,Run To The Hills,Iron Maiden
1849,Bad Company,Five Finger Death Punch
5549,November Rain,Guns N' Roses
2976,I Don't Know,Ozzy Osbourne
2104,Life Won't Wait,Ozzy Osbourne


In [28]:
print(songs_df.iloc[842])

title     California Love (w\/ Dr. Dre & Roger Troutman)
artist                                              2Pac
Name: 842 , dtype: object


In [29]:
print_recommendations(842)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
413,If I Ruled The World (Imagine That) (w\/ Laury...,Nas
886,Heartless,Kanye West
6741,Love In This Club (w\/ Young Jeezy),Usher
5668,How We Do (w\/ 50 Cent),The Game
329,Stronger,Kanye West


### Conclusion
Now that we have trained a word2vec model and seen how it works, we can use it to recommend some product like songs, movies, etc. I encourage you to play with the code and see how it works for different use cases for others datasets.