# Hands-On with Tokenization and  Embeddings

This notebook covered two fundamental concepts in Large Language Models (LLMs)

1. Tokenization: The process of splitting text into smaller units called tokens, which can be words or subwords.

2. Embeddings: Dense vector representations of words or sentences that capture their semantic meanings.

## Tokenization




The Tiktokenizer tool (https://tiktokenizer.vercel.app/) is a useful web-based application for exploring tokenization in OpenAI models like GPT.

✅ Real-time Tokenization – Instantly see how input text is broken into tokens.

✅ Cost Estimation – Shows the number of tokens, useful for estimating API costs in GPT models.

✅ Model-Specific Behavior – Allows you to compare how different OpenAI models (e.g., GPT-3.5, GPT-4) tokenize the same text.

✅ Visual Debugging – Helpful for understanding token splitting, subword units, and unexpected tokenization behaviors.


## Let us write a naive tokenizer

In [20]:
corpus = ["the quick brown fox jumps over the lazy dog",
          "the dog barks",
          "the fox runs"]

# Tokenization and vocabulary creation
# Split each sentence into words
tokens = [sentence.split() for sentence in corpus]
# Flatten token list and remove duplicates to create vocabulary
vocab = sorted(set(word for sentence in tokens for word in sentence))
# Create word-to-index and index-to-word mappings
word_to_idx = {word: idx for idx, word in enumerate(vocab)}
idx_to_word = {idx: word for word, idx in word_to_idx.items()}
vocab_size = len(vocab)

In [21]:
# Print results
print("Vocabulary:", vocab)
print("Word to Index Mapping:", word_to_idx)
print("Index to Word Mapping:", idx_to_word)
print("Vocabulary Size:", vocab_size)

Vocabulary: ['barks', 'brown', 'dog', 'fox', 'jumps', 'lazy', 'over', 'quick', 'runs', 'the']
Word to Index Mapping: {'barks': 0, 'brown': 1, 'dog': 2, 'fox': 3, 'jumps': 4, 'lazy': 5, 'over': 6, 'quick': 7, 'runs': 8, 'the': 9}
Index to Word Mapping: {0: 'barks', 1: 'brown', 2: 'dog', 3: 'fox', 4: 'jumps', 5: 'lazy', 6: 'over', 7: 'quick', 8: 'runs', 9: 'the'}
Vocabulary Size: 10


## Now let us explore Tokenizer for an open source model

In [1]:
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
# importing libraries
import random
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

In [28]:
# Load BERT tokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [5]:
# Set a random seed
random_seed = 42
random.seed(random_seed)

# Set a random seed for PyTorch (for GPU as well)
torch.manual_seed(random_seed)
if torch.cuda.is_available():
	torch.cuda.manual_seed_all(random_seed)

In [29]:
#this code takes a sentence, breaks it down into individual words or subwords using the BERT tokenizer, and then displays the resulting tokens.
text = "The firefighter bravely entered the burning building, while the newscaster reported on the unfolding events."

# Tokenize the text
tokenized_text = bert_tokenizer.tokenize(text)
#print tokenized text
print(f"tokenized Text: {tokenized_text}")


tokenized Text: ['the', 'fire', '##fighter', 'brave', '##ly', 'entered', 'the', 'burning', 'building', ',', 'while', 'the', 'newscast', '##er', 'reported', 'on', 'the', 'un', '##folding', 'events', '.']


- Subword Tokenization: BERT uses subword tokenization, meaning it breaks down words into smaller units that can be recombined to form the original word. This allows the model to handle a wider vocabulary, including unseen words, by recognizing patterns in common subword units.

- Special Characters: Words with special characters or accents might be split into multiple tokens, as seen with "cœur" and "lointain".

- Punctuation: Punctuation marks are usually treated as individual tokens.

- "##" Prefix: The "##" prefix is used to indicate that a token is a continuation of a previous token and should be combined with it when interpreting the tokenized output.

In [30]:
# Now we will Tokenize and encode text to get the tokens IDs.

# The function batch_encode_plus returns a dictionary containing the token IDs and attention masks
encoding = bert_tokenizer.batch_encode_plus(
	 [text],			 # List of input texts
	padding=True,			 # Pad to the maximum sequence length to ensure that all sentences have the same length.
	truncation=True,		 # Truncate to the maximum sequence length if necessary
	return_tensors='pt',	 # Return PyTorch tensors
	add_special_tokens=True # Add special tokens CLS and SEP
)

input_ids = encoding['input_ids'] # Token IDs
# print input IDs
print(f"Input ID: {input_ids}")

Input ID: tensor([[  101,  1996,  2543, 20027,  9191,  2135,  3133,  1996,  5255,  2311,
          1010,  2096,  1996, 20306,  2121,  2988,  2006,  1996,  4895, 21508,
          2824,  1012,   102]])


In [31]:
# Decode the token IDs back to text
decoded_text = bert_tokenizer.decode(input_ids[0], skip_special_tokens=True)
#print decoded text
print(f"Decoded Text: {decoded_text}")

Decoded Text: the firefighter bravely entered the burning building, while the newscaster reported on the unfolding events.


In [16]:
attention_mask = encoding['attention_mask'] # Attention mask
# print attention mask
print(f"Attention mask: {attention_mask}")

Attention mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


- No Padding: All values in the attention mask being 1 means that every token in the input text is considered relevant and should be attended to. This typically indicates that no padding was applied to the text.
- Full Attention: The model will consider and process all the tokens in the sequence without ignoring any of them. Each token will contribute to the overall understanding of the input text.

## Let us consider a Mistral tokenizer

In [32]:
from transformers import  AutoTokenizer
BASELINE_MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2"
mistral_tokenizer = AutoTokenizer.from_pretrained(BASELINE_MODEL_NAME)

In [33]:
tokens = mistral_tokenizer.tokenize(text)
print(tokens)

['▁The', '▁fire', 'fig', 'h', 'ter', '▁br', 'av', 'ely', '▁entered', '▁the', '▁burning', '▁building', ',', '▁while', '▁the', '▁new', 'sc', 'aster', '▁reported', '▁on', '▁the', '▁unf', 'olding', '▁events', '.']


- Subword Tokenization: Mistral, like many modern tokenizers, uses subword tokenization. This means it breaks down words like "firefighter" and "bravely" into smaller units ("fig", "h", "ter", "br", "av", "ely"). This allows the model to handle a larger vocabulary and unseen words more effectively.
- Spaces: Some tokenizers include leading or trailing spaces as part of the token (e.g., "' The'").
- Punctuation: Punctuation marks are typically treated as individual tokens.

In [34]:
ids = mistral_tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[415, 3339, 885, 28716, 360, 1170, 494, 723, 8426, 272, 13136, 3667, 28725, 1312, 272, 633, 824, 1993, 5745, 356, 272, 10077, 25107, 3926, 28723]


In [35]:
decoded_string = mistral_tokenizer.decode(ids)
print(decoded_string)

The firefighter bravely entered the burning building, while the newscaster reported on the unfolding events.


# Words Embeddings

In [27]:
# Load BERT model
model = BertModel.from_pretrained('bert-base-uncased')

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [36]:
# Generate embeddings using BERT model
with torch.no_grad():
	outputs = model(input_ids, attention_mask=attention_mask)
	word_embeddings = outputs.last_hidden_state # This contains the embeddings

# Output the shape of word embeddings
print(f"Shape of Word Embeddings: {word_embeddings.shape}")

Shape of Word Embeddings: torch.Size([1, 23, 768])


 The shape is a tensor with dimensions representing [batch size, sequence length, embedding dimension]. Here, our the dimension of the embedding space is 768.

In [40]:
# You can print the embedding of the first token "the". Let us output only the 20 first coordinate.
print(word_embeddings[0][0][:20])

tensor([-0.1181, -0.2146,  0.1180,  0.0116,  0.1689, -0.4654,  0.3359,  0.4142,
         0.5390,  0.1416,  0.0765, -0.1458, -0.3116,  0.5731,  0.4801,  0.3392,
        -0.4910,  0.5232,  0.3460, -0.2849])


In [41]:
# Compute the average of word embeddings to get the sentence embedding
sentence_embedding = word_embeddings.mean(dim=1) # Average pooling along the sequence length dimension

# Output the shape of the sentence embedding
print(f"Shape of Sentence Embedding: {sentence_embedding.shape}")

Shape of Sentence Embedding: torch.Size([1, 768])


In [43]:
# Let us compare the semantinc meaning of the following sentences using the bert model

sentences = ["That is a happy person", "Today is a sunny day", "That is a very happy person", "That is a happy dog"]

# Tokenize and encode the example sentence
sentences_encoding = bert_tokenizer.batch_encode_plus(
	sentences,
	padding=True,
	truncation=True,
	return_tensors='pt',
	add_special_tokens=True
)
sentences_input_ids = sentences_encoding['input_ids']
sentences_attention_mask = sentences_encoding['attention_mask']

# Generate embeddings for the sentences
with torch.no_grad():
	sentences_outputs = model(sentences_input_ids, attention_mask=sentences_attention_mask)
	sentences_embedding = sentences_outputs.last_hidden_state.mean(dim=1)

# Compute cosine similarity between the first sentence embedding and the others
for i in range(1, 4):
    similarity = cosine_similarity([sentences_embedding[0]], [sentences_embedding[i]])[0][0]
    print(f"Cosine similarity between : '{sentences[0]}' and '{sentences[i]}': {similarity:.3f}")

Cosine similarity between : 'That is a happy person' and 'Today is a sunny day': 0.649
Cosine similarity between : 'That is a happy person' and 'That is a very happy person': 0.928
Cosine similarity between : 'That is a happy person' and 'That is a happy dog': 0.852


# Embedding on your own data using a NN


In [47]:
import torch
import torch.nn as nn

# Toy dataset (replace with your own data)
corpus = ["The cat sat on the mat", "The dog chased the ball in the park",
          "Birds sing beautiful songs in the morning.", "The sun shines brightly during the day."]

# Tokenization (split sentences into words)
tokens = [sentence.split() for sentence in corpus]

tokens


[['The', 'cat', 'sat', 'on', 'the', 'mat'],
 ['The', 'dog', 'chased', 'the', 'ball', 'in', 'the', 'park'],
 ['Birds', 'sing', 'beautiful', 'songs', 'in', 'the', 'morning.'],
 ['The', 'sun', 'shines', 'brightly', 'during', 'the', 'day.']]

In [48]:
vocab = set(word for sentence in tokens for word in sentence)
word_to_idx = {word: idx for idx, word in enumerate(vocab)}
print(vocab)
print(word_to_idx)

{'sun', 'The', 'morning.', 'chased', 'day.', 'cat', 'park', 'shines', 'beautiful', 'Birds', 'sing', 'sat', 'songs', 'ball', 'the', 'mat', 'during', 'in', 'brightly', 'on', 'dog'}
{'sun': 0, 'The': 1, 'morning.': 2, 'chased': 3, 'day.': 4, 'cat': 5, 'park': 6, 'shines': 7, 'beautiful': 8, 'Birds': 9, 'sing': 10, 'sat': 11, 'songs': 12, 'ball': 13, 'the': 14, 'mat': 15, 'during': 16, 'in': 17, 'brightly': 18, 'on': 19, 'dog': 20}


In [68]:
# Create a vocabulary and assign unique token IDs

# Hyperparameters
vocab_size = len(vocab)
embedding_dim = 100

# Initialize an embedding layer
embedding_layer = nn.Embedding(vocab_size, embedding_dim)

# Example: Get embeddings for the word "cat"
word_idx = word_to_idx["cat"]
word_embedding = embedding_layer(torch.tensor([word_idx], dtype=torch.long)) #  to obtain the embedding for the word "cat" by essentially performing a lookup in the matrix

print(f"Embedding for 'cat': {word_embedding[0][:20]}")
print(word_embedding.shape)


Embedding for 'cat': tensor([ 1.8723, -0.4072, -0.2770, -0.1821,  0.5865,  0.2617,  0.5766, -0.7246,
         0.3305, -0.5248,  2.2697, -0.8297, -1.4333, -0.3883,  0.5718, -0.7854,
        -0.2839,  0.0305,  0.8492,  0.2427], grad_fn=<SliceBackward0>)
torch.Size([1, 100])


In [67]:
# Another way of  obtaining that embedding
embedding_layer.weight[word_idx][:20]

tensor([ 0.6007,  0.9561, -0.3089,  0.6745,  0.8899, -2.0931, -1.4284, -0.6044,
        -1.6035, -0.2440, -0.2733,  0.6231,  0.7620, -1.1159,  0.3083, -0.5413,
         0.1813,  1.3681,  1.2292,  1.9382], grad_fn=<SliceBackward0>)