### Agenda


> - Understand the need of tokenization and flow
> - Working of BPE and Wordpiece
> - Implementation of BPE using OpenAI's vocab size (using tiktoken- gpt4o model tokenization)
> - Implementation of BPE and Wordpiece using HuggingFace (**TIME FOR ASSIGNMENT-1**)
> - Introduction of Embeddings - Sparse vectors and Dense Vectors

## Using Tik-token

Visualization:
- With tokenid and model selection: https://tiktokenizer.vercel.app/

In [None]:
import tiktoken

In [None]:
# GPT-2 (does not merge spaces)
enc = tiktoken.get_encoding("gpt2")
print(enc.encode("    hello world!!!"))

# GPT-4 (merges spaces)
enc = tiktoken.get_encoding("cl100k_base")
print(enc.encode("    hello world!!!"))

# GPT-5 (merges spaces)
enc = tiktoken.get_encoding("o200k_base")
print(enc.encode("    hello world!!!"))

[220, 220, 220, 23748, 995, 10185]
[262, 24748, 1917, 12340]
[271, 40617, 2375, 10880]


In [None]:
texts = [
    "Naruto trained hard to master the Rasengan technique.",
    "The stock market fluctuates daily due to global events.",
    "Deep learning models require large amounts of data to perform well.",
    "In astronomy, black holes bend light through gravitational lensing.",
    "Luffy is the captain of Straw Hat Pirates",
]

In [None]:
tokenizer = tiktoken.encoding_for_model("gpt-4o")

In [None]:
tokenizer

<Encoding 'o200k_base'>

In [None]:
for i, text in enumerate(texts, 1):
    print(f"Text {i}: {text}")

    tokens = tokenizer.encode(text)
    print(f"Tokens: {tokens}")
    print(f"Token count: {len(tokens)}")

    # Decode tokens back to text - same as the original text
    # decoded_text = tokenizer.decode(tokens)
    # print(f"Decoded: {decoded_text}")

    # Show individual token meanings
    print("Individual tokens:")
    for token_id in tokens:
        token_text = tokenizer.decode([token_id])
        print(f"  {token_id} -> '{token_text}'")

    print("-" * 50)

### Actual usage - tiktoken

In [None]:
long_text = """
Large language models like GPT-4 are powerful tools for natural language processing.
They can understand context, generate human-like text, and perform various tasks.
However, API calls are charged based on the number of tokens processed.
Understanding tokenization helps you estimate costs and optimize your prompts.
"""

In [None]:
tokens = tokenizer.encode(long_text)
print(f"Total tokens: {len(tokens)}")

Total tokens: 72


In [None]:
cost = (len(tokens) / 1000) * 0.03
print(f"Estimated cost (assuming $0.03 per 1K tokens): ${cost}")

Estimated cost (assuming $0.03 per 1K tokens): $0.0021599999999999996


## HuggingFace

In [None]:
!pip install tokenizers

In [None]:
from tokenizers import Tokenizer, models, pre_tokenizers
from tokenizers import decoders, trainers, processors

In [None]:
tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))

### Using Simple List of data

In [None]:
texts = [
    "Hello Naruto trained hard to master the Rasengan technique.",
    "The stock market fluctuates daily due to global events.",
    "Deep learning models require large amounts of data to perform well.",
    "In astronomy, black holes bend light through gravitational lensing models",
    "Hello Luffy is the captain of Straw Hat Pirates",
]

In [None]:
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

In [None]:
tokenizer.decoder = decoders.BPEDecoder(suffix="</w>")

In [None]:
trainer = trainers.BpeTrainer(
    vocab_size=10000,
    min_frequency=2,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)

tokenizer.train_from_iterator(texts, trainer)

In [None]:
test_text = "Hello machine learning models tokenization!"

encoded = tokenizer.encode(test_text)
print("Original text:", test_text)
print("Token IDs:", encoded.ids)
print("Tokens:", encoded.tokens)

Original text: Hello machine learning models tokenization!
Token IDs: [68, 27, 16, 18, 23, 44, 20, 26, 20, 39, 28, 65, 69, 42, 25, 43, 24, 0, 40, 24, 59, 0]
Tokens: ['Hello', 'm', 'a', 'c', 'h', 'in', 'e', 'l', 'e', 'ar', 'n', 'ing', 'models', 'to', 'k', 'en', 'i', '[UNK]', 'at', 'i', 'on', '[UNK]']


### using open source data

In [None]:
from datasets import load_dataset
ds = load_dataset("wikitext", "wikitext-103-raw-v1")

In [None]:
tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
trainer = trainers.BpeTrainer(min_frequency=2,special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

In [None]:
trainer.vocab_size

30000

In [None]:
trainer.min_frequency

2

In [None]:
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.decoder = decoders.BPEDecoder(suffix="</w>")

In [None]:
ds['validation']['text']

Column(['', ' = Homarus gammarus = \n', '', ' Homarus gammarus , known as the European lobster or common lobster , is a species of clawed lobster from the eastern Atlantic Ocean , Mediterranean Sea and parts of the Black Sea . It is closely related to the American lobster , H. americanus . It may grow to a length of 60 cm ( 24 in ) and a mass of 6 kilograms ( 13 lb ) , and bears a conspicuous pair of claws . In life , the lobsters are blue , only becoming " lobster red " on cooking . Mating occurs in the summer , producing eggs which are carried by the females for up to a year before hatching into planktonic larvae . Homarus gammarus is a highly esteemed food , and is widely caught using lobster pots , mostly around the British Isles . \n', ''])

In [None]:
val_file = "wiki.validation.txt"
with open(val_file, "w", encoding="utf-8") as f:
    for line in ds['validation']["text"]:
        f.write(line + "\n")

In [None]:
!cat wiki.validation.txt

In [None]:
tokenizer.train([val_file], trainer)

In [None]:
tokenizer.save("tokenizer-validation.json")

In [None]:
tokenizer = Tokenizer.from_file("tokenizer-validation.json")

In [None]:
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")

In [None]:
print(output.tokens)

['Hel', 'lo', ',', 'y', "'", 'all', '!', 'How', 'are', 'you', '[UNK]', '?']


In [None]:
print(output.ids)

[5491, 330, 15, 88, 10, 396, 5, 4392, 397, 1867, 0, 34]


## Assignment - 1:

> Implement Wordpiece using HuggingFace

## Embeddings - Sparse and Dense Vectors

Visuals: https://projector.tensorflow.org/

In [None]:
!pip install fastembed

In [None]:
import numpy as np
from fastembed import TextEmbedding, SparseTextEmbedding

In [None]:
texts = [
    "I'll become the strongest ninja in the village!",
    "Believe in yourself and create your own destiny!",
    "Power without friends is meaningless.",
    "Even if I die, I'll protect everyone!"
]

In [None]:
dense_model = TextEmbedding(model_name="jinaai/jina-embeddings-v2-base-en")
sparse_model = SparseTextEmbedding(model_name="Qdrant/bm25")

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

onnx/model.onnx:   0%|          | 0.00/547M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

Fetching 18 files:   0%|          | 0/18 [00:00<?, ?it/s]

dutch.txt:   0%|          | 0.00/453 [00:00<?, ?B/s]

english.txt:   0%|          | 0.00/936 [00:00<?, ?B/s]

danish.txt:   0%|          | 0.00/424 [00:00<?, ?B/s]

arabic.txt: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

german.txt: 0.00B [00:00, ?B/s]

french.txt:   0%|          | 0.00/813 [00:00<?, ?B/s]

finnish.txt: 0.00B [00:00, ?B/s]

greek.txt: 0.00B [00:00, ?B/s]

italian.txt: 0.00B [00:00, ?B/s]

norwegian.txt:   0%|          | 0.00/851 [00:00<?, ?B/s]

portuguese.txt: 0.00B [00:00, ?B/s]

russian.txt: 0.00B [00:00, ?B/s]

hungarian.txt: 0.00B [00:00, ?B/s]

spanish.txt: 0.00B [00:00, ?B/s]

romanian.txt: 0.00B [00:00, ?B/s]

swedish.txt:   0%|          | 0.00/559 [00:00<?, ?B/s]

turkish.txt:   0%|          | 0.00/260 [00:00<?, ?B/s]

In [None]:
dense_embeddings = list(dense_model.embed(texts))

In [None]:
len(dense_embeddings)

4

In [None]:
dense_embeddings[0].shape

(768,)

In [None]:
print("DENSE VECTORS:")
for text, emb in zip(texts, dense_embeddings):
    print(f"'{text}' -> {emb.shape}, Values: {emb[:5]}")

DENSE VECTORS:
'I'll become the strongest ninja in the village!' -> (768,), Values: [-0.0177785  -0.0243032   0.01639675 -0.00791665 -0.02232398]
'Believe in yourself and create your own destiny!' -> (768,), Values: [-0.01656694 -0.01500895  0.03705287  0.041886    0.00065335]
'Power without friends is meaningless.' -> (768,), Values: [-0.01274847 -0.00664657  0.07809768  0.01914289 -0.04791698]
'Even if I die, I'll protect everyone!' -> (768,), Values: [-0.01920971 -0.02787863  0.03291088  0.03493232 -0.01571201]


In [None]:
sparse_embeddings = list(sparse_model.embed(texts))

In [None]:
sparse_embeddings[0].values.shape

(4,)

In [None]:
sparse_embeddings[1].values.shape

(3,)

In [None]:
sparse_embeddings[2].values.shape

(4,)

In [None]:
print(f"Dense: {dense_embeddings[0].shape[0]} dims")
print(f"Sparse: {len(sparse_embeddings[0].values)} non-zero values")

Dense: 768 dims
Sparse: 4 non-zero values
