# Gemma 3 Tokenizer Expansion Demo - Tibetan Language

This notebook demonstrates the `expand_tokenizer` function using Gemma 3 model with Tibetan corpus. We'll show tokenization before and after training on 100 new tokens.


## 1. Setup and Imports


In [1]:
import os
import sys
import logging
from typing import List

# Add Tokenizers directory to path for imports
# In Jupyter notebooks, os.getcwd() returns the directory where notebook is located
# If notebook is in Tokenizers/, this will work. Otherwise adjust the path.
sys.path.insert(0, os.getcwd())

from transformers import AutoTokenizer
from model_processors import get_processor

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)


  from .autonotebook import tqdm as notebook_tqdm


## 2. Define Tibetan Corpus and Sample Text


In [2]:
# Tibetan corpus for training
TIBETAN_CORPUS = [
    "‡Ωñ‡Ωº‡Ωë‡ºã‡ΩÄ‡æ±‡Ω≤‡ºã‡Ω¶‡æê‡Ωë‡ºã‡Ω°‡Ω≤‡ΩÇ‡ºã‡Ωì‡Ω≤‡ºã‡Ωñ‡Ωº‡Ωë‡ºã‡ΩÄ‡æ±‡Ω≤‡ºã‡Ωò‡Ω≤‡ºã‡Ω¢‡Ω≤‡ΩÇ‡Ω¶‡ºã‡ΩÄ‡æ±‡Ω≤‡ºã‡Ω¶‡æê‡Ωë‡ºã‡Ω°‡Ω≤‡ΩÇ‡ºã‡Ω¢‡Ω∫‡Ωë‡ºç",
    "‡ΩÑ‡ºã‡Ωö‡Ωº‡Ω¶‡ºã‡Ωñ‡Ωº‡Ωë‡ºã‡Ω¶‡æê‡Ωë‡ºã‡Ω¶‡æ≥‡Ωº‡Ωñ‡ºã‡ΩÇ‡Ωâ‡Ω∫‡Ω¢‡ºã‡Ωñ‡æ±‡Ω∫‡Ωë‡ºã‡Ωë‡ΩÇ‡Ωº‡Ω¶‡ºç",
    "‡Ωñ‡Ωº‡Ωë‡ºã‡Ω£‡æó‡Ωº‡ΩÑ‡Ω¶‡ºã‡Ωì‡Ω≤‡ºã‡Ω£‡æ∑‡Ωº‡ºã‡Ω¢‡æí‡æ±‡ºã‡ΩÇ‡Ω¢‡ºã‡Ωë‡ΩÑ‡ºã‡Ωñ‡Ω£‡ºã‡Ω°‡Ω¥‡Ω£‡ºã‡Ωë‡ΩÑ‡ºã‡Ω†‡Ωñ‡æ≤‡Ω∫‡Ω£‡ºã‡Ωñ‡ºã‡Ω°‡Ωº‡Ωë‡ºç",
    "‡Ωë‡ΩÄ‡Ω†‡ºã‡Ω£‡Ω¶‡ºã‡Ωò‡ΩÑ‡ºã‡Ωî‡Ωº‡ºã‡Ω°‡Ωº‡Ωë‡ºã‡Ω¢‡Ω¥‡ΩÑ‡ºã‡ΩÑ‡ºã‡Ωö‡Ωº‡Ω¶‡ºã‡Ω¶‡æ≥‡Ωº‡Ωñ‡ºã‡ΩÇ‡Ωâ‡Ω∫‡Ω¢‡ºã‡Ωñ‡æ±‡Ω∫‡Ωë‡ºã‡Ωë‡ΩÇ‡Ωº‡Ω¶‡ºç",
    "‡Ωñ‡Ωº‡Ωë‡ºã‡ΩÄ‡æ±‡Ω≤‡ºã‡Ω¢‡Ω≤‡ΩÇ‡ºã‡ΩÇ‡Ωì‡Ω¶‡ºã‡Ωë‡ΩÑ‡ºã‡Ω£‡Ωº‡ºã‡Ω¢‡æí‡æ±‡Ω¥‡Ω¶‡ºã‡Ωì‡Ω≤‡ºã‡Ωß‡ºã‡ΩÖ‡ΩÑ‡ºã‡Ω¢‡Ω≤‡ΩÑ‡ºã‡Ωî‡Ωº‡ºã‡Ω¢‡Ω∫‡Ωë‡ºç",
] * 100  # Repeat for more data

# Sample Tibetan text for demonstration (different from corpus)
SAMPLE_TIBETAN_TEXT = "‡Ωñ‡Ωº‡Ωë‡ºã‡Ω°‡Ω≤‡ΩÇ‡ºã‡Ωì‡Ω≤‡ºã‡Ω¢‡æí‡æ±‡Ω£‡ºã‡ΩÅ‡Ωñ‡ºã‡ΩÄ‡æ±‡Ω≤‡ºã‡Ω¶‡æê‡Ωë‡ºã‡Ω°‡Ω≤‡ΩÇ‡ºã‡Ω¢‡Ω∫‡Ωë‡ºç ‡ΩÑ‡ºã‡Ωö‡Ωº‡Ω¶‡ºã‡Ωë‡Ω∫‡ºã‡Ω¶‡æ≥‡Ωº‡Ωñ‡ºã‡ΩÇ‡Ωâ‡Ω∫‡Ω¢‡ºã‡Ωñ‡æ±‡Ω∫‡Ωë‡ºã‡Ωë‡ΩÇ‡Ωº‡Ω¶‡ºç"

print(f"Tibetan corpus size: {len(TIBETAN_CORPUS)} samples")
print(f"Sample text: {SAMPLE_TIBETAN_TEXT}")


Tibetan corpus size: 500 samples
Sample text: ‡Ωñ‡Ωº‡Ωë‡ºã‡Ω°‡Ω≤‡ΩÇ‡ºã‡Ωì‡Ω≤‡ºã‡Ω¢‡æí‡æ±‡Ω£‡ºã‡ΩÅ‡Ωñ‡ºã‡ΩÄ‡æ±‡Ω≤‡ºã‡Ω¶‡æê‡Ωë‡ºã‡Ω°‡Ω≤‡ΩÇ‡ºã‡Ω¢‡Ω∫‡Ωë‡ºç ‡ΩÑ‡ºã‡Ωö‡Ωº‡Ω¶‡ºã‡Ωë‡Ω∫‡ºã‡Ω¶‡æ≥‡Ωº‡Ωñ‡ºã‡ΩÇ‡Ωâ‡Ω∫‡Ω¢‡ºã‡Ωñ‡æ±‡Ω∫‡Ωë‡ºã‡Ωë‡ΩÇ‡Ωº‡Ω¶‡ºç


## 3. Load Original Gemma 3 Tokenizer


In [3]:
# Model configuration for Gemma 3
model_config = {
    "model_name": "google/gemma-3-1b-it",
    "algorithm": "SENTENCEPIECE_BPE"
}

# Get processor and original tokenizer
processor = get_processor("gemma", model_config)
original_tokenizer = processor.original_tokenizer

# Display original vocabulary size
original_vocab_size = len(original_tokenizer.get_vocab())
print(f"‚úÖ Loaded Gemma 3 tokenizer")
print(f"Original vocabulary size: {original_vocab_size:,}")


2025-11-12 00:38:28,962 - model_processors - INFO - Loaded google/gemma-3-1b-it tokenizer
2025-11-12 00:38:28,962 - model_processors - INFO - Initialized Gemma processor


‚úÖ Loaded Gemma 3 tokenizer
Original vocabulary size: 262,145


## 4. Tokenization BEFORE Training


In [4]:
# Encode sample text with original tokenizer
before_token_ids = original_tokenizer.encode(SAMPLE_TIBETAN_TEXT)
before_tokens = original_tokenizer.convert_ids_to_tokens(before_token_ids)

print("=" * 80)
print("BEFORE TRAINING")
print("=" * 80)
print(f"\nSample text: {SAMPLE_TIBETAN_TEXT}")
print(f"\nToken IDs: {before_token_ids}")
print(f"\nNumber of tokens: {len(before_token_ids)}")
print(f"\nToken strings:")
for i, (token_id, token) in enumerate(zip(before_token_ids, before_tokens)):
    print(f"  [{i}] ID: {token_id:6d} | Token: '{token}'")
print(f"\nVocabulary size: {original_vocab_size:,}")


BEFORE TRAINING

Sample text: ‡Ωñ‡Ωº‡Ωë‡ºã‡Ω°‡Ω≤‡ΩÇ‡ºã‡Ωì‡Ω≤‡ºã‡Ω¢‡æí‡æ±‡Ω£‡ºã‡ΩÅ‡Ωñ‡ºã‡ΩÄ‡æ±‡Ω≤‡ºã‡Ω¶‡æê‡Ωë‡ºã‡Ω°‡Ω≤‡ΩÇ‡ºã‡Ω¢‡Ω∫‡Ωë‡ºç ‡ΩÑ‡ºã‡Ωö‡Ωº‡Ω¶‡ºã‡Ωë‡Ω∫‡ºã‡Ω¶‡æ≥‡Ωº‡Ωñ‡ºã‡ΩÇ‡Ωâ‡Ω∫‡Ω¢‡ºã‡Ωñ‡æ±‡Ω∫‡Ωë‡ºã‡Ωë‡ΩÇ‡Ωº‡Ω¶‡ºç

Token IDs: [2, 240332, 144766, 242070, 212637, 240392, 55765, 203431, 100206, 242268, 155394, 242079, 151310, 202660, 73436, 242070, 212637, 240451, 240640, 239938, 241594, 236743, 52855, 242545, 235862, 239938, 120295, 239502, 243797, 239985, 155394, 239935, 243835, 240640, 92089, 157435, 201958, 239938, 239935, 239985, 239502, 241594]

Number of tokens: 42

Token strings:
  [0] ID:      2 | Token: '<bos>'
  [1] ID: 240332 | Token: '‡Ωñ'
  [2] ID: 144766 | Token: '‡Ωº‡Ωë‡ºã'
  [3] ID: 242070 | Token: '‡Ω°'
  [4] ID: 212637 | Token: '‡Ω≤‡ΩÇ‡ºã'
  [5] ID: 240392 | Token: '‡Ωì'
  [6] ID:  55765 | Token: '‡Ω≤‡ºã'
  [7] ID: 203431 | Token: '‡Ω¢‡æí‡æ±'
  [8] ID: 100206 | Token: '‡Ω£‡ºã'
  [9] ID: 242268 | Token: '‡ΩÅ'
  [10] ID: 155394 | Token: '‡Ωñ‡ºã'
  [11] ID: 24207

## 5. Train and Expand Tokenizer


In [5]:
# Expand tokenizer with Tibetan corpus
print("=" * 80)
print("TRAINING TOKENIZER")
print("=" * 80)
print(f"\nTraining on {len(TIBETAN_CORPUS)} Tibetan corpus samples...")
print(f"Target: Add 100 new tokens\n")

expanded_tokenizer, tokens_added, new_tokens = processor.expand_tokenizer(
    algorithm_name="SENTENCEPIECE_BPE",
    max_tokens=100,
    training_corpus=TIBETAN_CORPUS
)

new_vocab_size = len(expanded_tokenizer.get_vocab())

print("\n" + "=" * 80)
print("TRAINING COMPLETE")
print("=" * 80)
print(f"‚úÖ Tokens added: {tokens_added}")
print(f"‚úÖ Original vocabulary size: {original_vocab_size:,}")
print(f"‚úÖ New vocabulary size: {new_vocab_size:,}")
print(f"‚úÖ Vocabulary increase: {new_vocab_size - original_vocab_size:,}")
if new_tokens:
    print(f"\nSample new tokens: {new_tokens[:10]}{'...' if len(new_tokens) > 10 else ''}")


TRAINING TOKENIZER

Training on 500 Tibetan corpus samples...
Target: Add 100 new tokens



2025-11-12 00:38:31,712 - model_processors - INFO - 
=== EXPANDING GOOGLE/GEMMA-3-1B-IT WITH SENTENCEPIECE_BPE ===
2025-11-12 00:38:31,712 - model_processors - INFO - Original vocabulary size: 262,145
2025-11-12 00:38:31,712 - model_processors - INFO - Training corpus size: 500
2025-11-12 00:38:31,712 - model_processors - INFO - Using train_new_from_iterator approach for SENTENCEPIECE_BPE with 500 corpus samples
2025-11-12 00:38:31,712 - model_processors - INFO - Training new tokenizer from base using corpus...
2025-11-12 00:38:34,496 - model_processors - INFO - Base tokenizer model type: BPE
2025-11-12 00:38:34,496 - model_processors - INFO - New tokenizer model type: BPE
2025-11-12 00:38:34,512 - model_processors - INFO - Trying 1.5x buffer: target vocab size = 150 (150 new tokens to select 100 best from)
2025-11-12 00:38:34,512 - model_processors - INFO - Training new tokenizer from base using corpus...
Adding merges & tokens:  95%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç| 105/111 [00:01<00:0


TRAINING COMPLETE
‚úÖ Tokens added: 100
‚úÖ Original vocabulary size: 262,145
‚úÖ New vocabulary size: 262,245
‚úÖ Vocabulary increase: 100

Sample new tokens: ['‡Ω¢‡ºã‡Ωë‡ΩÑ‡ºã', '‡Ωº‡ºã‡Ω°‡Ωº‡Ωë‡ºã‡Ω¢‡Ω¥', '‡Ω£‡æ∑‡Ωº‡ºã‡Ω¢‡æí‡æ±‡ºã‡ΩÇ‡Ω¢‡ºã‡Ωë‡ΩÑ‡ºã', '‡ºã‡ΩÇ‡Ωì‡Ω¶‡ºã', '‡ΩÄ‡æ±', '‡Ω¢‡Ω≤‡ΩÇ', '‡Ωö‡Ωº', '‡Ωß‡ºã‡ΩÖ', '‡Ω¶‡æê‡Ωë‡ºã‡Ω°‡Ω≤‡ΩÇ', '‡ºã‡Ω¢‡Ω∫‡Ωë‡ºç']...


## 6. Tokenization AFTER Training


In [6]:
# Encode same sample text with expanded tokenizer
after_token_ids = expanded_tokenizer.encode(SAMPLE_TIBETAN_TEXT)
after_tokens = expanded_tokenizer.convert_ids_to_tokens(after_token_ids)

print("=" * 80)
print("AFTER TRAINING")
print("=" * 80)
print(f"\nSample text: {SAMPLE_TIBETAN_TEXT}")
print(f"\nToken IDs: {after_token_ids}")
print(f"\nNumber of tokens: {len(after_token_ids)}")
print(f"\nToken strings:")
for i, (token_id, token) in enumerate(zip(after_token_ids, after_tokens)):
    print(f"  [{i}] ID: {token_id:6d} | Token: '{token}'")
print(f"\nVocabulary size: {new_vocab_size:,}")


AFTER TRAINING

Sample text: ‡Ωñ‡Ωº‡Ωë‡ºã‡Ω°‡Ω≤‡ΩÇ‡ºã‡Ωì‡Ω≤‡ºã‡Ω¢‡æí‡æ±‡Ω£‡ºã‡ΩÅ‡Ωñ‡ºã‡ΩÄ‡æ±‡Ω≤‡ºã‡Ω¶‡æê‡Ωë‡ºã‡Ω°‡Ω≤‡ΩÇ‡ºã‡Ω¢‡Ω∫‡Ωë‡ºç ‡ΩÑ‡ºã‡Ωö‡Ωº‡Ω¶‡ºã‡Ωë‡Ω∫‡ºã‡Ω¶‡æ≥‡Ωº‡Ωñ‡ºã‡ΩÇ‡Ωâ‡Ω∫‡Ω¢‡ºã‡Ωñ‡æ±‡Ω∫‡Ωë‡ºã‡Ωë‡ΩÇ‡Ωº‡Ω¶‡ºç

Token IDs: [2, 262144, 242070, 212637, 262153, 203431, 100206, 242268, 155394, 242079, 151310, 262156, 242070, 212637, 240451, 262168, 236743, 52855, 242545, 235862, 239938, 120295, 262167, 239985, 155394, 239935, 262159, 92089, 157435, 201958, 262177]

Number of tokens: 31

Token strings:
  [0] ID:      2 | Token: '<bos>'
  [1] ID: 262144 | Token: '‡Ωñ‡Ωº‡Ωë‡ºã'
  [2] ID: 242070 | Token: '‡Ω°'
  [3] ID: 212637 | Token: '‡Ω≤‡ΩÇ‡ºã'
  [4] ID: 262153 | Token: '‡Ωì‡Ω≤‡ºã'
  [5] ID: 203431 | Token: '‡Ω¢‡æí‡æ±'
  [6] ID: 100206 | Token: '‡Ω£‡ºã'
  [7] ID: 242268 | Token: '‡ΩÅ'
  [8] ID: 155394 | Token: '‡Ωñ‡ºã'
  [9] ID: 242079 | Token: '‡ΩÄ'
  [10] ID: 151310 | Token: '‡æ±‡Ω≤‡ºã'
  [11] ID: 262156 | Token: '‡Ω¶‡æê‡Ωë‡ºã'
  [12] ID: 242070 | Token: '‡Ω°'
  [13] ID: 21263

## 6.5. Side-by-Side Encode Comparison


In [7]:
# Show encode() results side-by-side
print("=" * 80)
print("ENCODE() FUNCTION COMPARISON")
print("=" * 80)
print(f"\nSample text: {SAMPLE_TIBETAN_TEXT}\n")

# Get encode results
before_encode = original_tokenizer.encode(SAMPLE_TIBETAN_TEXT)
after_encode = expanded_tokenizer.encode(SAMPLE_TIBETAN_TEXT)

print("BEFORE (Original Tokenizer):")
print(f"  encode() result: {original_tokenizer.convert_ids_to_tokens(before_encode)}")
print(f"  Length: {len(before_encode)} tokens\n")

print("AFTER (Expanded Tokenizer):")
print(f"  encode() result: {expanded_tokenizer.convert_ids_to_tokens(after_encode)}")
print(f"  Length: {len(after_encode)} tokens\n")

print("=" * 80)
print("DIFFERENCE:")
print("=" * 80)
if len(before_encode) != len(after_encode):
    print(f"  Token count changed: {len(before_encode)} ‚Üí {len(after_encode)} ({len(after_encode) - len(before_encode):+d})")
else:
    print(f"  Token count: Same ({len(before_encode)} tokens)")

if before_encode != after_encode:
    print(f"  Token IDs changed: Different tokenization")
    # Show which positions differ
    max_len = max(len(before_encode), len(after_encode))
    differences = []
    for i in range(max_len):
        before_val = before_encode[i] if i < len(before_encode) else None
        after_val = after_encode[i] if i < len(after_encode) else None
        if before_val != after_val:
            differences.append((i, before_val, after_val))
    
    if differences:
        print(f"\n  Positions with different token IDs:")
        for pos, before_val, after_val in differences[:10]:  # Show first 10 differences
            print(f"    Position {pos}: {before_val} ‚Üí {after_val}")
        if len(differences) > 10:
            print(f"    ... and {len(differences) - 10} more differences")
else:
    print(f"  Token IDs: Identical")


ENCODE() FUNCTION COMPARISON

Sample text: ‡Ωñ‡Ωº‡Ωë‡ºã‡Ω°‡Ω≤‡ΩÇ‡ºã‡Ωì‡Ω≤‡ºã‡Ω¢‡æí‡æ±‡Ω£‡ºã‡ΩÅ‡Ωñ‡ºã‡ΩÄ‡æ±‡Ω≤‡ºã‡Ω¶‡æê‡Ωë‡ºã‡Ω°‡Ω≤‡ΩÇ‡ºã‡Ω¢‡Ω∫‡Ωë‡ºç ‡ΩÑ‡ºã‡Ωö‡Ωº‡Ω¶‡ºã‡Ωë‡Ω∫‡ºã‡Ω¶‡æ≥‡Ωº‡Ωñ‡ºã‡ΩÇ‡Ωâ‡Ω∫‡Ω¢‡ºã‡Ωñ‡æ±‡Ω∫‡Ωë‡ºã‡Ωë‡ΩÇ‡Ωº‡Ω¶‡ºç

BEFORE (Original Tokenizer):
  encode() result: ['<bos>', '‡Ωñ', '‡Ωº‡Ωë‡ºã', '‡Ω°', '‡Ω≤‡ΩÇ‡ºã', '‡Ωì', '‡Ω≤‡ºã', '‡Ω¢‡æí‡æ±', '‡Ω£‡ºã', '‡ΩÅ', '‡Ωñ‡ºã', '‡ΩÄ', '‡æ±‡Ω≤‡ºã', '‡Ω¶‡æê', '‡Ωë‡ºã', '‡Ω°', '‡Ω≤‡ΩÇ‡ºã', '‡Ω¢', '‡Ω∫', '‡Ωë', '‡ºç', '‚ñÅ', '‡ΩÑ‡ºã', '‡Ωö', '‡Ωº‡Ω¶‡ºã', '‡Ωë', '‡Ω∫‡ºã', '‡Ω¶', '‡æ≥', '‡Ωº', '‡Ωñ‡ºã', '‡ΩÇ', '‡Ωâ', '‡Ω∫', '‡Ω¢‡ºã', '‡Ωñ‡æ±', '‡Ω∫‡Ωë‡ºã', '‡Ωë', '‡ΩÇ', '‡Ωº', '‡Ω¶', '‡ºç']
  Length: 42 tokens

AFTER (Expanded Tokenizer):
  encode() result: ['<bos>', '‡Ωñ‡Ωº‡Ωë‡ºã', '‡Ω°', '‡Ω≤‡ΩÇ‡ºã', '‡Ωì‡Ω≤‡ºã', '‡Ω¢‡æí‡æ±', '‡Ω£‡ºã', '‡ΩÅ', '‡Ωñ‡ºã', '‡ΩÄ', '‡æ±‡Ω≤‡ºã', '‡Ω¶‡æê‡Ωë‡ºã', '‡Ω°', '‡Ω≤‡ΩÇ‡ºã', '‡Ω¢', '‡Ω∫‡Ωë‡ºç', '‚ñÅ', '‡ΩÑ‡ºã', '‡Ωö', '‡Ωº‡Ω¶‡ºã', '‡Ωë', '‡Ω∫‡ºã', '‡Ω¶‡æ≥', '‡Ωº', '‡Ωñ‡ºã', '‡ΩÇ', '‡Ωâ‡Ω∫', '‡Ω¢

## 8. Train Tokenizer with 1000 Tokens


In [8]:
# Expand tokenizer with Tibetan corpus - 1000 tokens
print("=" * 80)
print("TRAINING TOKENIZER WITH 1000 TOKENS")
print("=" * 80)
print(f"\nTraining on {len(TIBETAN_CORPUS)} Tibetan corpus samples...")
print(f"Target: Add 1000 new tokens\n")

expanded_tokenizer_1000, tokens_added_1000, new_tokens_1000 = processor.expand_tokenizer(
    algorithm_name="SENTENCEPIECE_BPE",
    max_tokens=1000,
    training_corpus=TIBETAN_CORPUS
)

new_vocab_size_1000 = len(expanded_tokenizer_1000.get_vocab())

print("\n" + "=" * 80)
print("TRAINING COMPLETE (1000 tokens)")
print("=" * 80)
print(f"‚úÖ Tokens added: {tokens_added_1000}")
print(f"‚úÖ Original vocabulary size: {original_vocab_size:,}")
print(f"‚úÖ New vocabulary size: {new_vocab_size_1000:,}")
print(f"‚úÖ Vocabulary increase: {new_vocab_size_1000 - original_vocab_size:,}")
if new_tokens_1000:
    print(f"\nSample new tokens: {new_tokens_1000[:10]}{'...' if len(new_tokens_1000) > 10 else ''}")


TRAINING TOKENIZER WITH 1000 TOKENS

Training on 500 Tibetan corpus samples...
Target: Add 1000 new tokens



2025-11-12 00:38:44,746 - model_processors - INFO - 
=== EXPANDING GOOGLE/GEMMA-3-1B-IT WITH SENTENCEPIECE_BPE ===
2025-11-12 00:38:44,746 - model_processors - INFO - Original vocabulary size: 262,145
2025-11-12 00:38:44,746 - model_processors - INFO - Training corpus size: 500
2025-11-12 00:38:44,762 - model_processors - INFO - Using train_new_from_iterator approach for SENTENCEPIECE_BPE with 500 corpus samples
2025-11-12 00:38:44,763 - model_processors - INFO - Training new tokenizer from base using corpus...
2025-11-12 00:38:47,562 - model_processors - INFO - Base tokenizer model type: BPE
2025-11-12 00:38:47,562 - model_processors - INFO - New tokenizer model type: BPE
2025-11-12 00:38:47,562 - model_processors - INFO - Trying 1.5x buffer: target vocab size = 1,500 (1500 new tokens to select 1000 best from)
2025-11-12 00:38:47,562 - model_processors - INFO - Training new tokenizer from base using corpus...
Adding merges & tokens: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 122/122 [00:02<


TRAINING COMPLETE (1000 tokens)
‚úÖ Tokens added: 113
‚úÖ Original vocabulary size: 262,145
‚úÖ New vocabulary size: 262,258
‚úÖ Vocabulary increase: 113

Sample new tokens: ['‡Ω¢‡ºã‡Ωë‡ΩÑ‡ºã', '‡Ωº‡ºã‡Ω°‡Ωº‡Ωë‡ºã‡Ω¢‡Ω¥', '‡Ω£‡æ∑‡Ωº‡ºã‡Ω¢‡æí‡æ±‡ºã‡ΩÇ‡Ω¢‡ºã‡Ωë‡ΩÑ‡ºã', '‡ºã‡ΩÇ‡Ωì‡Ω¶‡ºã', '‡ΩÄ‡æ±', '‡Ω¢‡Ω≤‡ΩÇ', '‡Ωö‡Ωº', '‡Ωß‡ºã‡ΩÖ', '‡Ω¶‡æê‡Ωë‡ºã‡Ω°‡Ω≤‡ΩÇ', '‡ºã‡Ω¢‡Ω∫‡Ωë‡ºç']...


## 9. Tokenization with 1000-Token Tokenizer


In [9]:
# Encode same sample text with 1000-token expanded tokenizer
after_token_ids_1000 = expanded_tokenizer_1000.encode(SAMPLE_TIBETAN_TEXT)
after_tokens_1000 = expanded_tokenizer_1000.convert_ids_to_tokens(after_token_ids_1000)

print("=" * 80)
print("AFTER TRAINING (1000 tokens)")
print("=" * 80)
print(f"\nSample text: {SAMPLE_TIBETAN_TEXT}")
print(f"\nToken IDs: {after_token_ids_1000}")
print(f"\nNumber of tokens: {len(after_token_ids_1000)}")
print(f"\nToken strings:")
for i, (token_id, token) in enumerate(zip(after_token_ids_1000, after_tokens_1000)):
    print(f"  [{i}] ID: {token_id:6d} | Token: '{token}'")
print(f"\nVocabulary size: {new_vocab_size_1000:,}")


AFTER TRAINING (1000 tokens)

Sample text: ‡Ωñ‡Ωº‡Ωë‡ºã‡Ω°‡Ω≤‡ΩÇ‡ºã‡Ωì‡Ω≤‡ºã‡Ω¢‡æí‡æ±‡Ω£‡ºã‡ΩÅ‡Ωñ‡ºã‡ΩÄ‡æ±‡Ω≤‡ºã‡Ω¶‡æê‡Ωë‡ºã‡Ω°‡Ω≤‡ΩÇ‡ºã‡Ω¢‡Ω∫‡Ωë‡ºç ‡ΩÑ‡ºã‡Ωö‡Ωº‡Ω¶‡ºã‡Ωë‡Ω∫‡ºã‡Ω¶‡æ≥‡Ωº‡Ωñ‡ºã‡ΩÇ‡Ωâ‡Ω∫‡Ω¢‡ºã‡Ωñ‡æ±‡Ω∫‡Ωë‡ºã‡Ωë‡ΩÇ‡Ωº‡Ω¶‡ºç

Token IDs: [2, 262144, 242070, 212637, 262153, 203431, 100206, 242268, 155394, 242079, 151310, 262156, 242070, 212637, 240451, 262168, 236743, 52855, 242545, 235862, 239938, 120295, 262167, 239985, 155394, 239935, 262159, 92089, 157435, 201958, 262177]

Number of tokens: 31

Token strings:
  [0] ID:      2 | Token: '<bos>'
  [1] ID: 262144 | Token: '‡Ωñ‡Ωº‡Ωë‡ºã'
  [2] ID: 242070 | Token: '‡Ω°'
  [3] ID: 212637 | Token: '‡Ω≤‡ΩÇ‡ºã'
  [4] ID: 262153 | Token: '‡Ωì‡Ω≤‡ºã'
  [5] ID: 203431 | Token: '‡Ω¢‡æí‡æ±'
  [6] ID: 100206 | Token: '‡Ω£‡ºã'
  [7] ID: 242268 | Token: '‡ΩÅ'
  [8] ID: 155394 | Token: '‡Ωñ‡ºã'
  [9] ID: 242079 | Token: '‡ΩÄ'
  [10] ID: 151310 | Token: '‡æ±‡Ω≤‡ºã'
  [11] ID: 262156 | Token: '‡Ω¶‡æê‡Ωë‡ºã'
  [12] ID: 242070 | Token: '‡Ω°'
  

## 10. Encode Comparison: All Three Tokenizers


In [10]:
# Show encode() results for all three tokenizers
print("=" * 80)
print("ENCODE() FUNCTION COMPARISON - ALL THREE TOKENIZERS")
print("=" * 80)
print(f"\nSample text: {SAMPLE_TIBETAN_TEXT}\n")

# Get encode results
before_encode = original_tokenizer.encode(SAMPLE_TIBETAN_TEXT)
after_encode_100 = expanded_tokenizer.encode(SAMPLE_TIBETAN_TEXT)
after_encode_1000 = expanded_tokenizer_1000.encode(SAMPLE_TIBETAN_TEXT)

print("ORIGINAL (Before Training):")
print(f"  encode() result: {original_tokenizer.convert_ids_to_tokens(before_encode)}")
print(f"  Length: {len(before_encode)} tokens\n")

print("EXPANDED (100 tokens):")
print(f"  encode() result: {expanded_tokenizer.convert_ids_to_tokens(after_encode_100)}")
print(f"  Length: {len(after_encode_100)} tokens\n")

print("EXPANDED (1000 tokens):")
print(f"  encode() result: {expanded_tokenizer_1000.convert_ids_to_tokens(after_encode_1000)}")
print(f"  Length: {len(after_encode_1000)} tokens\n")

print("=" * 80)
print("SUMMARY:")
print("=" * 80)
print(f"  Original:  {len(before_encode):3d} tokens")
print(f"  100 tokens: {len(after_encode_100):3d} tokens ({len(after_encode_100) - len(before_encode):+d} change)")
print(f"  1000 tokens: {len(after_encode_1000):3d} tokens ({len(after_encode_1000) - len(before_encode):+d} change)")
print(f"\n  Best (fewest tokens): {'1000-token' if len(after_encode_1000) <= len(after_encode_100) else '100-token' if len(after_encode_100) < len(before_encode) else 'Original'} tokenizer")


ENCODE() FUNCTION COMPARISON - ALL THREE TOKENIZERS

Sample text: ‡Ωñ‡Ωº‡Ωë‡ºã‡Ω°‡Ω≤‡ΩÇ‡ºã‡Ωì‡Ω≤‡ºã‡Ω¢‡æí‡æ±‡Ω£‡ºã‡ΩÅ‡Ωñ‡ºã‡ΩÄ‡æ±‡Ω≤‡ºã‡Ω¶‡æê‡Ωë‡ºã‡Ω°‡Ω≤‡ΩÇ‡ºã‡Ω¢‡Ω∫‡Ωë‡ºç ‡ΩÑ‡ºã‡Ωö‡Ωº‡Ω¶‡ºã‡Ωë‡Ω∫‡ºã‡Ω¶‡æ≥‡Ωº‡Ωñ‡ºã‡ΩÇ‡Ωâ‡Ω∫‡Ω¢‡ºã‡Ωñ‡æ±‡Ω∫‡Ωë‡ºã‡Ωë‡ΩÇ‡Ωº‡Ω¶‡ºç

ORIGINAL (Before Training):
  encode() result: ['<bos>', '‡Ωñ', '‡Ωº‡Ωë‡ºã', '‡Ω°', '‡Ω≤‡ΩÇ‡ºã', '‡Ωì', '‡Ω≤‡ºã', '‡Ω¢‡æí‡æ±', '‡Ω£‡ºã', '‡ΩÅ', '‡Ωñ‡ºã', '‡ΩÄ', '‡æ±‡Ω≤‡ºã', '‡Ω¶‡æê', '‡Ωë‡ºã', '‡Ω°', '‡Ω≤‡ΩÇ‡ºã', '‡Ω¢', '‡Ω∫', '‡Ωë', '‡ºç', '‚ñÅ', '‡ΩÑ‡ºã', '‡Ωö', '‡Ωº‡Ω¶‡ºã', '‡Ωë', '‡Ω∫‡ºã', '‡Ω¶', '‡æ≥', '‡Ωº', '‡Ωñ‡ºã', '‡ΩÇ', '‡Ωâ', '‡Ω∫', '‡Ω¢‡ºã', '‡Ωñ‡æ±', '‡Ω∫‡Ωë‡ºã', '‡Ωë', '‡ΩÇ', '‡Ωº', '‡Ω¶', '‡ºç']
  Length: 42 tokens

EXPANDED (100 tokens):
  encode() result: ['<bos>', '‡Ωñ‡Ωº‡Ωë‡ºã', '‡Ω°', '‡Ω≤‡ΩÇ‡ºã', '‡Ωì‡Ω≤‡ºã', '‡Ω¢‡æí‡æ±', '‡Ω£‡ºã', '‡ΩÅ', '‡Ωñ‡ºã', '‡ΩÄ', '‡æ±‡Ω≤‡ºã', '‡Ω¶‡æê‡Ωë‡ºã', '‡Ω°', '‡Ω≤‡ΩÇ‡ºã', '‡Ω¢', '‡Ω∫‡Ωë‡ºç', '‚ñÅ', '‡ΩÑ‡ºã', '‡Ωö', '‡Ωº‡Ω¶‡ºã', '‡Ωë', '‡Ω∫‡ºã', '‡Ω¶‡æ≥', '‡Ωº', '‡Ωñ‡ºã', '‡ΩÇ

## 11. Complete Comparison: All Three Tokenizers


In [11]:
print("=" * 80)
print("COMPLETE COMPARISON: ORIGINAL vs 100 TOKENS vs 113 TOKENS")
print("=" * 80)

print(f"\n{'Metric':<30} {'Original':<20} {'100 tokens':<20} {'113 tokens':<20}")
print("-" * 90)
print(f"{'Vocabulary Size':<30} {original_vocab_size:>18,} {new_vocab_size:>18,} {new_vocab_size_1000:>18,}")
print(f"{'Tokens Added':<30} {'N/A':>18} {tokens_added:>18} {tokens_added_1000:>18}")
print(f"{'Number of Tokens':<30} {len(before_token_ids):>18} {len(after_token_ids):>18} {len(after_token_ids_1000):>18}")

print(f"\n{'='*80}")
print("TOKEN COMPARISON")
print(f"{'='*80}")

print(f"\nOriginal ({len(before_token_ids)} tokens):")
print(f"  IDs: {before_token_ids}")
print(f"  Tokens: {before_tokens[:5]}{'...' if len(before_tokens) > 5 else ''}")

print(f"\n100 tokens ({len(after_token_ids)} tokens):")
print(f"  IDs: {after_token_ids}")
print(f"  Tokens: {after_tokens[:5]}{'...' if len(after_tokens) > 5 else ''}")

print(f"\n1000 tokens ({len(after_token_ids_1000)} tokens):")
print(f"  IDs: {after_token_ids_1000}")
print(f"  Tokens: {after_tokens_1000[:5]}{'...' if len(after_tokens_1000) > 5 else ''}")

# Calculate improvements
improvement_100 = len(before_token_ids) - len(after_token_ids)
improvement_1000 = len(before_token_ids) - len(after_token_ids_1000)

print(f"\n{'='*80}")
print("IMPROVEMENT ANALYSIS")
print(f"{'='*80}")
if improvement_100 > 0:
    reduction_pct_100 = (improvement_100 / len(before_token_ids)) * 100
    print(f"‚úÖ 100 tokens: {improvement_100} fewer tokens ({reduction_pct_100:.1f}% reduction)")
elif improvement_100 == 0:
    print(f"‚û°Ô∏è  100 tokens: Same number of tokens")
else:
    print(f"‚ö†Ô∏è  100 tokens: {abs(improvement_100)} more tokens")

if improvement_1000 > 0:
    reduction_pct_1000 = (improvement_1000 / len(before_token_ids)) * 100
    print(f"‚úÖ 1000 tokens: {improvement_1000} fewer tokens ({reduction_pct_1000:.1f}% reduction)")
elif improvement_1000 == 0:
    print(f"‚û°Ô∏è  113 tokens: Same number of tokens")
else:
    print(f"‚ö†Ô∏è  1000 tokens: {abs(improvement_1000)} more tokens")

if improvement_1000 > improvement_100:
    print(f"\nüí° 1000-token tokenizer provides better compression than 100-token tokenizer")
elif improvement_1000 < improvement_100:
    print(f"\nüí° 100-token tokenizer provides better compression than 113-token tokenizer")
else:
    print(f"\nüí° Both expanded tokenizers provide similar compression")


COMPLETE COMPARISON: ORIGINAL vs 100 TOKENS vs 113 TOKENS

Metric                         Original             100 tokens           113 tokens          
------------------------------------------------------------------------------------------
Vocabulary Size                           262,145            262,245            262,258
Tokens Added                                  N/A                100                113
Number of Tokens                               42                 31                 31

TOKEN COMPARISON

Original (42 tokens):
  IDs: [2, 240332, 144766, 242070, 212637, 240392, 55765, 203431, 100206, 242268, 155394, 242079, 151310, 202660, 73436, 242070, 212637, 240451, 240640, 239938, 241594, 236743, 52855, 242545, 235862, 239938, 120295, 239502, 243797, 239985, 155394, 239935, 243835, 240640, 92089, 157435, 201958, 239938, 239935, 239985, 239502, 241594]
  Tokens: ['<bos>', '‡Ωñ', '‡Ωº‡Ωë‡ºã', '‡Ω°', '‡Ω≤‡ΩÇ‡ºã']...

100 tokens (31 tokens):
  IDs: [2, 262144, 242070, 2