# Training a Fast Whisper Tokenizer for Tibetan

This notebook demonstrates how to train a Whisper tokenizer for Tibetan text using the `PreTrainedTokenizerFast` class, which correctly supports the `train_new_from_iterator` method.

We'll follow these steps:
1. Load a fast tokenizer from the Whisper model
2. Prepare a Tibetan text corpus
3. Train the tokenizer on the Tibetan corpus
4. Test and evaluate the tokenizer
5. Save the trained tokenizer

In [None]:
# Install necessary packages
!pip install transformers tokenizers tqdm

In [2]:
import os
from pathlib import Path
import numpy as np
from transformers import PreTrainedTokenizerFast, WhisperProcessor
from tqdm.notebook import tqdm

## 1. Load the Base Fast Tokenizer

We'll load the tokenizer from the Whisper model using `PreTrainedTokenizerFast` which supports the `train_new_from_iterator` method.

In [12]:
# Define the base model
BASE_MODEL = "/home/gangagyatso/Desktop/stt-bpe-trainer/data/whisper_tokenizer_added_tibetan"

# Load the fast tokenizer
print(f"🚀 Loading fast tokenizer from {BASE_MODEL}...")
tokenizer = PreTrainedTokenizerFast.from_pretrained(BASE_MODEL)
print("✅ Fast tokenizer loaded successfully")

# Check vocabulary size
print(f"Vocabulary size: {len(tokenizer)}")

🚀 Loading fast tokenizer from /home/gangagyatso/Desktop/stt-bpe-trainer/data/whisper_tokenizer_added_tibetan...
✅ Fast tokenizer loaded successfully
Vocabulary size: 53014


## 2. Prepare the Tibetan Corpus

Create an iterator for the Tibetan corpus.

In [4]:
# Path to your Tibetan corpus file
CORPUS_FILE = "data/corpus/mergedcorpus.txt"

# Check if the file exists
if not os.path.exists(CORPUS_FILE):
    print(f"❌ Corpus file not found: {CORPUS_FILE}")
else:
    print(f"✅ Found corpus file: {CORPUS_FILE}")
    
    # Look at a sample of the corpus
    with open(CORPUS_FILE, "r", encoding="utf-8") as f:
        sample_lines = [next(f).strip() for _ in range(3) if f]
    
    print("\nSample of corpus:")
    for i, line in enumerate(sample_lines):
        print(f"Line {i+1}: {line[:100]}..." if len(line) > 100 else f"Line {i+1}: {line}")

✅ Found corpus file: data/corpus/mergedcorpus.txt

Sample of corpus:
Line 1: ཨེ། སུའི་སྣང་བར་དེ་ཡོད་ན་
Line 2: སེམས་ཅན་ཐམས་ཅད་ཀྱི་ཚབ་ལ་གཞན་བསམ་ཚར་དུས་བདག་འཛིན་ཡོད་མ་རེད་
Line 3: རང་ཐོག་ལ་དངོས་སུ་སྙིང་རྗེ་ཡིས་འབྲེལ་བ་ཡོང་གི་ཡོད་རེད་


In [5]:
# Define corpus iterator function
def corpus_iterator():
    with open(CORPUS_FILE, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:  # Skip empty lines
                yield line

## 3. Test Initial Tokenizer on Tibetan

Let's first see how the original tokenizer handles Tibetan text.

In [6]:
# Example Tibetan text
tibetan_example = "བྱང་ཆུབ་ཀྱི་སེམས་རྣམ་པ་གཉིས་ཡོད་རེད།"

# Tokenize with the original tokenizer
tokens = tokenizer.tokenize(tibetan_example)
token_ids = tokenizer.encode(tibetan_example)

print("Original tokenizer results:")
print(f"Text: {tibetan_example}")
print(f"Token count: {len(tokens)}")
print(f"Tokens: {tokens}")
print(f"Token IDs: {token_ids}")

# Check roundtrip
decoded = tokenizer.decode(token_ids)
print(f"\nRoundtrip: {tibetan_example in decoded}")
print(f"Decoded text: {decoded}")

Original tokenizer results:
Text: བྱང་ཆུབ་ཀྱི་སེམས་རྣམ་པ་གཉིས་ཡོད་རེད།
Token count: 18
Tokens: ['བྱང', '་', 'ཆུབ', '་', 'ཀྱི', '་', 'སེམས', '་', 'རྣམ', '་', 'པ', '་', 'གཉིས', '་', 'ཡོད', '་', 'རེད', '།']
Token IDs: [50258, 50363, 52134, 51866, 52161, 51866, 51999, 51866, 52010, 51866, 52110, 51866, 51893, 51866, 52089, 51866, 51979, 51866, 51971, 51868, 50257]

Roundtrip: True
Decoded text: བྱང་ཆུབ་ཀྱི་སེམས་རྣམ་པ་གཉིས་ཡོད་རེད།<|endoftext|>


## 4. Train the Tokenizer on Tibetan Corpus

Now we'll train the tokenizer on our Tibetan corpus to improve its handling of Tibetan text.

In [7]:
# Set target vocabulary size
TARGET_VOCAB_SIZE = len(tokenizer) + 10000

print(f"🧠 Original vocabulary size: {len(tokenizer)}")
print(f"🎯 Target vocabulary size: {TARGET_VOCAB_SIZE}")

print("⏳ Starting tokenizer training... This may take a few minutes.")

# Train the tokenizer
new_tokenizer = tokenizer.train_new_from_iterator(corpus_iterator(), vocab_size=TARGET_VOCAB_SIZE, )

print("✅ Training complete!")
print(f"📈 New vocabulary size: {len(new_tokenizer)}")

🧠 Original vocabulary size: 53014
🎯 Target vocabulary size: 63014
⏳ Starting tokenizer training... This may take a few minutes.
The OrderedVocab you are attempting to save contains holes for indices [50258, 50259, 50260, 50261, 50262, 50263, 50264, 50265, 50266, 50267, 50268, 50269, 50270, 50271, 50272, 50273, 50274, 50275, 50276, 50277, 50278, 50279, 50280, 50281, 50282, 50283, 50284, 50285, 50286, 50287, 50288, 50289, 50290, 50291, 50292, 50293, 50294, 50295, 50296, 50297, 50298, 50299, 50300, 50301, 50302, 50303, 50304, 50305, 50306, 50307, 50308, 50309, 50310, 50311, 50312, 50313, 50314, 50315, 50316, 50317, 50318, 50319, 50320, 50321, 50322, 50323, 50324, 50325, 50326, 50327, 50328, 50329, 50330, 50331, 50332, 50333, 50334, 50335, 50336, 50337, 50338, 50339, 50340, 50341, 50342, 50343, 50344, 50345, 50346, 50347, 50348, 50349, 50350, 50351, 50352, 50353, 50354, 50355, 50356, 50357, 50358, 50359, 50360, 50361, 50362, 50363, 50364, 50365, 50366, 50367, 50368, 50369, 50370, 50371, 50

## 5. Test the Trained Tokenizer

Let's check how the newly trained tokenizer handles Tibetan text.

In [8]:
# Test examples
test_examples = [
    "བྱང་ཆུབ་ཀྱི་སེམས་རྣམ་པ་གཉིས་ཡོད་རེད།",  # Simple sentence
    "བོད་ཀྱི་སྐད་ཡིག་ནི་ལོ་རྒྱུས་ཧ་ཅང་རིང་པོ་ཡོད་པའི་སྐད་ཡིག་ཅིག་རེད།",  # Another example
    "བྱང་ཆུབ་ཀྱི་སེམས་རྣམ་པ་གཉིས་ཡོད་རེད། བྱང་ཆུབ་ཀྱི་སེམས་ཡོད་ན་ཡེ་ཤེས་ལྷ་ལ་འགྱུར་འགྲོ་གི་ཡོད་རེད། བྱང་ཆུབ་ཀྱི་སེམས་མེད་ན། དེ་ནས་འདི་"
]

# Function to analyze tokenization
def analyze_tokenization(tokenizer, text, name="Tokenizer"):
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.encode(text)
    decoded = tokenizer.decode(token_ids)
    
    print(f"\n--- {name} Results ---")
    print(f"Text: {text}")
    print(f"Token count: {len(tokens)}")
    print(f"Tokens: {tokens}")
    print(f"Roundtrip successful: {text in decoded}")
    
    return len(tokens)

In [9]:
# Compare tokenization between original and new tokenizer
for i, example in enumerate(test_examples):
    print(f"\n==== Test Example {i+1} ====\n")
    
    # Test original tokenizer
    orig_count = analyze_tokenization(tokenizer, example, "Original Tokenizer")
    
    # Test new tokenizer
    new_count = analyze_tokenization(new_tokenizer, example, "Trained Tokenizer")
    
    # Compare
    if new_count < orig_count:
        print(f"\n✨ Improvement: {orig_count - new_count} fewer tokens used!")
    elif new_count == orig_count:
        print(f"\n🔄 Same number of tokens used.")
    else:
        print(f"\n⚠️ New tokenizer uses {new_count - orig_count} more tokens.")


==== Test Example 1 ====


--- Original Tokenizer Results ---
Text: བྱང་ཆུབ་ཀྱི་སེམས་རྣམ་པ་གཉིས་ཡོད་རེད།
Token count: 18
Tokens: ['བྱང', '་', 'ཆུབ', '་', 'ཀྱི', '་', 'སེམས', '་', 'རྣམ', '་', 'པ', '་', 'གཉིས', '་', 'ཡོད', '་', 'རེད', '།']
Roundtrip successful: True

--- Trained Tokenizer Results ---
Text: བྱང་ཆུབ་ཀྱི་སེམས་རྣམ་པ་གཉིས་ཡོད་རེད།
Token count: 32
Tokens: ['à½ĸ', 'à¾±', 'à½Ħ', 'à¼ĭ', 'à½Ĩ', 'à½´', 'à½ĸ', 'à¼ĭ', 'à½Ģ', 'à¾±à½²à¼ĭ', 'à½¦', 'à½º', 'à½ĺà½¦', 'à¼ĭ', 'à½¢', 'à¾£', 'à½ĺ', 'à¼ĭ', 'à½Ķ', 'à¼ĭ', 'à½Ĥà½ī', 'à½²', 'à½¦', 'à¼ĭ', 'à½¡', 'à½¼', 'à½ĳ', 'à¼ĭ', 'à½¢', 'à½º', 'à½ĳ', 'à¼į']
Roundtrip successful: True

⚠️ New tokenizer uses 14 more tokens.

==== Test Example 2 ====


--- Original Tokenizer Results ---
Text: བོད་ཀྱི་སྐད་ཡིག་ནི་ལོ་རྒྱུས་ཧ་ཅང་རིང་པོ་ཡོད་པའི་སྐད་ཡིག་ཅིག་རེད།
Token count: 34
Tokens: ['བོད', '་', 'ཀྱི', '་', 'སྐད', '་', 'ཡིག', '་', 'ནི', '་', 'ལོ', '་', 'རྒྱུས', '་', 'ཧ', '་', 'ཅང', '་', 'རིང', '་', 'པོ', '་', 'ཡོད', '་', 'པའི', '་', 'སྐད', '་', 'ཡིག',

## 6. Analyze the New Vocabulary

Let's analyze what new tokens were added and check for Tibetan-specific tokens.

In [10]:
# Get vocabularies
old_vocab = tokenizer.get_vocab()
new_vocab = new_tokenizer.get_vocab()

# Find new tokens
new_tokens = [token for token in new_vocab.keys() if token not in old_vocab]
print(f"Added {len(new_tokens)} new tokens to the vocabulary")

# Find Tibetan tokens
tibetan_range = (0x0F00, 0x0FFF)  # Unicode range for Tibetan
new_tibetan_tokens = [token for token in new_tokens 
                      if any(ord(c) >= tibetan_range[0] and ord(c) <= tibetan_range[1] for c in token)]

print(f"Added {len(new_tibetan_tokens)} new Tibetan tokens")

# Show some examples
if new_tibetan_tokens:
    print("\nSample new Tibetan tokens:")
    for token in new_tibetan_tokens[:20]:  # Show up to 20
        print(token)

Added 2695 new tokens to the vocabulary
Added 0 new Tibetan tokens


## 7. Save the Trained Tokenizer

Save the new tokenizer for future use.

In [11]:
# Define output directory
OUTPUT_DIR = "data/whisper_tibetan_tokenizer_retrained"

# Create directory if it doesn't exist
Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)

# Save the tokenizer
new_tokenizer.save_pretrained(OUTPUT_DIR)
print(f"💾 Trained tokenizer saved to: {OUTPUT_DIR}")

# Test loading it back
loaded_tokenizer = PreTrainedTokenizerFast.from_pretrained(OUTPUT_DIR)
print(f"✅ Successfully loaded tokenizer with {len(loaded_tokenizer)} tokens")

💾 Trained tokenizer saved to: data/whisper_tibetan_tokenizer_retrained
✅ Successfully loaded tokenizer with 3059 tokens


## 8. Next Steps

Now that you've trained a tokenizer for Tibetan, here's how you can use it with a Whisper model:

```python
from transformers import WhisperForConditionalGeneration, PreTrainedTokenizerFast

# Load your trained tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("data/whisper_tibetan_tokenizer")

# Load a Whisper model
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

# Resize the token embeddings to match your tokenizer
model.resize_token_embeddings(len(tokenizer))

# Now you can fine-tune or use the model with your tokenizer
```

Remember to always resize the token embeddings of the model after loading your custom tokenizer!