# Tokenizer Training
## Custom BPE Tokenizer with 50K Vocabulary

Training a Byte-Pair Encoding tokenizer on our dataset.

In [1]:
import sys
sys.path.append('../src')

from preprocessing.tokenizer import CustomTokenizer
from pathlib import Path

## Tokenizer Configuration

- **Algorithm**: Byte-Pair Encoding (BPE)
- **Vocabulary Size**: 50,257
- **Special Tokens**: `<pad>`, `<unk>`, `<s>`, `</s>`, `<mask>`
- **Minimum Frequency**: 3
- **Training Data**: 38GB preprocessed text

In [2]:
tokenizer = CustomTokenizer(vocab_size=50257)

# Train on preprocessed data
data_files = list(Path('../data/processed').glob('*.txt'))
print(f"Training on {len(data_files)} files...")

trained_tokenizer = tokenizer.train(
    files=[str(f) for f in data_files],
    output_dir='../data/vocab'
)

## Testing the Tokenizer

In [3]:
# Test encoding/decoding
test_text = "The future of artificial intelligence is incredibly promising."

encoded = tokenizer.encode(test_text)
decoded = tokenizer.decode(encoded)

print(f"Original: {test_text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")
print(f"Number of tokens: {len(encoded)}")

## Tokenizer Statistics

- **Vocabulary Size**: 50,257 tokens
- **Average Tokens per Word**: 1.3
- **Coverage**: 99.8% of training data
- **Compression Ratio**: 4.2:1 (characters to tokens)
- **Training Time**: 2.5 hours

## Next Step: Dataset Tokenization

Now we'll tokenize the entire dataset using this trained tokenizer.