# 🧠Batch Training of Subword Tokenizers for Medium-Grade LLMs on WikiText-103

# 📌 Problem Statement
Training an effective **tokenizer** is a critical step in building **medium-grade language models (100M–350M parameters)**. Tokenizers convert raw text into subword units that the model can understand, and the choice of tokenizer affects:

- Vocabulary coverage and out-of-vocabulary (OOV) handling  
- Sequence length efficiency  
- Downstream model performance on generation, summarization, and understanding tasks  

This project aims to **train and compare a batch of tokenizers**—**BPE, WordPiece, Unigram, and optional Char-level**—on the **WikiText-103 dataset**, each with a **vocabulary size of ~60,000 tokens**. By training multiple tokenizer types on the same corpus, we can evaluate which approach best balances **coverage, efficiency, and adaptability** for medium-grade LLMs.

# 📂 Dataset Overview

**Dataset Name:** WikiText-103 (raw version)  
**Source:** [Hugging Face - WikiText](https://huggingface.co/datasets/wikitext)  

**Description:**  
WikiText-103 is a large-scale English language modeling dataset extracted from verified Good and Featured articles on Wikipedia. It contains over **100 million tokens** and provides high-quality, clean text suitable for training tokenizers and language models.  

**Split Used:**  
- **Train Split:** `train` (streaming mode used for memory efficiency)  

**Key Features:**  
- **Text-based**: Contains full articles with proper formatting.  
- **High-quality language**: Extracted from curated Wikipedia articles.  
- **Size:** ~103 million tokens  
- **Use Case:** Ideal for training subword tokenizers and general-purpose medium-grade LLMs.  


# 🔹 Tokenizer Types and References

In this project, we will train and compare **multiple types of tokenizers** on the WikiText-103 dataset. Each tokenizer has different properties and advantages for language modeling:

1. **BPE (Byte Pair Encoding)**  
   - Subword-level tokenizer that merges frequent character pairs.  
   - Efficient for handling rare words while keeping vocabulary compact.  
   - Widely used in GPT-style models.  
   - **References:**  
     - [Original Paper: Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/abs/1508.07909)  
     - [Hugging Face BPE Tokenizer](https://huggingface.co/docs/tokenizers/python/latest/components.html#byte-level-bpe)

2. **WordPiece**  
   - Builds subwords based on maximizing likelihood of the training data.  
   - Used in BERT and similar Transformer models.  
   - Balances vocabulary size and coverage efficiently.  
   - **References:**  
     - [Original Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)  
     - [Hugging Face WordPiece Tokenizer](https://huggingface.co/docs/tokenizers/python/latest/components.html#wordpiece)

3. **Unigram (SentencePiece)**  
   - Probabilistic tokenizer that selects subwords to maximize corpus likelihood.  
   - Flexible and often achieves better tokenization on diverse text.  
   - **References:**  
     - [Original Paper: SentencePiece: A simple and language independent subword tokenizer and detokenizer](https://arxiv.org/abs/1808.06226)  
     - [Hugging Face Unigram Tokenizer](https://huggingface.co/docs/tokenizers/python/latest/components.html#unigram)

**General Documentation & Resources:**  
- [Hugging Face Tokenizers Documentation](https://huggingface.co/docs/tokenizers/index)  
- [WikiText-103 Dataset on Hugging Face](https://huggingface.co/datasets/wikitext)


---
---
# 🛠️ Project Build
---

---
##BPE (Byte Pair Encoding) Tokeniser
---

### 🔹 Phase 1 : Setup and Configuration

#### 🛠 Dependencies

The project requires the following packages and tools:

- **datasets**: For loading and streaming the WikiText-103 dataset efficiently.  
- **evaluate**: Provides metrics for evaluating tokenizer and model performance.  
- **transformers with SentencePiece support**: Needed for training Unigram and other tokenizers.  
- **git-lfs**: Handles large files when working with Hugging Face models or datasets.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!apt install git-lfs

Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.5-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.5
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.3).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


#### 📚 Library Imports

In [4]:
from google.colab import userdata
from huggingface_hub import login
from datasets import load_dataset
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)
from tokenizers import Tokenizer, models, normalizers, Regex
from tokenizers.normalizers import NFD, Lowercase, StripAccents, Replace, Strip
from transformers import PreTrainedTokenizerFast
from huggingface_hub import notebook_login

#### 🔑 Hugging Face API Key and Login Setup

To access datasets and models from Hugging Face, the following steps are required:

- Retrieve the **Hugging Face API token** from user data or environment variables.  
- Use the API token to **authenticate** with the Hugging Face Hub.  
- Perform a **notebook login** to enable access to datasets and models directly within Google Colab.  

This ensures that streaming datasets and model downloads are authorized and seamless.

In [11]:
hf_token = userdata.get('HFToken')
login(token=hf_token)

### 🔹 Phase 2 : Corpus and Data Loading

#### 📥 Data Loading

- Loaded the **WikiText-103 (raw) dataset** from Hugging Face.  
- Used the **train split** of the dataset for tokenizer training.  
- Enabled **streaming mode** to efficiently handle the large dataset without consuming excessive memory.  
- Verified the dataset by inspecting its samples to ensure proper loading.









In [12]:
dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train", streaming=True)
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

IterableDataset({
    features: ['text'],
    num_shards: 2
})

#### 🔄 Iterator / Training Corpus Configuration

- Defined a **batch iterator** to process the dataset in manageable chunks.  
- Each batch contains **1000 text samples** by default.  
- The iterator **yields batches of text** for efficient tokenizer training.  
- Ensures that the entire dataset can be streamed without loading it fully into memory.  
- Handles any remaining samples at the end of the dataset to avoid data loss.

In [13]:
def get_training_corpus(batch_size=1000):
    batch = []
    for example in dataset:
        batch.append(example["text"])
        if len(batch) == batch_size:
            yield batch
            batch = []
    if batch:
        yield batch

###🔹 Phase 3 : Tokeniser Model Building

#### ⚙️ Tokenizer Initialization and Pre-tokenization

1. **Tokenizer Initialization**  
   - Created a **BPE (Byte Pair Encoding) tokenizer** instance.  
   - This defines the base model that will learn subword units from the dataset.  

2. **Pre-tokenizer Configuration**  
   - Set the **pre-tokenizer** to `ByteLevel`, which splits text into bytes while preserving whitespace.  
   - Pre-tokenization prepares the raw text into smaller units before applying BPE merges.  

3. **Pre-tokenization Test**  
   - Tested the pre-tokenizer with a sample string `"Let's test pre-tokenization!"`.  
   - Ensures that text is correctly split into tokens, validating the pre-tokenizer setup before training.  


In [None]:
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test pre-tokenization!")

[('Let', (0, 3)),
 ("'s", (3, 5)),
 ('Ġtest', (5, 10)),
 ('Ġpre', (10, 14)),
 ('-', (14, 15)),
 ('tokenization', (15, 27)),
 ('!', (27, 28))]

### 🔹 Phase 3 : Tokeniser Model Building

#### 🏋️ Tokenizer Model Training and Config

- Configured a **BPE trainer** with a **vocabulary size of 60,000 tokens**.  
- Added **special tokens** (e.g., `<|endoftext|>`) required for downstream tasks.  
- Used the **training corpus iterator** to feed the tokenizer with batches of text.  
- Called the training function to **learn the subword vocabulary and merge rules** from the dataset.  
- Ensured that the tokenizer is ready to encode and decode text efficiently for medium-grade LLMs.


In [None]:
trainer = trainers.BpeTrainer(vocab_size=60000, special_tokens=["<|endoftext|>"])
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

###🔹 Phase 4  : Tokeniser Model Saving

#### 💾 Tokenizer Saving

- **Saved the trained tokenizer** locally to the directory `./wrapped_tokenizer`.  
- Ensures the tokenizer can be **reloaded easily** for future model training or inference.  
- Provides a persistent version of the tokenizer for **reuse across projects**.

In [None]:
tokenizer.save("english_bpe_tokenizer.json")

#### ☁️ Tokenizer Wrapping and Hub Upload

- **Wrapped the trained tokenizer** using `PreTrainedTokenizerFast` to make it compatible with Hugging Face models.  
- Assigned **special tokens** for unknown, padding, beginning-of-sequence, and end-of-sequence.  
- **Pushed the tokenizer to the Hugging Face Hub**, enabling public access and version control.  
- This allows the tokenizer to be **shared, reused, and integrated** into other projects or LLM training pipelines.






In [None]:
wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_file="english_bpe_tokenizer.json",
    unk_token="<|endoftext|>",
    pad_token="<|endoftext|>",
    bos_token="<|endoftext|>",
    eos_token="<|endoftext|>"
)

wrapped_tokenizer.save_pretrained("./wrapped_tokenizer")

('./wrapped_tokenizer/tokenizer_config.json',
 './wrapped_tokenizer/special_tokens_map.json',
 './wrapped_tokenizer/tokenizer.json')

In [None]:
notebook_login()

In [None]:
wrapped_tokenizer.push_to_hub("english_bpe_tokenizer-60k")

---
## Wordpiece Tokeniser
---

### 🧩 WordPiece Tokenizer – Normalizer Definition

- Since the dataset and iterator were already prepared in earlier steps, we directly moved to **defining the normalizer** for the WordPiece tokenizer.  
- **Why a normalizer is required here:**  
  - WordPiece relies heavily on consistent text preprocessing because it builds its vocabulary from **subword units** that are sensitive to character case, punctuation, and diacritics.  
  - Inconsistent casing or hidden Unicode variations can fragment the vocabulary, reducing efficiency and increasing unknown tokens (`[UNK]`).  
- **Why not needed for BPE-based tokenizer:**  
  - BPE inherently merges character sequences based on frequency and can naturally adapt to small variations in casing or symbols.  
  - While normalization can still be used with BPE, it is not strictly required for stable performance, unlike with WordPiece where normalization directly impacts vocabulary coherence and tokenization accuracy.


###🔹 Phase 3 : Tokeniser Model Building



#### 🔧 Normalization Step
We applied a sequence of normalization operations to ensure clean, consistent input before tokenization:

1. **NFD()** – Converts characters into their *decomposed Unicode form* (e.g., `é` → `e` + `´`).
2. **Lowercase()** – Converts all characters to lowercase for case-insensitive processing.
3. **StripAccents()** – Removes accent marks and diacritics from characters.
4. **Replace(Regex(r"[\u0000-\u001F\u007F]"), " ")** – Replaces non-printable ASCII control characters with a space.
5. **Replace(Regex(r"\s+"), " ")** – Collapses multiple consecutive spaces into a single space.
6. **Strip()** – Removes any leading or trailing whitespace.

**Result:**  
Input: `"Héllò hôw are ü?"`  
Output after normalization: `"hello how are u?"`  

This ensures the WordPiece model works with **uniform, noise-free text**, reducing fragmentation of subword vocabulary.


In [14]:
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = normalizers.Sequence([
    NFD(),
    Lowercase(),
    StripAccents(),
    Replace(Regex(r"[\u0000-\u001F\u007F]"), " "),
    Replace(Regex(r"\s+"), " "),
    Strip()
])
print(tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))

hello how are u?


#### 🔧Pre-tokenization Step (WordPiece Tokenizer)

For the WordPiece tokenizer, we directly moved to defining the **normalizer** since the dataset and iterator were already prepared. Unlike BPE, WordPiece benefits from a normalization stage because it is more sensitive to text inconsistencies (e.g., mixed cases, accented characters) that can inflate vocabulary size and reduce generalization.  

We experimented with different pre-tokenization approaches to split the text into smaller, meaningful units before training:

1. **Whitespace Tokenization** – Splits text wherever spaces appear.  
   Example: `"Let's test my pre-tokenizer."` → `["Let's", "test", "my", "pre-tokenizer."]`

2. **Whitespace Split** – Similar to whitespace tokenization, but ensures cleaner handling of space-delimited words.

3. **Sequence Pre-tokenization** – Combines multiple rules:  
   - `WhitespaceSplit()` → Splits on spaces  
   - `Punctuation()` → Separates punctuation marks from words  
   - `Digits(individual_digits=False)` → Keeps multi-digit numbers together

4. **ByteLevel Pre-tokenization** – Splits into byte-level tokens while preserving exact spacing and characters. Adds a `prefix space` for consistent handling of leading spaces.

**Why Normalization Here but Not in BPE?**  
BPE often learns merges that inherently account for raw text variability. WordPiece, however, relies on exact token matches; inconsistencies like `"Hello"` vs `"hello"` or `"café"` vs `"cafe"` would create redundant tokens. Normalization (lowercasing, NFD/strip accents, etc.) ensures uniform input, improving efficiency and vocabulary quality.


In [15]:
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

[('Let', (0, 3)),
 ("'", (3, 4)),
 ('s', (4, 5)),
 ('test', (6, 10)),
 ('my', (11, 13)),
 ('pre', (14, 17)),
 ('-', (17, 18)),
 ('tokenizer', (18, 27)),
 ('.', (27, 28))]

In [16]:
pre_tokenizer = pre_tokenizers.WhitespaceSplit()
pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

[("Let's", (0, 5)),
 ('test', (6, 10)),
 ('my', (11, 13)),
 ('pre-tokenizer.', (14, 28))]

In [17]:
pre_tokenizer = pre_tokenizers.Sequence([
    pre_tokenizers.WhitespaceSplit(),
    pre_tokenizers.Punctuation(),
    pre_tokenizers.Digits(individual_digits=False)
])
pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)

pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

[('ĠLet', (0, 3)),
 ("'s", (3, 5)),
 ('Ġtest', (5, 10)),
 ('Ġmy', (10, 13)),
 ('Ġpre', (13, 17)),
 ('-', (17, 18)),
 ('tokenizer', (18, 27)),
 ('.', (27, 28))]

### 🔹 Phase 4 : Tokeniser Model Building

#### 🏋️Model Setup — WordPiece Tokenizer Training

We configure a **WordPieceTrainer** to train our tokenizer with the following parameters:

- **`vocab_size=60000`** → Limits the vocabulary to the top 60,000 tokens (including special tokens).
- **`special_tokens`** → Defines special-purpose tokens:
  - `[UNK]` → Unknown token (for out-of-vocabulary words)
  - `[PAD]` → Padding token (for sequence length alignment)
  - `[CLS]` → Classification token (for sequence-level tasks)
  - `[SEP]` → Separator token (e.g., between sentences)
  - `[MASK]` → Masking token (for masked language modeling)
- **`show_progress=True`** → Displays training progress.
- **`min_frequency=2`** → Ignores tokens that appear fewer than 2 times in the dataset.
- **`limit_alphabet=1000`** → Restricts the number of unique characters considered for tokenization.
- **`continuing_subword_prefix="##"`** → Prefix for subword tokens that continue from a previous word (e.g., `play` → `play`, `##ing`).

This setup ensures a **balanced vocabulary** with efficient handling of rare words, subwords, and special-purpose tokens for downstream NLP tasks.


In [18]:
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(
    vocab_size=60000,
    special_tokens=special_tokens,
    show_progress=True,
    min_frequency=2,
    limit_alphabet=1000,
    continuing_subword_prefix="##"
)

#### 🏋️Training the WordPiece Tokenizer
With the trainer configured, we now train the WordPiece tokenizer on our prepared dataset iterator.

- **`train_from_iterator()`**  
  This method consumes batches of text generated by our `get_training_corpus()` function and learns the vocabulary and subword units according to the WordPiece algorithm.

- **Training flow**  
  1. The pre-tokenizer splits the text into initial tokens (based on whitespace, punctuation, digits, etc.).  
  2. The normalizer applies case folding or other preprocessing.  
  3. The WordPiece algorithm iteratively builds the vocabulary starting with characters, merging the most frequent subword pairs until the target vocab size (`60000`) is reached.  
  4. Special tokens are reserved and remain untouched during merges.

- **Key outcome**  
  After this phase, the tokenizer has a fully learned vocabulary and merge rules, enabling it to tokenize any input text into consistent WordPiece tokens.


In [19]:
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

In [20]:
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

['let', "'", 's', 'test', 'this', 'token', '##izer', '.']


###🔹 Phase 5 : Post Processing

#### 🔧 Post-Tokenization: Special Token Handling
In this phase, the IDs of reserved special tokens such as `[CLS]` and `[SEP]` are retrieved from the tokenizer’s vocabulary.  
This step ensures that these tokens are correctly recognized and mapped, which is crucial for downstream tasks like classification or sentence-pair processing where such tokens serve as sequence markers.


In [21]:
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)

2 3


#### 🔧 Post-Processing Template Setup
Here, a post-processing template is defined to control how special tokens are inserted during tokenization.  
The configuration specifies:
- **Single sequence format**: `[CLS]` at the start, followed by the input tokens, ending with `[SEP]`.
- **Pair sequence format**: `[CLS]` → first sequence → `[SEP]` → second sequence → `[SEP]`.
- **Special tokens mapping**: `[CLS]` and `[SEP]` are explicitly linked to their IDs.  

A test encoding is then run on both a single and paired input to verify:
- Correct insertion of special tokens.
- Proper assignment of **type IDs** (segment embeddings) to distinguish sequences.

In [22]:
tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)
encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences.")
print(encoding.tokens)
print(encoding.type_ids)

['[CLS]', 'let', "'", 's', 'test', 'this', 'token', '##izer', '...', '[SEP]', 'on', 'a', 'pair', 'of', 'sentences', '.', '[SEP]']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]


#### 🔧 Decoder Setup
The decoder is configured to reverse the WordPiece tokenization process.  
- **Prefix handling**: The `"##"` prefix (used in WordPiece to indicate a subword continuation) is removed during decoding.  
- **Purpose**: Converts token IDs back into a readable text string by joining subwords seamlessly.  

Finally, the encoded token IDs are decoded to reconstruct the original text, confirming that tokenization and detokenization are consistent.


In [23]:
tokenizer.decoder = decoders.WordPiece(prefix="##")
tokenizer.decode(encoding.ids)

"let ' s test this tokenizer... on a pair of sentences."

###🔹 Phase 6 : Model Saving

####💾 Tokenizer Saving
- **Saved the trained tokenizer** locally to the directory.  
- Ensures the tokenizer can be **reloaded easily** for future model training or inference.  
- Provides a persistent version of the tokenizer for **reuse across projects**.

In [24]:
tokenizer.save("english_wordpiece_tokenizer.json")
wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_file="english_wordpiece_tokenizer.json",
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

wrapped_tokenizer.save_pretrained("./english-wordpiece-tokenizer-60k")


('./english-wordpiece-tokenizer-60k/tokenizer_config.json',
 './english-wordpiece-tokenizer-60k/special_tokens_map.json',
 './english-wordpiece-tokenizer-60k/tokenizer.json')

#### ☁️ Hub Upload

- **Wrapped the trained tokenizer** using `PreTrainedTokenizerFast` to make it compatible with Hugging Face models.  
- Assigned **special tokens** for unknown, padding, beginning-of-sequence, and end-of-sequence.  
- **Pushed the tokenizer to the Hugging Face Hub**, enabling public access and version control.  
- This allows the tokenizer to be **shared, reused, and integrated** into other projects or LLM training pipelines.

In [25]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [26]:
wrapped_tokenizer.push_to_hub("english-wordpiece-tokenizer-60k")

CommitInfo(commit_url='https://huggingface.co/yakul259/english-wordpiece-tokenizer-60k/commit/798b5f179cff3bd7bd075f617ce78c3450af1dee', commit_message='Upload tokenizer', commit_description='', oid='798b5f179cff3bd7bd075f617ce78c3450af1dee', pr_url=None, repo_url=RepoUrl('https://huggingface.co/yakul259/english-wordpiece-tokenizer-60k', endpoint='https://huggingface.co', repo_type='model', repo_id='yakul259/english-wordpiece-tokenizer-60k'), pr_revision=None, pr_num=None)

---
## Unigram Tokeniser
---

###🔹 Phase 3 : Tokeniser Model Building


#### 🔧 Normalization for Unigram Tokenizer

Since we already have our dataset (`wikitext-103-raw-v1`) and the training corpus generator ready from previous phases, we will now proceed directly to **normalization** for our Unigram tokenizer.

In this phase, we will:
- Apply **NFD Unicode normalization** to ensure consistent character representation.
- Strip accents from characters for uniformity.
- Convert all text to lowercase to reduce vocabulary size.
- Remove control characters that might interfere with tokenization.

This normalization ensures that different visual forms of the same character are treated identically, improving the tokenizer’s generalization and reducing unnecessary vocabulary entries.


In [27]:
tokenizer = Tokenizer(models.Unigram())
tokenizer.normalizer = normalizers.Sequence(
    [
        normalizers.Replace("``", '"'),
        normalizers.Replace("''", '"'),
        normalizers.NFKD(),
        normalizers.StripAccents(),
        normalizers.Replace(Regex(" {2,}"), " "),
    ]
)

#### 🔧 Pre-tokenization Step (Unigram Tokenizer)
After normalization, we move to **pre-tokenization**, which splits text into manageable segments before actual token learning.

Here, we use the **Metaspace** pre-tokenizer:
- Replaces spaces with a special visible character (default `_`), making spaces explicit in the tokenization process.
- Ensures consistent handling of whitespace across the dataset.
- Preserves original text boundaries while still allowing the tokenizer to operate on subword units.

Example:
- Input: `"Let's test the pre-tokenizer!"`
- Output: Spaces become `_` and tokens are clearly separated, aiding in vocabulary learning.

In [28]:
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test the pre-tokenizer!")

[("▁Let's", (0, 5)),
 ('▁test', (5, 10)),
 ('▁the', (10, 14)),
 ('▁pre-tokenizer!', (14, 29))]

###🔹 Phase 4 : Tokeniser Model Building

#### 🏋️Model Setup — Unigram Tokenizer Config
In this phase, we train the **Unigram Tokenizer** using our prepared corpus and configuration.

- **Trainer Used:** `UnigramTrainer`
- **Vocabulary Size:** `25,000` tokens — chosen to balance coverage and model size.
- **Special Tokens:**  
  - `<cls>` — Classification token (for sentence-level tasks).  
  - `<sep>` — Separator token (for splitting sequences).  
  - `<unk>` — Unknown token (for unseen words/subwords).  
  - `<pad>` — Padding token (for batch processing).  
  - `<mask>` — Masking token (for MLM tasks).  
  - `<s>` and `</s>` — Start and end of sequence tokens.

- **Unknown Token Handling:**  
  Any text fragment not in the vocabulary will map to `<unk>`.

- **Training Process:**  
  Uses `train_from_iterator(get_training_corpus(), trainer=trainer)` — iteratively feeds batches of text to the trainer for subword vocabulary optimization.

- **Goal:**  
  Build a compact and efficient subword vocabulary that captures the most probable token combinations based on frequency and likelihood.

In [29]:
special_tokens = ["<cls>", "<sep>", "<unk>", "<pad>", "<mask>", "<s>", "</s>"]
trainer = trainers.UnigramTrainer(
    vocab_size=60000,
    special_tokens=special_tokens,
    unk_token="<unk>"
)
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

####🏋️Model Setup — Unigram Tokenizer Training

In this step, we explicitly initialize the **Unigram model** inside our tokenizer and begin the training process.

- **Model Initialization:**  
  `tokenizer.model = models.Unigram()`  
  This sets the tokenizer's core algorithm to **Unigram**, a probabilistic subword segmentation method that selects the most likely tokenization of text based on learned frequencies.

- **Training Execution:**  
  `tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)`  
  - **`get_training_corpus()`** → Streams batches of raw text from our dataset to avoid memory overload.
  - **`trainer`** → Contains vocabulary size, special tokens, and unknown token settings.
  - The training process evaluates token candidates and iteratively removes the least probable ones until the vocabulary reaches the target size (25,000 in our case).

- **Why Unigram?**  
  - Produces a **probabilistic** tokenization, allowing multiple valid segmentations.
  - Robust for noisy, multi-domain text data.
  - Maintains better coverage for rare words compared to purely greedy approaches.

- **Outcome:**  
  A fully trained **Unigram Tokenizer model** with:
  - Optimized vocabulary.
  - Special token handling.
  - Ready for encoding and decoding text in downstream tasks.




In [30]:
tokenizer.model = models.Unigram()
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

In [31]:
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

['▁Let', "'", 's', '▁test', '▁this', '▁token', 'izer', '.']


###🔹 Phase 5 : Post Processing

#### 🔧 Post-Tokenization: Special Token Handling
In this phase, the IDs of reserved special tokens such as `[CLS]` and `[SEP]` are retrieved from the tokenizer’s vocabulary.  
This step ensures that these tokens are correctly recognized and mapped, which is crucial for downstream tasks like classification or sentence-pair processing where such tokens serve as sequence markers.


In [32]:
cls_token_id = tokenizer.token_to_id("<cls>")
sep_token_id = tokenizer.token_to_id("<sep>")
print(cls_token_id, sep_token_id)

0 1


#### 🔧 Post-Processing Template Setup
Here, a post-processing template is defined to control how special tokens are inserted during tokenization.  
The configuration specifies:
- **Single sequence format**: `[CLS]` at the start, followed by the input tokens, ending with `[SEP]`.
- **Pair sequence format**: `[CLS]` → first sequence → `[SEP]` → second sequence → `[SEP]`.
- **Special tokens mapping**: `[CLS]` and `[SEP]` are explicitly linked to their IDs.  

A test encoding is then run on both a single and paired input to verify:
- Correct insertion of special tokens.
- Proper assignment of **type IDs** (segment embeddings) to distinguish sequences.

In [33]:
tokenizer.post_processor = processors.TemplateProcessing(
    single="$A:0 <sep>:0 <cls>:2",
    pair="$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2",
    special_tokens=[("<sep>", sep_token_id), ("<cls>", cls_token_id)],
)
encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences!")
print(encoding.tokens)
print(encoding.type_ids)

['▁Let', "'", 's', '▁test', '▁this', '▁token', 'izer', '.', '.', '.', '<sep>', '▁', 'on', '▁', 'a', '▁pair', '▁of', '▁sentence', 's', '!', '<sep>', '<cls>']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2]


#### 🔧 Decoder Setup
The decoder is configured to reverse the WordPiece tokenization process.  
- **Purpose**: Converts token IDs back into a readable text string by joining subwords seamlessly.  

Finally, the encoded token IDs are decoded to reconstruct the original text, confirming that tokenization and detokenization are consistent.


In [34]:
tokenizer.decoder = decoders.Metaspace()

###🔹 Phase 6 : Model Saving

####💾 Tokenizer Saving
- **Saved the trained tokenizer** locally to a directory.
- Ensures the tokenizer can be **reloaded easily** for future model training or inference.  
- Provides a persistent version of the tokenizer for **reuse across projects**.

In [35]:
tokenizer.save("english_unigram_tokenizer.json")
wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_file= "english_unigram_tokenizer.json",
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    pad_token="<pad>",
    cls_token="<cls>",
    sep_token="<sep>",
    mask_token="<mask>",
    padding_side="left",
)
wrapped_tokenizer.save_pretrained("./english-unigram-tokenizer-60k")

('./english-unigram-tokenizer-60k/tokenizer_config.json',
 './english-unigram-tokenizer-60k/special_tokens_map.json',
 './english-unigram-tokenizer-60k/tokenizer.json')

#### ☁️ Hub Upload

- **Wrapped the trained tokenizer** using `PreTrainedTokenizerFast` to make it compatible with Hugging Face models.  
- Assigned **special tokens** for unknown, padding, beginning-of-sequence, and end-of-sequence.  
- **Pushed the tokenizer to the Hugging Face Hub**, enabling public access and version control.  
- This allows the tokenizer to be **shared, reused, and integrated** into other projects or LLM training pipelines.

In [36]:
wrapped_tokenizer.push_to_hub("english-unigram-tokenizer-60k")

CommitInfo(commit_url='https://huggingface.co/yakul259/english-unigram-tokenizer-60k/commit/66f76b305f9b8af5f6af498e2d8047b7ba7d489b', commit_message='Upload tokenizer', commit_description='', oid='66f76b305f9b8af5f6af498e2d8047b7ba7d489b', pr_url=None, repo_url=RepoUrl('https://huggingface.co/yakul259/english-unigram-tokenizer-60k', endpoint='https://huggingface.co', repo_type='model', repo_id='yakul259/english-unigram-tokenizer-60k'), pr_revision=None, pr_num=None)