# Tokenizer

Building a tokenizing the training dataset for building a small langauage model (SLM). Unlike large langugage models (LLMs) that power applications like ChatGPT, Gemini, MetaAI etc, SLMs are built using smaller datasets.

The dataset used is NaijaWeb [1], a corpus containing data from popular websites among Nigerians, providing a rich resource for modeling Nigerian linguistic and cultural contexts.

## Imports

In [1]:
import re

## Loading and tokenizing dataset

In [2]:
from datasets import load_dataset

naija_web = load_dataset("saheedniyi/naijaweb")
dataset = naija_web["train"]["text"][:500]
print(f"The dataset consists of {len(dataset)} paragraphs.")

The dataset consists of 500 paragraphs.


In [None]:
import re
from typing import List

def space_tokenizer(text: str) -> List[str]:
    """Splits a string into a list of tokens and removes punctuation."""
    # Replace exclamation marks with a space before splitting
    text = re.sub(r'[!;:()"\[\]{}<>/\\`~@#$%^&*\_+=|\n“”]', ' ', text)
    tokens = re.split(r'\s+', text)
    tokens = [token for token in tokens if token]
    return tokens

space_tokenizer(dataset[0])

['Governor',
 'Samuel',
 'Ortom',
 'of',
 'Benue',
 'State',
 'By',
 'Peter',
 'Duru',
 'Governor',
 'Samuel',
 'Ortom',
 'of',
 'Benue',
 'state',
 'has',
 'commended',
 'President',
 'Muhammadu',
 'Buhari',
 'for',
 'his',
 'directive',
 'to',
 'security',
 'agents',
 'to',
 'shoot',
 'anyone',
 'illegally',
 'bearing',
 'AK47',
 'rifle',
 'in',
 'the',
 'country.',
 'The',
 'Governor',
 'who',
 'gave',
 'the',
 'commendation',
 'Thursday',
 'in',
 'Makurdi',
 'said',
 'the',
 'President’s',
 'order',
 'would',
 'reduce',
 'the',
 'level',
 'of',
 'criminality,',
 'banditry',
 'and',
 'militia',
 'herders’',
 'attacks',
 'on',
 'Benue',
 'communities',
 'as',
 'well',
 'as',
 'in',
 'other',
 'parts',
 'of',
 'the',
 'country.',
 'According',
 'to',
 'him,',
 'the',
 'order',
 'would',
 'also',
 'make',
 'the',
 'communities',
 'safer',
 'for',
 'displaced',
 'farmers',
 'to',
 'return',
 'to',
 'their',
 'ancestral',
 'homes.',
 'I',
 'wish',
 'to',
 'commend',
 'Mr.',
 'President',

In [4]:
tokens = []

for paragraph in dataset:
  for token in space_tokenizer(paragraph):
    tokens.append(token)

print(f"Total number of tokens in the dataset: {len(tokens):,}")

Total number of tokens in the dataset: 318,350


In [None]:
unknown_token = "<UNK>" # Add 'unk' to account for new tokens not in vocab
pad_token = "<PAD>"

def build_vocabulary(tokens: List[str]) -> List[str]:
    """Builds a vocabulary from a list of tokens."""
    return [pad_token] + list(set(tokens)) + [unknown_token] 

In [6]:
vocabulary = build_vocabulary(tokens)

print(f"There are {len(vocabulary)} unique tokens in the Naijaweb dataset.")

vocabulary[:20]

There are 34838 unique tokens in the Naijaweb dataset.


['counterparts,',
 'doomsday',
 'plague,not',
 'Statement',
 'queried.',
 'territories',
 'healing',
 'unprepared,',
 'vendor',
 'collection.',
 'needed.',
 'face,',
 'legitimacy',
 'arrows',
 'March',
 'Ukachukwu',
 'battle.',
 'ABUJA',
 'quintuplets.',
 'Qaqa']

## Mapping tokens to token IDs (indices)

Language models cannot directly make use of text data, hence there is a need to map tokens to token IDs uniquely and vice versa for translation.

This works similarly to the encoder and decoder in the tranformer architecture

In [7]:
token_to_index = {}
index_to_token = {}

for index, token in enumerate(vocabulary):
    token_to_index[token] = index
    index_to_token[index] = token

In [None]:
unk_index = token_to_index[unknown_token]

def encode(text: str) -> List[int]:
    """Encodes a text sequence into a list of indices based on the vocabulary."""

    indices = []
    for token in space_tokenizer(text):
        token_index = token_to_index.get(token, unk_index)
        indices.append(token_index)

    return indices


def decode(indices: int | List[int]) -> List[str]:
    """Decodes a list (or single index) of integers back into tokens."""

    # If a single integer is passed, convert it into a list.
    if isinstance(indices, int):
        indices = [indices]

    tokens = []

    for index in indices:
        token = index_to_token.get(index, unk_index)
        tokens.append(token)

    # Join the decoded tokens into a single string.
    return " ".join(tokens)

In [12]:
text = dataset[0]
print(text)

Governor Samuel Ortom of Benue State
By Peter Duru
Governor Samuel Ortom of Benue state has commended President Muhammadu Buhari for his directive to security agents to shoot anyone illegally bearing AK47 rifle in the country.
The Governor who gave the commendation Thursday in Makurdi said the President’s order would reduce the level of criminality, banditry and militia herders’ attacks on Benue communities as well as in other parts of the country.
According to him, “the order would also make the communities safer for displaced farmers to return to their ancestral homes.
“I wish to commend Mr. President for his recent order against those bearing AK47 rifles. This I am sure will reduce the high rate of criminality, banditary and militia herdsmen attacks on our farming communities,” the Governor said.
He noted that President Buhari had done the right thing by listening to the calls he and other concerned Nigerians made on the need for the Federal Government to act faster and decisively t

In [13]:
token_id= encode(text)
len(token_id)

254

In [14]:
decode(token_id)

'Governor Samuel Ortom of Benue State By Peter Duru Governor Samuel Ortom of Benue state has commended President Muhammadu Buhari for his directive to security agents to shoot anyone illegally bearing AK47 rifle in the country. The Governor who gave the commendation Thursday in Makurdi said the President’s order would reduce the level of criminality, banditry and militia herders’ attacks on Benue communities as well as in other parts of the country. According to him, the order would also make the communities safer for displaced farmers to return to their ancestral homes. I wish to commend Mr. President for his recent order against those bearing AK47 rifles. This I am sure will reduce the high rate of criminality, banditary and militia herdsmen attacks on our farming communities, the Governor said. He noted that President Buhari had done the right thing by listening to the calls he and other concerned Nigerians made on the need for the Federal Government to act faster and decisively to 

In [16]:
new_sentence = "The quick brown fox jumps over the lazy dog and blockchain technology is fascinating."
encoded_ids = encode(new_sentence)

print("Original sentence:", new_sentence)
print("Encoded IDs:", encoded_ids)

decoded_sentence = decode(encoded_ids)
print("Decoded sentence:", decoded_sentence)

Original sentence: The quick brown fox jumps over the lazy dog and blockchain technology is fascinating.
Encoded IDs: [13580, 20167, 34837, 34837, 7956, 9632, 18350, 13001, 13562, 30022, 27059, 32002, 28032, 34837]
Decoded sentence: The quick UNK UNK jumps over the lazy dog and blockchain technology is UNK


## References

[1] Saheed Azeez. (2024). *Naijaweb: A Web Scraped Nigerian Context Dataset* (Version 1.0.0). Hugging Face Datasets. Available at: [https://huggingface.co/datasets/saheedniyi/naijaweb](https://huggingface.co/datasets/saheedniyi/naijaweb)