# Omeife Technologies Internship – Week 7: Tokenizer Models

**Prepared by:** Abdul Samod

---

This notebook implements three tokenizers (whitespace, rule-based, and subword) on a yoruba language health dataset. Run cells in order.

## Setup

Install required packages (run this cell in Colab). If you're running this locally in VS Code, you can install the same packages into your environment.

This cell installs `nltk`, `sentencepiece`, and `transformers` (optional).

In [1]:
!pip install -q nltk sentencepiece transformers

# For NLTK punkt tokenizer data
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Sample Corpus

This will be replaced by the actual dataset in the implementation of the tokenizers.

In [2]:
texts = [
    "Omeife Technologies dey build cool AI products.",
    "They sabi create tools for Nigerian languages like Yoruba, Igbo, and Hausa.",
    "Machine learning no easy, but we go still make am work well well!",
    "This is an unseenword test: NaijatechwizGPT should be handled by subword tokenizer.",
]

# Quick preview
for i, t in enumerate(texts, 1):
    print(f"{i}.", t)


1. Omeife Technologies dey build cool AI products.
2. They sabi create tools for Nigerian languages like Yoruba, Igbo, and Hausa.
3. Machine learning no easy, but we go still make am work well well!
4. This is an unseenword test: NaijatechwizGPT should be handled by subword tokenizer.


## 1) Whitespace Tokenizer

Simple `.split()` implementation. Fast but naive.

In [5]:
import time
import pandas as pd

# Loading the dataset and accessing the column
df = pd.read_excel("/content/yoruba_health_cleaned.xlsx")

# Preview to confirm column names
print("Columns:", df.columns)
df.head()

texts = df["Yoruba Language"].astype(str).tolist()


# Define whitespace tokenizer
def whitespace_tokenize(texts):
    return [t.split() for t in texts]

# Apply tokenizer
ws_tokens = whitespace_tokenize(texts)

# Preview first few tokenized lines
for i, s in enumerate(ws_tokens[:5], 1):
    print(f"{i}: {s}\n")

Columns: Index(['Yoruba Language', 'English Translation'], dtype='object')
1: ['igbesẹ', 'yii', 'waye', 'lati', 'jẹ', 'kí', 'abẹrẹ', 'naa', 'tete', 'de', 'sibi', 'ti', 'wọn', 'ti', 'nilo', 'rẹ', 'lati', 'gbe', 'ogun', 'ti', 'itankalẹ', 'aarun', 'mpox.']

2: ['bavarian', 'nordic', 'a/s', 'lo', 'pelo', 'abẹrẹ', 'naa,', 'ti', 'ajọ', 'europe', 'medicine', 'agency', 'si', 'se', 'ayẹwo', 'rẹ.']

3: ['tẹlẹ,', 'ko', 'si', 'abẹrẹ', 'ajẹsara', 'fun', 'aarun', 'mpox,', 'eyi', 'to', 'fa', 'ti', 'wọn', 'n', 'fi', 'lo', 'abẹrẹ', 'ajẹsara', 'small', 'pox', 'lati', 'koju', 'rẹ', 'nitori', 'aarun', 'mejeeji', 'jọ', 'ara', 'wọn.']

4: ['ọga', 'agba', 'fun', 'who,', 'dokita', 'tedros', 'adhanom', 'ghebreyesus', 'gba', 'niyanju', 'pe', 'ki', 'wọn', 'yara', 'lati', 'gbe', 'abẹrẹ', 'ajẹsara', 'naa', 'lọ', 'ilẹ', 'afrika', 'ati', 'awọn', 'agbegbe', 'mii', 'ti', 'aarun', 'naa', 'ti', 'n', 'ja.']

5: ['dokita', 'yukiko', 'nakatani', 'salaye', 'pe', 'bibuwọlu', 'abẹrẹ', 'ajẹsara', 'yii', 'yoo', 'ran', 'ijọba', 

## 2) Rule-based Tokenizer

I have chosen to use regex out of the two variants.

In [2]:
import re
import pandas as pd

# Loading the dataset and accessing the column
df = pd.read_excel("/content/yoruba_health_cleaned.xlsx")
texts = df["Yoruba Language"].astype(str).tolist()

# Regex tokenizer function
def regex_tokenize(texts):
    pattern = r"\b\w+\b|[.,!?;]"
    return [re.findall(pattern, t) for t in texts]

# Quick demo
print('Regex:')
for s in regex_tokenize(texts):
    print(s)

Regex:
['igbesẹ', 'yii', 'waye', 'lati', 'jẹ', 'kí', 'abẹrẹ', 'naa', 'tete', 'de', 'sibi', 'ti', 'wọn', 'ti', 'nilo', 'rẹ', 'lati', 'gbe', 'ogun', 'ti', 'itankalẹ', 'aarun', 'mpox', '.']
['bavarian', 'nordic', 'a', 's', 'lo', 'pelo', 'abẹrẹ', 'naa', ',', 'ti', 'ajọ', 'europe', 'medicine', 'agency', 'si', 'se', 'ayẹwo', 'rẹ', '.']
['tẹlẹ', ',', 'ko', 'si', 'abẹrẹ', 'ajẹsara', 'fun', 'aarun', 'mpox', ',', 'eyi', 'to', 'fa', 'ti', 'wọn', 'n', 'fi', 'lo', 'abẹrẹ', 'ajẹsara', 'small', 'pox', 'lati', 'koju', 'rẹ', 'nitori', 'aarun', 'mejeeji', 'jọ', 'ara', 'wọn', '.']
['ọga', 'agba', 'fun', 'who', ',', 'dokita', 'tedros', 'adhanom', 'ghebreyesus', 'gba', 'niyanju', 'pe', 'ki', 'wọn', 'yara', 'lati', 'gbe', 'abẹrẹ', 'ajẹsara', 'naa', 'lọ', 'ilẹ', 'afrika', 'ati', 'awọn', 'agbegbe', 'mii', 'ti', 'aarun', 'naa', 'ti', 'n', 'ja', '.']
['dokita', 'yukiko', 'nakatani', 'salaye', 'pe', 'bibuwọlu', 'abẹrẹ', 'ajẹsara', 'yii', 'yoo', 'ran', 'ijọba', 'ati', 'awọn', 'ajọ', 'bi', 'gavi', 'ati', 'unicef',

## 3) Subword Tokenizer (SentencePiece)

Using the SentencePiece model

In [3]:
import pandas as pd

df = pd.read_excel("/content/yoruba_health_cleaned.xlsx")

# Extract the Yoruba text column and convert to string list
texts = df["Yoruba Language"].astype(str).tolist()

# Check first few lines
print(texts[:5])


['igbesẹ yii waye lati jẹ kí abẹrẹ naa tete de sibi ti wọn ti nilo rẹ lati gbe ogun ti itankalẹ aarun mpox.', 'bavarian nordic a/s lo pelo abẹrẹ naa, ti ajọ europe medicine agency si se ayẹwo rẹ.', 'tẹlẹ, ko si abẹrẹ ajẹsara fun aarun mpox, eyi to fa ti wọn n fi lo abẹrẹ ajẹsara small pox lati koju rẹ nitori aarun mejeeji jọ ara wọn.', 'ọga agba fun who, dokita tedros adhanom ghebreyesus gba niyanju pe ki wọn yara lati gbe abẹrẹ ajẹsara naa lọ ilẹ afrika ati awọn agbegbe mii ti aarun naa ti n ja.', 'dokita yukiko nakatani salaye pe bibuwọlu abẹrẹ ajẹsara yii yoo ran ijọba ati awọn ajọ bi gavi ati unicef lọwọ lati gba abẹrẹ naa, ti wọn yoo si pese rẹ fun awọn to nilo rẹ.']


In [4]:
# Converting data to txt file
with open("yoruba_health_corpus.txt", "w", encoding="utf-8") as f:
    for line in texts:
        line = line.strip()
        if line:  # ignore empty lines
            f.write(line + "\n")

print("Yoruba health corpus file created successfully!")


Yoruba health corpus file created successfully!


In [5]:
import sentencepiece as spm

# Train SentencePiece (Unigram model)
spm.SentencePieceTrainer.train(
    input="yoruba_health_corpus.txt",
    model_prefix="yoruba_health",
    vocab_size=2000,          # Adjust based on dataset size; 1000–5000 is a good start
    model_type="unigram",     # Unigram or BPE both work fine
    character_coverage=1.0,   # Important for full Yoruba tone mark coverage
    train_extremely_large_corpus=False
)

print("SentencePiece model trained successfully!")


SentencePiece model trained successfully!


In [7]:
# Load trained SentencePiece model
sp = spm.SentencePieceProcessor(model_file="yoruba_health.model")

# Test a few sentences
for sentence in texts[:5]:
    tokens = sp.encode(sentence, out_type=str)
    print(tokens)

['▁igbe', 'sẹ', '▁yii', '▁waye', '▁lati', '▁jẹ', '▁kí', '▁abẹ', 'rẹ', '▁naa', '▁t', 'ete', '▁de', '▁si', 'bi', '▁ti', '▁wọn', '▁ti', '▁nilo', '▁rẹ', '▁lati', '▁gbe', '▁ogun', '▁ti', '▁itankal', 'ẹ', '▁aarun', '▁mpox', '.']
['▁ba', 'v', 'a', 'ri', 'an', '▁n', 'o', 'r', 'di', 'c', '▁a', '/', 's', '▁lo', '▁pe', 'lo', '▁abẹ', 'rẹ', '▁naa', ',', '▁ti', '▁ajọ', '▁europe', '▁m', 'e', 'di', 'c', 'ine', '▁a', 'ge', 'n', 'c', 'y', '▁si', '▁se', '▁ayẹwo', '▁rẹ', '.']
['▁tẹlẹ', ',', '▁ko', '▁si', '▁abẹ', 'rẹ', '▁ajẹsara', '▁fun', '▁aarun', '▁mpox', ',', '▁eyi', '▁to', '▁fa', '▁ti', '▁wọn', '▁n', '▁fi', '▁lo', '▁abẹ', 'rẹ', '▁ajẹsara', '▁', 'small', '▁', 'po', 'x', '▁lati', '▁ko', 'ju', '▁rẹ', '▁ni', 'tor', 'i', '▁aarun', '▁meje', 'e', 'j', 'i', '▁jọ', '▁ara', '▁wọn', '.']
['▁ọ', 'ga', '▁agba', '▁fun', '▁who', ',', '▁', 'dok', 'ita', '▁', 'ted', 'ros', '▁ad', 'han', 'om', '▁ghebreyesus', '▁gba', '▁ni', 'yan', 'ju', '▁pe', '▁ki', '▁wọn', '▁ya', 'ra', '▁lati', '▁gbe', '▁abẹ', 'rẹ', '▁ajẹsara', '▁naa'

Summary on which tokenizer performed best



*   Vocabulary size: Using a uniform data size for all 3 models, the whitespace handled it more gracefully and much faster
*   Average tokens per sentence: The whitespace tokenizer and the rule-based tokenizer had similar average tokens per sentence between 19-21 tokens per sentence while the subword tokenizer has an average of 24-26 tokens per sentence.
*   Handling of unknown words: The whitespace and rule-based tokenizers handled unknown words better compared to the subword model as the subword model had an uneven splitting of familiar words.
*   Processing time: The whitespace tokenizer has an execution time of 3.574s while the rule-based tokenizer has an execution time of 3.566s; the subword tokenizer had an execution time of 1.171s albeit it was split in different blocks.




