## Section 1: Load Yambeta Sentences from Excel
This section is responsible for extracting a corpus of Yambeta-language text from an Excel file containing Bible passages. The sentences are retrieved from a column labeled 'Bible text (YAT)', and any missing data (NaN values) are filtered out. This Yambeta corpus serves as the input for training the tokenizer. It is important to note that the extraction process ensures that only valid, non-null data is included for downstream tasks.

In [None]:
import pandas as pd

# Load the Excel file
file_path = 'final_dataset.xlsx'
df = pd.read_excel(file_path)

# Assuming the sentences are in a column named 'Bible text (YAT)'
sentences_column = 'Bible text (YAT)'

# Extract the sentences and store them in an array
yambeta_sentences = df[sentences_column].dropna().tolist()

print(f"Loaded {len(yambeta_sentences)} Yambeta sentences.")


Loaded 7897 Yambeta sentences.


## Section 2: Helper Functions for Batch Processing and Saving to Hugging Face Hub
In this section, we define utility functions to facilitate batch processing of the Yambeta corpus and provide methods to integrate the trained tokenizer with the Hugging Face Hub. The batch_iterator function processes the data in batches, ensuring efficient handling of large datasets. The save_to_hf_hub function allows for the seamless deployment of the tokenizer to the Hugging Face Model Hub, making it accessible for public use.

In [None]:
pip install huggingface_hub



In [None]:
from huggingface_hub import HfApi
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer
from transformers import BertTokenizerFast
from google.colab import drive
import os


batch_size = 1000

# Batch helper function
def batch_iterator():
    for i in range(0, len(yambeta_sentences), batch_size):
        batch = yambeta_sentences[i : i + batch_size]
        batch_texts = [str(item) for item in batch]
        yield batch_texts

def check_local_readme():
    file_path = "yat-bert-tokenizer/README.md"
    if os.path.exists(file_path):
        print(f"README.md exists at {file_path}")
    else:
        print(f"README.md does not exist at {file_path}.")


# Hugging Face saver function
def save_to_hf_hub_old(tokenizer):
    drive.mount('/content/drive/')
    token_file_path = '/content/drive/MyDrive/hf/pt4c-huggingface_token.txt'
    with open(token_file_path, 'r') as file:
        huggingface_token = file.read().strip()

    tokenizer.save_pretrained('yat-bert-tokenizer')
    tokenizer.push_to_hub("DS4H-ICTU/yat-bert-tokenizer", token=huggingface_token)

import os
from transformers import PreTrainedTokenizerFast

def save_to_hf_hub(tokenizer):
    # Mount drive (if using Google Colab)
    drive.mount('/content/drive/')

    # Get the Hugging Face token from the specified file
    token_file_path = '/content/drive/MyDrive/hf/pt4c-huggingface_token.txt'
    with open(token_file_path, 'r') as file:
        huggingface_token = file.read().strip()

    # Save the tokenizer to the local directory
    tokenizer.save_pretrained('yat-bert-tokenizer')

    # Create a model card with metadata
    model_card = generate_model_card()

    # Save the model card in the tokenizer directory
    model_card_path = "yat-bert-tokenizer/README.md"
    with open(model_card_path, "w") as f:
        f.write(model_card)

    # Check if README.md is correctly saved
    check_local_readme()

    # Push the tokenizer to the hub
    # tokenizer.push_to_hub("DS4H-ICTU/yat-bert-tokenizer", token=huggingface_token)

    # Explicitly push the README.md file to the Hugging Face Hub

    # Create a new repository for a dataset
    repo_id = "DS4H-ICTU/yat-bert-tokenizer"  # Specify the correct repo name

    api = HfApi()
    try:
        create_repo(repo_id, repo_type="model", private=True, token=huggingface_token)
        print(f"Created repository: {repo_id}")
    except Exception as e:
        print(f"Error creating repository: {e}")


    api.upload_file(
        path_or_fileobj=model_card_path,
        path_in_repo="README.md",
        repo_id=repo_id,
        repo_type="model",
        token=huggingface_token
    )

    print("Tokenizer and model card uploaded successfully!")

def generate_model_card():
    # Template for the model card with metadata placeholders
    model_card_template = """# Yambeta Tokenizer for NLP tasks

## Model Description
This tokenizer was developed for Yambeta, a Bantu language from Cameroon. The tokenizer is based on the WordPiece model architecture and has been fine-tuned to handle the unique phonetic and diacritical features of the Yambeta language.

- **Developed by**: DS4H-ICTU Research Group in Cooperation with the
- **Language(s)**: Yambeta (Bantu language from Cameroon)
- **License**: Apache 2.0 (or specify if different)
- **Model Type**: Tokenizer (WordPiece)

## Model Sources
- **Repository**: [Your repository URL]
- **Paper**: [Link to related paper if available]
- **Demo**: [Optional: link to demo]

## Uses
- **Direct Use**: This tokenizer is designed for NLP tasks such as Named Entity Recognition (NER), translation, and text generation in the Yambeta language.
- **Downstream Use**: Can be used as a foundation for models processing Yambeta text.

## Bias, Risks, and Limitations
- **Biases**: The tokenizer might not perfectly capture linguistic nuances due to the limited size of the Yambeta corpus.
- **Out-of-Scope Use**: The tokenizer may not perform well for non-Yambeta languages.

## Training Details
- **Training Data**: Extracted from Yambeta Bible text corpus (final_dataset.xlsx).
- **Training Procedure**: Preprocessing of text involved normalization of diacritics, tokenization using WordPiece, and post-processing to handle special tokens.
- **Training Hyperparameters**:
  - Vocabulary Size: 25,000
  - Special Tokens: [UNK], [PAD], [CLS], [SEP], [MASK]

## Evaluation
- **OOV Rate**: 0.36%
- **Tokenization Efficiency**: Average tokens per sentence: 23.25
- **Special Character Handling**: Successfully handles diacritics and tone markers in Yambeta.

## Environmental Impact
- **Hardware Type**: Google Colab GPU
- **Hours Used**: 4 hours (training time)
- **Cloud Provider**: Google Cloud
- **Carbon Emitted**: Estimated using [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700) calculator

## Citation
If you use this tokenizer in your work, please cite it using the following format:

```
@misc{yambeta_tokenizer,
  title = {Yambeta Tokenizer},
  author = {Dr.-Ing. Philippe Tamla},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/DS4H-ICTU/yat-bert-tokenizer}
}
```

## Contact Information
For more information, contact the developers at: philiptamla@gmail.com"""

    return model_card_template

## Section 3: Train Bert Tokenizer for Yambeta Language
This section details the process of training a Bert-style WordPiece tokenizer on the Yambeta corpus. The tokenizer is configured with normalization, pre-tokenization, and post-processing strategies to handle the unique phonetic and diacritical properties of the Yambeta language. Special tokens for the Cameroonian language (consonants, vowels, and tones) are incorporated into the tokenizer's vocabulary. The tokenizer is then fine-tuned using the Yambeta corpus and saved for downstream tasks such as language modeling and named entity recognition.

In [None]:
# Fine-tune Bert-Tokenizer for Yambeta language
def train_bert_tokenizer():
    tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

    # 1. Normalization
    tokenizer.normalizer = normalizers.Sequence([
        normalizers.NFD(),
        # Optionally enable lowercasing and stripping accents if needed
        # normalizers.Lowercase(),
        # normalizers.StripAccents()
    ])

    # 2. Pre-Tokenization
    tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

    # 3. Model Training
    special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
    cameroonian_consonants = ['p', 't', 'k', 'kp', 'b', 'd', 'g', 'gb', 'ɓ', 'ɗ', 'ƴ', 'pf', 'tf', 'ts', 'c', 'kf', 'bv', 'dv', 'dz', 'j', 'gv', 'f', 's', 'sh', 'x', 'xf', 'h', 'v', 'z', 'zh', 'gh', 'hv', 'm', 'n', 'ny', 'ŋ', 'ŋm', 'l', 'sl', 'zl', 'ʙ**', 'vb', 'r', 'ẅ', 'y', 'w']
    cameroonian_vowels = ['i', 'ɨ', 'ʉ', 'u', 'e', 'ø', 'ɤ', 'o', 'ɛ', 'œ', 'ə', 'ɔ', 'æ', 'a', 'ɑ', 'α']
    cameroonian_tones = ['áà', 'àá', 'áa', 'aá', 'áá', 'əə́', 'ɛ́ɛ', 'ɛ́ɛ́', 'ə́ə́', 'ú', 'ó', 'ɔ́', 'ɔ́ɔ́', 'á', 'ə́', 'ɔɔ́', 'óó', 'ɛ́ɛ́', 'í', 'Ɛ́']

    # Merging special characters
    other_special_characters = ["...", "-", "—", "–", "_", "°", "«", "»", "(", ")", "[", "]", "{", "}", "<", ">", "&", "*", "#", "$", "£", "%", "+", "=", "<", ">", "|", "/", "\\", "@", "www"]
    special_tokens = special_tokens + [f"[{char}]" for char in cameroonian_consonants + cameroonian_vowels + cameroonian_tones + other_special_characters]

    # Train the tokenizer
    trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)
    tokenizer.train_from_iterator(batch_iterator(), trainer=trainer)

    # 4. Post-Processing
    cls_token_id = tokenizer.token_to_id("[CLS]")
    sep_token_id = tokenizer.token_to_id("[SEP]")

    tokenizer.post_processor = processors.TemplateProcessing(
        single=f"[CLS]:0 $A:0 [SEP]:0",
        pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
        special_tokens=[
            ("[CLS]", cls_token_id),
            ("[SEP]", sep_token_id),
        ],
    )

    # Test encoding
    encoding = tokenizer.encode("Moóŋí waam nyɔ́ onómɛɛd nyɔ́ osaá a kɔɔ́dɔ́ŋɔ́n Pol. Kogóón. Pɔɔd pálɛ na ɛyóŋánán agobɛ́.")
    tokenizer.decoder = decoders.WordPiece(prefix="##")

    # Wrapping the tokenizer inside Transformers for easy use
    bert_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)
    return bert_tokenizer

yat_bert_tokenizer = train_bert_tokenizer()
save_to_hf_hub(yat_bert_tokenizer)




Mounted at /content/drive/
README.md exists at yat-bert-tokenizer/README.md
Error creating repository: name 'create_repo' is not defined


- empty or missing yaml metadata in repo card


Tokenizer and model card uploaded successfully!


## Section 4: Tokenization of Sample Sentences
This section demonstrates the tokenizer's capability by applying it to a set of sample Yambeta sentences. The tokenizer converts the input sentences into tokens suitable for further NLP tasks such as machine translation and named entity recognition. The output provides insights into the tokenizer’s handling of Yambeta diacritics and linguistic structures.

In [None]:
sample_sentences = [
    "Táá wọ́nɔ́ ná yoog ɛ pɔɔd yɛ́ Yə́sus Kilíʼtus, kɛnannán kɛ́ Tə́fid nyɔ́ ayɛ́ɛ nyɔ́lɛ́nyɔ́amɔɛ́d tɛn kɛnannán kɛ́ Ábɛlaam əyə́biə́níí a yɛ́lɛ́ aa yɛ́ɛnɛ pálɛɛ́ ɔsɔ́g pɔ́nɔ́:",
    "Ábɛlaam yiíbíən Ɛ́sag, Ɛ́sag əə́bíən Yáʼkɔb. Yáʼkɔb əə́bíən Yúda na pɔɔ́n pə́mmú pɛ́ndɛ́ŋ, pomóŋŋí pá Yúda.",
    "Yúda əə́bíən na oʼkán Tamáal lɛ́ na Fálɛs, na Sɛ́la. Fálɛs əə́bíən Ɛ́sɛlɔm, Ɛ́sɛlɔm əə́bíən Álam,",
    "Álam əə́bíən Amɛnadáab, Amɛnadáab əə́bíən Násɔŋ, Násɔŋ əə́bíən Sálmɔn,",
    "Sálmɔn əə́bíən Póos. (Ŋŋí o Póos ayɛ́ɛ niiŋ lɛ́ Ɛlaáab.) Póos aáság kubíən Obɛ́ɛd. (Ŋŋí wo Obɛ́ɛd ayɛ́ɛ niiŋ lɛ́ Ulúud.) Obɛ́ɛd əə́bíən Yəsə́ə,",
    "Yəsə́ə əə́bíən Tə́fid nyɔ́ yɛɛ́bág nkúm yɛ Ɛ́sɛlayɛl. Tə́fid əə́bíən Salomɔ́ɔŋ. (Əyímubíən na oʼkán ó Úli.)",
    "Salomɔ́ɔŋ əə́bíən Olobóam, Olobóam əə́bíən Ábɛa, Ábɛa əə́bíən Asáaf,",
    "Asáaf əə́bíən Yosafáad, Yosafáad əə́bíən Yoláam, Yoláam əə́bíən Osɛ́as,",
    "Osɛ́as aáság kubíən Yoáʼtam, Yoáʼtam əə́bíən Aʼkáas, Aʼkáas əə́bíən Ɛsɛ́ʼkɛas,",
    "Ɛsɛ́ʼkɛas əə́bíən Manasə́ə, Manasə́ə əə́bíən Amɔ́ɔŋ, Amɔ́ɔŋ əə́bíən Yosɛ́as.",
    "Əəbíən mɔɔ́n ɔnɔ́mɛɛd, ólog mɔɔ́n nyóon lɛ́ Yə́sus. Nyɔ́lɛ́ aa alɛ́ɛ́ kɔɔyɛɛ́ pɔɔd a mabɛ́ mɔ́ɔ́bɔn.",
    "Yə́sus əyə́biə́níí a Pɛ́ʼtɛlɛɛm, pálɛ́ɛg yimmú yɛ́ a nigúu nɛ́ Siudə́ə. A kɛnɛŋ kɛ́go kɛ́ɛg, Ɛlóod aa ayɛ́ɛ nkúm. Náan aa pɔɔd pə́mmú pə́yíím pádɛ́ɛmɛn kɔ́gɔ́ɔg a noá nó ándɛ koany kóagaáyɛnɛ, pááság kiim alon a Yolósalɛm. Páyɛ́ɛ pɔɔd pá páyɛ́ɛ agobógɛla na muə́dədəʼ."
]

# Test tokenizer on sample sentence
yat_bert_tokenizer.tokenize(sample_sentences[11])


['Yə́sus',
 'əyə́biə́níí',
 'a',
 'Pɛ́ʼtɛlɛɛm',
 ',',
 'pálɛ́ɛg',
 'yimmú',
 'yɛ́',
 'a',
 'nigúu',
 'nɛ́',
 'Siudə́ə',
 '.',
 'A',
 'kɛnɛŋ',
 'kɛ́go',
 'kɛ́ɛg',
 ',',
 'Ɛlóod',
 'aa',
 'ayɛ́ɛ',
 'nkúm',
 '.',
 'Náan',
 'aa',
 'pɔɔd',
 'pə́mmú',
 'pə́yíím',
 'pádɛ́ɛmɛn',
 'kɔ́gɔ́ɔg',
 'a',
 'noá',
 'nó',
 'ándɛ',
 'koany',
 'kóagaáyɛnɛ',
 ',',
 'pááság',
 'kiim',
 'alon',
 'a',
 'Yolósalɛm',
 '.',
 'Páyɛ́ɛ',
 'pɔɔd',
 'pá',
 'páyɛ́ɛ',
 'agobógɛla',
 'na',
 'muə́dədəʼ',
 '.']

## Section 5: Evaluating the Tokenizer
This section provides the evaluation strategies to assess the performance of the Yambeta tokenizer. We focus on important metrics such as vocabulary size, tokenization efficiency, handling of special characters, out-of-vocabulary (OOV) rate, and decoding accuracy. These metrics help ensure that the tokenizer is well-suited for Yambeta text and maintains linguistic integrity.

### 5.1: Vocabulary Size
Measure the size of the tokenizer's vocabulary after training to ensure that it efficiently represents the Yambeta corpus.

In [None]:
# Get the size of the vocabulary
vocab_size = len(yat_bert_tokenizer.get_vocab())
print(f"Vocabulary Size: {vocab_size}")


Vocabulary Size: 25000


### 5.2: Tokenization Efficiency
Evaluate how efficiently the tokenizer represents Yambeta sentences by measuring the average number of tokens per sentence. A well-optimized tokenizer should reduce the number of tokens while maintaining sentence integrity.

In [None]:
# Measure tokenization efficiency by calculating average tokens per sentence
def calculate_tokenization_efficiency(tokenizer, sentences):
    total_tokens = 0
    total_sentences = len(sentences)

    for sentence in sentences:
        encoding = tokenizer(sentence)
        total_tokens += len(encoding['input_ids'])  # Count the number of tokens for each sentence

    avg_tokens_per_sentence = total_tokens / total_sentences
    print(f"Average tokens per sentence: {avg_tokens_per_sentence}")

# Test tokenization efficiency on sample sentences
calculate_tokenization_efficiency(yat_bert_tokenizer, sample_sentences)


Average tokens per sentence: 23.25


### 5.3: Handling of Special Characters
Assess how well the tokenizer handles special characters, diacritics, and tone markers in Yambeta by tokenizing sentences and reviewing the tokenization output.

In [None]:
# Test tokenization of special characters and diacritics
special_char_sentence = "Yə́sus Kilíʼtus kɛnannán kɛ́ Tə́fid nyɔ́ ayɛ́ɛ nyɔ́lɛ́nyɔ́amɔɛ́d."
tokens = yat_bert_tokenizer.tokenize(special_char_sentence)

print(f"Original Sentence: {special_char_sentence}")
print(f"Tokens: {tokens}")


Original Sentence: Yə́sus Kilíʼtus kɛnannán kɛ́ Tə́fid nyɔ́ ayɛ́ɛ nyɔ́lɛ́nyɔ́amɔɛ́d.
Tokens: ['Yə́sus', 'Kilíʼtus', 'kɛnannán', 'kɛ́', 'Tə́fid', 'nyɔ́', 'ayɛ́ɛ', 'nyɔ́lɛ́nyɔ́amɔɛ́d', '.']


### 5.4: Out-of-Vocabulary (OOV) Rate
Evaluate the out-of-vocabulary (OOV) rate by checking how many tokens in the Yambeta corpus are not recognized by the tokenizer. This metric helps determine the tokenizer's coverage of Yambeta vocabulary.

In [None]:
# Calculate the Out-of-Vocabulary (OOV) rate
def calculate_oov_rate(tokenizer, sentences):
    oov_count = 0
    total_tokens = 0

    for sentence in sentences:
        encoding = tokenizer(sentence)
        total_tokens += len(encoding['input_ids'])
        # Count OOV tokens (usually represented as [UNK] or a specific token ID)
        oov_count += encoding['input_ids'].count(tokenizer.unk_token_id)

    oov_rate = (oov_count / total_tokens) * 100
    print(f"OOV Rate: {oov_rate:.2f}%")

# Evaluate the OOV rate
calculate_oov_rate(yat_bert_tokenizer, sample_sentences)


OOV Rate: 0.36%


### 5.5: Decoding Accuracy
Test how well the tokenizer decodes sentences back to their original form. This metric helps determine how accurately the tokenizer preserves the structure and meaning of Yambeta sentences during tokenization and detokenization.

In [None]:
# Test decoding accuracy by encoding and then decoding a sentence
sentence = "Táá wọ́nɔ́ ná yoog ɛ pɔɔd yɛ́ Yə́sus Kilíʼtus, kɛnannán kɛ́ Tə́fid nyɔ́."
encoded = yat_bert_tokenizer(sentence)['input_ids']

# Decode the token IDs back to the original sentence
decoded_sentence = yat_bert_tokenizer.decode(encoded)

print(f"Original Sentence: {sentence}")
print(f"Decoded Sentence: {decoded_sentence}")


Original Sentence: Táá wọ́nɔ́ ná yoog ɛ pɔɔd yɛ́ Yə́sus Kilíʼtus, kɛnannán kɛ́ Tə́fid nyɔ́.
Decoded Sentence: [CLS] Táá [UNK] ná yoog ɛ pɔɔd yɛ́ Yə́sus Kilíʼtus, kɛnannán kɛ́ Tə́fid nyɔ́. [SEP]


Evaluation Metrics:

| **Metric**             | **Result**                                                                                                                                                        |
|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Vocabulary Size**               | 25,000                                                                                                                                                            |
| **Tokenization Efficiency**       | Average tokens per sentence: 23.25                                                                                                                                |
| **Handling of Special Characters**| Original Sentence: Yə́sus Kilíʼtus kɛnannán kɛ́ Tə́fid nyɔ́ ayɛ́ɛ nyɔ́lɛ́nyɔ́amɔɛ́d. <br> Tokens: ['Yə́sus', 'Kilíʼtus', 'kɛnannán', 'kɛ́', 'Tə́fid', 'nyɔ́', 'ayɛ́ɛ', 'nyɔ́lɛ́nyɔ́amɔɛ́d', '.'] |
| **Out-of-Vocabulary (OOV) Rate**  | OOV Rate: 0.36%                                                                                                                                                   |
| **Decoding Accuracy**             | Original Sentence: Táá wọ́nɔ́ ná yoog ɛ pɔɔd yɛ́ Yə́sus Kilíʼtus, kɛnannán kɛ́ Tə́fid nyɔ́. <br> Decoded Sentence: [CLS] Táá [UNK] ná yoog ɛ pɔɔd yɛ́ Yə́sus Kilíʼtus, kɛnannán kɛ́ Tə́fid nyɔ́. [SEP] |


## Interpretation

**Vocabulary Size:**

The tokenizer has a vocabulary size of 25,000 tokens, which includes not only full words but also subwords and special tokens. This size is considered optimal for a balance between vocabulary coverage and tokenization efficiency. For a language like Yambeta, which has unique diacritics, tone markers, and complex linguistic structures, a vocabulary size of 25,000 ensures that most of the language's lexicon is captured effectively without inflating the model size unnecessarily. This coverage provides good representation for both common and uncommon words while maintaining an efficient tokenization process.

**Tokenization Efficiency:**

The average number of tokens per sentence is 23.25. This indicates that the tokenizer is efficient in its handling of Yambeta sentences. Given that Yambeta contains several complex characters, tones, and diacritics, having an average token count of 23.25 means that the tokenizer is able to represent the sentence using a manageable number of tokens. This efficiency is crucial for NLP tasks like translation and named entity recognition, where sentence length directly impacts computation time and model performance. Lower tokenization overhead also suggests that the tokenizer is well-suited for large-scale text processing tasks.

**Handling of Special Characters:**

The tokenizer successfully handled special characters and diacritics in Yambeta. In the sentence Yə́sus Kilíʼtus kɛnannán kɛ́ Tə́fid nyɔ́ ayɛ́ɛ nyɔ́lɛ́nyɔ́amɔɛ́d, the tokenizer was able to correctly tokenize complex words like Yə́sus and Kilíʼtus without breaking the diacritics or tones. The tokenization maintains the integrity of the language's unique phonetic properties, demonstrating that the tokenizer is effective in handling the idiosyncrasies of Yambeta. This performance is critical for preserving linguistic meaning in downstream tasks like text classification or machine translation.

**Out-of-Vocabulary (OOV) Rate:**

The out-of-vocabulary rate was 0.36%, indicating that less than 1% of the words in the Yambeta corpus were not recognized by the tokenizer. This very low OOV rate suggests that the tokenizer has excellent coverage of the Yambeta language. The inclusion of subword tokenization strategies, as well as an adequately sized vocabulary, allows the tokenizer to break down rare or unfamiliar words into smaller, recognizable units. This ensures that even previously unseen words can still be represented accurately, reducing the likelihood of significant information loss during tokenization.

**Decoding Accuracy:**

While the tokenizer successfully tokenized and decoded most of the sentence, there was one instance of an out-of-vocabulary word, as indicated by the [UNK] token in the decoded sentence. This means that the tokenizer was unable to fully reconstruct the original sentence due to the presence of a word that it couldn't represent (possibly due to insufficient training data for that particular word or character). However, the rest of the sentence was decoded accurately, preserving most of the meaning and structure. The presence of the special tokens [CLS] and [SEP] indicates the correct segmentation of the input sentence, as expected from a BERT-style tokenizer.

**Conclusion:**

Overall, the tokenizer performs well in key areas such as vocabulary coverage, tokenization efficiency, and handling of special characters. With a low OOV rate and accurate tokenization of Yambeta's diacritics and tone markers, the tokenizer demonstrates its suitability for processing texts in the Yambeta language. The minor issue with decoding suggests that further refinement of the vocabulary or training data may be necessary to reduce the occurrence of [UNK] tokens, but the overall performance is robust and effective for linguistic tasks.