<a href="https://colab.research.google.com/github/DibiaCorp85/fine-tuning_nllb-200_600M/blob/main/_Fine_Tuning_En_Yo_LaTn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Install Dependencies**

In [None]:
!pip install --quiet chainlit pyngrok datasets transformers evaluate accelerate peft sacrebleu rouge_score bitsandbytes

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/800.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━[0m [32m481.3/800.5 kB[0m [31m14.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m800.5/800.5 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m995.0 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.8/67.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!pip install --quiet --upgrade fsspec datasets

## **Import Core Libraries**

In [None]:
import os
import random
import torch
import requests
from torch.optim import AdamW
from datasets import (load_dataset,
                      concatenate_datasets,
                      DatasetDict,
                      Dataset,
                      get_dataset_config_names,
                      Features,
                      ClassLabel,
                      Value,
                      Translation
                      )

from transformers import (
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    BitsAndBytesConfig,
)
from torch.utils.tensorboard import SummaryWriter
import evaluate
import pandas as pd
from sklearn.model_selection import train_test_split

from peft import (
    TaskType,
    LoraConfig,
    get_peft_model,
    PeftModel,
    PeftConfig,
)

from huggingface_hub import login
from google.colab import drive
import getpass
from pyngrok import conf, ngrok
import subprocess
import time

In [None]:
# Mount Google Drive to save model and logs
drive.mount('/content/drive', force_remount = True)
save_dir = "/content/drive/MyDrive/Colab Notebooks/NLLB_200/En-Yo_LaTn"
os.makedirs(save_dir, exist_ok = True)

Mounted at /content/drive


## **Load Datasets**

English-"Language" datasets are loaded from Hugging Face.

For this process, the following languages are considered:
* Yoruba

The following are the datasets are used:

* Opus100 containing the above listed languages paired with English language.
* UdS-LSV/menyo20k_mt for English-Yoruba pairs

### **Opus100 Dataset**

In [None]:
login("hf access token key")

# List of desired target languages ISO codes(to pair with English Language)
target_language = {"yo" : "Yoruba"}

source_language = "en" # Fixed source language

desired_pairs = [f"{source_language}-{tgt}" for tgt in target_language]

# Fetch all configurations from Opus100
available_configs = get_dataset_config_names("opus100")

# Filter those that exist in Opus100
present_pairs = [pair for pair in desired_pairs if pair in available_configs]
missing_pairs = [pair for pair in desired_pairs if pair not in available_configs]

# Print results
print(" The En-Yo language pair is present in Opus100 dataset:")
for pair in present_pairs:
    print(f" - {pair} ({target_language[pair.split('-')[1]]})")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/65.4k [00:00<?, ?B/s]

 The En-Yo language pair is present in Opus100 dataset:
 - en-yo (Yoruba)


#### **Load dataset**

In [None]:
# English-Yoruba
selected_language = ["yo"]

if "yo" in selected_language:
    try:
        opus_en_yo = load_dataset("opus100", "en-yo")
        print("English-Yoruba language pair downloaded!")
    except Exception as e:
        print("Failed to download English-Yoruba:", e)

train-00000-of-00001.parquet:   0%|          | 0.00/391k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10375 [00:00<?, ? examples/s]

English-Yoruba language pair downloaded!


### **UdS-LSV/menyo20k_mt for English-Yoruba Pairs**

In [None]:
# English-Yoruba
selected_languages = ["yo"]

if "yo" in selected_languages:
    try:
        menyo_en_yo = load_dataset("UdS-LSV/menyo20k_mt", trust_remote_code = True)
        print("English-Yoruba language pair II downloaded!")
    except Exception as e:
        print("Failed to download English-Yoruba:", e)


README.md:   0%|          | 0.00/6.33k [00:00<?, ?B/s]

menyo20k_mt.py:   0%|          | 0.00/4.64k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.49M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/850k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.87M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10070 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3397 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6633 [00:00<?, ? examples/s]

English-Yoruba language pair II downloaded!


## **Qualitative and Quantitative Examination of Datasets**

The following is a check-list to examine each dataset:

1. Splits
2. Features/columns
3. Format
4. Missing row
5. Check internal usaga of language pairs

### **Splits Check**

In [None]:
# Check splits
def print_split_info(name, dataset):
    print(f"\n📊 Dataset: {name}")

    if isinstance(dataset, DatasetDict):
        for split_name, split in dataset.items():
            print(f"  ➤ Split: {split_name} | Rows: {split.num_rows}")
    elif isinstance(dataset, Dataset):
        print(f"  ➤ Single split | Rows: {dataset.num_rows}")
    else:
        print("❌ Unrecognized dataset type.")

In [None]:
print_split_info("Opus100: English-Yoruba Language Pair", opus_en_yo)
print_split_info("Menyo: English-Yoruba Language Pair II", menyo_en_yo)


📊 Dataset: Opus100: English-Yoruba Language Pair
  ➤ Split: train | Rows: 10375

📊 Dataset: Menyo: English-Yoruba Language Pair II
  ➤ Split: train | Rows: 10070
  ➤ Split: validation | Rows: 3397
  ➤ Split: test | Rows: 6633


### **Features/Columns/Schema Check**

In [None]:
# Check schema/column names using just the train split
def print_dataset_features(datasets_with_names):
    """
    Prints the .features of multiple Hugging Face datasets with names.

    Args:
        datasets_with_names (list of tuples): List of (dataset, name) pairs.
    """
    for dataset, name in datasets_with_names:
        print(f"\n📘 Features for: {name}")
        print(dataset.features)

In [None]:
print_dataset_features([
    (opus_en_yo['train'], "Opus EN-YO"),
    (menyo_en_yo['train'], "Menyo EN-YO II")
    ])


📘 Features for: Opus EN-YO
{'translation': Translation(languages=['en', 'yo'], id=None)}

📘 Features for: Menyo EN-YO II
{'translation': Translation(languages=('en', 'yo'), id=None)}


### **NLLB-Format-Compatibility Check**

In [None]:
# The NLLB model expects the following format:

"""
{
  "translation": {
    "source language": "source sentence",
    "target language": "target sentence"
  }
}
"""

'\n{\n  "translation": {\n    "source language": "source sentence",\n    "target language": "target sentence"\n  }\n}\n'

In [None]:
# Confirm dataset format

def is_nllb_format(dataset, lang_pair=None, name="Unnamed"):
    """
    Checks if a Hugging Face dataset follows NLLB-style format.

    Args:
        dataset: Hugging Face dataset to inspect.
        lang_pair: Tuple (source_lang, target_lang), e.g., ("en", "yo")
        name: Name for logging
    """
    print(f"\n🔍 Checking NLLB format for: {name}")
    passed = True

    for i, example in enumerate(dataset.select(range(min(10, len(dataset))))):  # check first 10 examples
        if "translation" not in example:
            print(f"❌ Missing 'translation' field at index {i}")
            passed = False
            break
        if not isinstance(example["translation"], dict):
            print(f"❌ 'translation' is not a dict at index {i}")
            passed = False
            break
        if lang_pair:
            src, tgt = lang_pair
            if src not in example["translation"] or tgt not in example["translation"]:
                print(f"❌ Missing expected lang codes {src}/{tgt} at index {i}")
                passed = False
                break
            if not example["translation"][src] or not example["translation"][tgt]:
                print(f"❌ Empty values for {src}/{tgt} at index {i}")
                passed = False
                break
        elif len(example["translation"]) != 2:
            print(f"⚠️ Unexpected number of languages in 'translation' at index {i}: {example['translation'].keys()}")
            passed = False
            break

    if passed:
        print("✅ Format is NLLB-compatible.")
    else:
        print("❗ Not NLLB-compatible.")

In [None]:
is_nllb_format(opus_en_yo['train'], lang_pair=("en", "yo"), name="Opus EN-YO")
is_nllb_format(menyo_en_yo['train'], lang_pair=("en", "yo"), name="MENYO EN-IG")


🔍 Checking NLLB format for: Opus EN-YO
✅ Format is NLLB-compatible.

🔍 Checking NLLB format for: MENYO EN-IG
✅ Format is NLLB-compatible.


### **Check Missing Row**

In [None]:
def check_missing_rows_all_splits(dataset_dict, name, src_lang=None, tgt_lang=None):
    """
    Checks for missing or invalid rows across all splits in a DatasetDict.

    Args:
        dataset_dict (DatasetDict): The dataset with multiple splits.
        name (str): Dataset name for reporting.
        src_lang (str): Source language key (e.g., 'en').
        tgt_lang (str): Target language key (e.g., 'yo').
    """
    for split_name, split_dataset in dataset_dict.items():
        total = len(split_dataset)
        missing = 0

        for row in split_dataset:
            try:
                if "translation" in row:
                    trans = row["translation"]
                    src = trans.get(src_lang, "").strip() if src_lang else ""
                    tgt = trans.get(tgt_lang, "").strip() if tgt_lang else ""
                else:
                    src = row.get(src_lang, "").strip()
                    tgt = row.get(tgt_lang, "").strip()

                if not src or not tgt or len(src) <= 1 or len(tgt) <= 1:
                    missing += 1
            except Exception:
                missing += 1

        print(f"🔍 {name} ({split_name}): {missing} missing / {total} total rows")


In [None]:
# Run checks
check_missing_rows_all_splits(opus_en_yo, "Opus EN-YO", src_lang="en", tgt_lang="yo")
check_missing_rows_all_splits(menyo_en_yo, "Menyo EN-YO II", src_lang="en", tgt_lang="yo")

🔍 Opus EN-YO (train): 5 missing / 10375 total rows
🔍 Menyo EN-YO II (train): 0 missing / 10070 total rows
🔍 Menyo EN-YO II (validation): 0 missing / 3397 total rows
🔍 Menyo EN-YO II (test): 0 missing / 6633 total rows


### **Check Internal Usage of Language Pairs**

In [None]:
def preview_dataset_rows(dataset, name, src_lang=None, tgt_lang=None, n=3):
    """
    Prints the first `n` rows of a dataset, showing raw string content using repr().

    Args:
        dataset: Hugging Face Dataset object
        name (str): Dataset name for display
        src_lang (str): Source language key
        tgt_lang (str): Target language key
        n (int): Number of rows to preview
    """
    print(f"\n📘 Preview for: {name}")
    for i in range(min(n, len(dataset))):
        row = dataset[i]
        try:
            if "translation" in row:
                src = row["translation"].get(src_lang, "")
                tgt = row["translation"].get(tgt_lang, "")
            else:
                src = row.get(src_lang, "")
                tgt = row.get(tgt_lang, "")
            print(f"{i+1}. {src_lang}: {repr(src)}")
            print(f"   {tgt_lang}: {repr(tgt)}")
        except Exception as e:
            print(f"{i+1}. ❌ Error reading row {i}: {e}")


In [None]:
# Show first 3 rows of each dataset
preview_dataset_rows(opus_en_yo['train'], "Opus EN-YO", src_lang="en", tgt_lang="yo")
preview_dataset_rows(menyo_en_yo['train'], "Menyo EN-YO II", src_lang="en", tgt_lang="yo")


📘 Preview for: Opus EN-YO
1. en: 'Mozilla (HTML)'
   yo: 'Mozilla (HTML)'
2. en: 'Workspace Switcher Preferences'
   yo: 'Àwọn ìkúndùǹ Ìjánu-ìsún Ààyè-iṣẹ́'
3. en: 'Set'
   yo: 'Dí'

📘 Preview for: Menyo EN-YO II
1. en: 'Unit 1: What is Creative Commons?'
   yo: '\ufeffÌdá 1: Kín ni Creative Commons?'
2. en: 'This work is licensed under a Creative Commons Attribution 4.0 International License.'
   yo: 'Iṣẹ́ yìí wà lábẹ́ àṣẹ Creative Commons Attribution 4.0 International License.'
3. en: 'Creative Commons is a set of legal tools, a nonprofit organization, as well as a global network and a movement — all inspired by people’s willingness to share their creativity and knowledge, and enabled by a set of open copyright licenses.'
   yo: 'Creative Commons jẹ́ àwọn ọ̀kan-ò-jọ̀kan ohun-èlò ajẹmófin, iléeṣẹ́ àìlérèlórí, àti àjọ àwọn ènìyàn eléròǹgbà kan náà kárí àgbáńlá ayé— tí í ṣe ìmísí àwọn ènìyànkan tí ó ní ìfẹ́ tinútinú láti pín àwọn iṣẹ́-àtinúdá àti ìmọ̀ wọn èyí tí ó ní àtìlẹ

## **Data Cleaning**

Before we clean, here are some highlight of observations in the examination stage:

1. Opus100 English-Yoruba Language Pair" (`opus_en_yo`) has just one split with less than 11k rows. Too small for the task at hand. We add Menyo...dataset.
2. Missing rows
  * Opus EN-YO (train)has 5 missing rows  
  We eliminate missing rows in this stage.
3. Menyo EN-YO II contains BOM text.Remove BOM text `\ufeffÌdá`.

  Unicode BOM (Byte Order Mark)` is invisible when printed, but it pollutes the text internally. Models treat it as a real character → causing:
  * Garbage tokens during tokenization
  * Poor fine-tuning
  * Lower translation quality This must be eradicated to avoid errors.


### **Clear Missing Rows**

In [None]:
from datasets import DatasetDict

def clean_missing_rows(dataset_dict, src_lang=None, tgt_lang=None):
    """
    Removes missing/invalid rows across all splits in a DatasetDict.

    Args:
        dataset_dict (DatasetDict): The dataset with splits (e.g. train, test, validation).
        src_lang (str): Source language key (e.g., 'en').
        tgt_lang (str): Target language key (e.g., 'yo').

    Returns:
        DatasetDict: Cleaned dataset with bad rows removed.
    """
    cleaned_splits = {}

    for split_name, split_dataset in dataset_dict.items():
        def is_valid(row):
            try:
                if "translation" in row:
                    src = row["translation"].get(src_lang, "").strip()
                    tgt = row["translation"].get(tgt_lang, "").strip()
                else:
                    src = row.get(src_lang, "").strip()
                    tgt = row.get(tgt_lang, "").strip()
                return bool(src and tgt and len(src) > 1 and len(tgt) > 1)
            except Exception:
                return False

        print(f"🧹 Cleaning split: {split_name}...")
        cleaned_split = split_dataset.filter(is_valid)
        cleaned_splits[split_name] = cleaned_split
        print(f"✅ {len(cleaned_split)} rows retained from {len(split_dataset)}")

    return DatasetDict(cleaned_splits)


In [None]:
opus_en_yo_clean = clean_missing_rows(opus_en_yo, src_lang="en", tgt_lang="yo")
menyo_clean = clean_missing_rows(menyo_en_yo, src_lang="en", tgt_lang="yo")

🧹 Cleaning split: train...


Filter:   0%|          | 0/10375 [00:00<?, ? examples/s]

✅ 10370 rows retained from 10375
🧹 Cleaning split: train...


Filter:   0%|          | 0/10070 [00:00<?, ? examples/s]

✅ 10070 rows retained from 10070
🧹 Cleaning split: validation...


Filter:   0%|          | 0/3397 [00:00<?, ? examples/s]

✅ 3397 rows retained from 3397
🧹 Cleaning split: test...


Filter:   0%|          | 0/6633 [00:00<?, ? examples/s]

✅ 6633 rows retained from 6633


### **Remove BOM Text from Menyo Dataset**

In [None]:
# Define cleaning function
def remove_bom(example):
    translation = example["translation"]
    cleaned_translation = {
        "yo": translation["yo"].replace("\ufeff", "") if "yo" in translation else None,
        "en": translation["en"].replace("\ufeff", "") if "en" in translation else None,
    }
    return {"translation": cleaned_translation}

# Apply cleaning to all splits
for split in ["train", "validation", "test"]:
    if split in menyo_en_yo:
        menyo_en_yo[split] = menyo_en_yo[split].map(remove_bom)

Map:   0%|          | 0/10070 [00:00<?, ? examples/s]

Map:   0%|          | 0/3397 [00:00<?, ? examples/s]

Map:   0%|          | 0/6633 [00:00<?, ? examples/s]

In [None]:
# Check
print(menyo_en_yo['train'][0])
print(menyo_en_yo['validation'][0])
print(menyo_en_yo['test'][0])
print("✅ BOM characters removed from dataset.")

{'translation': {'en': 'Unit 1: What is Creative Commons?', 'yo': 'Ìdá 1: Kín ni Creative Commons?'}}
{'translation': {'en': 'We prepare the saddle, and the goat presents itself; is it a burden for the lineage of goats?', 'yo': 'A di gàárì sílẹ̀ ewúrẹ́ ń yọjú; ẹrù ìran rẹ̀ ni?'}}
{'translation': {'en': 'Pending the time she would finally pack and go, everybody should be content with eating just anything.', 'yo': 'Títí di ìgbà tí ó máa fi kó ẹrù rẹ̀ lọ pátápátá, kí oníkálùkù ní ìtẹ́lọ̀rùn pẹ̀lú ohunkóhun tó bá rí jẹ.'}}
✅ BOM characters removed from dataset.


## **Standardize Feature Schema**

In [None]:
# Define consistent translation

translation_features = Features({
    "translation": Translation(languages=("en", "yo"))
})


In [None]:
# Prepare dataset list to combine
datasets_to_concat = [opus_en_yo_clean, menyo_clean]

# Cast all splits in all datasets to ensure schema alignment
for i in range(len(datasets_to_concat)):
    for split in datasets_to_concat[i]:
        datasets_to_concat[i][split] = datasets_to_concat[i][split].cast(translation_features)

Casting the dataset:   0%|          | 0/10370 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/10070 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/3397 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/6633 [00:00<?, ? examples/s]

## **Concatenate Datasets**

In [None]:
def concatenate_yo_datasets(datasets_dicts):
    splits = ["train", "validation", "test"]
    combined = {}

    for split in splits:
        # Only include datasets that have this split
        split_datasets = [ds[split] for ds in datasets_dicts if split in ds]
        combined[split] = concatenate_datasets(split_datasets)

    return DatasetDict(combined)

# Combine datasets
datasets_to_concat = [opus_en_yo_clean, menyo_clean]
yoruba_dataset = concatenate_yo_datasets(datasets_to_concat)

# Check number of rows
for split in yoruba_dataset:
    print(f"{split}: {len(yoruba_dataset[split])} examples")

train: 20440 examples
validation: 3397 examples
test: 6633 examples


In [None]:
# Check
print(yoruba_dataset["train"].features)

{'translation': Translation(languages=['en', 'yo'], id=None)}


In [None]:
# Inspect examples from each split
for split in ["train", "validation", "test"]:
    print(f"\n🧪 Split: {split}")
    print([split][0])
    print(yoruba_dataset[split][1])


🧪 Split: train
train
{'translation': {'en': 'Workspace Switcher Preferences', 'yo': 'Àwọn ìkúndùǹ Ìjánu-ìsún Ààyè-iṣẹ́'}}

🧪 Split: validation
validation
{'translation': {'en': 'You have been crowned a king, and yet you make good-luck charms; would you be crowned God?', 'yo': 'A fi ọ́ jọba ò ń ṣàwúre o fẹ́ jẹ Ọlọ́run ni?'}}

🧪 Split: test
test
{'translation': {'en': 'She knew how best she was going to take care of herself and Tinu.', 'yo': 'Ó mọ bí ó ṣe má a tọ́jú ara rẹ̀ àti Tinú.'}}


Everything looks fine

## **Save Dataset to Disc**

In [None]:
save_path = f"{save_dir}/CombinedYorubaDataset"
yoruba_dataset.save_to_disk(save_path)
print(f"✅ Dataset saved to: {save_path}")

Saving the dataset (0/1 shards):   0%|          | 0/20440 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/3397 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/6633 [00:00<?, ? examples/s]

✅ Dataset saved to: /content/drive/MyDrive/Colab Notebooks/NLLB_200/En-Yo_LaTn/CombinedYorubaDataset


## **Tokenization**

### **Load Tokenizer**

In [None]:
model_checkpoint = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer.padding_side = "right" #  proper for attention mask alignment and decoder positioning for encoder-decoder model like NLLB

tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

### **Preprocess**

In [None]:
# Isolate data splits
train_dataset = yoruba_dataset["train"]
val_dataset = yoruba_dataset["validation"]
test_dataset = yoruba_dataset["test"]


# Set the target language for NLLB tokenizer globally
tokenizer.src_lang = "eng_Latn"
tokenizer.tgt_lang = "yor_Latn"

def preprocess(example):
    source = example.get("translation", {}).get("en", None)
    target = example.get("translation", {}).get("yo", None)

    if not source or not target:
        return {
            "input_ids": [],
            "attention_mask": [],
            "labels": []
            }

    # Add source language prefix for NLLB-style
    input_text = f">>yor_Latn<< {source}"

    # Tokenize source and target using set lang codes
    model_inputs = tokenizer(
        input_text,
        max_length=128,
        padding="max_length",
        truncation=True,
    )

    # Target tokenization works correctly if tgt_lang is already set
    target_inputs = tokenizer(
        target,
        max_length=128,
        padding="max_length",
        truncation=True,
    )

    model_inputs["labels"] = target_inputs["input_ids"]
    return model_inputs

In [None]:
# Map Preprocessing on all splits

train_tokenized = train_dataset.map(
    preprocess,
    remove_columns=["translation"],
    num_proc=4,  # Optional: use multiple processes
    desc="Tokenizing train set"
).filter(lambda example: example.get("labels") is not None)

val_tokenized = val_dataset.map(
    preprocess,
    remove_columns=["translation"],
    num_proc=4,
    desc="Tokenizing val set"
).filter(lambda example: example.get("labels") is not None)

test_tokenized = test_dataset.map(
    preprocess,
    remove_columns=["translation"],
    num_proc=4,
    desc="Tokenizing test set"
).filter(lambda example: example.get("labels") is not None)

Tokenizing train set (num_proc=4):   0%|          | 0/20440 [00:00<?, ? examples/s]

Filter:   0%|          | 0/20440 [00:00<?, ? examples/s]

Tokenizing val set (num_proc=4):   0%|          | 0/3397 [00:00<?, ? examples/s]

Filter:   0%|          | 0/3397 [00:00<?, ? examples/s]

Tokenizing test set (num_proc=4):   0%|          | 0/6633 [00:00<?, ? examples/s]

Filter:   0%|          | 0/6633 [00:00<?, ? examples/s]

In [None]:
# Check a sample from tokenized data to confirm tokenization
print(train_tokenized[0])

{'input_ids': [256047, 20545, 256198, 57642, 179399, 104, 234972, 248161, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': [256047, 179399, 104, 234972, 248161, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [None]:
# The strcture above is interpreted as:

"""
{
    'translation': {
        'en': 'Mozilla (HTML)',         # The source sentence (English)
        'yo': '<Yoruba> Mozilla (HTML)' # The target sentence (Yoruba, with language tag)
    },
    'input_ids': [256047, 179399, 104, 234972, 248161, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], # Tokenized input
    'attention_mask': [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # Mask for padding tokens
    'labels': [256047, 179399, 104, 234972, 248161, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], # Tokenized target (no language tag)
    'detected_tgt_lang': 'yor' # Detected target language
}
"""

"\n{\n    'translation': {\n        'en': 'Mozilla (HTML)',         # The source sentence (English)\n        'yo': '<Yoruba> Mozilla (HTML)' # The target sentence (Yoruba, with language tag)\n    },\n    'input_ids': [256047, 179399, 104, 234972, 248161, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], # Tokenized input\n    'attention_mask': [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # Mask for padding tokens\n    'labels': [256047, 179399, 104, 234972, 248161, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], # Tokenized target (no language tag)\n    'detected_tgt_lang': 'yor' # Detected target language\n}\n"

In [None]:
print(train_tokenized)

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 20440
})


## **Fine-Tune NLLB**

### **Configure BitsAndBytes**

In [None]:
# BitsAndBytes parameters
################################################################################
use_4bit = True # 4-bit precision on base model loading
bnb_4bit_compute_dtype = torch.float16 # compute datatype for 4-bit base model
bnb_4bit_quant_type = "nf4" # quantization type
use_nested_quant = False # activate nested quantization for 4-bit base models (double quantization)


bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=bnb_4bit_compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint,
                                              device_map="auto",
                                              low_cpu_mem_usage=True,  # Explicitly set to avoid the warning
                                              quantization_config=bnb_config)

config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

### **Test Model with Zero Shot Inferencing**

In [None]:
%%time

# 🌍 Define source & target languages (ISO 639-3 codes)
src_lang = "eng_Latn"
tgt_lang = "yor_Latn"

# ✏️ Example input sentence in English
input_sentence = "The weather today is very pleasant."

# 🔡 Tokenize with language codes
inputs = tokenizer(
    input_sentence,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=512
)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# ✨ Set language tokens
inputs["forced_bos_token_id"] = tokenizer.convert_tokens_to_ids(tgt_lang)
tokenizer.src_lang = src_lang

# 🔁 Run inference
with torch.no_grad():
    output_tokens = model.generate(
        **inputs,
        max_length=128,
        num_beams=4,
        early_stopping=True
    )

# 🗣️ Decode result
translated_text = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)[0]
print(f"🔤 English: {input_sentence}")
print(f"🌍 Yoruba: {translated_text}")

🔤 English: The weather today is very pleasant.
🌍 Yoruba: Ojú ọjọ́ òde òní dára gan-an.
CPU times: user 1.17 s, sys: 631 ms, total: 1.8 s
Wall time: 5.89 s


### **Setup Model with LoRA (PEFT**

In [None]:
# Set up LoRA config with target_modules
lora_config = LoraConfig(
    r = 8,  # Rank of the decomposition
    lora_alpha = 32,  # Scaling factor for LoRA updates
    lora_dropout = 0.05,  # Dropout rate for LoRA
    task_type = TaskType.SEQ_2_SEQ_LM,  # Sequence-to-sequence task
    bias = 'none',
    target_modules = ["q_proj", "v_proj", "k_proj", "o_proj"],  # Attention layers (query, value, key, output)
)

# Apply LoRA adapters to the model
model_with_lora = get_peft_model(model, lora_config)

In [None]:
# Check trainable parameters
model_with_lora.print_trainable_parameters()

trainable params: 1,769,472 || all params: 616,843,264 || trainable%: 0.2869


### **Prepare DataCollator**

In [None]:
# Instantiate the data collator for sequence-to-sequence tasks
data_collator = DataCollatorForSeq2Seq(tokenizer,
                                       model = model_with_lora,
                                       padding = True)

### **Define Seq2SeqTrainingArguments**

In [None]:
# Training arguments

training_args = Seq2SeqTrainingArguments(
    eval_strategy = "epoch",  # Evaluate after every epoch
    logging_dir = f"{save_dir}/logs",  # Directory for storing logs
    logging_strategy = "steps",  # Log every N steps
    logging_steps = 25,  # Log every 25 steps
    save_strategy = "epoch",  # Save model after every epoch
    save_total_limit = 3,  # Keep only the latest 3 checkpoints
    per_device_train_batch_size = 4,  # Batch size per device for training
    per_device_eval_batch_size = 4,  # Batch size per device for evaluation
    gradient_accumulation_steps = 2,  # Accumulate gradients for 2 steps before updating weights
    num_train_epochs = 3,  # Total number of epochs
    predict_with_generate = True,  # Predict with generate
    weight_decay = 0.01,  # Weight decay
    lr_scheduler_type = "linear",  # Linear learning rate scheduler
    optim = "paged_adamw_32bit",  # Optimizer to use
    learning_rate = 2e-5,  # Initial learning rate
    eval_steps = 500, # run validation every 500 steps
    fp16 = True,  # Use mixed precision training (not recommended for faster training on GPUs, especially A100 GPUs)
    load_best_model_at_end = True,  # Load the best model at the end based on evaluation metric
    metric_for_best_model = "eval_loss",  # Metric to monitor for the best model (e.g., BLEU score for translation)
    greater_is_better = False,  # Higher BLEU metric scores are better
    report_to = "none",  # Use TensorBoard for logging
    disable_tqdm = False,  # Enable or disable tqdm (progress bar)
    save_steps = 500,  # Save model checkpoints every 500 steps
    label_names = ["labels"],  # Name of the label column in the dataset
  )


### **Compute Metrics**

In [None]:
# metric = evaluate.load("sacrebleu")

# def compute_metrics(eval_preds):
#     preds, labels = eval_preds
#     decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
#     decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
#     result = metric.compute(predictions=decoded_preds, references=[[l] for l in decoded_labels])
#     return {"bleu": result["score"]}


### **Training Setup**

In [None]:
trainer = Seq2SeqTrainer(
    model=model_with_lora,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized.select(range(min(len(val_tokenized), 2000))),
    data_collator=data_collator,
    #compute_metrics=compute_metrics  # compute BLEU
)

### **Train**

In [None]:
# Compute total training time
start_time = time.time()
print(f"Training starts at {start_time}")

trainer.train()

end_time = time.time()
print(f"Training ends at {end_time}")

total_seconds = end_time - start_time
hours = int(total_seconds // 3600)
minutes = int((total_seconds % 3600) // 60)
seconds = int(total_seconds % 60)

print(f"Training time: {hours}h {minutes}m {seconds}s")

Training starts at 1747760575.7574346


Epoch,Training Loss,Validation Loss
1,6.6725,6.007836


Epoch,Training Loss,Validation Loss
1,6.6725,6.007836
2,6.3915,5.964256
3,6.5479,5.956226


Training ends at 1747764421.0577939
Training time: 1h 4m 5s


## **Save Model and Tokenizer**

In [None]:
trainer.save_model(f"{save_dir}/En-Yo_FT_model")
tokenizer.save_pretrained(f"{save_dir}/En-Yo_FT_model")

('/content/drive/MyDrive/Colab Notebooks/NLLB_200/En-Yo_LaTn/En-Yo_FT_model/tokenizer_config.json',
 '/content/drive/MyDrive/Colab Notebooks/NLLB_200/En-Yo_LaTn/En-Yo_FT_model/special_tokens_map.json',
 '/content/drive/MyDrive/Colab Notebooks/NLLB_200/En-Yo_LaTn/En-Yo_FT_model/sentencepiece.bpe.model',
 '/content/drive/MyDrive/Colab Notebooks/NLLB_200/En-Yo_LaTn/En-Yo_FT_model/added_tokens.json',
 '/content/drive/MyDrive/Colab Notebooks/NLLB_200/En-Yo_LaTn/En-Yo_FT_model/tokenizer.json')

## **Push to Hugging Face**

In [None]:
# !pip install huggingface_hub



In [None]:
# !huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: write

In [None]:
# from huggingface_hub import HfApi, HfFolder
# from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# # Define repo details
# repo_name = "drakensberg85/English-Yoruba_NLLB_FT_model"
# model_path = f"{save_dir}/En-Yo_FT_model"

# # # Upload using transformers
# AutoModelForSeq2SeqLM.from_pretrained(model_path).push_to_hub(repo_name)
# AutoTokenizer.from_pretrained(model_path).push_to_hub(repo_name)

adapter_model.safetensors:   0%|          | 0.00/7.11M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/32.2M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/drakensberg85/English-Yoruba_NLLB_FT_model/commit/9cc03e33b1f9fe0dcc2c12b2fbd594326c88f228', commit_message='Upload tokenizer', commit_description='', oid='9cc03e33b1f9fe0dcc2c12b2fbd594326c88f228', pr_url=None, repo_url=RepoUrl('https://huggingface.co/drakensberg85/English-Yoruba_NLLB_FT_model', endpoint='https://huggingface.co', repo_type='model', repo_id='drakensberg85/English-Yoruba_NLLB_FT_model'), pr_revision=None, pr_num=None)

## **TensorBoard Logging and Setup**

In [None]:
# writer = SummaryWriter(f"{save_dir}/logs")
# print("Training complete. View metrics using TensorBoard:")
# print(f"Run this in Colab terminal: tensorboard --logdir={save_dir}/logs")

In [None]:
# %reload_ext tensorboard
# %tensorboard --logdir "/content/drive/MyDrive/Colab Notebooks/NLLB_600M/En-Yo_LaTn/logs"

## **Inference Check**

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Use the correct model path
model_path = f"{save_dir}/En-Yo_FT_model"

# Load tokenizer and model from local files
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path, local_files_only=True)

# Send model to appropriate device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Example sentence for translation
english_sentence = "I want to go home."
source_sentence = f">>yor_Latn<< {english_sentence}"

# Tokenize input and move to device
inputs = tokenizer(source_sentence, return_tensors="pt").to(device)

# Generate translation
with torch.no_grad():
    output = model.generate(**inputs, max_length=128, num_beams=5, early_stopping=True)

# Decode and print translation
yoruba_translation = tokenizer.decode(output[0], skip_special_tokens=True)

print(f"English: {english_sentence}")
print(f"Yoruba: {yoruba_translation}")


English: I want to go home.
Yoruba: Mo fẹ́ lọ sílé.


## **Incorporating Chainlit**

In [None]:
%%writefile app.py

import chainlit as cl
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch


# Load model & tokenizer
model_path = f"{save_dir}/En-Yo_FT_model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to("cuda" if torch.cuda.is_available() else "cpu")

@cl.on_chat_start
async def start():
    await cl.Message(content="👋 Welcome! Type something in English and I'll translate it to Yoruba!").send()

@cl.on_message
async def main(message: cl.Message):
    input_text = f">>yor_Latn<< {message.content}"
    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

    # Create an empty Chainlit message to stream into
    response = cl.Message(content="")
    await response.send()

    # Generate tokens step-by-step
    output_tokens = model.generate(
        **inputs,
        max_length=128,
        num_beams=1,  # Beam search disables streaming behavior
        do_sample=False,
        output_scores=False,
        return_dict_in_generate=True
    )

    # Stream tokens (you can simulate streaming with a short delay per chunk if needed)
    output_text = tokenizer.decode(output_tokens.sequences[0], skip_special_tokens=True)

    # Simulate streaming (token-by-token)
    for token in output_text.split():
        response.content += token + " "
        await response.update()

    # Final update
    await response.update()


Writing app.py
