#Overview

This script fine-tunes the facebook/nllb-200-distilled-600M model to perform Yoruba-to-English translation using a custom dataset of Yoruba phrases and their literal English translations. It handles data loading, tokenization, training, evaluation, and deployment via Gradio. The goal is to build a context-aware machine translation model for low-resource languages like Yoruba.

The input is a clean CSV file consisting of 61 rows and 4 columns:

"Phrase" which comprises of proverbs in the source language(Yoruba)

"Literal Translation" which are the corresponding literal translations of the proverbs in the target language(English)

"Phrase_Tokens" and "Literal_Translation_Tokens" which contains the word tokens of the "Phrase" and "Literal Translation" columns respectively.

The output includes:

Gradio web interface: where users can input Yoruba text and receive the English translation.

Evaluation Metric:
BLEU score for evaluating translation performance.

The trained Translation model "nllb_translation_model"

In [1]:
!pip install datasets



In [1]:
# Import required libraries
import pandas as pd                                                                        # For reading and handling tabular data
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM                              # To load pretrained model and tokenizer
from datasets import Dataset                                                               # Hugging Face datasets library for handling datasets
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq  # For training a sequence-to-sequence model
import os                                                                                  # For environment variable management

# Disable Weights & Biases logging
os.environ["WANDB_DISABLED"] = "true"

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Load the preprocessed CSV file
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Project 2 Context Matters/yoruba_phrases_processesed.csv")

In [4]:
# Drop the token columns as they are not needed for training the model
df = df.drop(["Phrase_Tokens", "Literal_Translation_Tokens"], axis=1)

#Preprocessing and Tokenization

The preprocessing, applied during dataset preparation using Hugging Face’s tokenizer for the NLLB model, involves truncating and padding inputs to a fixed maximum length, tokenizing both the source (Yoruba) and target (English) text, and setting the tokenized English sequence as the label for supervised training.

In [5]:
# Convert the Pandas dataframe to a Hugging Face dataset format
hf_dataset = Dataset.from_pandas(df)

# Split the dataset into train and test
split_dataset = hf_dataset.train_test_split(test_size=0.2)

In [6]:
# Define the pretrained model name
model_name = "facebook/nllb-200-distilled-600M"

# Load the tokenizer for the model and set the source language to Yoruba (Latin script)
tokenizer = AutoTokenizer.from_pretrained(model_name, src_lang="yor_Latn")

# Load the actual pretrained model for translation (sequence-to-sequence task)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

In [7]:
# Define a function to preprocess each example in the dataset
def preprocess(example):

    # Tokenize the source text (Yoruba phrase)
    src = tokenizer(example["Phrase"], truncation=True, padding="max_length", max_length=128)

    # Tokenize the target text (literal English translation)
    tgt = tokenizer(example["Literal Translation"], truncation=True, padding="max_length", max_length=128)

    # Set the target token IDs as labels for the model to learn to predict
    src["labels"] = tgt["input_ids"]
    return src

# Apply the preprocessing function to training and testing datasets
tokenized_train = split_dataset["train"].map(preprocess, batched=True)
tokenized_eval = split_dataset["test"].map(preprocess, batched=True)

Map:   0%|          | 0/48 [00:00<?, ? examples/s]

Map:   0%|          | 0/12 [00:00<?, ? examples/s]

In [8]:
# Define the training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./nllb_yo_model",           # Directory where the model checkpoints will be saved
    eval_strategy="epoch",                  # Evaluate the model after each epoch
    learning_rate=2e-5,                     # Learning rate (how fast the model learns)
    per_device_train_batch_size=4,          # Number of examples to process per batch on each device
    num_train_epochs=10,                    # Number of times the model will go through the whole dataset
    save_total_limit=2,                     # Save only the 2 most recent checkpoints
    predict_with_generate=True,             # Allows generation-based evaluation during training
    logging_strategy="steps",               # Log at regular step intervals
    logging_steps=50,                       # Log every 50 training steps
    report_to="none",                       # Avoid external logging services
    save_strategy="epoch",                  # Save model at the end of each epoch
)

# Define a data collator to dynamically pad batches to the longest sequence in each batch
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Set up the trainer that will handle training the model
trainer = Seq2SeqTrainer(
    model=model,                           # The model to train
    args=training_args,                    # Training arguments defined above
    train_dataset=tokenized_train,         # Tokenized training dataset
    eval_dataset=tokenized_eval,           # Tokenized evaluation dataset (same as training in this example)
    tokenizer=tokenizer,                   # The tokenizer
    data_collator=data_collator            # Handles padding
)

# Train the model
trainer.train()


  trainer = Seq2SeqTrainer(
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.58.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,No log,11.472418
2,No log,10.990137
3,No log,10.512846
4,No log,9.979028
5,10.993300,9.395634




Epoch,Training Loss,Validation Loss
1,No log,11.472418
2,No log,10.990137
3,No log,10.512846
4,No log,9.979028
5,10.993300,9.395634
6,10.993300,8.973236
7,10.993300,8.690907
8,10.993300,8.505245
9,9.149100,8.398514
10,9.149100,8.364431


TrainOutput(global_step=120, training_loss=9.826325352986654, metrics={'train_runtime': 3341.3851, 'train_samples_per_second': 0.144, 'train_steps_per_second': 0.036, 'total_flos': 130026276126720.0, 'train_loss': 9.826325352986654, 'epoch': 10.0})

The translate_nllb function takes a Yoruba phrase as input, tokenizes it using the loaded NLLB tokenizer, and generates an English translation using the trained model. It sets the source language (yor_Latn) and forces the beginning-of-sentence token for the target language (eng_Latn), then moves the model and input tensors to the GPU if available or otherwise uses CPU. It performs translation using beam search and finally decodes and returns a clean English string as output.

In [9]:
# Import PyTorch to use the model for inference (translation)
import torch

# Define a function that translates a Yoruba phrase to English using the trained model
def translate_nllb(text, src_lang="yor_Latn", tgt_lang="eng_Latn"):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Use GPU if available

    model.to(device)  # Move model to device (GPU or CPU)
    tokenizer.src_lang = src_lang  # Set source language for tokenizer

    # Tokenize the input text and move to device
    encoded = tokenizer(text, return_tensors="pt").to(device)

    # Generate translation, setting the target language
    generated = model.generate(
        **encoded,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang)
    )

    # Decode the output tokens into text
    return tokenizer.decode(generated[0], skip_special_tokens=True)


In [10]:
# Show 5 sample translations
for i in range(5):
    print("YO:", df["Phrase"][i])                  # Yoruba input
    print("EN REF:", df["Literal Translation"][i])              # Reference English
    print("EN PRED:", translate_nllb(df["Phrase"][i]))  # Predicted English
    print()


YO: ile oba t'o jo ewa lo busi
EN REF: when a king's palace burns down the rebuilt palace is more beautiful
EN PRED: the house is not a big house

YO: gbogbo alangba lo d'anu dele a ko mo eyi t'inu nrun
EN REF: all lizards lie flat on their stomach and it is difficult to determine which has a stomach ache
EN PRED: every one of you is a little bit of a mess

YO: ile la ti n ko eso re ode
EN REF: charity begins at home
EN PRED: the house is growing its fruit

YO: a pę ko to jęun ki ję ibaję
EN REF: the person that eat late will not eat spoiled food
EN PRED: a pęę nie before the day is over

YO: eewu bę loko longę longę fun ara rę eewu ni
EN REF: there is danger at longę's farm longę is a name of a yoruba legend longę himself is danger
EN PRED: the danger of being a long man is a danger to the self



#Deployment

Gradio creates a lightweight web interface with a textbox for Yoruba input, a text output for the English translation, and an optional title and description for context; the interface is launched using interface.launch(share=True), which makes it accessible via a public link.

In [11]:
# Deploying with Gradio
import gradio as gr

# Define Gradio interface
interface = gr.Interface(
    fn=translate_nllb,                     # using translation function
    inputs=gr.Textbox(lines=2, placeholder="Enter Yoruba phrase..."),
    outputs="text",
    title="Yoruba to English Translator",
    description="Translate Yoruba text to English using NLLB model"
)

# Launch the app
interface.launch(share=True)


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://048b03f46767b74997.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [12]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.5-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.5


In [None]:
# Load BLEU score metric for evaluation
from evaluate import load
bleu = load("bleu")

# Translate all Yoruba phrases in the dataset
preds = [translate_nllb(text) for text in df["Phrase"]]

# Prepare references in the required format for BLEU (list of lists)
refs = [[ref] for ref in df["Literal Translation"]]

# Compute BLEU score to evaluate the quality of the translation
print(bleu.compute(predictions=preds, references=refs))


##Inferences from the evaluation score
The low BLEU score (\~1.7%) shows the model struggles to generate accurate or fluent English translations from Yoruba phrases.

The model recognizes basic words  but as the n-gram length increases, it's accuracy drops sharply, indicating limited fluency due to early training or insufficient data.


🧮 Brevity Penalty: 0.6657
The model's translations are too short when compared to the reference, leading to a low brevity penalty and indicating incomplete or truncated outputs.


📏 Length Ratio: 0.71
The model's translations are too short, often ending early and failing to fully express the Yoruba phrases in English.

The model is undertrained or lacks sufficient data to produce high-quality translations.

Increasing the size and quality of the dataset is the best approach to improve vocabulary and fluency of the model.

In [14]:
# Save the trained model and tokenizer to Google Drive
save_path = "/content/drive/MyDrive/Colab Notebooks/Project 2 Context Matters/nllb_translation_model"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

# Uncomment to load the model
# from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# model = AutoModelForSeq2SeqLM.from_pretrained("./nllb_translation_model")
# tokenizer = AutoTokenizer.from_pretrained("./nllb_translation_model")


('/content/drive/MyDrive/Colab Notebooks/Project 2 Context Matters/nllb_translation_model/tokenizer_config.json',
 '/content/drive/MyDrive/Colab Notebooks/Project 2 Context Matters/nllb_translation_model/special_tokens_map.json',
 '/content/drive/MyDrive/Colab Notebooks/Project 2 Context Matters/nllb_translation_model/sentencepiece.bpe.model',
 '/content/drive/MyDrive/Colab Notebooks/Project 2 Context Matters/nllb_translation_model/added_tokens.json',
 '/content/drive/MyDrive/Colab Notebooks/Project 2 Context Matters/nllb_translation_model/tokenizer.json')