<a href="https://colab.research.google.com/github/KnextKoder/Mein_LLM/blob/main/yor_finetune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Introduction & Goal 🎯

## Fine-tuning a Language Model for Translation

Welcome! This notebook is your hands-on guide to fine-tuning a pre-trained model for a new task.
We'll take a model that understands English and teach it how to translate from **English to Yoruba**.

### What You'll Learn:
1.  **Setup**: How to install and import the necessary libraries.
2.  **Load Data**: How to load a standard translation dataset from the Hugging Face Hub.
3.  **Preprocess**: How to prepare the text data for the model using a tokenizer.
4.  **Fine-Tune**: The core process of training the model on the new data.
5.  **Inference**: How to use your newly fine-tuned model to translate sentences.

In [None]:
# 2. Setup 🛠️ | Installing Libraries
# First things first, we need to install the libraries that will do the heavy lifting.

# - **transformers**: Provides the pre-trained models (like T5) and the training tools.
# - **datasets**: Makes it super easy to load datasets from the Hugging Face Hub.
# - **sacrebleu**: A standard library for evaluating translation quality.
# - **accelerate**: Helps PyTorch (the backend for transformers) run smoothly on GPUs or TPUs.

!pip install transformers[torch] datasets sacrebleu accelerate -q

In [None]:
# 3. Loading the Dataset 📚
# We need data to teach our model. We'll use the 'opus_books' dataset, which contains pairs of sentences in English and French from translated books.

from datasets import load_dataset

# Load a small part of the dataset to keep training fast for this tutorial.
# We'll use the first 1% of the pairs for this example.
# raw_datasets = load_dataset("opus_books", "en-fr", split="train[:1%]")
raw_datasets = load_dataset("0xmarvel/soro-en-yor", split="train")

# The dataset is currently one big block. Let's split it into a training set and a testing set.
# 90% for training, 10% for testing.
split_datasets = raw_datasets.train_test_split(train_size=0.9, seed=42)

# Rename the 'test' split to 'validation' which is a more common term in training.
split_datasets["validation"] = split_datasets.pop("test")

# Let's see what a sample looks like!
print("A sample from our dataset:")
print(split_datasets["train"][1])

A sample from our dataset:
{'en': '"Based on the advice of the Federal Ministry of Health and the NCDC, I am directing the cessation of all movements in Lagos and the FCT for an initial period of 14 days with effect from 11pm on Monday, 30th March 2020.', 'yo': 'Nípa ìmọ̀ràn láti ọ̀dọ̀ àwọn àjọ tó ń mójútó ètò ìlera àti àjọ tó ń mójútó gbígbógun ti ààrùn lórílẹ́ èdè Nàìjíríà, nítorí náà mo pàṣẹ, pé kó ní sí wíwọlé tàbí jíjáde nílùú Èkó, Ògùn àti FCT Àbújá fún odidi ọjọ́ mẹ́rìnlá gbáko bẹ̀rẹ̀ láti aago mọ́kànlá, ọgbọ̀ọjọ́, oṣù kẹta,ọdún 2020.'}


In [None]:
# 4. Preprocessing the Data ✍️
# Models don't understand words; they understand numbers. The process of converting words to numbers is called "tokenization".
# We'll use a "tokenizer" that was created alongside our pre-trained model to ensure the numbers match what the model expects.

from transformers import AutoTokenizer

# The model we'll be fine-tuning is 't5-small'. It's a good balance of size and performance for a tutorial.
model_checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
source_lang = "en"
target_lang = "yo"

# T5 is a sequence-to-sequence model. It needs a specific prefix to know what task it should be doing.
# For translation, we'll tell it "translate English to French: ".
prefix = "translate English to Yoruba: "

def preprocess_function(examples):
    """This function takes a batch of examples and tokenizes them."""
    # Access the lists of sentences directly using the language keys
    inputs = [prefix + text for text in examples[source_lang]]
    targets = [text for text in examples[target_lang]]

    # Tokenize the inputs and targets
    model_inputs = tokenizer(inputs, max_length=128, truncation=True)
    labels = tokenizer(text_target=targets, max_length=128, truncation=True)

    # The 'labels' are what the model should learn to predict.
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Now, apply this function to our entire dataset. The 'map' function is a powerful way to do this quickly.
tokenized_datasets = split_datasets.map(preprocess_function, batched=True)

print("\nSample of tokenized data (the model sees numbers, not words):")
print(tokenized_datasets["train"][1])

Map:   0%|          | 0/664 [00:00<?, ? examples/s]


Sample of tokenized data (the model sees numbers, not words):
{'en': '"Based on the advice of the Federal Ministry of Health and the NCDC, I am directing the cessation of all movements in Lagos and the FCT for an initial period of 14 days with effect from 11pm on Monday, 30th March 2020.', 'yo': 'Nípa ìmọ̀ràn láti ọ̀dọ̀ àwọn àjọ tó ń mójútó ètò ìlera àti àjọ tó ń mójútó gbígbógun ti ààrùn lórílẹ́ èdè Nàìjíríà, nítorí náà mo pàṣẹ, pé kó ní sí wíwọlé tàbí jíjáde nílùú Èkó, Ògùn àti FCT Àbújá fún odidi ọjọ́ mẹ́rìnlá gbáko bẹ̀rẹ̀ láti aago mọ́kànlá, ọgbọ̀ọjọ́, oṣù kẹta,ọdún 2020.', 'input_ids': [13959, 1566, 12, 6545, 14446, 9, 10, 96, 25557, 30, 8, 1867, 13, 8, 5034, 7849, 13, 1685, 11, 8, 445, 23125, 6, 27, 183, 3, 26243, 8, 1830, 7, 257, 13, 66, 9780, 16, 29461, 11, 8, 377, 6227, 21, 46, 2332, 1059, 13, 968, 477, 28, 1504, 45, 850, 2028, 30, 2089, 6, 604, 189, 1332, 6503, 5, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [None]:
# 5. Fine-Tuning the Model 🔥
# This is where the magic happens! We'll load the pre-trained T5 model and set up a "Trainer" that handles the entire training loop for us.

from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq

# Load the pre-trained T5 model.
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

# A data collator is a helper that batches our tokenized data together nicely for the model.
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# Define the training arguments. These are like settings for our training session.
args = Seq2SeqTrainingArguments(
    output_dir="t5-small-en-to-yor",          # Where to save the model
    eval_strategy="epoch",            # Evaluate at the end of each epoch
    learning_rate=2e-5,                     # A standard learning rate for fine-tuning
    per_device_train_batch_size=16,         # How many examples to process at once during training
    per_device_eval_batch_size=16,          # How many examples to process at once during evaluation
    weight_decay=0.01,                      # Helps prevent overfitting
    save_total_limit=3,                     # Only keep the best 3 model checkpoints
    num_train_epochs=100,                     # We'll train for 3 full passes over the data
    predict_with_generate=True,             # Necessary for sequence-to-sequence tasks
    push_to_hub=False,                      # Set to True if you want to upload to Hugging Face Hub
)

# Create the Trainer object.
trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Start training! This will take a few minutes on a Colab GPU.
# Make sure your runtime is set to GPU (Runtime -> Change runtime type -> T4 GPU)
print("Starting the fine-tuning process...")
trainer.train()
print("Training complete!")

  trainer = Seq2SeqTrainer(


Starting the fine-tuning process...


Epoch,Training Loss,Validation Loss
1,No log,2.473315
2,2.857800,2.270584
3,2.435500,2.158744
4,2.435500,2.085432
5,2.298900,2.030452
6,2.209300,1.984345
7,2.136500,1.947982
8,2.136500,1.915529
9,2.091900,1.887195
10,2.047500,1.862744


Epoch,Training Loss,Validation Loss
1,No log,2.473315
2,2.857800,2.270584
3,2.435500,2.158744
4,2.435500,2.085432
5,2.298900,2.030452
6,2.209300,1.984345
7,2.136500,1.947982
8,2.136500,1.915529
9,2.091900,1.887195
10,2.047500,1.862744


Training complete!


In [None]:
# 6. Inference (Using Your Model) 🗣️
# The model is trained! Now for the fun part: let's give it an English sentence and see how it does.

from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer
import os
import torch

# Find the latest checkpoint directory
output_dir = "t5-small-en-to-yor"
checkpoints = [os.path.join(output_dir, d) for d in os.listdir(output_dir) if d.startswith('checkpoint-')]
latest_checkpoint = max(checkpoints, key=os.path.getmtime)

print(f"Loading model from: {latest_checkpoint}")

# Load the fine-tuned model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(latest_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(latest_checkpoint)

# Define the prefix for the translation task
prefix = "translate English to Yoruba: "

def translate_en_to_yor(text):
    """Translates English text to Yoruba using the fine-tuned model."""
    inputs = [prefix + text]
    # Tokenize the input text
    tokenized_inputs = tokenizer(inputs, return_tensors="pt", padding=True, truncation=True)

    # Generate the translation
    # Add a check to move tensors to GPU if available
    if torch.cuda.is_available():
        tokenized_inputs = {k: v.to("cuda") for k, v in tokenized_inputs.items()}
        model.to("cuda")

    outputs = model.generate(**tokenized_inputs, max_length=128)

    # Decode the generated tokens back to text
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return translated_text

# Let's try translating a sentence.
english_sentence = "Dogs."
yoruba_translation = translate_en_to_yor(english_sentence)

print(f"English: {english_sentence}")
print(f"Model's Yoruba Translation: {yoruba_translation}")

print("\n--- Another example ---")
english_sentence_2 = "Bird."
yoruba_translation_2 = translate_en_to_yor(english_sentence_2)
print(f"English: {english_sentence_2}")
print(f"Model's Yoruba Translation: {yoruba_translation_2}")

Loading model from: t5-small-en-to-yor/checkpoint-37400
English: Dogs.
Model's Yoruba Translation: wn   e.

--- Another example ---
English: Bird.
Model's Yoruba Translation: wn m.


# 7. Conclusion & Next Steps 🎉

## Congratulations!

You have successfully fine-tuned a pre-trained T5 model to translate from English to French.

### What We Accomplished:
- Loaded and prepared a real-world translation dataset.
- Tokenized the text so the model could understand it.
- Ran a complete fine-tuning process using the Hugging Face Trainer.
- Used the final model to generate new translations.

### Where to Go From Here:
- **Train for longer**: Training for more epochs on more data will improve performance.
- **Try a larger model**: Using `t5-base` or `t5-large` will yield much better results (but take longer to train).
- **Translate other languages**: Find another dataset on the Hugging Face Hub and try fine-tuning for a different language pair!