# **Fine-Tuning M2M100 Model for English-to-Tigrinya Translation**


This code demonstrates how to fine-tune the Facebook M2M100 multilingual translation model for translating text from English to Tigrinya. It follows a systematic approach that includes loading datasets, tokenizing text, training the model, and evaluating its performance. The key steps include:

1. **Model Setup**:  
   - Load the pre-trained `m2m100_418M` model and tokenizer.  
   - Move the model to GPU for faster computation when available.

2. **Dataset Preparation**:  
   - Load English-to-Tigrinya training, validation, and test datasets from CSV files.  
   - Preprocess data by tokenizing text and setting English (`en`) as the source language and Amharic (`am`) as the target language.  
     - *Note: Tigrinya is not directly covered in the model. Instead, Amharic was used as the target language, as it closely aligns with Tigrinya.*


3. **Baseline Evaluation**:  
   - Generate baseline translations using the pre-trained model.  
   - Compute the BLEU score to measure the quality of the translations.

4. **Fine-Tuning**:  
   - Define training parameters such as learning rate, batch size, and number of epochs.  
   - Fine-tune the M2M100 model using the training dataset and validate its performance on the validation dataset.  
   - Save the fine-tuned model and tokenizer for deployment.

5. **Evaluation of Fine-Tuned Model**:  
   - Generate translations for the validation dataset with the fine-tuned model.  
   - Compute the BLEU score to evaluate improvements over the baseline.

6. **Deployment and Testing**:  
   - Save the fine-tuned model and tokenizer to disk.  
   - Test the model on example sentences in English and generate translations in Tigrinya.

### Conclusion
Despite fine-tuning, the BLEU scores for both the baseline and fine-tuned models remain low, suggesting that this model is not effective for Tigrinya-to-English translation. Further exploration of alternative models or additional high-quality datasets is needed to improve performance for this language pair.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **Install and Import Libraries**

In [None]:
!pip install transformers datasets sacrebleu
!pip install evaluate

In [None]:
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
import torch
from datasets import Dataset
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
import evaluate
import pandas as pd

## **Load Data and Run the Baseline Model on the Validation Dataset**

In [None]:
# Load the pretrained M2M100 model and tokenizer
model_name = "facebook/m2m100_418M"
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)

# Move the model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/298 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/908 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

In [None]:
train_dataset = pd.read_csv("/Capstone/Dataset_csv/en_to_ti_sample_train.csv")
val_dataset = pd.read_csv("/Capstone/Dataset_csv/en_to_ti_val.csv")
test_dataset = Dataset.from_pandas(val_dataset)

In [None]:
# Load the tokenizer and smaller model version if available
model_name = "facebook/m2m100_418M"  # This is the current model; check if a smaller version is available
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)

# Move the model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Set the source and target languages
tokenizer.src_lang = "en"
tokenizer.tgt_lang = "am"

In [None]:
def generate_translation_in_batches(texts, batch_size=32):
    translations = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        # Tokenize and move inputs to GPU
        tokenizer.src_lang = "en"  # Source language is English
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)
        # Generate translations
        outputs = model.generate(**inputs, max_length=128)
        # Decode the translations
        batch_translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        translations.extend(batch_translations)
    return translations

# Extract test source texts (English)
test_source_texts = test_dataset["Source"]

# Generate translations for the baseline model
baseline_translations = generate_translation_in_batches(test_source_texts)


In [None]:

test_reference_texts = test_dataset["Target"]
# Load BLEU metric
metric = evaluate.load("sacrebleu")

# Compute BLEU score for baseline model
baseline_result = metric.compute(predictions=baseline_translations, references=test_reference_texts)
print(f"Baseline BLEU Score: {baseline_result['score']}")


Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

Baseline BLEU Score: 0.08116495341885102


In [None]:
# Load the pretrained M2M100 model and tokenizer
model_name = "facebook/m2m100_418M"
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)

# Move the model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)


In [None]:
# Load your training and validation datasets
train_dataset = pd.read_csv("/Capstone/Dataset_csv/en_to_ti_sample_train.csv")
test_dataset = pd.read_csv("/Capstone/Dataset_csv/en_to_ti_sample_test.csv")


# Convert to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_dataset)
validation_dataset = Dataset.from_pandas(test_dataset)

# Tokenize the data
def preprocess_function(examples):
    tokenizer.src_lang = "en"  # Source language is English
    tokenizer.tgt_lang = "tir"  # Target language is Tigrinya

    inputs = tokenizer(examples["Source"], truncation=True, max_length=128, padding="max_length")
    targets = tokenizer(examples["Target"], truncation=True, max_length=128, padding="max_length")
    inputs["labels"] = targets["input_ids"]
    return inputs

# Apply tokenization
tokenized_train_dataset = train_dataset.map(preprocess_function, batched=True)
tokenized_validation_dataset = validation_dataset.map(preprocess_function, batched=True)



Map:   0%|          | 0/6148 [00:00<?, ? examples/s]

Map:   0%|          | 0/1538 [00:00<?, ? examples/s]

## **Fine-Tune and Evaluate the Pre-Trained Model**

In [None]:
# Define training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="/Capstone/m2m100_finetune_results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    save_total_limit=2,
    predict_with_generate=True,
    fp16=True,  # Enable mixed precision for faster training
)

# Define the trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_validation_dataset,
    tokenizer=tokenizer,
)

# Fine-tune the model
trainer.train()


  trainer = Seq2SeqTrainer(
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss
1,No log,1.30406
2,2.822700,1.113568
3,1.147000,1.07425




TrainOutput(global_step=1155, training_loss=1.8625578116544914, metrics={'train_runtime': 347.6681, 'train_samples_per_second': 53.051, 'train_steps_per_second': 3.322, 'total_flos': 4996259660169216.0, 'train_loss': 1.8625578116544914, 'epoch': 3.0})

In [None]:
# Load the BLEU metric
metric = evaluate.load("sacrebleu")

# Generate translations for validation dataset
def generate_translation_in_batches(texts, batch_size=32):
    translations = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)
        outputs = model.generate(**inputs, max_length=128)
        batch_translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        translations.extend(batch_translations)
    return translations

validation_source_texts = validation_dataset["Source"]
references = [[ref] for ref in validation_dataset["Target"]]
predictions = generate_translation_in_batches(validation_source_texts)

# Compute BLEU score
fine_tuned_bleu = metric.compute(predictions=predictions, references=references)
print(f"Fine-Tuned BLEU Score: {fine_tuned_bleu['score']}")


Fine-Tuned BLEU Score: 0.8301035630609231


## **Save the Fine-Tuned Model**

In [None]:
# Save the fine-tuned model and tokenizer
model.save_pretrained("/Capstone/m2m100_fine_tuned")
tokenizer.save_pretrained("/Capstone/m2m100_fine_tuned")

print("Fine-tuned model saved successfully!")


Fine-tuned model saved successfully!


## **Run Examples**

In [None]:
def translate_text(text, src_lang="en", tgt_lang="tir"):
    """
    Translates a single text input using the fine-tuned model.

    Args:
    text (str): The text to translate.
    src_lang (str): Source language (default is English).
    tgt_lang (str): Target language (default is Tigrinya).

    Returns:
    str: The translated text.
    """
    tokenizer.src_lang = src_lang  # Set the source language
    tokenizer.tgt_lang = tgt_lang  # Set the target language

    # Tokenize and prepare inputs
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128).to(device)

    # Generate translations
    outputs = model.generate(**inputs, max_length=128, num_beams=5)

    # Decode the output
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return translated_text


In [None]:
# Define example sentences
examples = [
    "Within three months, I will be able to read, write and speak.",  # English to Tigrinya
    "If there were no telephones, it would be inconvenient.",  # English to Tigrinya
]

# Generate translations for each example
for i, example in enumerate(examples):
    translated_text = translate_text(example, src_lang="en", tgt_lang="tir")
    print(f"Example {i+1}:")
    print(f"Source (English): {example}")
    print(f"Translation (Tigrinya): {translated_text}")



Example 1:
Source (English): Within three months, I will be able to read, write and speak.
Translation (Tigrinya): ድሕሪ ሰለስተ መዓልቲ፡ ክትብል፡ ክትብል፡ ክትብል፡ ክትብል፡ ክትብል ኣለኒ።

Example 2:
Source (English): If there were no telephones, it would be inconvenient.
Translation (Tigrinya): telephones እንተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘይተዘ

