# Evaluating Helsinki-NLP Model for Tigrinya-to-English Translation

This code fine-tunes the Helsinki-NLP `opus-mt-ti-en` model for translating text from Tigrinya to English. The process includes dataset preparation, model training, and evaluation using metrics such as BLEU, and chrF++. Below is an overview of the key steps and findings:

---

## Key Steps

### 1. **Model Setup**
- Load the pre-trained Helsinki-NLP `opus-mt-ti-en` model and tokenizer.
- Move the model to GPU for faster computation.

### 2. **Dataset Preparation**
- Load training, validation, and test datasets containing Tigrinya (source) and English (target) text pairs.
- Tokenize the datasets and prepare them for training using Hugging Face's `Dataset` API.

### 3. **Baseline Evaluation**
- Generate translations for the test dataset using the pre-trained model.
- Compute baseline metrics:
  - **BLEU Score:** 8.10
  - **chrF++ Score:** 31.36

### 4. **Fine-Tuning**
- Define training arguments (e.g., learning rate, batch size, epochs) and fine-tune the model using the training dataset.
- Save the fine-tuned model for evaluation and deployment.

### 5. **Post-Fine-Tuning Evaluation**
- Evaluate the fine-tuned model on the test dataset.
- Compute translation quality using metrics:
  - **BLEU Score:** Improved to **27.57**, showing significant improvement in translation quality.
  - **chrF++ Score:** Improved to **48.25**, reflecting better lexical overlap.

---

## **Observations**
- The fine-tuned model shows substantial improvement over the baseline, in BLEU and chrF++ scores, indicating better translations.
---

## **Conclusion**
The Helsinki-NLP `opus-mt-ti-en` model, after fine-tuning, demonstrates significant improvement for Tigrinya-to-English translation. This study highlights the effectiveness of fine-tuning pre-trained models for low-resource language translation tasks.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **Install and Import Libraries**

In [21]:
!pip install transformers datasets
!pip install transformers datasets evaluate
!pip install sacrebleu

In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import Seq2SeqTrainingArguments
from transformers import Seq2SeqTrainer
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from datasets import Dataset
import evaluate
import torch

In [12]:
# Check GPU availability
print(f"CUDA Available: {torch.cuda.is_available()}")
print(f"Device Name: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'No GPU'}")


CUDA Available: True
Device Name: NVIDIA A100-SXM4-40GB


## **Load Data and Run the Baseline Model on the Validation Dataset**

In [13]:
# Load the 10% sample dataset
train_data = pd.read_csv("/Capstone/Dataset_csv/ti_to_en_train.csv")
test_data = pd.read_csv("/Capstone/Dataset_csv/ti_to_en_test.csv")
val_data = pd.read_csv("/Capstone/Dataset_csv/ti_to_en_val.csv")

print(f"Training Set: {len(train_data)} rows")
print(f"Testing Set: {len(test_data)} rows")
print(f"Validation Set: {len(val_data)} rows")


Training Set: 286500 rows
Testing Set: 35813 rows
Validation Set: 35813 rows


In [14]:
print(test_data.head())

   Unnamed: 0                                             Target  \
0      169388  'The women who are part of parliament are the ...   
1       59682              'Sometimes, it 's time to break up. '   
2      144968  'It has been said that a consul was short of t...   
3      269661             'This stunned the presiding officers.'   
4      338063  'Manchester United, CHELSEA, Manchester City a...   

                                              Source  
0    'እተን ገበርትን ሓደግትን እተን ኣባላት ባይቶ ዝኾና ደቂ ኣንስትዮ’የን።'  
1              'ሓደ ሓደ ግዜ ኣብ ግዜኡ ምፍልላይ የዋጽእ’ዩ” በለተን።'  
2        'ቈናኖ ቀደም ኣብ ከምኡ ዝበለ እዋን ግዜ ይሓጽረን ነይሩ ይበሃል።'  
3                 'እዚ ከኣ ነቶም ዝተኣዘዙ ሓለፍቲ ኣመና ኣደንጸዎም።'  
4  'ማንቸስተር ዩናይትድ፡ ቸልሲ፡ ማንቸስተር ሲቲ ኣብዚ ግዜ’ዚ ኸኣ ሌስተር...  


In [15]:
# Load the pre-trained model and tokenizer
model_name = "Helsinki-NLP/opus-mt-ti-en"  # Change to "ti-en" for reverse task
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)




In [16]:
# Convert training and testing data to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_data)
val_dataset = Dataset.from_pandas(val_data)

def preprocess_function(examples):
    # Tokenize the source (English)
    model_inputs = tokenizer(
        examples["Source"],  # Replace "Source" with the source column
        max_length=128,
        truncation=True,
        padding="max_length",
    )
    # Tokenize the target (Tigrinya)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["Target"],  # Replace "Target" with the target column
            max_length=128,
            truncation=True,
            padding="max_length",
        )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Tokenize both datasets
tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_val = val_dataset.map(preprocess_function, batched=True)


Map:   0%|          | 0/286500 [00:00<?, ? examples/s]



Map:   0%|          | 0/35813 [00:00<?, ? examples/s]

In [17]:
# Move the model to the GPU
model = model.to("cuda")


In [18]:
def generate_translation_in_batches(texts, batch_size=32):
    translations = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]

        # Tokenize and move inputs to GPU
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=128)
        inputs = {key: value.to("cuda") for key, value in inputs.items()}  # Move to GPU

        # Generate translations
        outputs = model.generate(**inputs, max_length=128)

        # Decode and store translations
        batch_translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        translations.extend(batch_translations)
    return translations

In [19]:
# Extract test source texts
test_source_texts = val_dataset["Source"]

# Generate translations
baseline_translations = generate_translation_in_batches(test_source_texts)


In [20]:
# Load BLEU metric
metric = evaluate.load("sacrebleu")

# Prepare references
references = [[text] for text in val_dataset["Target"]]

# Compute BLEU score
baseline_bleu = metric.compute(predictions=baseline_translations, references=references)
print(f"Baseline BLEU Score: {baseline_bleu['score']}")


Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

Baseline BLEU Score: 8.103436946749364


## **Fine-Tune and Evaluate the Pre-Trained Model**

In [21]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    num_train_epochs=4,
    save_total_limit=2,
    predict_with_generate=True,
)




In [22]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
)
# Start fine-tuning
trainer.train()


  trainer = Seq2SeqTrainer(
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss
1,0.3585,0.328554
2,0.3143,0.29868
3,0.2946,0.286036
4,0.2826,0.282256




TrainOutput(global_step=35816, training_loss=0.326682807346367, metrics={'train_runtime': 6898.2506, 'train_samples_per_second': 166.129, 'train_steps_per_second': 5.192, 'total_flos': 3.8847526207488e+16, 'train_loss': 0.326682807346367, 'epoch': 4.0})

In [23]:
# Load the BLEU metric
metric = evaluate.load("sacrebleu")

def generate_translation_in_batches(texts, batch_size=32):
    translations = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]

        # Tokenize and move inputs to GPU
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=128)
        inputs = {key: value.to("cuda") for key, value in inputs.items()}  # Move to GPU

        # Generate translations
        outputs = model.generate(**inputs, max_length=128)

        # Decode and store translations
        batch_translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        translations.extend(batch_translations)
    return translations

validation_source_texts = val_dataset["Source"]
references = [[ref] for ref in val_dataset["Target"]]
predictions = generate_translation_in_batches(validation_source_texts)

# Compute BLEU score
fine_tuned_bleu = metric.compute(predictions=predictions, references=references)
print(f"Fine-Tuned BLEU Score: {fine_tuned_bleu['score']}")


Fine-Tuned BLEU Score: 27.946216540913316


## **Save the Fine-Tuned Model**

In [24]:
# Define the model name for clarity
model_name = "opus-mt-ti-en_fine_tuned"
save_path = f"/content/drive/MyDrive/Capstone/{model_name}"

# Save the fine-tuned model and tokenizer
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"Fine-tuned Helsinki-NLP model saved successfully at: {save_path}")


Fine-tuned Helsinki-NLP model saved successfully at: /content/drive/MyDrive/Capstone/opus-mt-ti-en_fine_tuned


## **Evaluate the Pre-Trained and Fine-Tuned Models on the Test Dataset**

In [2]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "opus-mt-ti-en_fine_tuned"

# Path where the fine-tuned model was saved
load_path = f"/Capstone/{model_name}"

# Reload the fine-tuned model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(load_path)
model = AutoModelForSeq2SeqLM.from_pretrained(load_path)

# Move the model to GPU if available
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

print(f"Fine-tuned Helsinki-NLP model '{model_name}' loaded successfully!")




Fine-tuned Helsinki-NLP model 'opus-mt-ti-en_fine_tuned' loaded successfully!


In [3]:
def translate_sentence(sentence, source_lang="ti", target_lang="en"):
    # Set source and target language tokens for Helsinki-NLP
    tokenizer.src_lang = source_lang
    tokenizer.tgt_lang = target_lang

    # Tokenize and move inputs to GPU
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)

    # Generate translation
    outputs = model.generate(**inputs, max_length=128)

    # Decode the translation
    translated_sentence = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return translated_sentence

# Example: Translate an English sentence to Tigrinya
english_sentence = "ኣብ ዓመት ብገምጋም ካብ 20 ክሳዕ 30 ሚእታዊት ዝኾኑ ኣናህብ።"
tigrinya_translation = translate_sentence(english_sentence, source_lang="ti", target_lang="en")
print("Translated to Tigrinya:", tigrinya_translation)


Translated to Tigrinya: On average, 20 to 30 percent of bees per year.'


In [12]:
import pandas as pd

# Load the testing data
test_data = pd.read_csv("/Capstone/Dataset_csv/ti_to_en_test.csv")


In [5]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load the pre-trained model and tokenizer
model_name = "Helsinki-NLP/opus-mt-ti-en"  # Replace with your model name
baseline_model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda")
baseline_tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/972k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/819k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]



In [13]:
# Convert testing data to Hugging Face Datasets
test_dataset = Dataset.from_pandas(test_data)


In [14]:
def generate_baseline_translation_in_batches(texts, batch_size=32):
    translations = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        inputs = baseline_tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=128).to("cuda")
        outputs = baseline_model.generate(**inputs, max_length=128, num_beams=5)
        batch_translations = baseline_tokenizer.batch_decode(outputs, skip_special_tokens=True)
        translations.extend(batch_translations)
    return translations

# Generate translations for the test dataset
test_source_texts = test_dataset["Source"]
baseline_translations = generate_baseline_translation_in_batches(test_source_texts)


In [15]:

test_reference_texts = test_dataset["Target"]
# Load BLEU metric
metric = evaluate.load("sacrebleu")

# Compute BLEU score for baseline model
baseline_result = metric.compute(predictions=baseline_translations, references=test_reference_texts)
print(f"Baseline BLEU Score: {baseline_result['score']}")


Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

Baseline BLEU Score: 8.059410989071239


In [16]:
# Load the fine-tuned model and tokenizer
fine_tuned_model_path = "/Capstone/opus-mt-ti-en_fine_tuned"
fine_tuned_model = AutoModelForSeq2SeqLM.from_pretrained(fine_tuned_model_path).to("cuda")
fine_tuned_tokenizer = AutoTokenizer.from_pretrained(fine_tuned_model_path)



In [17]:
def generate_fine_tuned_translation_in_batches(texts, batch_size=32):
    translations = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        inputs = fine_tuned_tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=128).to("cuda")
        outputs = fine_tuned_model.generate(**inputs, max_length=128, num_beams=5)
        batch_translations = fine_tuned_tokenizer.batch_decode(outputs, skip_special_tokens=True)
        translations.extend(batch_translations)
    return translations

# Generate translations for the test dataset
fine_tuned_translations = generate_fine_tuned_translation_in_batches(test_source_texts)

In [18]:
# Compute BLEU score for fine-tuned model
fine_tuned_result = metric.compute(predictions=fine_tuned_translations, references=test_reference_texts)
print(f"Fine-Tuned BLEU Score: {fine_tuned_result['score']}")


Fine-Tuned BLEU Score: 27.576832796294102


In [19]:
# Load the chrF++ metric
chrf_metric = evaluate.load("chrf")

# Compute chrF++ score
chrf_result = chrf_metric.compute(predictions=baseline_translations, references=test_reference_texts)
print(f"Baseline chrF++ Score: {chrf_result['score']}")

Downloading builder script:   0%|          | 0.00/9.01k [00:00<?, ?B/s]

Baseline chrF++ Score: 31.3638776976514


In [20]:
# Load the chrF++ metric
chrf_metric = evaluate.load("chrf")

# Compute chrF++ score
chrf_result = chrf_metric.compute(predictions=fine_tuned_translations, references=test_reference_texts)
print(f"Fine-Tuned chrF++ Score: {chrf_result['score']}")

Fine-Tuned chrF++ Score: 48.17007127080958
