# Fine-Tuning T5 Model for English-to-Tigrinya Translation

This script fine-tunes the `t5-large` model for English-to-Tigrinya translation. The process involves data preprocessing, model training, and evaluation using the BLEU metric. Below is a summary of the workflow and observations.

---

## Key Steps

### 1. **Model and Tokenizer Initialization**
- The pre-trained `t5-large` model and tokenizer are loaded from the Hugging Face library.
- The tokenizer is configured to handle English as the source language and Tigrinya as the target language.

### 2. **Data Preprocessing**
- Training and validation datasets are tokenized with padding and truncation to a maximum sequence length of 128 tokens.
- Labels are generated for the target language using the tokenizer.

### 3. **Training Process**
- The model is fine-tuned for 3 epochs using the Hugging Face `Trainer` class with:
  - Learning rate: `3e-5`
  - Batch size: 8
  - Weight decay: 0.01
- Dynamic padding is implemented via a `DataCollatorForSeq2Seq`.

### 4. **Evaluation**
- Translations are generated for the validation set using the fine-tuned model.
- BLEU scores are calculated to evaluate translation quality.

---

## Observations

### Baseline BLEU Score
- The baseline BLEU score after fine-tuning is **0.0028**, indicating very poor translation quality.

### Challenges with the T5 Model
1. **Low BLEU Score**:
   - Despite fine-tuning, the BLEU score is extremely low, suggesting that the model struggles to generate accurate translations for the English-to-Tigrinya task.
   - This reflects the model's inability to effectively learn patterns between the two languages.

2. **Inadequate Pre-training for Tigrinya**:
   - The `t5-large` model is not specifically pre-trained on Tigrinya, a low-resource language, limiting its capacity to understand and translate to/from Tigrinya.

3. **Metric Limitations**:
   - The BLEU score may fail to fully capture the quality of translations for morphologically rich languages like Tigrinya.

---

## Conclusion

The fine-tuned T5 model is not effective for English-to-Tigrinya translation. The BLEU score remains negligible even after fine-tuning, highlighting the following limitations:
- Lack of sufficient pre-training on Tigrinya data.
- Challenges posed by the complexity and morphology of the Tigrinya language.

### Recommendations
- Use a model pre-trained or fine-tuned on multilingual datasets that include Tigrinya.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install transformers datasets evaluate


Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m9

In [3]:
import pandas as pd
from datasets import Dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import evaluate


## Load and Prepare the Dataset


In [10]:
# Load the dataset
train_df = pd.read_csv("/Capstone/Dataset_csv/en_to_ti_sample_train.csv")
val_df = pd.read_csv("/Capstone/Dataset_csv/en_to_ti_sample_test.csv")

# Convert to Hugging Face Dataset
hf_train = Dataset.from_pandas(train_df)
hf_val = Dataset.from_pandas(val_df)

In [11]:
val_df.head()

Unnamed: 0.1,Unnamed: 0,Source,Target
0,76374,'10. We work hard in the greatest possible way...,'10. ደረቕ፡ ስልኩይን ስራሕ-ኣልቦን ከይንኸውን ብዝለዓለ መልክዑ ንጽዕር።'
1,48373,'Sanchez has also previously received requests...,'ሳንቸስ ኣቐዲሙ’ውን ካብ ኢንተር ሚላን ጠለብ ረኺቡ ኔሩ’ዩ።'
2,17909,"'For pass over the isles of Chittim, and see; ...",'ናብ ደሴታት ኪቲም ተሳገሩ እሞ ርኣዩ፡ ናብ ቄዳር ልኣኹ እሞ ኣጸቢቕኩም...
3,70958,"'Thanks!"" And 'on the other side ""? ""We are fi...",'“መስገን!” ብኣኻ ወገን’ከ? “ደሓን ኢና ጽቡቕ።'
4,75930,"'A further 29 liberals, including eight newbor...",'በቲ ዝሓለፈ ሰሉስ ለይቲ ዘጋጠመ ሓደጋ፡ ካልኦት ሸሞንተ ናጽላታት ዝርከ...


## Load the T5 Model and Tokenizer

In [12]:
# Load T5 model and tokenizer
model_name = "t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


## Tokenize the Dataset
Prepare the dataset for the model by tokenizing the inputs and outputs.

In [14]:
def preprocess_function(examples):
    # Tokenize the source (English in this case)
    model_inputs = tokenizer(
        examples["Source"],
        max_length=128,
        truncation=True,
        padding="max_length",
    )
    # Tokenize the target (Tigrinya in this case)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["Target"],
            max_length=128,
            truncation=True,
            padding="max_length",
        )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Tokenize the dataset
tokenized_train = hf_train.map(preprocess_function, batched=True)
tokenized_val = hf_val.map(preprocess_function, batched=True)


Map:   0%|          | 0/6148 [00:00<?, ? examples/s]



Map:   0%|          | 0/1538 [00:00<?, ? examples/s]

## Generate Translations
Run the model to generate translations on the validation dataset.

In [15]:
def generate_translation_in_batches(texts, batch_size=32):
    translations = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=128)
        outputs = model.generate(**inputs, max_length=128)
        batch_translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        translations.extend(batch_translations)
    return translations

# Generate translations for validation set
validation_texts = hf_val["Source"]
translations = generate_translation_in_batches(validation_texts)


In [17]:
!pip install sacrebleu

Collecting sacrebleu
  Downloading sacrebleu-2.4.3-py3-none-any.whl.metadata (51 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting portalocker (from sacrebleu)
  Downloading portalocker-3.0.0-py3-none-any.whl.metadata (8.5 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading sacrebleu-2.4.3-py3-none-any.whl (103 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.0/104.0 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Downloading portalocker-3.0.0-py3-none-any.whl (19 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.6 portalocker-3.0.0 sacrebleu-2.4.3


In [18]:
# Load BLEU metric
metric = evaluate.load("sacrebleu")

# Prepare references
references = [[text] for text in hf_val["Target"]]

# Compute BLEU score
result = metric.compute(predictions=translations, references=references)
print(f"Baseline BLEU Score: {result['score']}")

Baseline BLEU Score: 0.010315360646188536


In [20]:
from transformers import DataCollatorForSeq2Seq, TrainingArguments, Trainer
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Define the data collator for dynamic padding
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True,
)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",             # Directory to save the checkpoints and model
    evaluation_strategy="epoch",       # Evaluate after each epoch
    save_strategy="epoch",             # Save model after each epoch
    learning_rate=3e-5,                # Learning rate
    per_device_train_batch_size=8,     # Training batch size
    per_device_eval_batch_size=8,      # Evaluation batch size
    num_train_epochs=3,                # Number of training epochs
    weight_decay=0.01,                 # Weight decay for regularization
    logging_dir="./logs",              # Directory to save logs
    logging_steps=500,                 # Log every 500 steps
    save_total_limit=2,                # Limit the number of saved checkpoints
)


# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Train the model
trainer.train()

# Save the fine-tuned model and tokenizer
model.save_pretrained("./fine_tuned_t5")
tokenizer.save_pretrained("./fine_tuned_t5")


  trainer = Trainer(
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,0.7065,0.043272
2,0.0437,0.040859
3,0.0415,0.039712


('./fine_tuned_t5/tokenizer_config.json',
 './fine_tuned_t5/special_tokens_map.json',
 './fine_tuned_t5/spiece.model',
 './fine_tuned_t5/added_tokens.json',
 './fine_tuned_t5/tokenizer.json')

In [22]:
# Load the BLEU metric
metric = evaluate.load("sacrebleu")

def generate_translation_in_batches(texts, batch_size=32):
    translations = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]

        # Tokenize and move inputs to GPU
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=128)
        inputs = {key: value.to("cuda") for key, value in inputs.items()}  # Move to GPU

        # Generate translations
        outputs = model.generate(**inputs, max_length=128)

        # Decode and store translations
        batch_translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        translations.extend(batch_translations)
    return translations

validation_source_texts = hf_val["Source"]
references = [[ref] for ref in hf_val["Target"]]
predictions = generate_translation_in_batches(validation_source_texts)

# Compute BLEU score
fine_tuned_bleu = metric.compute(predictions=predictions, references=references)
print(f"Fine-Tuned BLEU Score: {fine_tuned_bleu['score']}")



Fine-Tuned BLEU Score: 0.002799700973168112
