MT5 is an existing pre-trained language model, trained on hundreds of languages and I will adapt it to do translation task.

It uses a **sequence-to-sequence** format for language translation.

It can be adapted for low resource languages

Uses simple transformers

Simple transformers - Implementation of 'Attention is all you need paper'

Had the encoder an decorder and subsections within it:

 It has Dropout regularization in some layers and option to use GPU -

 Dropout -  to reduce overfitting and improve the generalization of a model. It involves randomly "dropping out" (i.e., setting to zero) a certain fraction of neurons or units in a neural network during each training iteration.

In [None]:
!pip install simpletransformers

In [None]:
import os
import pandas as pd
import warnings

# Suppress the UserWarning related to do_sample and top_p
warnings.filterwarnings("ignore", category=UserWarning, module="transformers.generation.configuration_utils")


## Data preparation
This function splits data into training and evaluation datasets in the ratio of 80:20(80% training and 20% validation)

In [None]:
def prepare_translation_datasets(csv_filename, train_ratio=0.8):
    data_df = pd.read_csv(csv_filename)
    train_size = int(len(data_df) * train_ratio)
    train_df = data_df[:train_size]
    eval_df = data_df[train_size:]

    # Add prefixes to the data
    train_df["prefix"] = "translate Luhya to Swahili"
    eval_df["prefix"] = "translate Luhya to Swahili"

    return train_df, eval_df

In [None]:
train_df, eval_df = prepare_translation_datasets("preprocessed_data.csv")

In [None]:
train_df

In [None]:
train_df.to_csv("train.tsv", sep="\t")
eval_df.to_csv("eval.tsv", sep="\t")

## Model Training

**logging.basicConfig()**:
* Configures the logging system in Python to log messages with a severity level of "INFO" and above. It's a common way to set up logging in Python for debugging.
* It means that log messages with "INFO," "WARNING," "ERROR," and "CRITICAL" severity levels will be recorded,



In [None]:
import logging
from simpletransformers.t5 import T5Model, T5Args
#to use T5-based models and arguments for translation
logging.basicConfig(level=logging.INFO) #basically logs information
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

In [None]:
train_df = pd.read_csv("train.tsv", sep="\t").astype(str)
eval_df = pd.read_csv("eval.tsv", sep="\t").astype(str)

train_df["prefix"] = ""
eval_df["prefix"] = ""

You can use a prefix value to tell an mT5 (or T5) to perform a specific task. This is quite useful to train a model which can perform multiple tasks.

In [None]:
model_args = T5Args()
model_args.max_seq_length = 128 #maximum number of tokens or characters allowed in a sequence of text
model_args.train_batch_size = 4 #determines the number of data samples that are processed together in each iteration (or batch) during the training process.
model_args.eval_batch_size = 4
model_args.num_train_epochs = 8 #specifies the number of training epochs when training a machine learning model. An epoch is a single pass through the entire training dataset during the training process.
model_args.evaluate_during_training = True #it enables the evaluation of the model's performance on a validation dataset during the training process
model_args.evaluate_during_training_steps = 20000 # it specifies the interval at which the model's performance on a validation dataset is evaluated during the training process, measured in terms of training steps.
model_args.use_multiprocessing = False # creates multiple processes to achieve parallelism
model_args.fp16 = False
model_args.learning_rate = 3e-4
model_args.save_steps = -1
model_args.save_eval_checkpoints = False
model_args.no_cache = True
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.preprocess_inputs = False
model_args.num_return_sequences = 1
model_args.wandb_project = "Luhya-Swahili Translation"
model_args.use_cuda = False

model = T5Model("mt5", "google/mt5-base", args=model_args)

In [None]:
print(train_df.columns)

In [None]:
train_df["prefix"] = "translate Luhya to Swahili"
train_df["input_text"] = train_df["Luhya"]
train_df["target_text"] = train_df["Swahili"]

In [None]:
eval_df["prefix"] = "translate Luhya to Swahili"
eval_df["input_text"] = eval_df["Luhya"]
eval_df["target_text"] = eval_df["Swahili"]

In [None]:
model.train_model(train_df, eval_data=eval_df)

  0%|          | 0/4119 [00:00<?, ?it/s]



Epoch:   0%|          | 0/8 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Running Epoch 0 of 8:   0%|          | 0/1030 [00:00<?, ?it/s]

  0%|          | 0/1030 [00:00<?, ?it/s]



Running Epoch 1 of 8:   0%|          | 0/1030 [00:00<?, ?it/s]

  0%|          | 0/1030 [00:00<?, ?it/s]



Running Epoch 2 of 8:   0%|          | 0/1030 [00:00<?, ?it/s]

  0%|          | 0/1030 [00:00<?, ?it/s]



Running Epoch 3 of 8:   0%|          | 0/1030 [00:00<?, ?it/s]

  0%|          | 0/1030 [00:00<?, ?it/s]



Running Epoch 4 of 8:   0%|          | 0/1030 [00:00<?, ?it/s]

  0%|          | 0/1030 [00:00<?, ?it/s]



Running Epoch 5 of 8:   0%|          | 0/1030 [00:00<?, ?it/s]

  0%|          | 0/1030 [00:00<?, ?it/s]



Running Epoch 6 of 8:   0%|          | 0/1030 [00:00<?, ?it/s]

  0%|          | 0/1030 [00:00<?, ?it/s]



Running Epoch 7 of 8:   0%|          | 0/1030 [00:00<?, ?it/s]

  0%|          | 0/1030 [00:00<?, ?it/s]



(8240,
 {'global_step': [1030, 2060, 3090, 4120, 5150, 6180, 7210, 8240],
  'eval_loss': [4.436595710673074,
   4.850429329761239,
   4.103908126668412,
   4.065062848172446,
   4.021116838898769,
   4.3030334184336105,
   4.37587941709415,
   4.736433420994485],
  'train_loss': [5.03427791595459,
   4.277291774749756,
   2.3910603523254395,
   1.544602394104004,
   2.1327333450317383,
   0.8446611166000366,
   1.3017898797988892,
   1.428533673286438]})

In [None]:
model

<simpletransformers.t5.t5_model.T5Model at 0x7a1efc4df610>

The library to calculate the Bleu score

In [None]:
!pip install sacrebleu

In [None]:
import sacrebleu
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

In [None]:
val_df = pd.read_csv("eval.tsv", sep="\t").astype(str)

luhya_truth = [eval_df.loc[eval_df["prefix"] == "translate Luhya to Swahili"]["target_text"].tolist()]
to_luhya = eval_df.loc[eval_df["prefix"] == "translate Luhya to Swahili"]["input_text"].tolist()

swahili_truth = [eval_df.loc[eval_df["prefix"] == "translate Luhya to Swahili"]["target_text"].tolist()]
to_swahili = eval_df.loc[eval_df["prefix"] == "translate Luhya to Swahili"]["input_text"].tolist()


In [None]:
import torch

In [None]:
torch.cuda.empty_cache()

In [None]:
# Predict
luhya_preds = model.predict(to_luhya)

swa_luh_bleu = sacrebleu.corpus_bleu(luhya_preds, luhya_truth)
print("--------------------------")
print("Swahili to Luhya: ", swa_luh_bleu.score)

swahili_preds = model.predict(to_swahili)

luh_swa_bleu = sacrebleu.corpus_bleu(swahili_preds, swahili_truth)
print("Luhya to Swahili: ", luh_swa_bleu.score)

Generating outputs:   0%|          | 0/258 [00:00<?, ?it/s]

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



Decoding outputs:   0%|          | 0/1030 [00:00<?, ?it/s]



--------------------------
Swahili to Luhya:  3.0748070049807694


Generating outputs:   0%|          | 0/258 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/1030 [00:00<?, ?it/s]

Luhya to Swahili:  3.0748070049807694


In [None]:
import os

//trial

In [None]:
import nltk
from nltk.translate import meteor_score
import pandas as pd
import sacrebleu
import logging

In [None]:
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

In [None]:
import locale
from nltk import download

# Set the locale to UTF-8
locale.getdefaultlocale()
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

# Download the NLTK data
download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
# METEOR scores
meteor_swa_luh = meteor_score.meteor_score(swahili_truth, luhya_preds)
meteor_luh_swa = meteor_score.meteor_score(luhya_truth, swahili_preds)

print("--------------------------")
print("METEOR Scores:")
print("Swahili to Luhya:")
print("METEOR Score:", meteor_swa_luh)

print("\nLuhya to Swahili:")
print("METEOR Score:", meteor_luh_swa)

--------------------------
METEOR Scores:
Swahili to Luhya:
METEOR Score: 0.000970873786407767

Luhya to Swahili:
METEOR Score: 0.000970873786407767


Calling the model from hugging face

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model_path = "outputs"  # Replace with the path to your trained model
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

tokenizer.save_pretrained("custom_tokenizer")




('custom_tokenizer/tokenizer_config.json',
 'custom_tokenizer/special_tokens_map.json',
 'custom_tokenizer/spiece.model',
 'custom_tokenizer/added_tokens.json',
 'custom_tokenizer/tokenizer.json')

In [None]:
def translate_text(original_text, model, tokenizer):
    inputs = tokenizer.encode("translate Luhya to Swahili: " + original_text, return_tensors="pt")
    translation = model.generate(inputs, max_length=90, num_return_sequences=1)
    translated_text = tokenizer.decode(translation[0], skip_special_tokens=True)
    return translated_text


In [None]:
original_text = "si ndareta"
translated_text = translate_text(original_text, model, tokenizer)
print("Original Text:", original_text)
print("Translated Text:", translated_text)

Original Text: si ndareta
Translated Text: nilileta


In [None]:
original_text = "lidudu"
translated_text = translate_text(original_text, model, tokenizer)
print("Original Text:", original_text)
print("Translated Text:", translated_text)

Original Text: lidudu
Translated Text: ndege


In [None]:
original_text = "yatsia"
translated_text = translate_text(original_text, model, tokenizer)
print("Original Text:", original_text)
print("Translated Text:", translated_text)

Original Text: yatsia
Translated Text: alienda


In [None]:
!pip install huggingface_hub

NotImplementedError: ignored

In [None]:
from huggingface_hub import Repository

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
model.push_to_hub("")


# Save the tokenizer to the local directory
tokenizer.push_to_hub("")