<a href="https://colab.research.google.com/github/Firojpaudel/GenAI-Chronicles/blob/main/Seq2Seq/Seq2Seq_BART_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **mBART**: *As a Translator* `(Eng-Nep)`

---

Before, I implemented the BART as a summarizer. Now, its time to use it as a language translator.

Will be trying to set up English to Nepali translation using `facebook/mbart-large-50-many-to-many-mmt` model from Hugging Face.

---

In [1]:
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model_name= "facebook/mbart-large-50-many-to-many-mmt"

tokenizer= MBart50TokenizerFast.from_pretrained(model_name)
model= MBartForConditionalGeneration.from_pretrained(model_name)

In [2]:
tokenizer.src_lang = "en_XX"
target_lang= "ne_NP"

In [3]:
def translate(text):
  inputs= tokenizer(text, return_tensors="pt").to('cpu')

  translated_tokens= model.generate(
      **inputs,
      forced_bos_token_id= tokenizer.lang_code_to_id[target_lang]
  )

  return tokenizer.decode(translated_tokens[0], skip_special_tokens= True)

In [4]:
eng_text = "Welcome to the notebook file!"

nep_translation= translate(eng_text)

print(nep_translation)

नोटबुक फाइलमा स्वागत छ!


Well its working as intended but what if we want to fine-tune the model further?

okay, so I have like curated a very small csv file that has gen-z english with its appropriate Nepali translations. Now, lets finetune this.

In [None]:
!pip install datasets

In [6]:
from datasets import Dataset

import pandas as pd

df= pd.read_csv("/content/low_key.csv")
df.columns= df.columns.str.strip()
print(df.columns)

## Converting the dataframe to Hugging Face Dataset

dataset = Dataset.from_pandas(df)
print(dataset.features)

Index(['english', 'nepali'], dtype='object')
{'english': Value(dtype='string', id=None), 'nepali': Value(dtype='string', id=None)}


In [7]:
tokenizer= MBart50TokenizerFast.from_pretrained(model_name, src_lang= 'en_XX', tgt_lang="ne_NP")

In [8]:
##@ Preprocessing the Dataset:

def preprocess_func(examples):
  inputs = [ex for ex in examples['english']]
  targets = [ex for ex in examples['nepali']]
  model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding= 'max_length')

  # Tokenize targets with the `text_target` keyword argument
  with tokenizer.as_target_tokenizer():
      labels = tokenizer(targets, max_length=128, truncation=True, padding= 'max_length')

  model_inputs["labels"] = labels["input_ids"]
  return model_inputs

tokenized_dataset = dataset.map(preprocess_func, batched=True)

Map:   0%|          | 0/19 [00:00<?, ? examples/s]



In [9]:
from transformers import Trainer, TrainingArguments
import torch
##@ Setting up the training args

training_args= TrainingArguments(
    output_dir= './results',
    eval_strategy= 'epoch',
    learning_rate= 2e-5,
    per_device_train_batch_size= 4,
    per_device_eval_batch_size= 4,
    logging_steps= 5,
    weight_decay = 0.01,
    num_train_epochs=30,
    fp16= torch.cuda.is_available()
)

In [10]:
trainer = Trainer(
    model= model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset,
    tokenizer=tokenizer,
)

  trainer = Trainer(


In [11]:
torch.cuda.empty_cache()

In [12]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mfirojpaudel[0m ([33mfirojpaudel-madan-bhandari-memorial-college[0m). Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss
1,11.8377,11.309142
2,10.8745,10.280623
3,10.106,9.661011
4,9.517,9.078266
5,8.9408,8.493481
6,8.3706,7.914397
7,7.8062,7.351004
8,7.2536,6.798064
9,6.7147,6.259949
10,6.1871,5.740569




TrainOutput(global_step=150, training_loss=4.6632652091979985, metrics={'train_runtime': 153.0709, 'train_samples_per_second': 3.724, 'train_steps_per_second': 0.98, 'total_flos': 154407995965440.0, 'train_loss': 4.6632652091979985, 'epoch': 30.0})

Since the dataset was very less (19) and epoch was small enough, its obvious the loss is high. however, we can see it decreasing. So the model is working.

---

In [13]:
##@ first saving the finetuned model and testing

model.save_pretrained('./finetuned_mbart')
tokenizer.save_pretrained('./finetuned_mbart')

('./finetuned_mbart/tokenizer_config.json',
 './finetuned_mbart/special_tokens_map.json',
 './finetuned_mbart/sentencepiece.bpe.model',
 './finetuned_mbart/added_tokens.json',
 './finetuned_mbart/tokenizer.json')

In [14]:
##@ Time to test

model_name= "./finetuned_mbart"

tokenizer= MBart50TokenizerFast.from_pretrained(model_name)
model= MBartForConditionalGeneration.from_pretrained(model_name)

In [15]:
def translate_after(input_text):

    source_lang="en_XX"
    target_lang="ne_NP"
    # Set the tokenizer's source language
    tokenizer.src_lang = source_lang

    # Tokenize the input text
    inputs = tokenizer(input_text, return_tensors="pt").to('cpu')

    # Generate translation with the target language token forced at the beginning
    translated_tokens = model.generate(
        **inputs,
        forced_bos_token_id=tokenizer.lang_code_to_id[target_lang]
    )

    # Decode the generated tokens to get the translated text
    return tokenizer.decode(translated_tokens[0], skip_special_tokens=True)

In [20]:
input_text= '''hey Bae! what's up?

 I Bet you are fine.

 How's your fam?
 '''

translated_after_text = translate_after(input_text)

print(translated_after_text)

हे प्रेमी! के भइरहेछ? म पक्का तिमी राम्रो छौ। तिम्रा परिवारमा के भइरहेछ?


😵‍💫 Well it's working I guess 🤷‍♂️