<a href="https://colab.research.google.com/github/B0BWAX/MT5-TRANSLATION-FINETUNING/blob/main/MT5_TRANSLATION_FINETUNING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install datasets transformers sentencepiece sacrebleu



In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Load Model and Tokenizer

Model: [Google MT5-Small](https://huggingface.co/google/mt5-small)

In [3]:
model_checkpoint = "google/mt5-small"

In [4]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [5]:
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
model = model.cuda()

## Load Dataset

Translation Dataset: [XNLI](https://huggingface.co/datasets/xnli)

In [6]:
import datasets
from datasets import load_dataset, Dataset

In [7]:
dataset = load_dataset("xnli", "all_languages")

In [8]:
dataset

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 392702
    })
    test: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 5010
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 2490
    })
})

In [9]:
train = dataset['train'].shuffle(seed=342).select(range(30000))
validation = dataset['validation']
test = dataset['test'].select(range(3000))

In [24]:
test[10]['premise']

{'ar': 'هناك الكثير تستطيع التحدث عنه  وأنا سوف أتاجاوز ذلك تماما .',
 'bg': 'Има толкова много, което може да се разкаже за това, че просто ще го пропусна.',
 'de': 'Es gibt so viel was ich darüber erzählen könnte, ich überspringe das einfach.',
 'el': "Υπάρχουν τόσα πολλά που θα μπορούσες να μιλήσεις γι 'αυτό απλά θα τα παραλείψω.",
 'en': "There's so much you could talk about on that I'll just skip that.",
 'es': 'Hay tanto que se puede decir sobre eso, que sencillamente me voy a saltar eso.',
 'fr': "Il y a tellement de choses dont vous pourriez parler que je vais juste m'en passer.",
 'hi': 'इतना है कि आप इसके बारे में बात कर सकते हैं कि मैं इसे छोड़ूँगा|',
 'ru': 'Об этом можно так много говорить, что я опущу подробности.',
 'sw': 'Kuna mengi ambayo unaweza kuzungumzia kuhusu hilo lakini  nitaachana nayo tu.',
 'th': 'มันมีอีกมากที่คุณสามารถพูดคุยเกี่ยวกับสิ่งนั้น ฉันจะข้ามไปละกัน',
 'tr': 'Bu konu hakkında söyleyebileceğin çok şey var pas geçiyorum.',
 'ur': 'بہت اتنا ہے کہ آپ ا

## Preprocessing

We will fine tune the model for the following languages:
* English
* French
* Spanish
* Arabic

We will specify which langauge the model will translate to by including a prefix at the beginning of the input

In [11]:
max_seq_len = 32

In [41]:
LANG_TOKEN_MAPPING = {
    'en': '<en>',
    'fr': '<fr>',
    'es': '<es>',
    'ar': '<ar>'
}

In [13]:
# special tokens
special_tokens_dict = {'additional_special_tokens': list(LANG_TOKEN_MAPPING.values())}
tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))

Embedding(250104, 512)

In [14]:
import numpy as np
import torch

def encode_input_str(text, target_lang, tokenizer, seq_len,
                     lang_token_map=LANG_TOKEN_MAPPING):
  target_lang_token = lang_token_map[target_lang]

  # Tokenize and add special tokens
  input_ids = tokenizer.encode(
      text = target_lang_token + text,
      return_tensors = 'pt',
      padding = 'max_length',
      truncation = True,
      max_length = seq_len)

  return input_ids[0]

def encode_target_str(text, tokenizer, seq_len,
                      lang_token_map=LANG_TOKEN_MAPPING):
  token_ids = tokenizer.encode(
      text = text,
      return_tensors = 'pt',
      padding = 'max_length',
      truncation = True,
      max_length = seq_len)

  return token_ids[0]

def format_translation_data(translations, lang_token_map,
                            tokenizer, seq_len=128):
  # Choose a random 2 languages for in i/o
  langs = list(lang_token_map.keys())
  input_lang, target_lang = np.random.choice(langs, size=2, replace=False)

  # Get the translations for the batch
  input_text = translations[input_lang]
  target_text = translations[target_lang]

  if input_text is None or target_text is None:
    return None

  input_token_ids = encode_input_str(
      input_text, target_lang, tokenizer, seq_len, lang_token_map)

  target_token_ids = encode_target_str(
      target_text, tokenizer, seq_len, lang_token_map)

  return input_token_ids, target_token_ids

def transform_batch(batch, lang_token_map, tokenizer):
  inputs = []
  targets = []
  for translation_set in batch['premise']:
    formatted_data = format_translation_data(
        translation_set, lang_token_map, tokenizer, max_seq_len)

    if formatted_data is None:
      continue

    input_ids, target_ids = formatted_data
    inputs.append(input_ids.unsqueeze(0))
    targets.append(target_ids.unsqueeze(0))

  batch_input_ids = torch.cat(inputs).cuda()
  batch_target_ids = torch.cat(targets).cuda()

  return batch_input_ids, batch_target_ids

def get_data_generator(dataset, lang_token_map, tokenizer, batch_size=32):
  dataset = dataset.shuffle()
  for i in range(0, len(dataset), batch_size):
    raw_batch = dataset[i:i+batch_size]
    yield transform_batch(raw_batch, lang_token_map, tokenizer)

In [15]:
# Testing `data_transform`
in_ids, out_ids = format_translation_data(
    train[10]['premise'], LANG_TOKEN_MAPPING, tokenizer)

print(' '.join(tokenizer.convert_ids_to_tokens(in_ids)))
print(' '.join(tokenizer.convert_ids_to_tokens(out_ids)))

# Testing data generator
data_gen = get_data_generator(train, LANG_TOKEN_MAPPING, tokenizer, 8)
data_batch = next(data_gen)
print('Input shape:', data_batch[0].shape)
print('Output shape:', data_batch[1].shape)

<en> ▁اه ▁ , ▁لا ▁ , ▁ انها ▁ ليست ▁ مزد حم ة ▁ انها ▁بع يدة ▁ بما ▁ فيه ▁الك فا ية ▁ لكي ▁لا ▁ت كون ▁ مزد حم ة ▁ وهي ▁محا طة ▁بال ▁ال منح درات ▁ ل ذا ▁ مهما ▁ كانت ▁الر ياح ▁ ه ناك ▁دائ ما ▁جز ء ▁كبير ▁من ▁البح يرة ▁ هاد ئة </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
▁ah ▁no ▁it ▁ ' s ▁not ▁over ▁crowd ed ▁it ▁ ' s ▁ uh ▁it ▁it ▁far ▁ enough ▁away ▁so ▁it ▁ ' s ▁not ▁over ▁crowd ed ▁and ▁it ▁ ' s ▁ surround ed ▁by ▁ cliff s ▁so ▁no ▁matter ▁how ▁much ▁wind ▁it ▁is ▁there ▁is ▁ always ▁ a ▁big ▁ portion ▁of ▁the ▁lake ▁that ▁ ' s ▁calm </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 

## Finetuning

In [16]:
from transformers import AdamW, get_linear_schedule_with_warmup

n_epochs = 10
batch_size = 40
print_freq = 100
checkpoint_freq = 500
lr = 5e-4
n_batches = int(np.ceil(len(train) / batch_size))
total_steps = n_epochs * n_batches
n_warmup_steps = int(total_steps * 0.01)

optimizer = AdamW(model.parameters(), lr=lr)
scheduler = get_linear_schedule_with_warmup(
    optimizer, n_warmup_steps, total_steps)

losses = []



In [17]:
def eval_model(model, gdataset, max_iters=8):
  test_generator = get_data_generator(gdataset, LANG_TOKEN_MAPPING,
                                      tokenizer, batch_size)
  eval_losses = []
  for i, (input_batch, label_batch) in enumerate(test_generator):
    if i >= max_iters:
      break

    model_out = model.forward(
        input_ids = input_batch,
        labels = label_batch)
    eval_losses.append(model_out.loss.item())

  return np.mean(eval_losses)

In [18]:
from tqdm import tqdm_notebook

for epoch_idx in range(n_epochs):
  # Randomize data order
  data_generator = get_data_generator(train, LANG_TOKEN_MAPPING,
                                      tokenizer, batch_size)

  for batch_idx, (input_batch, label_batch) \
      in tqdm_notebook(enumerate(data_generator), total=n_batches):
    optimizer.zero_grad()

    # Forward pass
    model_out = model.forward(
        input_ids = input_batch,
        labels = label_batch)

    # Calculate loss and update weights
    loss = model_out.loss
    losses.append(loss.item())
    loss.backward()
    optimizer.step()
    scheduler.step()

    # Print training update info
    if (batch_idx + 1) % print_freq == 0:
      avg_loss = np.mean(losses[-print_freq:])
      print('Epoch: {} | Step: {} | Avg. loss: {:.3f} | lr: {}'.format(
          epoch_idx+1, batch_idx+1, avg_loss, scheduler.get_last_lr()[0]))

    if (batch_idx + 1) % checkpoint_freq == 0:
      validation_loss = eval_model(model, validation)
      print('Saving model with test loss of {:.3f}'.format(validation_loss))
      torch.save(model.state_dict(), '/content/drive/MyDrive/mt5_finetuned_state_dict.pt')
      torch.save(model, '/content/drive/MyDrive/mt5_finetuned_model.pt')

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  in tqdm_notebook(enumerate(data_generator), total=n_batches):


  0%|          | 0/750 [00:00<?, ?it/s]

Epoch: 1 | Step: 100 | Avg. loss: 12.383 | lr: 0.0004983164983164984
Epoch: 1 | Step: 200 | Avg. loss: 3.761 | lr: 0.0004915824915824916
Epoch: 1 | Step: 300 | Avg. loss: 2.738 | lr: 0.0004848484848484849
Epoch: 1 | Step: 400 | Avg. loss: 2.420 | lr: 0.0004781144781144781
Epoch: 1 | Step: 500 | Avg. loss: 2.210 | lr: 0.0004713804713804714
Saving model with test loss of 2.630
Epoch: 1 | Step: 600 | Avg. loss: 2.060 | lr: 0.0004646464646464646
Epoch: 1 | Step: 700 | Avg. loss: 1.965 | lr: 0.00045791245791245794


  0%|          | 0/750 [00:00<?, ?it/s]

Epoch: 2 | Step: 100 | Avg. loss: 1.832 | lr: 0.00044781144781144786
Epoch: 2 | Step: 200 | Avg. loss: 1.819 | lr: 0.00044107744107744107
Epoch: 2 | Step: 300 | Avg. loss: 1.748 | lr: 0.0004343434343434344
Epoch: 2 | Step: 400 | Avg. loss: 1.685 | lr: 0.0004276094276094276
Epoch: 2 | Step: 500 | Avg. loss: 1.685 | lr: 0.0004208754208754209
Saving model with test loss of 2.462
Epoch: 2 | Step: 600 | Avg. loss: 1.657 | lr: 0.0004141414141414142
Epoch: 2 | Step: 700 | Avg. loss: 1.626 | lr: 0.0004074074074074074


  0%|          | 0/750 [00:00<?, ?it/s]

Epoch: 3 | Step: 100 | Avg. loss: 1.571 | lr: 0.00039730639730639736
Epoch: 3 | Step: 200 | Avg. loss: 1.554 | lr: 0.00039057239057239056
Epoch: 3 | Step: 300 | Avg. loss: 1.532 | lr: 0.00038383838383838383
Epoch: 3 | Step: 400 | Avg. loss: 1.516 | lr: 0.0003771043771043771
Epoch: 3 | Step: 500 | Avg. loss: 1.494 | lr: 0.00037037037037037035
Saving model with test loss of 2.387
Epoch: 3 | Step: 600 | Avg. loss: 1.508 | lr: 0.00036363636363636367
Epoch: 3 | Step: 700 | Avg. loss: 1.492 | lr: 0.0003569023569023569


  0%|          | 0/750 [00:00<?, ?it/s]

Epoch: 4 | Step: 100 | Avg. loss: 1.433 | lr: 0.0003468013468013468
Epoch: 4 | Step: 200 | Avg. loss: 1.433 | lr: 0.00034006734006734006
Epoch: 4 | Step: 300 | Avg. loss: 1.416 | lr: 0.0003333333333333333
Epoch: 4 | Step: 400 | Avg. loss: 1.377 | lr: 0.00032659932659932664
Epoch: 4 | Step: 500 | Avg. loss: 1.381 | lr: 0.00031986531986531985
Saving model with test loss of 2.308
Epoch: 4 | Step: 600 | Avg. loss: 1.403 | lr: 0.00031313131313131316
Epoch: 4 | Step: 700 | Avg. loss: 1.366 | lr: 0.00030639730639730637


  0%|          | 0/750 [00:00<?, ?it/s]

Epoch: 5 | Step: 100 | Avg. loss: 1.334 | lr: 0.0002962962962962963
Epoch: 5 | Step: 200 | Avg. loss: 1.322 | lr: 0.0002895622895622896
Epoch: 5 | Step: 300 | Avg. loss: 1.312 | lr: 0.0002828282828282828
Epoch: 5 | Step: 400 | Avg. loss: 1.312 | lr: 0.00027609427609427613
Epoch: 5 | Step: 500 | Avg. loss: 1.300 | lr: 0.00026936026936026934
Saving model with test loss of 2.230
Epoch: 5 | Step: 600 | Avg. loss: 1.293 | lr: 0.00026262626262626266
Epoch: 5 | Step: 700 | Avg. loss: 1.315 | lr: 0.0002558922558922559


  0%|          | 0/750 [00:00<?, ?it/s]

Epoch: 6 | Step: 100 | Avg. loss: 1.256 | lr: 0.0002457912457912458
Epoch: 6 | Step: 200 | Avg. loss: 1.256 | lr: 0.00023905723905723905
Epoch: 6 | Step: 300 | Avg. loss: 1.262 | lr: 0.0002323232323232323
Epoch: 6 | Step: 400 | Avg. loss: 1.239 | lr: 0.0002255892255892256
Epoch: 6 | Step: 500 | Avg. loss: 1.266 | lr: 0.00021885521885521886
Saving model with test loss of 2.377
Epoch: 6 | Step: 600 | Avg. loss: 1.251 | lr: 0.00021212121212121213
Epoch: 6 | Step: 700 | Avg. loss: 1.246 | lr: 0.0002053872053872054


  0%|          | 0/750 [00:00<?, ?it/s]

Epoch: 7 | Step: 100 | Avg. loss: 1.200 | lr: 0.00019528619528619528
Epoch: 7 | Step: 200 | Avg. loss: 1.195 | lr: 0.00018855218855218854
Epoch: 7 | Step: 300 | Avg. loss: 1.192 | lr: 0.00018181818181818183
Epoch: 7 | Step: 400 | Avg. loss: 1.229 | lr: 0.0001750841750841751
Epoch: 7 | Step: 500 | Avg. loss: 1.189 | lr: 0.00016835016835016836
Saving model with test loss of 2.255


KeyboardInterrupt: 

In [111]:
#@title Test Interface
input_text = "hello what is your name?" #@param {type:"string"}
output_language = 'ar' #@param ["en", "ar", "fr", "es"]

input_ids = encode_input_str(
    text = input_text,
    target_lang = output_language,
    tokenizer = tokenizer,
    seq_len = 128,
    lang_token_map = LANG_TOKEN_MAPPING)
input_ids = input_ids.unsqueeze(0).cuda()

output_tokens = model.generate(input_ids, num_beams=20, length_penalty=0.2)
print(input_text + '  ->  ' + \
      tokenizer.decode(output_tokens[0], skip_special_tokens=True))

hello what is your name?  ->  هلا, ما هو اسمك?


In [26]:
def translate(input, target_lang):
    input_ids = encode_input_str(text = input, target_lang = output_language, tokenizer = tokenizer,
                                 seq_len = 128, lang_token_map = LANG_TOKEN_MAPPING)
    input_ids = input_ids.unsqueeze(0).cuda()

    translated_ids = model.generate(input_ids=input_ids, num_beams=10, length_penalty=0.2)

    translated_sentence = tokenizer.decode(translated_ids[0], skip_special_tokens=True)
    return translated_sentence

In [29]:
translate("you are not being very nice to me", 'ar')

'انت لا تكون فخمة بالنسبة لي.'