<a href="https://colab.research.google.com/github/Mahdi-Golizadeh/Natural-Language-Processing/blob/main/transformers/translators/translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a Translation Model (EN-FA)
In this notebook I will build a transformer model for translation task
* the task is to train a transformer model with structure like t5 but smaller
* the huggingface transformers is used
* I will train a samll transformer model for translating from english to farsi

## Install necessary libraries

In [1]:
!pip install -q datasets
!pip install -q transformers
!pip install -q sentencepiece
!pip install -q sacrebleu
!pip install -q evaluate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m452.9/452.9 KB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.6/140.6 KB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m99.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m58.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

## Import necessary libraries

In [2]:
import datasets
import transformers
import evaluate
import numpy as np

## Dataset

To have small dataset for faster training KDE4 dataset is used here

First need to download the dataset

In [3]:
raw_datasets = datasets.load_dataset("kde4", lang1= "en", lang2= "fa")

Downloading builder script:   0%|          | 0.00/4.25k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/8.45k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.10k [00:00<?, ?B/s]



Downloading and preparing dataset kde4/en-fa to /root/.cache/huggingface/datasets/kde4/en-fa-lang1=en,lang2=fa/0.0.0/243129fb2398d5b0b4f7f6831ab27ad84774b7ce374cf10f60f6e1ff331648ac...


Downloading data:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset kde4 downloaded and prepared to /root/.cache/huggingface/datasets/kde4/en-fa-lang1=en,lang2=fa/0.0.0/243129fb2398d5b0b4f7f6831ab27ad84774b7ce374cf10f60f6e1ff331648ac. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

To see a sample of dataset

In [4]:
raw_datasets["train"][2]

{'id': '2',
 'translation': {'en': 'Add All Found Feeds to Akregator',
  'fa': 'افزودن همۀ خوراندنهای یافته\u200cشده به Akregator'}}

this dataset only has train split so we will create a test split for validation

In [5]:
split_datasets = raw_datasets["train"].train_test_split(train_size= .9, seed= 49)

now rename test split to validation split

In [6]:
split_datasets["validation"] = split_datasets.pop("test")

To investigate our newly created dataset

In [7]:
split_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 74788
    })
    validation: Dataset({
        features: ['id', 'translation'],
        num_rows: 8310
    })
})

## Selecting Checkpoint

select a checkpoint for a pretrained model for fine-tuning

I will choose mt5-base for tokenizer

In [8]:
checkpoint = "google/mt5-base"

next to download the proper tokenizer

In [9]:
tokenizer = transformers.AutoTokenizer.from_pretrained(checkpoint, src_lang= "en", tgt_lang= "fa", return_tensors= "pt")

Downloading:   0%|          | 0.00/376 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/702 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]



In [10]:
tokenizer

PreTrainedTokenizerFast(name_or_path='google/mt5-base', vocab_size=250100, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'})

### Use case sample of the tokenizer

take a sample of dataset

In [11]:
src_sent = split_datasets["validation"][48]["translation"]["en"]
tgt_sent = split_datasets["validation"][48]["translation"]["fa"]
src_sent, tgt_sent

('D-Bus Call Failed', '& نامه... \u200c')

to test tokenizer

In [12]:
model_inputs = tokenizer(src_sent, text_target= tgt_sent, return_tensors= "pt")
model_inputs

{'input_ids': tensor([[   431,    264,  69114,  10633, 111099,    345,      1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[  549, 30968,   302,   259,     1]])}

now decode the output of tokenizer with decode method to see if it works properly

In [13]:
tokenizer.batch_decode(model_inputs["input_ids"]), tokenizer.batch_decode(model_inputs["labels"])

(['D-Bus Call Failed</s>'], ['& نامه... </s>'])

## Preprocessing Data

to tokenize the dataset we need a function to seperate sentences and determine target

In [14]:
max_length= 64
def preprocess(example):
    inputs = [ex["en"] for ex in example["translation"]]
    targets = [ex["fa"] for ex in example["translation"]]
    model_inputs = tokenizer(inputs, text_target= targets, max_length= max_length, truncation= True)

    return model_inputs

map preprocess function and remove unnecessary columns

In [15]:
tokenized_datasets = split_datasets.map(preprocess, batched= True, remove_columns= split_datasets["train"].column_names)

  0%|          | 0/75 [00:00<?, ?ba/s]

  0%|          | 0/9 [00:00<?, ?ba/s]

A sample of preprocessed dataset

In [16]:
tokenized_datasets["train"][0]

{'input_ids': [653, 259, 185168, 265, 299, 1],
 'attention_mask': [1, 1, 1, 1, 1, 1],
 'labels': [259, 14268, 259, 7259, 20331, 1]}

I will use a seq2seq trainer which is a subclass of trainer that will allow us to properly deal with evaluation

## Model

a custom model that has t5 structure but down-sized to be trainable on colab

In [17]:
model_config = transformers.MT5Config(
    d_model= 128,
    d_ff= 256,
    num_layers= 2,
    d_kv= 16,
    num_heads= 8,
)

In [18]:
model = transformers.AutoModelForSeq2SeqLM.from_config(model_config)

## Data Collation

I will use DataCollatorFoeSeq2Seq to take care of necessary processing for input of the model

In [19]:
data_collator = transformers.DataCollatorForSeq2Seq(
    model= model,
    tokenizer= tokenizer,
)

In [20]:
data_collator

DataCollatorForSeq2Seq(tokenizer=PreTrainedTokenizerFast(name_or_path='google/mt5-base', vocab_size=250100, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}), model=MT5ForConditionalGeneration(
  (shared): Embedding(250112, 128)
  (encoder): T5Stack(
    (embed_tokens): Embedding(250112, 128)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=128, out_features=128, bias=False)
              (k): Linear(in_features=128, out_features=128, bias=False)
              (v): Linear(in_features=128, out_features=128, bias=False)
              (o): Linear(in_features=128, out_features=128, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
         

## Metrics

the default metric for almost every translation task is bleu do sacrebleu is used for evaluation

In [21]:
metric = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

to see the detail of metric

In [22]:
metric

EvaluationModule(name: "sacrebleu", module_type: "metric", features: [{'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}, {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}], usage: """
Produces BLEU scores along with its sufficient statistics
from a source against one or more references.

Args:
    predictions (`list` of `str`): list of translations to score. Each translation should be tokenized into a list of tokens.
    references (`list` of `list` of `str`): A list of lists of references. The contents of the first sub-list are the references for the first prediction, the contents of the second sub-list are for the second prediction, etc. Note that there must be the same number of references for each prediction (i.e. all sub-lists must be of the same length).
    smooth_method (`str`): The smoothing method to use, defaults to `'e

we need a function to make model output appropraite for  metric in evaluation

In [23]:
def compute_metric(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(
        preds, skip_special_tokens= True,
    )
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens= True,)
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [label.strip() for label in decoded_labels]
    result = metric.compute(
        predictions= decoded_preds,
        references= decoded_labels,
    )
    return {"BLEU": result["score"]}

## Training Process

defining arguments required for training

In [35]:
args = transformers.Seq2SeqTrainingArguments(
    "mt5-trans-en-fa",
    evaluation_strategy= "no",
    save_strategy= "epoch",
    overwrite_output_dir= True,
    save_total_limit= 1,
    learning_rate= 2e-5,
    per_device_train_batch_size= 32,
    per_device_eval_batch_size= 32,
    weight_decay= .1,
    num_train_epochs= 3,
    predict_with_generate= True,
    fp16= True,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


now defining trainer

In [36]:
trainer = transformers.Seq2SeqTrainer(
    model,
    args,
    train_dataset= tokenized_datasets["train"],
    eval_dataset= tokenized_datasets["validation"],
    data_collator= data_collator,
    tokenizer= tokenizer,
    compute_metrics= compute_metric
)

Using cuda_amp half precision backend


to see model performance before training

In [39]:
trainer.evaluate(max_length= max_length)

***** Running Evaluation *****
  Num examples = 8310
  Batch size = 32


{'eval_loss': 19.34880828857422,
 'eval_BLEU': 0.008929368712566728,
 'eval_runtime': 74.6611,
 'eval_samples_per_second': 111.303,
 'eval_steps_per_second': 3.482,
 'epoch': 3.0}

Now to train the model

In [None]:
log = trainer.train()

***** Running training *****
  Num examples = 74788
  Num Epochs = 3
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 7014
  Number of trainable parameters = 64817152


Step,Training Loss
500,23.2772


## Model Evaluation

to test fine-tuned model

In [43]:
trans = transformers.pipeline("translation", model= "/content/mt5-trans-en-fa/checkpoint-7014" )

loading configuration file /content/mt5-trans-en-fa/checkpoint-7014/config.json
Model config MT5Config {
  "_name_or_path": "/content/mt5-trans-en-fa/checkpoint-7014",
  "architectures": [
    "MT5ForConditionalGeneration"
  ],
  "d_ff": 256,
  "d_kv": 16,
  "d_model": 128,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "mt5",
  "num_decoder_layers": 2,
  "num_heads": 8,
  "num_layers": 2,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "use_cache": true,
  "vocab_size": 250112
}

loading configuration file /content/mt5-trans-en-fa/checkpoint-7014/config.json
Model config MT5Conf

In [44]:
trans("tomorrow I will come")

[{'translation_text': ''}]

In [51]:
trainer

TypeError: ignored

our model needs more training and becuase it is not big enough the perfomance isn't satisfying