# Fine-Tuning mBart for Arabic to Persian Subtitle Translation 🎥📝🤖

## Introduction

<img src='https://production-media.paperswithcode.com/methods/Screen_Shot_2020-06-01_at_9.49.47_PM.png' />

In this notebook, the pre-trained [mBart 50](https://arxiv.org/abs/2008.00401) model from the [Hugging Face Model Hub](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) has been fine-tuned on a dataset of [English-Persian subtitle pairs](https://huggingface.co/datasets/Peymansoft/English-Persian-Subtitle). The primary goal of this fine-tuning process is to enhance the model's ability to generate translations that closely mimic the style and tone typical of subtitles.

Through this experimentation, we observe that the fine-tuned model successfully adapts to the nuances of subtitle language, resulting in translations that feel more natural and contextually appropriate for viewers.

The final model, demonstrating improved translation performance, has been [pushed to Hugging Face ](https://huggingface.co/Peymansoft/MBart-50-Subtitle-English-Persian)for open-source access and further development by the community. This repository aims to provide a comprehensive overview of the fine-tuning process and facilitate further advancements in subtitle translation.



In [1]:
# install dependencies
!pip install datasets sacrebleu evaluate

Collecting sacrebleu
  Downloading sacrebleu-2.4.3-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting portalocker (from sacrebleu)
  Downloading portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB)
Downloading sacrebleu-2.4.3-py3-none-any.whl (103 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.0/104.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading portalocker-2.10.1-py3-none-any.whl (18 kB)
Installing collected packages: portalocker, sacrebleu, evaluate
Successfully installed evaluate-0.4.3 portalocker-2.10.1 sacrebleu-2.4.3


In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Load Dataset 📂📊

In [3]:
import pandas as pd
import numpy as np
from datasets import load_dataset


# load the dataset (from Hugging Face datasets hub)
raw_datasets = load_dataset("Helsinki-NLP/opus-100", "ar-en")

README.md:   0%|          | 0.00/65.4k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/214k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/99.3M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/979k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [4]:
# Raw dataset structure
raw_datasets

DatasetDict({
    test: Dataset({
        features: ['translation'],
        num_rows: 2000
    })
    train: Dataset({
        features: ['translation'],
        num_rows: 1000000
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 2000
    })
})

In [5]:
ds_train = raw_datasets["train"].select(range(5000))
ds_val = raw_datasets["validation"].select(range(1000))
ds_test = raw_datasets["test"]

In [6]:
# prompt: make data like this
# Dataset({
#     features: ['source', 'target'],
#     num_rows: 1000
# })

def convert_to_new_format(examples):
  new_examples = []
  for example in examples['translation']:
    new_examples.append({'source': example['en'], 'target': example['ar']})
  return {'source': [ex['source'] for ex in new_examples], 'target': [ex['target'] for ex in new_examples]}


new_ds_train = ds_train.map(convert_to_new_format, batched=True)
new_ds_val = ds_val.map(convert_to_new_format, batched=True)
new_ds_test = ds_test.map(convert_to_new_format, batched=True)

# Keep only the first 1000 examples for training
# new_ds_train = new_ds_train.select(range(1000))


print(new_ds_train)


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Dataset({
    features: ['translation', 'source', 'target'],
    num_rows: 5000
})


In [7]:
# First train sample - Persian
new_ds_train[0]['target']

'و هذه؟'

Optionally select a subset of training samples if needed for faster training/testing

In [8]:
# randomly select some of the train samples in case that you do not need all of them
#num_samples = 1000
#raw_datasets['train'] = raw_datasets['train'].shuffle(seed=42).select(range(num_samples))

In [9]:
# Raw dataset structure
raw_datasets

DatasetDict({
    test: Dataset({
        features: ['translation'],
        num_rows: 2000
    })
    train: Dataset({
        features: ['translation'],
        num_rows: 1000000
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 2000
    })
})

# Tokenization 🔤✂️

Here, you must select the **checkpoint** path. This is crucial because the tokenization and model structure are determined based on this path.

In [10]:
from transformers import AutoTokenizer

checkpoint= 'facebook/mbart-large-50-many-to-many-mmt' # Pre-trained mBart model checkpoint
tokenizer = AutoTokenizer.from_pretrained(checkpoint, return_tensors='pt')
tokenizer.src_lang = "en_XX"
tokenizer.tgt_lang = "ar_AR"

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/529 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/649 [00:00<?, ?B/s]



Let's examine how the tokenizer performs on a single instance.

In [11]:
# a single tokenization example

en_sentence = new_ds_train[0]['source']
fa_sentence = new_ds_train[0]['target']

inputs = tokenizer(en_sentence, text_target= fa_sentence) # This is referred to as input because it will be fed to the model.

In [12]:
# The tokenization result for a single instance is as follows
inputs

{'input_ids': [250004, 3493, 903, 32, 2], 'attention_mask': [1, 1, 1, 1, 1], 'labels': [250001, 65, 3070, 1245, 2]}

In [13]:
inputs.keys()

dict_keys(['input_ids', 'attention_mask', 'labels'])

In [14]:
# Tokens for the English instance:
print(tokenizer.convert_ids_to_tokens(inputs['input_ids']))

['en_XX', '▁And', '▁this', '?', '</s>']


In [15]:
# Tokens for the Persian instance
print(tokenizer.convert_ids_to_tokens(inputs['labels']))

['ar_AR', '▁و', '▁هذه', '؟', '</s>']


In [16]:
max_length = 128 # The maximum length of the tokenization output can be adjusted according to your data.

# Define a function to implement tokenization on the raw_datasets using the map() method.

def preprocess_function(examples):
    inputs = [ex for ex in examples['source']]
    targets = [ex for ex in examples["target"]]
    model_inputs = tokenizer(
        inputs, text_target=targets, max_length=max_length, truncation=True
    )
    return model_inputs

In [17]:
# Tokenize raw_datasets
train_ds = new_ds_train.map(preprocess_function, batched= True, remove_columns=['translation', 'source', 'target'])
validation_ds = new_ds_val.map(preprocess_function, batched= True, remove_columns=['translation', 'source', 'target'])


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [18]:
train_ds

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 5000
})

In [19]:
validation_ds

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 1000
})

# Model 🤖🧠

In [20]:
# Load the pre-trained mBart model from Hugging Face.
from transformers import AutoModelForSeq2SeqLM

model= AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

model.safetensors:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/261 [00:00<?, ?B/s]

## Freeze 🧊🔒

In [21]:
# If you want to freeze the pre-trained layers, there are different approaches to do this. In this case, the encoder layers are frozen while the decoder layers will be updated during fine-tuning.
for param in model.model.encoder.parameters():
    param.requires_grad = False

## Create Batches using DataCollator 📦

In [22]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer= tokenizer, model= model)

In [23]:
batch = data_collator([train_ds[i] for i in range(1, 3)])
batch.keys()

dict_keys(['input_ids', 'attention_mask', 'labels', 'decoder_input_ids'])

# Evaluation Definition 📊🔍

In [24]:
import evaluate

metric = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [25]:
import numpy as np


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

# Training 🤖📚

In [26]:
from transformers import Seq2SeqTrainingArguments

# Training settings
args = Seq2SeqTrainingArguments(
    checkpoint,
    evaluation_strategy="steps",
    eval_steps=1000,
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
    gradient_accumulation_steps=2,
    dataloader_num_workers=16,
    logging_strategy="steps",
    logging_steps=500
)



In [27]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=train_ds,
    eval_dataset=validation_ds,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [28]:
# Start fine-tuning process
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.01111417563333311, max=1.0)…

  self.pid = os.fork()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to a

Step,Training Loss,Validation Loss,Bleu
1000,2.1894,2.368187,10.742858
2000,1.7455,2.349487,12.255348
3000,1.3893,2.395577,11.255426


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

TrainOutput(global_step=3750, training_loss=1.80242119140625, metrics={'train_runtime': 5896.6181, 'train_samples_per_second': 2.544, 'train_steps_per_second': 0.636, 'total_flos': 691437143556096.0, 'train_loss': 1.80242119140625, 'epoch': 3.0})

In [44]:
trainer.push_to_hub("Messam174/fine-tune-mBart-larg-50-many-to-many-ar-en")

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/Messam174/mbart-large-50-many-to-many-mmt/commit/ae8bb78ba285f86231c10c17f468a16db3f84265', commit_message='Messam174/fine-tune-mBart-larg-50-many-to-many-ar-en', commit_description='', oid='ae8bb78ba285f86231c10c17f468a16db3f84265', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Messam174/mbart-large-50-many-to-many-mmt', endpoint='https://huggingface.co', repo_type='model', repo_id='Messam174/mbart-large-50-many-to-many-mmt'), pr_revision=None, pr_num=None)

# Evaluate the Model 📊🔍

## Scores 📊

In [29]:
trainer.evaluate(max_length=max_length)

  self.pid = os.fork()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to a

  self.pid = os.fork()


{'eval_loss': 2.391180992126465,
 'eval_bleu': 12.067443653112056,
 'eval_runtime': 406.3279,
 'eval_samples_per_second': 2.461,
 'eval_steps_per_second': 1.231,
 'epoch': 3.0}

## Inference 🔍

In [30]:
from transformers import pipeline

# Replace this with your own checkpoint
fine_tuned_checkpoint = "/kaggle/working/facebook/mbart-large-50-many-to-many-mmt/checkpoint-3750"
translator = pipeline("translation", model=fine_tuned_checkpoint, src_lang = "en_XX", tgt_lang = "ar_AR")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [31]:
translator("I'm gonna make him an offer he can't refuse.")

[{'translation_text': 'سأقدم له عرضاً لا يستطيع رفضه'}]

In [32]:
translator("Toto, I've a feeling we're not in Kansas anymore.")

[{'translation_text': '(توتو) أشعر أننا لسنا في (كانساس) بعد الآن'}]

# Pushing the Model to the Hugging Face Hub 🚀🤗☁️

## Push to Hub 🚀

In [42]:
model = AutoModelForSeq2SeqLM.from_pretrained( "/kaggle/working/facebook/mbart-large-50-many-to-many-mmt/checkpoint-3750")
tokenizer = AutoTokenizer.from_pretrained("/kaggle/working/facebook/mbart-large-50-many-to-many-mmt/checkpoint-3750")

In [43]:
model.push_to_hub("Messam174/fine-tune-mBart-larg-50-many-to-many-ar-en")
tokenizer.push_to_hub("Messam174/fine-tune-mBart-larg-50-many-to-many-ar-en")

model.safetensors:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/Messam174/fine-tune-mBart-larg-50-many-to-many-ar-en/commit/5db81a97102555c5b7216363a4890b610c214999', commit_message='Upload tokenizer', commit_description='', oid='5db81a97102555c5b7216363a4890b610c214999', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Messam174/fine-tune-mBart-larg-50-many-to-many-ar-en', endpoint='https://huggingface.co', repo_type='model', repo_id='Messam174/fine-tune-mBart-larg-50-many-to-many-ar-en'), pr_revision=None, pr_num=None)

## Load the Pushed Model  📥☁️

In [54]:
model= AutoModelForSeq2SeqLM.from_pretrained( "Messam174/mbart-large-50-many-to-many-mmt")
tokenizer = AutoTokenizer.from_pretrained("Messam174/mbart-large-50-many-to-many-mmt")

In [56]:
from transformers import pipeline

fine_tuned_checkpoint = "Messam174/mbart-large-50-many-to-many-mmt"
trans = pipeline("translation", model=fine_tuned_checkpoint, src_lang = "en_XX", tgt_lang = "ar_AR")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [57]:
trans("I Love you")

[{'translation_text': 'أحبك'}]

In [52]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("translation", model="Messam174/mbart-large-50-many-to-many-mmt",src_lang = "en_XX", tgt_lang = "ar_AR")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [53]:
pip("I love you so much.")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


/bin/bash: -c: line 1: syntax error near unexpected token `('
/bin/bash: -c: line 1: `/opt/conda/bin/python3.10 -m pip ("I love you so much.")'
Note: you may need to restart the kernel to use updated packages.
