# **English-Russian Translator**

## **Introduction**

In today's interconnected world, effective language translation plays a crucial role in breaking down communication barriers and enabling cross-cultural understanding. An automated translation system can facilitate seamless interactions, enhance information dissemination, and support users in understanding content in their preferred language.

For this project, I have chosen to fine-tune a pretrained model found on https://huggingface.co/tasks/translation that enables accurate and contextually relevant translation from English to Russian. The goal is to enhance the accessibility of information across language barriers and improve communication between users who speak different languages.

Utilizing a pretrained model offers substantial advantages. It diminishes computational expenses, lessens your environmental impact, and grants you access to cutting-edge models without the need to initiate training from the ground up. Transformers offer an array of thousands of pretrained models catering to various tasks. Upon employing a pretrained model, you fine-tune it using a dataset tailored to your specific task, a technique recognized as fine-tuning, which wields remarkable training prowess.

The model has been trained on a dataset sourced from Hugging Face, encompassing pairs of concise English and Russian sentences.

In [12]:
# Load the dataset
from datasets import load_dataset

raw_datasets = load_dataset("kde4", lang1="en", lang2="ru")

Downloading builder script:   0%|          | 0.00/4.25k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/8.45k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.10k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.16M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
pip install datasets

In [13]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 180793
    })
})

In [14]:
# Access the "train" dataset within the DatasetDict and preview the first few rows
train_dataset = raw_datasets["train"]
print(train_dataset[0:5])  # Print the first 5 examples

{'id': ['0', '1', '2', '3', '4'], 'translation': [{'en': 'Lauri Watts', 'ru': 'Lauri Watts'}, {'en': 'ROLES_OF_TRANSLATORS', 'ru': 'Andrei Darashenka adorosh@ chat. ru Перевод на русский'}, {'en': '2006-02-26 3.5.1', 'ru': '2002- 09- 02 3. 10. 00'}, {'en': 'The Babel & konqueror; plugin gives you quick access to the Babelfish translation service.', 'ru': 'Babelfish, модуль & konqueror;, позволяет быстро обращаться к сервису перевода Babelfish.'}, {'en': 'KDE', 'ru': 'KDE'}]}


We have 180,793 pairs of sentences, but in one single split, so we will need to create our own validation set.

In [15]:
split_datasets = raw_datasets["train"].train_test_split(train_size=0.9, seed=20)
split_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 162713
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 18080
    })
})

In [16]:
# Rename the "test" key to "validation
split_datasets["validation"] = split_datasets.pop("test")

In [17]:
# Let’s take a look at one element of the dataset:
split_datasets["train"][1]["translation"]

{'en': 'Destination:', 'ru': 'Место назначения:'}

We can use the translation pipeline from Transformers library to use a specific model checkpoint that is from one specific language to another, in our case from English to Russian.

In [18]:
# Load the pretrained model
from transformers import pipeline

model_checkpoint = "Helsinki-NLP/opus-mt-en-ru"
translator = pipeline("translation", model=model_checkpoint)
translator("Default to expanded threads")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/307M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/803k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/2.60M [00:00<?, ?B/s]



[{'translation_text': 'По умолчанию расширенные нити'}]

In [None]:
!pip install transformers

In [None]:
pip install transformers sentencepiece -q

In [None]:
pip install sentencepiece

In [19]:
# # Let’s take a look at one more example and see how our pretrained model work:
split_datasets["train"][172]["translation"]

{'en': '& Assign Shortcut...', 'ru': 'Привязать & комбинацию клавиш...'}

In [20]:
translator(
    "Unable to import %1 using the OFX importer plugin. This file is not the correct format."
)

[{'translation_text': 'Невозможно импортировать% 1 с помощью импортера OFX. Этот файл не является правильным форматом.'}]

# **Preprocessing**

Within the realm of natural language processing (NLP), tokenization stands as the pivotal procedure for disintegrating text into discrete units referred to as tokens. While these tokens commonly constitute words, they have the flexibility to encompass phrases, subwords, or even characters, contingent on the specific application. Tokenization holds a foundational role across numerous NLP undertakings, encompassing language modeling, machine translation, and text classification. Subsequent to the tokenization process, the text can undergo conversion into a numerical format, which then serves as input for machine learning models.

In [21]:
from transformers import AutoTokenizer

model_checkpoint = "Helsinki-NLP/opus-mt-en-ru"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="pt")

To prepare our data we need to ensure that the tokenizer processes the targets in the output language. We can do this by passing the targets to the text_targets argument of the tokenizer’s __call__ method.

To see how this works, let’s process one sample of each language in the training set:

In [22]:
en_sentence = split_datasets["train"][1]["translation"]["en"]
ru_sentence = split_datasets["train"][1]["translation"]["ru"]

inputs = tokenizer(en_sentence, text_target=ru_sentence)
inputs

{'input_ids': [23793, 13888, 1969, 38, 0], 'attention_mask': [1, 1, 1, 1, 1], 'labels': [23866, 2992, 38, 0]}

As observed, the output displays the input IDs linked to the English sentence, whereas the IDs corresponding to the Russian sentence are preserved within the labels field. If we overlook specifying that we intend to tokenize labels separately, they will undergo tokenization by the input tokenizer, which, in the context of our model, will lead to undesirable outcomes.

In [23]:
wrong_targets = tokenizer(ru_sentence)
print(tokenizer.convert_ids_to_tokens(wrong_targets["input_ids"]))
print(tokenizer.convert_ids_to_tokens(inputs["labels"]))

['▁', 'М', 'е', 'с', 'т', 'о', '▁', 'н', 'а', 'з', 'н', 'а', 'ч', 'е', 'н', 'и', 'я', ':', '</s>']
['▁Место', '▁назначения', ':', '</s>']


As evident, applying the English tokenizer to prepare a Russian sentence results in a significantly higher token count because the tokenizer lacks knowledge of Russian words.

Given that 'inputs' is a dictionary following our standard keys (input IDs, attention mask, etc.), the final stage involves defining the preprocessing function that we will utilize on the datasets:

In [24]:
max_length = 128


def preprocess_function(examples):
    inputs = [ex["en"] for ex in examples["translation"]]
    targets = [ex["ru"] for ex in examples["translation"]]
    model_inputs = tokenizer(
        inputs, text_target=targets, max_length=max_length, truncation=True
    )
    return model_inputs

It's worth noting that we've established an identical maximum length for both our input and output sequences, opting for a length of 128 since the texts we are working with appear to be relatively concise.

We can now implement this preprocessing across all segments of our dataset simultaneously.

In [25]:
tokenized_datasets = split_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=split_datasets["train"].column_names,
)

Map:   0%|          | 0/162713 [00:00<?, ? examples/s]

Map:   0%|          | 0/18080 [00:00<?, ? examples/s]

# **Fine-tuning the model with the Trainer API**

We are going to use a model that was trained on a translation task and can actually be used already, so there is no warning about missing weights or newly initialized ones.

In [26]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

For dynamic batching and handling padding, we require a data collator. In this scenario, we cannot simply employ a DataCollatorWithPadding since it exclusively pads the input components (input IDs, attention mask, and token type IDs). We need to ensure that our labels are also padded to match the maximum length found in the label sequences.

In [27]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

To test this on a few samples, we just call it on a list of examples from our tokenized training set:

In [28]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])
batch.keys()

dict_keys(['input_ids', 'attention_mask', 'labels', 'decoder_input_ids'])

We can check our labels have been padded to the maximum length of the batch, using -100:

In [29]:
batch["labels"]

tensor([[23866,  2992,    38,     0,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100],
        [ 5074,    52,  8527,    31,  2583,   471,  3345,  1680,   295,  1896,
          3038,    12,    92,   363, 10476,    31,  1266,  2168,   355,    12,
             3, 52183,    31,  1266,  2168,   355,    12, 41419,    78,     7,
            40, 22678, 51729,     7,    21,  2629,     3,  5946,   533,  1858,
          2492,   422,    38,     0]])

And we can also have a look at the decoder input IDs, to see that they are shifted versions of the labels:

In [30]:
batch["decoder_input_ids"]

tensor([[62517, 23866,  2992,    38,     0, 62517, 62517, 62517, 62517, 62517,
         62517, 62517, 62517, 62517, 62517, 62517, 62517, 62517, 62517, 62517,
         62517, 62517, 62517, 62517, 62517, 62517, 62517, 62517, 62517, 62517,
         62517, 62517, 62517, 62517, 62517, 62517, 62517, 62517, 62517, 62517,
         62517, 62517, 62517, 62517],
        [62517,  5074,    52,  8527,    31,  2583,   471,  3345,  1680,   295,
          1896,  3038,    12,    92,   363, 10476,    31,  1266,  2168,   355,
            12,     3, 52183,    31,  1266,  2168,   355,    12, 41419,    78,
             7,    40, 22678, 51729,     7,    21,  2629,     3,  5946,   533,
          1858,  2492,   422,    38]])

Here are the labels for the first and second elements in our dataset:

In [31]:
for i in range(1, 3):
    print(tokenized_datasets["train"][i]["labels"])

[23866, 2992, 38, 0]
[5074, 52, 8527, 31, 2583, 471, 3345, 1680, 295, 1896, 3038, 12, 92, 363, 10476, 31, 1266, 2168, 355, 12, 3, 52183, 31, 1266, 2168, 355, 12, 41419, 78, 7, 40, 22678, 51729, 7, 21, 2629, 3, 5946, 533, 1858, 2492, 422, 38, 0]


We will pass this data_collator along to the Seq2SeqTrainer. Our next step will be to look at metrics.

In [None]:
!pip install sacrebleu

# **Metrics**

The traditional metric employed for translation evaluation is the BLEU score. This metric assesses the proximity of translations to their reference labels. Notably, it does not gauge the intelligibility or grammatical correctness of the model's generated outputs. Instead, it relies on statistical rules to ensure that all words in the generated outputs are also present in the reference translations. Furthermore, BLEU incorporates rules that penalize repetitive words in the generated output if they do not align with repetition in the reference translations (to prevent the model from producing sentences like "the the the the the") and also penalizes shorter output sentences compared to their reference counterparts (to discourage the model from generating sentences like "the").

However, BLEU has a limitation: it assumes that the text is already tokenized, making it challenging to compare scores across models that use different tokenization methods. Consequently, the prevailing metric for benchmarking translation models today is SacreBLEU. SacreBLEU addresses this limitation and other issues by standardizing the tokenization process. To utilize this metric, the initial step is to install the SacreBLEU library:

In [32]:
import evaluate

metric = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [None]:
pip install evaluate

This metric is configured to accept both inputs and targets in the form of text. Its design allows for the inclusion of multiple acceptable targets, acknowledging that there can be various valid translations for the same sentence. While the dataset we are currently utilizing provides only a single reference, it is not uncommon in natural language processing (NLP) to encounter datasets that offer multiple sentences as labels. Consequently, when using this metric, the predictions should be presented as a list of sentences, whereas the references should be structured as a list of lists, each containing sentences.

Let's see it on example:

In [33]:
predictions = [
    "This plugin lets you translate web pages between several languages automatically."
]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)

{'score': 46.750469682990165,
 'counts': [11, 6, 4, 3],
 'totals': [12, 11, 10, 9],
 'precisions': [91.66666666666667,
  54.54545454545455,
  40.0,
  33.333333333333336],
 'bp': 0.9200444146293233,
 'sys_len': 12,
 'ref_len': 13}

This results in a commendable BLEU score of 46.75, signifying strong performance. Conversely, when we evaluate the two unfavorable prediction types commonly produced by translation models—namely, those characterized by excessive word repetitions or overly brief sentences—we observe notably low BLEU scores:

In [34]:
predictions = ["This This This This"]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)

{'score': 1.683602693167689,
 'counts': [1, 0, 0, 0],
 'totals': [4, 3, 2, 1],
 'precisions': [25.0, 16.666666666666668, 12.5, 12.5],
 'bp': 0.10539922456186433,
 'sys_len': 4,
 'ref_len': 13}

In [35]:
predictions = ["This plugin"]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)

{'score': 0.0,
 'counts': [2, 1, 0, 0],
 'totals': [2, 1, 0, 0],
 'precisions': [100.0, 100.0, 0.0, 0.0],
 'bp': 0.004086771438464067,
 'sys_len': 2,
 'ref_len': 13}

The score ranges from 0 to 100, with higher values indicating superior performance.

To transform the model outputs into texts that the metric can utilize, we will employ the tokenizer.batch_decode() method. It is essential to remove all instances of -100s from the labels (note that the tokenizer will automatically handle the padding token in a similar manner):

In [36]:
import numpy as np


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

Now we are ready to fine-tune our model.

# **Fine-tuning the model**

In addition to the standard hyperparameters such as learning rate, number of epochs, batch size, and some weight decay, there are a few notable deviations from what we observed in the previous sections:

We refrain from configuring regular evaluations since the evaluation process consumes significant time. Instead, we will assess our model only once before the commencement of training and once again after its completion.

We enable the use of fp16 (floating-point 16-bit precision), a setting that enhances training speed on contemporary GPUs.

We activate the predict_with_generate option, as discussed earlier, to enable generation during inference.

In [37]:
from transformers import Seq2SeqTrainingArguments

args = Seq2SeqTrainingArguments(
    f"marian-finetuned-kde4-en-to-ru",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
    overwrite_output_dir=True,
)

In [38]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Before initiating the training process, we will initially assess the score achieved by our model. This preliminary evaluation aims to ensure that our fine-tuning efforts are not inadvertently degrading the model's performance.

In [39]:
trainer.evaluate(max_length=max_length)

{'eval_loss': 2.4358930587768555,
 'eval_bleu': 20.884771972743884,
 'eval_runtime': 1307.2189,
 'eval_samples_per_second': 13.831,
 'eval_steps_per_second': 0.216}

A BLEU score of 20.88 falls on the lower end of the scale, indicating that our model struggles to perform well in translating English sentences to French.

Now, it's time to proceed with the training phase.

In [40]:
trainer.train()

Step,Training Loss
500,2.203
1000,1.9943
1500,1.8908
2000,1.8244
2500,1.7772
3000,1.7559
3500,1.7065


Step,Training Loss
500,2.203
1000,1.9943
1500,1.8908
2000,1.8244
2500,1.7772
3000,1.7559
3500,1.7065
4000,1.6981
4500,1.6545
5000,1.6214


TrainOutput(global_step=15255, training_loss=1.5475764881935483, metrics={'train_runtime': 2857.4506, 'train_samples_per_second': 170.83, 'train_steps_per_second': 5.339, 'total_flos': 9410862482325504.0, 'train_loss': 1.5475764881935483, 'epoch': 3.0})

After the completion of training, we will conduct another evaluation of our model. Hopefully, we anticipate observing an improvement in the BLEU score.

In [41]:
trainer.evaluate(max_length=max_length)

{'eval_loss': 1.3764605522155762,
 'eval_bleu': 29.18742957939016,
 'eval_runtime': 1380.227,
 'eval_samples_per_second': 13.099,
 'eval_steps_per_second': 0.205,
 'epoch': 3.0}

A value of approximately 29.19 suggests that the model's translations have improved from the earlier mentioned BLEU score of 20.88. A nine-point improvement is certainly commendable, but the model's overall performance still falls short of our initial expectations.

In [None]:
pip install transformers[torch]

In [None]:
pip install accelerate -U