#### Using mt5-base for translation

Polish-->Japanese based on the Tatoeba dataset, fine-tuning with LoRA

Dataset used:

[Tatoeba](https://opus.nlpl.eu/Tatoeba/corpus/version/Tatoeba)

Citations: J. Tiedemann, 2012, [Parallel Data, Tools and Interfaces in OPUS](http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf). In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)

### Loading the dataset

In [None]:
!pip install transformers datasets evaluate scikit-learn peft -Uqq

In [None]:
import datasets
from datasets import load_dataset

# Data from the Tatoeba project, split up and converted to HF dataset format
# For other datasets remember to shuffle! This one already is shuffled
dataset_train = load_dataset("json", data_files="/content/drive/MyDrive/Datasets/Tatoeba_train.json")
dataset_valid = load_dataset("json", data_files="/content/drive/MyDrive/Datasets/Tatoeba_valid.json")
dataset_test = load_dataset("json", data_files="/content/drive/MyDrive/Datasets/Tatoeba_test.json")


### Train/valid/test split

In [None]:
from datasets import DatasetDict

ds_splits = DatasetDict({
    'train': dataset_train['train'],
    'valid': dataset_valid['train'],
    'test': dataset_test["train"]
})

In [None]:
ds_splits

DatasetDict({
    train: Dataset({
        features: ['Source', 'Target'],
        num_rows: 22350
    })
    valid: Dataset({
        features: ['Source', 'Target'],
        num_rows: 1242
    })
    test: Dataset({
        features: ['Source', 'Target'],
        num_rows: 1242
    })
})

In [None]:
ds_splits["train"][0]

{'Source': 'Dlaczego powiedziałeś coś tak głupiego?',
 'Target': 'どうしてそんなに馬鹿なことを言ったの？'}

### Check GPU availability

In [None]:
import torch


if torch.cuda.is_available():
  print("CUDA available. Device count:")
  print(torch.cuda.device_count())
  device_id = torch.cuda.current_device()
  print(torch.cuda.get_device_name(device_id))
else:
  print("CUDA unavailable")

CUDA available. Device count:
1
NVIDIA A100-SXM4-40GB


### Get the model and wrap it in the peft object

In [None]:
from transformers import T5Tokenizer, MT5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/mt5-base")
original_model = MT5ForConditionalGeneration.from_pretrained("google/mt5-base")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [None]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    lora_dropout=0.05,
    task_type=TaskType.SEQ_2_SEQ_LM

)

In [None]:
model = get_peft_model(original_model,
                            lora_config)

In [None]:
model.print_trainable_parameters()

trainable params: 3,538,944 || all params: 585,940,224 || trainable%: 0.6040


In [None]:
print(model)

PeftModelForSeq2SeqLM(
  (base_model): LoraModel(
    (model): MT5ForConditionalGeneration(
      (shared): Embedding(250112, 768)
      (encoder): MT5Stack(
        (embed_tokens): Embedding(250112, 768)
        (block): ModuleList(
          (0): MT5Block(
            (layer): ModuleList(
              (0): MT5LayerSelfAttention(
                (SelfAttention): MT5Attention(
                  (q): lora.Linear(
                    (base_layer): Linear(in_features=768, out_features=768, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.05, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=768, out_features=32, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=32, out_features=768, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
     

### Test the tokenizer

In [None]:
def test_tokenizer(input_text):
  input_tokenized = tokenizer(input_text, return_tensors="pt")
  print(input_tokenized)
  out = tokenizer.decode(input_tokenized.input_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
  print(f"In: {input_text}")
  print(f"Out: {out}")

test_tokenizer("Samochód")
test_tokenizer("Chodźmy do żabki")
test_tokenizer("ザブカへ行きましょう")

{'input_ids': tensor([[22115, 55337,   285,     1]]), 'attention_mask': tensor([[1, 1, 1, 1]])}
In: Samochód
Out: Samochód
{'input_ids': tensor([[  8144,  15732,   1813,    342,  50478, 111528,      1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}
In: Chodźmy do żabki
Out: Chodźmy do żabki
{'input_ids': tensor([[  259, 16786, 11594,  6388,  6031, 68222, 46265,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}
In: ザブカへ行きましょう
Out: ザブカへ行きましょう


### Tokenize

In [None]:
def preprocess_function(examples):
    inputs = [f"Translate Polish to Japanese: {source_text}" for source_text in examples["Source"]]
    targets = examples["Target"]

    # Tokenize inputs and outputs
    model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding='max_length')
    labels = tokenizer(targets, max_length=128, truncation=True, padding='max_length')
    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

# Preprocess the dataset
tokenized_dataset = ds_splits.map(preprocess_function, batched=True)

In [None]:
tokenized_dataset["train"][0]

{'Source': 'Dlaczego powiedziałeś coś tak głupiego?',
 'Target': 'どうしてそんなに馬鹿なことを言ったの？',
 'input_ids': [89349,
  259,
  58459,
  288,
  30865,
  267,
  259,
  30104,
  22099,
  259,
  58942,
  78179,
  964,
  3376,
  756,
  259,
  318,
  82729,
  52770,
  291,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
 

### Training

In [None]:
!pip install fugashi[unidic-lite]



In [None]:
from fugashi import Tagger

tagger = Tagger('-Owakati')
def tokenize_japanese(text):
  return [word.surface for word in tagger(text)]

In [None]:
text = "麩菓子は、麩を主材料とした日本の菓子。"
tokenize_japanese(text)

['麩', '菓子', 'は', '、', '麩', 'を', '主材', '料', 'と', 'し', 'た', '日本', 'の', '菓子', '。']

In [None]:
!pip install sacrebleu



In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
from transformers import EarlyStoppingCallback

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=4e-4,
    per_device_train_batch_size=32, # 32 or 16--> OOM, 8 was fine on a T4
    per_device_eval_batch_size=32,
    num_train_epochs=4,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    report_to="none",
    save_total_limit=2,
    load_best_model_at_end=True,
    save_strategy = "epoch",
    metric_for_best_model='eval_loss',
    predict_with_generate=True
)


trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["valid"],
    callbacks=[EarlyStoppingCallback(3, 0.0)]
)

trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,0.3516,0.260545
2,0.3036,0.222252
3,0.2826,0.213637
4,0.2823,0.21112


TrainOutput(global_step=2796, training_loss=1.3506807056875187, metrics={'train_runtime': 1510.6134, 'train_samples_per_second': 59.181, 'train_steps_per_second': 1.851, 'total_flos': 2.7041662107648e+16, 'train_loss': 1.3506807056875187, 'epoch': 4.0})

In [None]:
from datetime import datetime
save_time = datetime.now()
save_time_str = save_time.strftime("%Y-%m-%d_%H-%M-%S")
save_dir = f"mt5-base-pl-ja-adapter-{save_time_str}"
print("Saving the model")
model.save_pretrained(save_dir)

Saving the model


In [None]:
import os

zip_filename = f"{save_dir}.zip"
drive_path = f"/content/drive/MyDrive/Models/{zip_filename}"
print("Zipping the model")
os.system(f"zip -r {zip_filename} {save_dir}")

Zipping the model


0

In [None]:
os.system(f"mv {zip_filename} '{drive_path}'")

0

In [None]:
print("Model moved to google drive!")

Model moved to google drive!


In [None]:
print("Model filename:")
print(zip_filename)

Model filename:
mt5-base-pl-ja-adapter-2025-02-12_15-01-23.zip


### Testing the model

In [None]:
tokenizer.decode(tokenized_dataset["train"]["input_ids"][0])

'Translate Polish to Japanese: Dlaczego powiedziałeś coś tak głupiego?</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>'

In [None]:
tokenizer.decode(tokenized_dataset["train"]["labels"][0])

'どうしてそんなに馬鹿なことを言ったの?</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>'

In [None]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def text_to_translation_prompt(source_text):
  return f"Translate Polish to Japanese: {source_text}"

def translate_text(source_text, temperature=0.3, top_k=20):
  input_text = text_to_translation_prompt(source_text)
  input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
  with torch.no_grad():
    output_ids = model.generate(input_ids=input_ids,
                                top_k=top_k,
                                temperature = temperature,
                                do_sample=True)
  print(f"PL: {source_text}")
  print(f"JP: {tokenizer.decode(output_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)}")
  print("---")

texts_to_translate = [
    "Pogasiła wszystkie światła o dziesiątej."
    "Chodźmy do żabki",
    "Chodźmy do kina",
    "Spokojnie jak na wojnie",
    "lol",
    "Wiedźmin to super gra",
    "Lubię programować",
    "Polski to trudny język",
    "Kalendarz Gregoriański został wprowadzony w 1582 roku",
    "Nie można oczekiwać świetnych wyników od modelu który nie uczył się nawet na całym zbiorze danych",
    "Test test test",
    "Jestem głodny"
]

for text in texts_to_translate:
  translate_text(text)

PL: Pogasiła wszystkie światła o dziesiątej.Chodźmy do żabki
JP: 彼女はすべての光を雨に焚いた。
---
PL: Chodźmy do kina
JP: 私たちは映画に行こう。
---
PL: Spokojnie jak na wojnie
JP: 戦争中だ。
---
PL: lol
JP: 笑う。
---
PL: Wiedźmin to super gra
JP: ミミは素晴らしいゲームだ。
---
PL: Lubię programować
JP: プログラムが好きだ。
---
PL: Polski to trudny język
JP: 日本語は難しすぎる。
---
PL: Kalendarz Gregoriański został wprowadzony w 1582 roku
JP: Gregoriのカレンダーは1582年改められた。
---
PL: Nie można oczekiwać świetnych wyników od modelu który nie uczył się nawet na całym zbiorze danych
JP: そのモデルは全部データに理解できなかった。
---
PL: Test test test
JP: テストテストテストを英語で翻訳した。
---
PL: Jestem głodny
JP: 疲れている。
---


In [None]:
translate_text("samochód")

PL: samochód
JP: 車は運転手だ。
---


In [None]:
texts_to_translate = [
    "Test test test",
    "Mam na imię Adrian",
    "Tłumaczenie jest trudne",
    "Mają ulubioną potrawą jest omlet",
    "Tom ma bardzo szybki samochód",
    "Samochód Toma jest bardzo szybki"
]

for text in texts_to_translate:
  translate_text(text)

PL: Test test test
JP: テストテストテストをテストした。
---
PL: Mam na imię Adrian
JP: 私の名前はAdrianです。
---
PL: Tłumaczenie jest trudne
JP: 翻訳は難しそう。
---
PL: Mają ulubioną potrawą jest omlet
JP: 彼らはお気に入りの料理はオリーブです。
---
PL: Tom ma bardzo szybki samochód
JP: トムは速い車を持っている。
---
PL: Samochód Toma jest bardzo szybki
JP: トムは速い。
---


In [None]:
texts_to_translate = [
    "Zagrajmy w grę",
    "Poczekaj chwilę!",
    "Nie wiem co zrobić",
    "Gdzie jest stacja kolejowa?",
    "Jak dojść na stację kolejową?",
    "Smutno mi",
    "Cicho bądź!"
]

for text in texts_to_translate:
  translate_text(text)

PL: Zagrajmy w grę
JP: ゲームをプレイしましょう。
---
PL: Poczekaj chwilę!
JP: 時間があったら待つ。
---
PL: Nie wiem co zrobić
JP: 何をすればいいか分からない。
---
PL: Gdzie jest stacja kolejowa?
JP: 電車の駅はどこですか。
---
PL: Jak dojść na stację kolejową?
JP: 列車に行きますか。
---
PL: Smutno mi
JP: とても悲しく。
---
PL: Cicho bądź!
JP: あなたは、いいよ。
---


In [None]:
for text in texts_to_translate:
  translate_text(text, temperature=1, top_k=100)

PL: Zagrajmy w grę
JP: ゲームをスタートしましょう。
---
PL: Poczekaj chwilę!
JP: しばらく待ってくれ。
---
PL: Nie wiem co zrobić
JP: 何すればいいのよ。
---
PL: Gdzie jest stacja kolejowa?
JP: 電車の近くの駅はどこですか。
---
PL: Jak dojść na stację kolejową?
JP: なぜ新幹線に乗って帰ったかを知りました。
---
PL: Smutno mi
JP: 私には緊張していると私が悪い。
---
PL: Cicho bądź!
JP: 色いいですね。
---


### Checking the BLEU score

In [None]:
model_predictions = []
batch_size = 64
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

test_input_ids = tokenized_dataset['test']["input_ids"]
true_translations = tokenized_dataset['test']["Target"]

for i in range(0, len(test_input_ids), batch_size):
    batch_input_ids = test_input_ids[i:i + batch_size]

    input_ids_tensor = torch.tensor(batch_input_ids).to(device)

    with torch.no_grad():
        output_ids = model.generate(input_ids=input_ids_tensor)

    batch_predictions = [tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=False) for output in output_ids]
    model_predictions.extend(batch_predictions)




In [None]:
model_predictions[0]

'明日、この場所に行ってきます。'

In [None]:
len(model_predictions)

1242

In [None]:
!pip install nltk



In [None]:
import fugashi
tagger = fugashi.Tagger()

tagger = Tagger('-Owakati')
def tokenize_japanese(text):
  return [word.surface for word in tagger(text)]

tokenized_target = []
tokenized_predictions = []

for text in tokenized_dataset['test']["Target"]:
    tokenized_target.append([tokenize_japanese(text)]) # Single reference

In [None]:
for text in model_predictions:
    tokenized_predictions.append(tokenize_japanese(text))

In [None]:
tokenized_dataset['test']["Target"][0]

'明日の今頃は大阪を見物しているでしょう。'

In [None]:
tokenized_target[0]

[['明日', 'の', '今頃', 'は', '大阪', 'を', '見物', 'し', 'て', 'いる', 'でしょう', '。']]

In [None]:
tokenized_predictions[0]

['明日', '、', 'この', '場所', 'に', '行っ', 'て', 'き', 'ます', '。']

In [None]:
len(tokenized_target)

1242

In [None]:
len(tokenized_predictions)

1242

In [None]:
import nltk

nltk.translate.bleu_score.corpus_bleu(tokenized_target, tokenized_predictions)

0.12440461525629558