# Fine-tuning mT5 for Low-resource Languages

## mT5

Multilingual T5 ([mT5](https://github.com/google-research/multilingual-t5)) is a massively multilingual pretrained text-to-text
transformer model, trained following a similar recipe as
[T5](https://github.com/google-research/text-to-text-transfer-transformer). T5 was introduced in the paper [_Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer_](https://arxiv.org/abs/1910.10683). 

<img src="https://1.bp.blogspot.com/-o4oiOExxq1s/Xk26XPC3haI/AAAAAAAAFU8/NBlvOWB84L0PTYy9TzZBaLf6fwPGJTR0QCLcBGAsYHQ/s1600/image3.gif" width="700" height="300" />

## Languages covered

mT5 is pretrained on the [mC4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual_nights_stay) corpus, covering 101 languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque,
Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese,
Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino,
Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole,
Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian,
Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish,
Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy,
Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian,
Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan,
Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali,
Sotho, Spanish, Sundanese, **Swahili**, Swedish, Tajik, Tamil, Telugu, Thai,
Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa,
Yiddish, Yoruba, Zulu.

*PS: This notebook is built on Kaggle using ***GPU T4x2*** accelerator and it is prepared based on https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/flan-t5-samsum-summarization.ipynb*

## Task: Instruct mT5 to summarize Swahili content

We use [**"XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages"**](https://aclanthology.org/2021.findings-acl.413/) to evaluate summarization performance in Swahili

In [None]:
# python
!pip install pytesseract transformers datasets py7zr --upgrade
!pip install evaluate rouge-score

In [5]:
from datasets import load_dataset

# Only use the 10% of the test split for a fast demonstration of evaluation 
xlsum_swa_test = load_dataset("csebuetnlp/xlsum", "swahili", split='test[:10%]')

In [6]:
swa_sample = xlsum_swa_test[0]

print(f"Text: \n{swa_sample['text']}\n---------------")
print(f"Summary: \n{swa_sample['summary']}\n---------------")

Text: 
Messi na Mbappe Ingawa kila mwamba ngoma, ngozi huivutia kwake, ukweli ni kuwa, wachezaji wapya wametumia vyema jukwaa lililopatikana Kombe la Dunia Urusi 2018. Baada ya kutawala vinywa vya wengi ndani ya zaidi ya kipindi cha miaka 10, kuondoka kwa Argentina na Ureno Urusi hatua za mchujo, imewanyima Messi na Ronaldo muda zaidi wa kujadiliwa. Kurunzi za soka sasa zimeelekezwa kwa Neymar wa Brazil, Paul Pogba, Kylian Mbappe na Antoine Griezmann wa Ufaransa, Edison Cavani na Luis Suarez wa Uruguay, Harry Kane wa Uingereza na Romelu Lukaku wa Ubelgiji. Orodha ni ndefu, lakini yote yataamuliwa kulingana na mchango wa wachezaji hawa kwa ufanisi wa timu zao. Mfungaji bora kombe la Dunia 2018 Hatua hii inamaanisha kwamba Kuondoka kwa Ureno imekuwa pigo la kipekee kwa Ronaldo, aliyejitahidi kuzidisha mabao yake Urusi kwani ni wazi sasa wenzake watampiku kwenye uwaniaji wa tuzo ya mfungaji bora. Ronaldo, alianza vyema baada ya kuondoka na mpira wa mechi dhidi ya Uhispania alipofunga hat-

In [3]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id="google/mt5-base"

# Load tokenizer of mt5-base
tokenizer = AutoTokenizer.from_pretrained(model_id)

max_input_len = 256
max_target_len = 64

tokenizer_config.json:   0%|          | 0.00/376 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/702 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [4]:
def preprocess_xlsum(examples, padding="max_length"):
    inputs = [f'Summarize the follow text:\n{text}' for text in examples["text"]]
    
     # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_input_len, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=examples["summary"], max_length=max_target_len, padding=padding, truncation=True)
    
    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_xlsum_swa_test = xlsum_swa_test.map(preprocess_xlsum, batched=True)

Map:   0%|          | 0/99 [00:00<?, ? examples/s]

In [5]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

# Metric
metric = evaluate.load("rouge")

# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    # Replace -100 in the predictions as we can't decode them.    
    preds = np.where(preds != -100, preds, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

2024-02-21 04:11:44.017316: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-21 04:11:44.017440: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-21 04:11:44.187700: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [6]:
from transformers import AutoModelForSeq2SeqLM

# load model from the hub
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

pytorch_model.bin:   0%|          | 0.00/2.33G [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

## Q1: Can we leverage an existing high quality instruction dataset for this task which are commonly only available **in English**

In [7]:
dolly_english = load_dataset("databricks/databricks-dolly-15k")

def preprocess_dolly(examples, padding="max_length"):
    inputs = []
    targets = []
    for instruction, context in zip(examples["instruction"], examples["context"]):
        if len(context) > 0:
          inputs.append(f'{instruction}\nContext: {context}')
        else:
          inputs.append(instruction)
    
    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_input_len, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=examples["response"], max_length=max_target_len, padding=padding, truncation=True)
    
    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dolly_english = dolly_english.map(preprocess_dolly, batched=True)

Downloading readme:   0%|          | 0.00/8.20k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/13.1M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/15011 [00:00<?, ? examples/s]

In [10]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

In [11]:
from huggingface_hub import HfFolder
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

train_batch_size = 2
eval_batch_size = 8

# Hugging Face repository id
eng_model_id = f"mt5_base_eng_dolly"

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=Seq2SeqTrainingArguments(
        output_dir=eng_model_id,
        per_device_train_batch_size=train_batch_size,
        per_device_eval_batch_size=eval_batch_size,
gradient_accumulation_steps=4,
        predict_with_generate=True,
        #fp16=True, # Overflows with fp16
        learning_rate=1e-3,
        max_steps=1000,
        # logging & evaluation strategies
        logging_strategy="steps",
        logging_steps=250,
        evaluation_strategy="no",
        save_strategy="no",
        load_best_model_at_end=True,
        generation_max_length=max_target_len,
        report_to="none",
        push_to_hub=False,
    ),
    data_collator=data_collator,
    train_dataset=tokenized_dolly_english['train'],
    eval_dataset=tokenized_xlsum_swa_test,
    compute_metrics=compute_metrics,
)

In [12]:
# Start training
trainer_en.train()

# Save the model
trainer_en.save_model()



Step,Training Loss
250,4.92
500,3.6492
750,3.4266
1000,3.2679


In [12]:
# Evaluate the fine-tuned model
trainer.evaluate()



{'eval_loss': 5.133183479309082,
 'eval_rouge1': 11.5638,
 'eval_rouge2': 4.4385,
 'eval_rougeL': 10.0082,
 'eval_rougeLsum': 10.0471,
 'eval_gen_len': 46.17171717171717,
 'eval_runtime': 16.8475,
 'eval_samples_per_second': 5.876,
 'eval_steps_per_second': 0.415,
 'epoch': 0.27}

In [16]:
import torch

# Free memory for the second training 
del model
del trainer_en
torch.cuda.empty_cache()

In [29]:
# English example
en_context = "Gundogan, 26, told BBC Sport he can see the finishing line after tearing cruciate knee ligaments in December, but will not rush his return. The German missed the 2014 World Cup following back surgery that kept him out for a year, and sat out Euro 2016 because of a dislocated kneecap. He said that it is heavy mentally to accept that. Gundogan will not be fit for the start of the Premier League season at Brighton on 12 August but said his recovery time is now being measured in weeks rather than months. Gundogan made 15 appearances and scored five goals in his debut season for City following his £20m move from Borussia Dortmund. He is eager to get on the field again and was impressed at the club's 4-1 win over Real Madrid in a pre-season game in Los Angeles on Wednesday. Manager Pep Guardiola has made five new signings already this summer and continues to have an interest in Arsenal forward Alexis Sanchez and Monaco's Kylian Mbappe. Gundogan said that we felt that last year as well but it was a completely new experience for all of us. We know the Premier League a bit more now and can't wait for the season to start." "City complete their three-match tour of the United States against Tottenham in Nashville on Saturday. Chelsea manager Antonio Conte said earlier this week he did not feel Tottenham were judged by the same standards as his own side, City and Manchester United. Spurs have had the advantage in their recent meetings with City, winning three and drawing one of their last four Premier League games. And Gundogan thinks they are a major threat."
prompt = f"Summarize the following text:\n{en_context}"
inputs = tokenizer(prompt, max_length=256, return_tensors="pt").to("cuda")
output = model.generate(**inputs, do_sample=False)
output = tokenizer.batch_decode(output, skip_special_tokens=True)
print(f"Text: \n{prompt}\n---------------")
print(f"Summary: \n{output}\n---------------")

Text: 
Summarize the following text:
 Gundogan, 26, told BBC Sport he can see the finishing line after tearing cruciate knee ligaments in December, but will not rush his return. The German missed the 2014 World Cup following back surgery that kept him out for a year, and sat out Euro 2016 because of a dislocated kneecap. He said that it is heavy mentally to accept that. Gundogan will not be fit for the start of the Premier League season at Brighton on 12 August but said his recovery time is now being measured in weeks rather than months. Gundogan made 15 appearances and scored five goals in his debut season for City following his £20m move from Borussia Dortmund. He is eager to get on the field again and was impressed at the club's 4-1 win over Real Madrid in a pre-season game in Los Angeles on Wednesday. Manager Pep Guardiola has made five new signings already this summer and continues to have an interest in Arsenal forward Alexis Sanchez and Monaco's Kylian Mbappe. Gundogan said that

In [35]:
# Swahili example
swa_text = swa_sample["text"]
prompt = f"Summarize the following text:\n{swa_text}"
inputs = tokenizer(prompt, max_length=256, return_tensors="pt").to("cuda")
output = model.generate(**inputs, do_sample=False, max_new_tokens=64)
output = tokenizer.batch_decode(output, skip_special_tokens=True)
print(f"Text: \n{prompt}\n---------------")
print(f"Summary: \n{output}\n---------------")

Text: 
Summarize the following text:
 Magari zaidi ya 50 aina ya Toyota Prado hayajapatikana Chama tawala kilihesabu magari hayo mwezi mmoja kabla ya kuingia madarakani baada ya kupata ushindi uchaguzini mwezi Desemba. Imekuwa kawaida kwa baadhi ya maafisa wa serikali inayoondoka kutorejesha magari ya serikali, na hulazimu serikali mpya kuyatwaa kwa nguvu nchini Ghama. Waziri mmoja katika serikali iliyoondoka ya John Mahama hata hivyo amesema kuenezwa kwa habari kwamba wenzake walitekeleza uhalifu ni makosa. Aliyekuwa waziri wa usalama Omane Boamah ameambia mwandishi wa BBC Thomas Naadi kwamba hiyo ni "mbinu inayotumiwa na serikali mpya kuipa sababu za kununua magari mapya." Msemaji wa rais Eugene Arhin aliambia wanahabari kwamba maafisa wa serikali mpya walipata magari: Kituo cha redio cha Citi FM nchini Ghana kimeripoti kwamba rais amelazimika kutumia gari aina ya BMW lililoundwa miaka 10 iliyopita kutokana na kutorejeshwa kwa magari hayo. Nana Akufo-Addo (kulia) alimshinda John Maha

## Q2: How beneficial is the instruction dataset in the target language?

In [None]:
# Our Swahili instruction dataset
# This dataset is translated from Dolly-15k English instructions, later filtered and post-edited by Toloka
!wget https://github.com/AligningLLMtoLRL/AligningLLMtoLRL.github.io/raw/main/materials/Dataset.zip
!unzip Dataset.zip

In [17]:
import pandas as pd

dolly_swahili_df = pd.read_excel("/kaggle/working/translated_ds.xlsx")
dolly_swahili_df.head(2)

Unnamed: 0,task_id,INPUT:context_tr,INPUT:context_src,INPUT:response_tr,INPUT:response_src,INPUT:instruction_tr,INPUT:instruction_src,toloka probabilities
0,000287b55d--656f562fa7ccfa2fa62cbad5,"""I'm So Excited"" ni wimbo wa mwimbaji wa Aust...","""I'm So Excited"" is a song by Australian singe...","""I'm So Excited"" ni wimbo wa mwimbaji wa Austr...","""I'm So Excited"" is a song by Australian singe...",Ni nani mwimbaji wa wimbo wa I'm So Excited?,Who is the singer of the song I'm So Excited?,0.988446
1,000287b55d--656f562fa7ccfa2fa62cbb0b,,,Kupanga safari ya kwenda Ulaya ni sawa na kupa...,Planning a trip to Europe is similar to planni...,"Je, nifanyeje kuhusu kupanga safari ya kwenda...",How should I go about planning a trip to Europe?,0.982769


In [18]:
from datasets import Dataset

# Load our Swahili instruction dataset
# This dataset is translated from Dolly-15k English instructions, later filtered and post-edited
dolly_swahili = Dataset.from_pandas(dolly_swahili_df)

# Modify dataset to make consistent with original dolly
dolly_swahili = dolly_swahili.rename_column("INPUT:context_tr", "context")
dolly_swahili = dolly_swahili.rename_column("INPUT:instruction_tr", "instruction")
dolly_swahili = dolly_swahili.rename_column("INPUT:response_tr", "response")

# Preprocess the dataset
tokenized_dolly_swahili = dolly_swahili.map(preprocess_dolly, batched=True,
                                           remove_columns=["INPUT:context_src", "INPUT:instruction_src", "INPUT:response_src", "toloka probabilities", "task_id"]))

Map:   0%|          | 0/12125 [00:00<?, ? examples/s]

In [21]:
# load mT5 from the hub for a new fine-tuning
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

train_batch_size = 2
eval_batch_size = 8

# Hugging Face repository id
swa_model_id = f"mt5_base_swa_dolly"

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=Seq2SeqTrainingArguments(
        output_dir=swa_model_id,
        per_device_train_batch_size=train_batch_size,
        per_device_eval_batch_size=eval_batch_size,
        gradient_accumulation_steps=4,
        predict_with_generate=True,
        #fp16=True, # Overflows with fp16
        learning_rate=1e-3,
        max_steps=1000,
        # logging & evaluation strategies
        logging_strategy="steps",
        logging_steps=250,
        evaluation_strategy="no",
        save_strategy="no",
        load_best_model_at_end=True,
        generation_max_length=max_target_len,
        report_to="none",
        push_to_hub=False,
    ),
    data_collator=data_collator,

In [22]:
# Start training
trainer.train()

# Save the model
trainer.save_model()



Step,Training Loss
250,5.9137
500,3.8241
750,3.5676
1000,3.346


In [27]:
# Evaluate the fine-tuned model
trainer.evaluate()



{'eval_loss': 3.438464641571045,
 'eval_rouge1': 18.7487,
 'eval_rouge2': 5.0003,
 'eval_rougeL': 15.0974,
 'eval_rougeLsum': 15.1395,
 'eval_gen_len': 35.81818181818182,
 'eval_runtime': 19.2231,
 'eval_samples_per_second': 5.15,
 'eval_steps_per_second': 0.364}

In [16]:
# Swahili example
swa_text = swa_sample["text"]
prompt = f"Summarize the following text:\n{swa_text}"
inputs = tokenizer(prompt, max_length=256, return_tensors="pt").to("cuda")
output = model.generate(**inputs, do_sample=False, max_new_tokens=64)
output = tokenizer.batch_decode(output, skip_special_tokens=True)
print(f"Text: \n{prompt}\n---------------")
print(f"Summary: \n{output}\n---------------")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Text: 
Summarize the following text:
 Magari zaidi ya 50 aina ya Toyota Prado hayajapatikana Chama tawala kilihesabu magari hayo mwezi mmoja kabla ya kuingia madarakani baada ya kupata ushindi uchaguzini mwezi Desemba. Imekuwa kawaida kwa baadhi ya maafisa wa serikali inayoondoka kutorejesha magari ya serikali, na hulazimu serikali mpya kuyatwaa kwa nguvu nchini Ghama. Waziri mmoja katika serikali iliyoondoka ya John Mahama hata hivyo amesema kuenezwa kwa habari kwamba wenzake walitekeleza uhalifu ni makosa. Aliyekuwa waziri wa usalama Omane Boamah ameambia mwandishi wa BBC Thomas Naadi kwamba hiyo ni "mbinu inayotumiwa na serikali mpya kuipa sababu za kununua magari mapya." Msemaji wa rais Eugene Arhin aliambia wanahabari kwamba maafisa wa serikali mpya walipata magari: Kituo cha redio cha Citi FM nchini Ghana kimeripoti kwamba rais amelazimika kutumia gari aina ya BMW lililoundwa miaka 10 iliyopita kutokana na kutorejeshwa kwa magari hayo. Nana Akufo-Addo (kulia) alimshinda John Maha