# Aya: Multilingual Instruction Following 

The [Aya-101](https://huggingface.co/CohereForAI/aya-101) is a massively multilingual generative language model that follows instructions in 101 languages. Aya outperforms mT0 and BLOOMZ a wide variety of automatic and human evaluations despite covering double the number of languages. The Aya model is trained using [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset), [Aya Collection](https://huggingface.co/datasets/CohereForAI/aya_collection), [xP3x](https://huggingface.co/datasets/CohereForAI/xP3x), a subset of [DataProvenance collection](https://github.com/Data-Provenance-Initiative/Data-Provenance-Collection) and ShareGPT-Command. 

[Aya-101](https://huggingface.co/CohereForAI/aya-101) is based on 13 billion parameter [mT5](https://github.com/google-research/multilingual-t5) model and further instruction fine-tuned by [Cohere For AI](https://cohere.com/research). 

<img src="https://huggingface.co/CohereForAI/aya-101/resolve/main/aya-fig1.png" width="1000" height="600"/>

*PS: This notebook is built on Kaggle using ***GPU T4x2*** accelerator and it is prepared based on https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/flan-t5-samsum-summarization.ipynb and https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k?usp=sharing*

In [None]:
!pip install transformers sentencepiece --upgrade
!pip install datasets --upgrade
!pip install ipywidgets torch
!pip install evaluate rouge-score nltk

In [None]:
!pip install peft --upgrade
!pip install accelerate bitsandbytes loralib --upgrade 

## Task: Instruct Aya-101 to summarize Swahili content

We use [**"XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages"**](https://aclanthology.org/2021.findings-acl.413/) to evaluate summarization performance in Swahili

In [3]:
from datasets import load_dataset 

# Only use the 10% of the test split for a fast demonstration of evaluation 
xlsum_swa_test = load_dataset("csebuetnlp/xlsum", "swahili", split='test[:10%]')

Downloading data:   0%|          | 0.00/12.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.62M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7898 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/987 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/987 [00:00<?, ? examples/s]

In [4]:
swa_sample = xlsum_swa_test[1]

print(f"Text: \n{swa_sample['text']}\n---------------")
print(f"Summary: \n{swa_sample['summary']}\n---------------")

Text: 
Trump alijitetea kwenye mdahalo Jumapili kwa kumshambulia mumewe Hillary Clinton, Bill Clinton Spika wa Bunge la Wawakilishi Paul Ryan ameapa kuangazia sasa kuhakikisha wagombea wa chama hicho wanatetea viti vyao katika Bunge la Congress. Hata hivyo, hajabatilisha uamuzi wake wa kumuidhinisha mgombea huyo. Bw Trump naye amemjibu Bw Ryan kupitia mtandao wa Twitter na kusema hafai kupoteza wakati akimpigania. Awali, mpinzani wa Bw Trump kutoka chama cha Democratic Hillary Clinton ametilia shaka hatua ya Bw Trump kuomba radhi kutokana na matamshi hayo aliyoyatoa miaka 11 iliyopita. Kwenye kadha hiyo ya video, Bw Trump anaonekana akisema alivyoomba kushiriki mapenzi na mwanamke aliyeolewa. Aidha, anatoa matamshi ya kudhalilisha kuhusu wanawake. Lakini Jumapili, Bw Trump alisema maneno yake kwenye kanda hiyo ya video yalikuwa "mazungumzo ya mzaha faraghani". Hata hivyo alisema anajutia kuyasema. Akizungumza wakati wa mdahalo wa urais Jumapili, Bw Trump hata hivyo alisema hakumnyanyas

# [Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model](https://arxiv.org/abs/2402.07827)

In [5]:
from datasets import load_dataset

# Load Aya collection
aya_collection = load_dataset("CohereForAI/aya_collection", "translated_wikiqa")
print(aya_collection)

# List of languages
languages = sorted(set(aya_collection["train"]["language"]))
print(f'Number of languages in Aya collection, translated_dolly subset: {len(languages)}\n')
print(f'Languages in Aya collection, translated_dolly subset: {languages}\n')

Downloading readme:   0%|          | 0.00/71.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.46M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.25M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/34867 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/123760 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/16660 [00:00<?, ? examples/s]

DatasetDict({
    test: Dataset({
        features: ['id', 'inputs', 'targets', 'dataset_name', 'sub_dataset_name', 'task_type', 'template_id', 'language', 'script', 'split'],
        num_rows: 34867
    })
    train: Dataset({
        features: ['id', 'inputs', 'targets', 'dataset_name', 'sub_dataset_name', 'task_type', 'template_id', 'language', 'script', 'split'],
        num_rows: 123760
    })
    validation: Dataset({
        features: ['id', 'inputs', 'targets', 'dataset_name', 'sub_dataset_name', 'task_type', 'template_id', 'language', 'script', 'split'],
        num_rows: 16660
    })
})
Number of languages in Aya collection, translated_dolly subset: 113

Languages in Aya collection, translated_dolly subset: ['ace', 'acm', 'acq', 'aeb', 'afr', 'ajp', 'als', 'amh', 'apc', 'arb', 'ars', 'ary', 'arz', 'azb', 'azj', 'bel', 'ben', 'bjn', 'bul', 'cat', 'ceb', 'ces', 'ckb', 'cym', 'dan', 'deu', 'ell', 'eng', 'epo', 'est', 'eus', 'fin', 'fra', 'gla', 'gle', 'glg', 'guj', 'hat', 'hau',

In [6]:
# Load Aya dataset
aya_dataset = load_dataset("CohereForAI/aya_dataset")
print(aya_dataset)

# List of languages
languages = sorted(set(aya_dataset["train"]["language_code"]))
print(f'Number of languages in Aya dataset: {len(languages)}\n')
print(f'Languages in Aya dataset: {languages}\n')

# Annotation type in the dataset
annotation_type = sorted(set(aya_dataset["train"]["annotation_type"]))
print(f'Annotation types in Aya dataset: {annotation_type}')

Downloading readme:   0%|          | 0.00/13.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/137M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/978k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/202364 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1750 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['inputs', 'targets', 'language', 'language_code', 'annotation_type', 'user_id'],
        num_rows: 202364
    })
    test: Dataset({
        features: ['inputs', 'targets', 'language', 'language_code', 'annotation_type', 'user_id'],
        num_rows: 1750
    })
})
Number of languages in Aya dataset: 70

Languages in Aya dataset: ['acq', 'ajp', 'als', 'amh', 'arb', 'ars', 'ary', 'arz', 'ben', 'ceb', 'ckb', 'dan', 'deu', 'ell', 'eng', 'eus', 'fil', 'fin', 'fra', 'gle', 'guj', 'hat', 'hau', 'hin', 'hun', 'ibo', 'ind', 'ita', 'jav', 'jpn', 'kan', 'kir', 'kor', 'lit', 'mal', 'mar', 'mya', 'nld', 'npi', 'nso', 'nya', 'pan', 'pbt', 'pes', 'plt', 'pol', 'por', 'rus', 'sin', 'sna', 'snd', 'som', 'spa', 'srp', 'sun', 'swe', 'swh', 'tam', 'tel', 'tha', 'tur', 'ukr', 'urd', 'vie', 'wol', 'xho', 'yor', 'zho', 'zsm', 'zul']

Annotation types in Aya dataset: ['original-annotations', 're-annotations']


In [None]:
import torch
from transformers import AutoModelForSeq2SeqLM
from transformers import logging as hf_logging

aya = AutoModelForSeq2SeqLM.from_pretrained("CohereForAI/aya-101", torch_dtype=torch.bfloat16, device_map='auto')

In [7]:
from transformers import AutoTokenizer

model_id = "CohereForAI/aya-101"
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [75]:
# Swahili example
swa_text = swa_sample["text"]
prompt = f"Summarize the following text:\n{swa_text}"
inputs = tokenizer(prompt, max_length=256, return_tensors="pt").to("cuda")
output = aya.generate(**inputs, do_sample=False, max_new_tokens=64)
output = tokenizer.batch_decode(output, skip_special_tokens=True)
print(f"Text: \n{prompt}\n---------------")
print(f"Summary: \n{output}\n---------------")

Text: 
Summarize the following text:
Trump alijitetea kwenye mdahalo Jumapili kwa kumshambulia mumewe Hillary Clinton, Bill Clinton Spika wa Bunge la Wawakilishi Paul Ryan ameapa kuangazia sasa kuhakikisha wagombea wa chama hicho wanatetea viti vyao katika Bunge la Congress. Hata hivyo, hajabatilisha uamuzi wake wa kumuidhinisha mgombea huyo. Bw Trump naye amemjibu Bw Ryan kupitia mtandao wa Twitter na kusema hafai kupoteza wakati akimpigania. Awali, mpinzani wa Bw Trump kutoka chama cha Democratic Hillary Clinton ametilia shaka hatua ya Bw Trump kuomba radhi kutokana na matamshi hayo aliyoyatoa miaka 11 iliyopita. Kwenye kadha hiyo ya video, Bw Trump anaonekana akisema alivyoomba kushiriki mapenzi na mwanamke aliyeolewa. Aidha, anatoa matamshi ya kudhalilisha kuhusu wanawake. Lakini Jumapili, Bw Trump alisema maneno yake kwenye kanda hiyo ya video yalikuwa "mazungumzo ya mzaha faraghani". Hata hivyo alisema anajutia kuyasema. Akizungumza wakati wa mdahalo wa urais Jumapili, Bw Trump h

In [8]:
max_input_len = 256
max_target_len = 64

def preprocess_xlsum(examples, padding="max_length"):
    inputs = [f'Summarize the follow text:\n{text}' for text in examples["text"]]
    
     # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_input_len, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=examples["summary"], max_length=max_target_len, padding=padding, truncation=True)
    
    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_xlsum_swahili_test = xlsum_swa_test.map(preprocess_xlsum, batched=True)

Map:   0%|          | 0/99 [00:00<?, ? examples/s]

In [None]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=aya,
    label_pad_token_id=label_pad_token_id, # tokenizer.pad_token_id,
    pad_to_multiple_of=8
)

In [9]:
import evaluate
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

# Metric
metric = evaluate.load("rouge")

# helper function to postprocess text
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    preds = np.where(preds != -100, preds, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {k: round(v * 100, 4) for k, v in result.items()}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    return result

2024-02-21 12:00:32.548458: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-21 12:00:32.548590: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-21 12:00:32.690888: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [116]:
from huggingface_hub import HfFolder
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

train_batch_size = 0
eval_batch_size = 1

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=Seq2SeqTrainingArguments(
        output_dir=model_id,
        do_train=False,
        per_device_eval_batch_size=eval_batch_size,
        predict_with_generate=True,
        generation_max_length=max_target_len,
        report_to="none",
        push_to_hub=False,
    ),
    data_collator=data_collator,
    train_dataset=tokenized_dolly_english['train'],
    eval_dataset=tokenized_xlsum_swa_test,
    compute_metrics=compute_metrics,
)

In [117]:
trainer.evaluate()

{'eval_loss': 1.3297851085662842,
 'eval_rouge1': 35.1247,
 'eval_rouge2': 14.4541,
 'eval_rougeL': 27.442,
 'eval_rougeLsum': 27.7463,
 'eval_gen_len': 40.3,
 'eval_runtime': 67.4628,
 'eval_samples_per_second': 0.148,
 'eval_steps_per_second': 0.148}

In [None]:
import torch

# Free memory for the second training 
del model
del trainer_en
torch.cuda.empty_cache()

In [None]:
# Our Swahili instruction dataset
# This dataset is translated from Dolly-15k English instructions, later filtered and post-edited by Toloka
!wget https://github.com/AligningLLMtoLRL/AligningLLMtoLRL.github.io/raw/main/materials/Dataset.zip
!unzip Dataset.zip

In [38]:
import pandas as pd

dolly_swahili_df = pd.read_excel("/kaggle/working/translated_ds.xlsx")
dolly_swahili_df.head(2)

Unnamed: 0,task_id,INPUT:context_tr,INPUT:context_src,INPUT:response_tr,INPUT:response_src,INPUT:instruction_tr,INPUT:instruction_src,toloka probabilities
0,000287b55d--656f562fa7ccfa2fa62cbad5,"""I'm So Excited"" ni wimbo wa mwimbaji wa Aust...","""I'm So Excited"" is a song by Australian singe...","""I'm So Excited"" ni wimbo wa mwimbaji wa Austr...","""I'm So Excited"" is a song by Australian singe...",Ni nani mwimbaji wa wimbo wa I'm So Excited?,Who is the singer of the song I'm So Excited?,0.988446
1,000287b55d--656f562fa7ccfa2fa62cbb0b,,,Kupanga safari ya kwenda Ulaya ni sawa na kupa...,Planning a trip to Europe is similar to planni...,"Je, nifanyeje kuhusu kupanga safari ya kwenda...",How should I go about planning a trip to Europe?,0.982769


In [23]:
from datasets import Dataset

# Load our Swahili instruction dataset
dolly_swahili = Dataset.from_pandas(dolly_swahili_df)

def preprocess_dolly(examples, padding="max_length"):
    inputs = []
    targets = []
    for instruction, context in zip(examples["INPUT:instruction_tr"], examples["INPUT:context_tr"]):
        if len(context) > 0:
          inputs.append(f'{instruction}\nContext: {context}')
        else:
          inputs.append(instruction)
    
    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_input_len, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=examples["INPUT:response_tr"], max_length=max_target_len, padding=padding, truncation=True)
    
    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Preprocess the dataset
tokenized_dolly_swahili = dolly_swahili.map(preprocess_dolly, batched=True, 
                                            remove_columns=["INPUT:context_src", "INPUT:instruction_src", "INPUT:response_src", "toloka probabilities", "task_id"])

Map:   0%|          | 0/12125 [00:00<?, ? examples/s]

In [24]:
from transformers import AutoModelForSeq2SeqLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# load model from the hub
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")

config.json:   0%|          | 0.00/836 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/11 [00:00<?, ?it/s]

model-00001-of-00011.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00011.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00011.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00004-of-00011.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00005-of-00011.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00006-of-00011.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00007-of-00011.safetensors:   0%|          | 0.00/4.87G [00:00<?, ?B/s]

model-00008-of-00011.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00009-of-00011.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00010-of-00011.safetensors:   0%|          | 0.00/2.99G [00:00<?, ?B/s]

model-00011-of-00011.safetensors:   0%|          | 0.00/4.10G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/11 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

In [25]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [26]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8, 
    lora_alpha=32, 
    target_modules=["k", "q", "v"],
    lora_dropout=0.05, 
    bias="none", 
    task_type="TaskType.SEQ_2_SEQ_LM"
)

model = get_peft_model(model, config)

In [28]:
from transformers import DataCollatorForSeq2Seq

# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

In [34]:
from huggingface_hub import HfFolder
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

train_batch_size = 2
eval_batch_size = 1

# Hugging Face repository id
new_aya_id = f"aya-swa-dolly-qlora"

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=Seq2SeqTrainingArguments(
        output_dir=new_aya_id,
        per_device_train_batch_size=train_batch_size,
        per_device_eval_batch_size=eval_batch_size,
        gradient_accumulation_steps=4,
        predict_with_generate=True,
        learning_rate=1e-4,
        max_steps=200,
        # logging & evaluation strategies
        logging_strategy="steps",
        logging_steps=50,
        evaluation_strategy="no",
        save_strategy="no",
        load_best_model_at_end=True,
        generation_max_length=max_target_len,
        report_to="none",
        push_to_hub=False,
        optim="paged_adamw_8bit"
    ),
    data_collator=data_collator,
    train_dataset=tokenized_dolly_swahili,
    eval_dataset=tokenized_xlsum_swahili_test,
    compute_metrics=compute_metrics,
)

In [None]:
# Fine-tune the model
trainer.train()

In [None]:
# Evaluate the fine-tuned model
trainer.evaluate()

```json
{'eval_loss': 1.3190653324127197,
 'eval_rouge1': 38.0103,
 'eval_rouge2': 15.7488,
 'eval_rougeL': 28.7387,
 'eval_rougeLsum': 28.7794,
 'eval_gen_len': 37.9,
 'eval_runtime': 88.4342,
 'eval_samples_per_second': 0.113,
 'eval_steps_per_second': 0.113}
```

In [None]:
# Save model and tokenizer
trainer.model.save_pretrained(repository_id)
tokenizer.save_pretrained(repository_id)