<a href="https://colab.research.google.com/github/Saputoa21/Computational_Linguistics-Crosslingual_Methods/blob/main/Saputo_Bonus_Exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bonus Exercises 2: Low-Rank Adaptation and Crosslingual Transfer**



This notebook represents the second bonus exercises for the lecture Multilingual and Crosslingual Methods and Language Resources (2024W 340168-1). For each successfully completed bonus exercise, a maximum of three points can be achieved that will be added to the points of the final exam. The tasks to be completed in the following notebook are marked with 👋 ⚒.

---




In this notebook, we will use Low-Rank Adaptation to Fine-Tune XLM-R on the task of linguistic acceptability in English and then test its zero-shot capability in other languages.

# **Make sure to set your runtime to GPU before you start training.**

(Tab: Runtime/Change Runtime Type -> Select GPU)

-----------
## **Fine-Tuning on English**

The first part has already been prepared for you. We will load and preprocess the Corpus for Linguistic Acceptability (COLA) dataset from GLUE and then use Low-Rank Adaptation to fine-tune XLM-R.

### Installation

As always, we first need to install the necessary libraries. One that is new in this notebook is the Parameter-Efficient Fine-Tuning (PEFT) library.

In [None]:
!pip install -U evaluate
!pip install -U datasets
!pip install -U transformers
!pip install -U peft



### Loading the Dataset

In this notebook we will first be using the COLA dataset from the GLUE library and then a multilingual extension.
 We will first train on English and transfer to another language and evaluate zero-shot transfer on one more language (see [here](https://huggingface.co/datasets/Geralt-Targaryen/MELA) for a selection).

In [None]:
from datasets import load_dataset

dataset_en = load_dataset("glue", "cola")
dataset_en.num_rows

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


{'train': 8551, 'validation': 1043, 'test': 1063}

Let us take a look at the components of the dataset.

In [None]:
dataset_en['train'].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['unacceptable', 'acceptable'], id=None),
 'idx': Value(dtype='int32', id=None)}

Hugging Face Datasets is designed to be interoperable with libraries like Pandas, as well as NumPy, PyTorch, TensorFlow, and JAX. To enable the conversion between various third-party libraries, Hugging Face Datasets provides a Dataset.set_format() function. This function only changes the output format of the dataset, so you can easily switch to another format without affecting the underlying data format which is Apache Arrow. The formatting is done in-place, so let’s convert our dataset to Pandas and look at a random sample:

In [None]:
from IPython.display import display, HTML

dataset_en.set_format("pandas")
df = dataset_en["train"][:]
# Create a random sample
sample = df.sample(n=5, random_state=42)
display(HTML(sample.to_html()))

Unnamed: 0,sentence,label,idx
2389,Angela characterized Shelly as a lifesaver.,1,2389
5048,They're not finding it a stress being in the same office.,1,5048
3133,Paul exhaled on Mary.,0,3133
5955,I ordered if John drink his beer.,0,5955
625,Press the stamp against the pad completely.,1,625


The Pandas dataframe can now be used as we would always use Pandas, for instance to count the number of labels for `cause` in the column question.

In [None]:
df["label"].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,6023
0,2528


We can see that the two labels are spread quite evenly across the two types of questions.

This was just a brief detour to show how datasets can be nicely manipulated and displayed using other libraries. We will now get back to our usual datasets library from Hugging Face. To this end, we will reset the format.

In [None]:
dataset_en.reset_format()

### Preprocessing the Dataset

In this example, we model COPA as a multiple-choice task with two choices. Thus, we encode the premise and question as well as both choices as one input to our `xlm-roberta-base` model. Using `dataset.map()`, we can pass the full dataset through the tokenizer in batches.

In [None]:
from datasets import load_dataset, DatasetDict
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
batch_size = 32

def tokenize_function(examples):
    return tokenizer(examples["sentence"], padding=True, truncation=True)

def preprocess_dataset(dataset):
  token_dataset = dataset.map(tokenize_function, batched=True, batch_size=batch_size)
  tokenized_dataset = token_dataset.rename_column("label", "labels")
  return tokenized_dataset

data_set_en_with_test = DatasetDict(
    train=dataset_en['train'].shuffle(seed=24).select(range(7488)),
    validation=dataset_en['validation'],
    test=dataset_en['train'].shuffle(seed=24).select(range(7488, 8551)),
)

tokenized_dataset_en = preprocess_dataset(data_set_en_with_test)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/1043 [00:00<?, ? examples/s]

In [None]:
tokenized_dataset_en["train"][1]

{'sentence': 'John knows that she left and whether she will come back.',
 'labels': 1,
 'idx': 7246,
 'input_ids': [0,
  4939,
  93002,
  450,
  2412,
  25737,
  136,
  36766,
  2412,
  1221,
  1380,
  4420,
  5,
  2,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0]}

In [None]:
tokenized_dataset_en["train"]["sentence"][1]

'John knows that she left and whether she will come back.'

In [None]:
tokenized_dataset_en["train"]["labels"][1]

1

-----------
## **Low-Rank Adaptation (LoRA)**

In order to perform low-rank adaptation (LoRA) on a pretrained language model for parameter-efficient fine-tuning (PEFT), we need to set a few parameters in the LoRA Configuration. Hugging Face offers some [documentation on LoRA](https://huggingface.co/docs/peft/main/en/developer_guides/lora).

The `task-type` specifies which task the model should be fine-tuned on and needs to correspond to the way the model is loaded. If we load a model for Sequence Classification, also the task needs to be `SEQ_CLS`, an abbreviation for Sequence Classification. Then the dataset needs to be one with an input sequence and a number of target classes.

The `target-module` depends on the type of model, which for XLM-R is `["query", "value"]`. Since we wish to change model parameters, the inference mode is set to false. The variable `r`indicates the rank to which the dimensionality is being reduced. The variable `alpha` is a scaling parameter, because `r`scales at 1.0. With small datasets or if unsure, the rank and alpha can be the same. Finally, dropout is a random omission of trainable parameters (setting to zero) during training, mostly to avoid overfitting.

Feel free to play with and adapt these parameters if you are interested in seeing the effect.


In [None]:
from peft import LoraConfig, PeftType, get_peft_model
from transformers import AutoModelForSequenceClassification

peft_type = PeftType.LORA
peft_config = LoraConfig(task_type="SEQ_CLS", target_modules=["query", "value"], inference_mode=False, r=8, lora_alpha=8, lora_dropout=0.1)
model = AutoModelForSequenceClassification.from_pretrained("xlm-roberta-large")
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
model

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 1,838,082 || all params: 561,730,564 || trainable%: 0.3272


PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): XLMRobertaForSequenceClassification(
      (roberta): XLMRobertaModel(
        (embeddings): XLMRobertaEmbeddings(
          (word_embeddings): Embedding(250002, 1024, padding_idx=1)
          (position_embeddings): Embedding(514, 1024, padding_idx=1)
          (token_type_embeddings): Embedding(1, 1024)
          (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (encoder): XLMRobertaEncoder(
          (layer): ModuleList(
            (0-23): 24 x XLMRobertaLayer(
              (attention): XLMRobertaAttention(
                (self): XLMRobertaSdpaSelfAttention(
                  (query): lora.Linear(
                    (base_layer): Linear(in_features=1024, out_features=1024, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.1, inplace=False)
                    )
     

In [None]:
import numpy as np
import evaluate
from transformers import TrainingArguments, Trainer, EvalPrediction
from datasets import concatenate_datasets

num_train_epochs = 5
logging_steps = len(tokenized_dataset_en["train"]) // (batch_size * num_train_epochs)
accuracy = evaluate.load("accuracy")

training_args = TrainingArguments(
    learning_rate=2e-4,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_steps=logging_steps,
    output_dir="./training_output",
    overwrite_output_dir=True,
    report_to='none',
    load_best_model_at_end=True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=True,
)

def compute_metrics(eval_pred):
    """Called at the end of validation. Gives accuracy"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    # calculates the accuracy
    return accuracy.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_en["train"],
    eval_dataset=tokenized_dataset_en["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Once we have configured the model with PEFT, we can train the PEFT model as usual.

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.5877,0.598689,0.693193
2,0.5198,0.495831,0.761266
3,0.4804,0.479715,0.782359
4,0.4416,0.507901,0.778523
5,0.3952,0.495295,0.797699


TrainOutput(global_step=1170, training_loss=0.509758540096446, metrics={'train_runtime': 586.7646, 'train_samples_per_second': 63.808, 'train_steps_per_second': 1.994, 'total_flos': 2688914965511424.0, 'train_loss': 0.509758540096446, 'epoch': 5.0})

👋 ⚒ Evaluate on the English test set to see how well the fine-tuning has worked.

In [None]:
# Your code for the evaluation here
import torch
from peft import PeftModel, PeftConfig

peft_model_id = "/content/training_output/checkpoint-1170"
config = PeftConfig.from_pretrained(peft_model_id)
inference_model = AutoModelForSequenceClassification.from_pretrained(config.base_model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

inference_model = PeftModel.from_pretrained(inference_model, peft_model_id)

model_inputs = tokenizer("I ordered if John dink his beer.", return_tensors="pt")
outputs = inference_model(**model_inputs)
prediction = outputs.logits.argmax(dim=-1)
print(prediction)
print(["unacceptable", "acceptable"][prediction])

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tensor([0])
unacceptable


In [None]:
from torch.utils.data import DataLoader

eval_dataloader = DataLoader(tokenized_dataset_en['test'], batch_size=8)

inference_model.eval()

for batch in eval_dataloader:
    input = tokenizer(batch['sentence'], padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        outputs = inference_model(**input)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    accuracy.add_batch(predictions=predictions, references=batch['labels'])

accuracy.compute()

{'accuracy': 0.8052681091251176}

## **Crosslingual Transfer**

In this section, we will be using the Multilingual Evaluation of Linguistic Acceptability ([MELA](https://github.com/sjtu-compling/mela?tab=readme-ov-file)), which is also [available on Hugging Face](https://huggingface.co/datasets/Geralt-Targaryen/MELA) to test the transfer and zero-shot capabilities of XLM-R with LoRA Fine-Tuning.

We will first fine-tune on German and then test the on German but also in a zero-shot approach on another language of your choice.

Please be aware of the fact that MELA "only" offers a dev and a test set - no train, validation, test split. Thus, the preprocessing needs to be slightly adapted.

In [None]:
from datasets import load_dataset

de = load_dataset("Geralt-Targaryen/MELA", "de")
dataset_de = preprocess_dataset(de)
batch_size = 16
print(dataset_de["test"][20])

{'idx': 'c1-1.1_n9-a', 'labels': 1, 'sentence': 'Wenn du glaubst, dass er sich geirrt habe, kannst du dann alles verstehen', 'input_ids': [0, 7896, 115, 24682, 5829, 271, 4, 1421, 72, 833, 6, 128696, 3198, 3260, 4, 32540, 115, 3700, 4174, 85516, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [None]:
print(dataset_de["dev"][0])

{'idx': 'c5-5.1_n1-f-1', 'labels': 1, 'sentence': 'Er hat nicht ausgeschlossen, dass es so gewesen sein könnte ', 'input_ids': [0, 1004, 1256, 749, 206941, 4, 1421, 198, 221, 72888, 2988, 25482, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


👋 ⚒ Use the German dev partition to further-finetune the previously configured model and then evaluate on the test partition of the German dataset.

In [None]:
# Fine-tuning on German dev set here
import numpy as np
import evaluate
from transformers import TrainingArguments, Trainer, EvalPrediction
from datasets import concatenate_datasets

num_train_epochs = 5
logging_steps = len(dataset_de["dev"]) // (batch_size * num_train_epochs)
accuracy = evaluate.load("accuracy")

training_args = TrainingArguments(
    learning_rate=2e-4,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_steps=logging_steps,
    output_dir="./training_output2",
    overwrite_output_dir=True,
    report_to='none',
    load_best_model_at_end=True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=True,
)

def compute_metrics(eval_pred):
    """Called at the end of validation. Gives accuracy"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    # calculates the accuracy
    return accuracy.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=inference_model,
    args=training_args,
    train_dataset=dataset_de["dev"],
    eval_dataset=dataset_de["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,1.001,0.526841,0.738624
2,0.4307,0.549932,0.715344
3,0.332,0.485426,0.759788
4,0.4677,0.4682,0.778836
5,0.3656,0.469248,0.777778


TrainOutput(global_step=35, training_loss=0.7001416385173798, metrics={'train_runtime': 31.0494, 'train_samples_per_second': 16.103, 'train_steps_per_second': 1.127, 'total_flos': 22890086700000.0, 'train_loss': 0.7001416385173798, 'epoch': 5.0})

Evaluation on German Dataset

In [None]:
import torch
from peft import PeftModel, PeftConfig

peft_model_id_de = "/content/training_output2/checkpoint-28"
config = PeftConfig.from_pretrained(peft_model_id_de)
inference_model_de = AutoModelForSequenceClassification.from_pretrained(config.base_model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

inference_model_de = PeftModel.from_pretrained(inference_model_de, peft_model_id)

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model_inputs = tokenizer("Er sagt, dass man diesen Satz grammatisch machen kann.", return_tensors="pt") #labeled as 1
outputs = inference_model_de(**model_inputs)
prediction = outputs.logits.argmax(dim=-1)
print(prediction)
print(["unacceptable", "acceptable"][prediction])

tensor([1])
acceptable


In [None]:
model_inputs = tokenizer("Jeder, der hat gelacht.", return_tensors="pt") #labeled as 0
outputs = inference_model_de(**model_inputs)
prediction = outputs.logits.argmax(dim=-1)
print(prediction)
print(["unacceptable", "acceptable"][prediction])

tensor([1])
acceptable


In [None]:
from torch.utils.data import DataLoader

eval_dataloader = DataLoader(dataset_de['test'], batch_size=8)

inference_model_de.eval()

for batch in eval_dataloader:
    input = tokenizer(batch['sentence'], padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        outputs = inference_model_de(**input)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    accuracy.add_batch(predictions=predictions, references=batch['labels'])

accuracy.compute()

{'accuracy': 0.7587301587301587}

👋 ⚒ Select another language of your choice from the [MELA dataset](https://huggingface.co/datasets/Geralt-Targaryen/MELA) to only evaluate the fine-tuned model (zero-shot capability).

**Alternative**: Feel free to create your own mini-dataset of a few (non)-acceptable sentences in a language of your choice to test the model's zero-shot capacity.

Russian examples to evaluate model's performace on a language from a differenet language family (an East Slavic language)

In [None]:
model_inputs = tokenizer("Проснешься не торопясь, посердишься на что-нибудь, поворчишь.", return_tensors="pt") #labeled as 1
outputs = inference_model_de(**model_inputs)
prediction = outputs.logits.argmax(dim=-1)
print(prediction)
print(["unacceptable", "acceptable"][prediction])

tensor([1])
acceptable


In [None]:
model_inputs = tokenizer("Те, кто мечтает стать инженером, исследователем, лётчиком, космонавтом, должен развивать свою зрительную память.", return_tensors="pt") #labeled as 0
outputs = inference_model_de(**model_inputs)
prediction = outputs.logits.argmax(dim=-1)
print(prediction)
print(["unacceptable", "acceptable"][prediction])

tensor([1])
acceptable


Italian examples to evaluate model's performace on a language from a differenet language family (a Romance language)

In [None]:
# Your evaluation here
model_inputs = tokenizer("Tommaso legge il giornale.", return_tensors="pt") #labeled as 1
outputs = inference_model_de(**model_inputs)
prediction = outputs.logits.argmax(dim=-1)
print(prediction)
print(["unacceptable", "acceptable"][prediction])

tensor([1])
acceptable


In [None]:
model_inputs = tokenizer("Uno studente parlato poco fa.", return_tensors="pt") #labeled as 0
outputs = inference_model_de(**model_inputs)
prediction = outputs.logits.argmax(dim=-1)
print(prediction)
print(["unacceptable", "acceptable"][prediction])

tensor([0])
unacceptable


Islandic examples to evaluate model's performace on a language from a the same language family (a North Germanic language)

In [None]:
model_inputs = tokenizer("Það er ekki gott að vanta einan í tíma.", return_tensors="pt") #labeled as 1
outputs = inference_model_de(**model_inputs)
prediction = outputs.logits.argmax(dim=-1)
print(prediction)
print(["unacceptable", "acceptable"][prediction])

tensor([1])
acceptable


In [None]:
model_inputs = tokenizer("Þetta eru þessar bækur fjórar mínar.", return_tensors="pt") #labeled as 0
outputs = inference_model_de(**model_inputs)
prediction = outputs.logits.argmax(dim=-1)
print(prediction)
print(["unacceptable", "acceptable"][prediction])

tensor([1])
acceptable
