Title: AIDI 1002 Final Term Project Report

Members' Names or Individual's Name: Aquilav Johnson
Emails: 200577021@student.georgianc.on.ca

# Introduction:

#### Problem Description:

The problem addressed in the document is the inefficiency of fine-tuning large pre-trained models for multiple downstream tasks in natural language processing (NLP). This leads to a high number of parameters and computational cost.

#### Context of the Problem:

The context of the problem is the need for a transfer learning strategy that allows for efficient training of models on multiple downstream tasks without sacrificing performance.

#### Limitation About other Approaches:

This problem is important because it hinders the practical application of NLP models, especially in scenarios where models need to be trained to solve many tasks that arrive from customers in sequence, such as in cloud services.

#### Solution:

The proposed method of adapter-based tuning aims to address the inefficiency of fine-tuning by introducing adapter modules that add only a small number of trainable parameters per task, allowing for efficient parameter sharing and performance retention.

# Background

Explain the related work using the following table

| Reference |Explanation |  Dataset/Input |Weakness
| --- | --- | --- | --- |
| Collobert et al. [1] | Training on multiple tasks simultaneously, sharing network parameters across tasks to exploit task regularities and improve performance.| SQUAD dataset for QA | Only 80% accuracy
| Bengio et al. [2] | This model aimed to address language processing challenges by leveraging neural networks to generate probabilistic language representations.| SQUAD V2 dataset for QA | High accuracy but poor on unkown answers
| Adapter-based tuning | The proposed adapter-based tuning method aims to achieve parameter-efficient transfer learning for NLP tasks | Various NLP classification tasks | Future improvement in understanding the impact of adapter size on performance

The last row in this table should be about the method discussed in this paper (If you can't find the weakenss of this method then write about the future improvement, see the future work section of the paper)

# Methodology

The existing paper introduces the concept of adapter-based tuning for NLP tasks, which involves the integration of adapter modules into pre-trained models to achieve parameter-efficient transfer learning. The proposed contribution is the demonstration of the effectiveness of adapter-based tuning for a wide range of text classification tasks.

# Implementation

The implementation of the adapter-based tuning method involves integrating adapter modules into pre-trained models and training them on various NLP tasks. The code and its explanation will be provided in the subsequent section.

Apart from model implementation, tried the model with different set of hyperparameters and took the observations.

## Installation

First, let's install the required libraries:

In [1]:
!pip install -qq adapters datasets
!pip install transformers[torch]

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m229.9/229.9 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate>=0.20.3 (from transformers[torch])
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.25.0


## Dataset Preprocessing

Before we start to train our adapter, we first prepare the training data. Our training dataset can be loaded via HuggingFace `datasets` using one line of code:

In [2]:
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes")
dataset.num_rows

Downloading builder script:   0%|          | 0.00/5.03k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.02k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.25k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/488k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

{'train': 8530, 'validation': 1066, 'test': 1066}

In [3]:
dataset['train'][0]

{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'label': 1}

Now, we need to encode all dataset samples to valid inputs for our Transformer model. Since we want to train on `roberta-base`, we load the corresponding `RobertaTokenizer`. Using `dataset.map()`, we can pass the full dataset through the tokenizer in batches:

In [4]:
from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

def encode_batch(batch):
  """Encodes a batch of input data using the model tokenizer."""
  return tokenizer(batch["text"], max_length=80, truncation=True, padding="max_length")

# Encode the input data
dataset = dataset.map(encode_batch, batched=True)
dataset = dataset.rename_column(original_column_name="label", new_column_name="labels")
# Transform to pytorch tensors and only output the required columns
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Training the model...

## Training

We use a pre-trained RoBERTa model checkpoint from the Hugging Face Hub. We load it with [`AutoAdapterModel`](https://docs.adapterhub.ml/classes/models/auto.html), a class unique to `adapters`. In addition to regular _Transformers_ classes, this class comes with all sorts of adapter-specific functionality, allowing flexible management and configuration of multiple adapters and prediction heads. [Learn more](https://docs.adapterhub.ml/prediction_heads.html#adaptermodel-classes).

In [5]:
from transformers import RobertaConfig
from adapters import AutoAdapterModel

config = RobertaConfig.from_pretrained(
    "roberta-base",
    num_labels=2,
)
model = AutoAdapterModel.from_pretrained(
    "roberta-base",
    config=config,
)

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaAdapterModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'heads.default.3.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Here comes the important part!**

We add a new adapter to our model by calling `add_adapter()`. We pass a name (`"rotten_tomatoes"`) and an adapter configuration. `"lora"` denotes a [LoRA](https://docs.adapterhub.ml/methods.html#lora) configuration.
_Adapters_ supports a diverse range of different adapter configurations. For example, `config="bn_seq"`[sequential bottleneck adapter](https://docs.adapterhub.ml/methods.html#bottleneck-adapters)  can be passed for training a  adapter or `config="prefix_tuning"` for a [prefix tuning](https://docs.adapterhub.ml/methods.html#prefix-tuning). You can find all currently supported configs [here](https://docs.adapterhub.ml/methods.html#prefix-tuning).

Next, we add a binary classification head. It's convenient to give the prediction head the same name as the adapter. This allows us to activate both together in the next step. The `train_adapter()` method does two things:

1. It freezes all weights of the pre-trained model, so only the adapter weights are updated during training.
2. It activates the adapter and the prediction head such that both are used in every forward pass.

In [6]:
model.add_adapter("rotten_tomatoes", config="lora")
# Alternatively, e.g.:
# model.add_adapter("rotten_tomatoes", config="lora")
model.add_classification_head(
    "rotten_tomatoes",
    num_labels=2,
    id2label={ 0: "👎", 1: "👍"}
  )

model.train_adapter("rotten_tomatoes")

For training an adapter, we make use of the `AdapterTrainer` class built-in into _Adapters_. This class is largely identical to _Transformer_'s `Trainer`, with some helpful tweaks e.g. for checkpointing only adapter weights.

We configure the training process using a `TrainingArguments` object and define a method that will calculate the evaluation accuracy in the end. We pass both, together with the training and validation split of our dataset, to the trainer instance.

**Note the differences in hyperparameters compared to full fine-tuning.** Adapter training usually requires a few more training epochs than full fine-tuning.

In [12]:
import numpy as np
from transformers import TrainingArguments, EvalPrediction
from adapters import AdapterTrainer

training_args = TrainingArguments(
    learning_rate=1e-4,
    num_train_epochs=6,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    logging_steps=200,
    output_dir="./training_output",
    overwrite_output_dir=True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=False,
)

training_args_1 = TrainingArguments(
    learning_rate=1e-3,
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_steps=200,
    output_dir="./training_output",
    overwrite_output_dir=True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=False,
)

def compute_accuracy(p: EvalPrediction):
  preds = np.argmax(p.predictions, axis=1)
  return {"acc": (preds == p.label_ids).mean()}

trainer = AdapterTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    compute_metrics=compute_accuracy,
)

trainer_1 = AdapterTrainer(
    model=model,
    args=training_args_1,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    compute_metrics=compute_accuracy,
)


Start the training 🚀

In [13]:
trainer.train()

Step,Training Loss
200,0.2793
400,0.2556
600,0.2544
800,0.2411
1000,0.2353
1200,0.2384
1400,0.2365
1600,0.2364


TrainOutput(global_step=1602, training_loss=0.24701393431938543, metrics={'train_runtime': 441.6154, 'train_samples_per_second': 115.893, 'train_steps_per_second': 3.628, 'total_flos': 2141601149692800.0, 'train_loss': 0.24701393431938543, 'epoch': 6.0})

In [14]:
trainer_1.train()

Step,Training Loss
200,0.35
400,0.3219
600,0.2885
800,0.2518
1000,0.2759
1200,0.2317
1400,0.2161
1600,0.2303
1800,0.1778
2000,0.1923


TrainOutput(global_step=2670, training_loss=0.22892126626289738, metrics={'train_runtime': 391.8344, 'train_samples_per_second': 108.847, 'train_steps_per_second': 6.814, 'total_flos': 1784667624744000.0, 'train_loss': 0.22892126626289738, 'epoch': 5.0})

##Observations after changing few hyperparameters:
Final Training Loss:

Training Session 2 achieved a slightly lower final training loss (0.2289) compared to Training Session 1 (0.2470). A lower training loss indicates better convergence of the model.
Training Duration:

Training Session 2 completed in a shorter duration (391.83 seconds) compared to Training Session 1 (441.62 seconds). This suggests that Training Session 2 achieved a better loss with less computational time.
Samples Per Second:

Training Session 1 had a higher number of samples processed per second (115.89) compared to Training Session 2 (108.85). This indicates that, despite the longer training duration, Training Session 1 processed more samples per second.
Steps Per Second:

Training Session 2 had a significantly higher number of steps processed per second (6.81) compared to Training Session 1 (3.63). This suggests that Training Session 2 was more computationally efficient per training step.
Total Floating Point Operations:

Training Session 1 involved more total floating-point operations (2.14e15) compared to Training Session 2 (1.78e15). The total FLOPs can provide insights into the overall computational load.

In [9]:
trainer.evaluate()

{'eval_loss': 0.27291080355644226,
 'eval_acc': 0.8855534709193246,
 'eval_runtime': 4.3183,
 'eval_samples_per_second': 246.856,
 'eval_steps_per_second': 7.873,
 'epoch': 6.0}

In [15]:
trainer.evaluate()

{'eval_loss': 0.35869863629341125,
 'eval_acc': 0.8930581613508443,
 'eval_runtime': 4.7274,
 'eval_samples_per_second': 225.493,
 'eval_steps_per_second': 7.192,
 'epoch': 6.0}

We can put our trained model into a _Transformers_ pipeline to be able to make new predictions conveniently:

In [10]:
from transformers import TextClassificationPipeline

classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer, device=training_args.device.index)

classifier("I am Good")

The model 'RobertaAdapterModel' is not supported for . Supported models are ['AlbertForSequenceClassification', 'BartForSequenceClassification', 'BertForSequenceClassification', 'BigBirdForSequenceClassification', 'BigBirdPegasusForSequenceClassification', 'BioGptForSequenceClassification', 'BloomForSequenceClassification', 'CamembertForSequenceClassification', 'CanineForSequenceClassification', 'LlamaForSequenceClassification', 'ConvBertForSequenceClassification', 'CTRLForSequenceClassification', 'Data2VecTextForSequenceClassification', 'DebertaForSequenceClassification', 'DebertaV2ForSequenceClassification', 'DistilBertForSequenceClassification', 'ElectraForSequenceClassification', 'ErnieForSequenceClassification', 'ErnieMForSequenceClassification', 'EsmForSequenceClassification', 'FalconForSequenceClassification', 'FlaubertForSequenceClassification', 'FNetForSequenceClassification', 'FunnelForSequenceClassification', 'GPT2ForSequenceClassification', 'GPT2ForSequenceClassification', 

[{'label': '👍', 'score': 0.7858206033706665}]

At last, we can also extract the adapter from our model and separately save it for later reuse. Note the size difference compared to a full model!

In [11]:
model.save_adapter("./final_adapter", "rotten_tomatoes")

!ls -lh final_adapter

total 3.5M
-rw-r--r-- 1 root root  488 Dec 16 04:26 adapter_config.json
-rw-r--r-- 1 root root  443 Dec 16 04:26 head_config.json
-rw-r--r-- 1 root root 1.2M Dec 16 04:26 pytorch_adapter.bin
-rw-r--r-- 1 root root 2.3M Dec 16 04:26 pytorch_model_head.bin


# Conclusion and Future Direction

This project has explored the limitations associated with fine-tuning large pre-trained models for NLP tasks, highlighting their inefficiency. To overcome this challenge, the study has investigated the application of adapter-based tuning as a more efficient alternative. The obtained results are encouraging, demonstrating promising performance improvements. However, the exploration doesn't end there; the project's future trajectory should focus on a detailed examination of the impact of adapter size on overall performance. This involves understanding how the scale or dimensions of adapters influence their efficacy in adapting to specific NLP tasks. Additionally, the project aims to extend its scope by delving deeper into further enhancements and optimizations within the adapter-based tuning framework. The goal is to refine and advance the technique, ensuring that it continues to be a viable solution for improving efficiency in NLP tasks. By scrutinizing the relationship between adapter size and performance and seeking continuous improvements, the project aspires to contribute valuable insights and methodologies to the field of NLP model tuning and optimization.

# References:

[1]:  Collobert, A unified architecture for
natural language processing, Deep neural networks with
multitask learning, 2008

[2]:  C Janvin,  A neural probabilistic language model, Journal of Machine
Learning Research, 2003