# Tutorial 2: Insert a LoRA adapter to Finetune Bert for Sequence Classification

When we import a pretrained transformer model from HuggingFace, we receive the encoder/decoder weights, which aren't that useful on their own - to perform a useful task such as sequence classification, we add a classification head on top of the model and train those weights on the required dataset. In this tutorial, we'll look at fine tuning a Bert model for sequence classification with two approaches. First, we'll attempt full Supervised Fine Tuning (SFT) and see how this is infeasible without specialized hardware. Then, we'll use the Mase stack to add a LoRA adapter to the model. [LoRA](https://arxiv.org/abs/2106.09685) was proposed by a research team at Microsoft in 2021, as an efficient technique for Parameter Efficient Fine Tuning (PEFT). This enables us to achieve accuracies comparable to full fine tuning, while only training a fraction of the parameters.

## Sentiment Analysis with the IMDb Dataset

The IMDB dataset, introduced in [this 2011 paper](https://aclanthology.org/P11-1015/) from Stanford, is commonly used for sentiment analysis in the Natural Language Processing (NLP) community. This is a collection of 50k movie reviews from the IMDb website, labelled as either "positive" or "negative". Here is an example of a positive review:

> I turned over to this film in the middle of the night and very nearly skipped right passed it. It was only because there was nothing else on that I decided to watch it. In the end, I thought it was great.<br /><br />An interesting storyline, good characters, a clever script and brilliant directing makes this a fine film to sit down and watch. This was, in fact, the first I'd heard of this movie, but I would have been happy to have paid money to see this at the cinema.<br /><br />My IMDB Rating : 8 out of 10<br /><br />

And a negative review:

> its a totally average film with a few semi-alright action sequences that make the plot seem a little better and remind the viewer of the classic van dam films. parts of the plot don't make sense and seem to be added in to use up time. the end plot is that of a very basic type that doesn't leave the viewer guessing and any twists are obvious from the beginning. the end scene with the flask backs don't make sense as they are added in and seem to have little relevance to the history of van dam's character. not really worth watching again, bit disappointed in the end production, even though it is apparent it was shot on a low budget certain shots and sections in the film are of poor directed quality

Here, we use the HuggingFace ``datasets`` library to load and tokenize the model.

In [None]:
from datasets import load_dataset
from transformers import DataCollatorWithPadding, AutoTokenizer


def tokenize_function(example):
    return tokenizer(example["text"], truncation=True)


raw_datasets = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Generate a MaseGraph with Custom Arguments

By inspecting the implementation of the Bert model in HuggingFace, we can see the forward function has a signature similar to the following.

```python
    class BertForSequenceClassification(BertPreTrainedModel):
        def __init__(self, config):
            super().__init__(config)
            self.bert = BertModel(config)
            ...

        def forward(
            self,
            input_ids: Optional[torch.Tensor] = None,
            attention_mask: Optional[torch.Tensor] = None,
            token_type_ids: Optional[torch.Tensor] = None,
            position_ids: Optional[torch.Tensor] = None,
            head_mask: Optional[torch.Tensor] = None,
            inputs_embeds: Optional[torch.Tensor] = None,
            labels: Optional[torch.Tensor] = None,
            output_attentions: Optional[bool] = None,
            output_hidden_states: Optional[bool] = None,
            return_dict: Optional[bool] = None,
        ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
            ...
```

By default, the MaseGraph constructor chooses to use the `input_ids` argument, ignoring the other optional arguments. However, you can specify which inputs to drive during symbolic tracing using the `hf_input_names` argument. In the following cell, we also drive the `attention_mask` and `labels` inputs. By specifying the `labels` argument, we include a `nn.CrossEntropyLoss` module at the end of the model to calculate the loss directly.

> **Task:** Remove the `attention_mask` and `labels` arguments from the `hf_input_names` list and re-run the following cell. Use `mg.draw()` to visualize the graph in each case. Can you see any changes in the graph topology? Can you explain why this happens?

In [None]:
import torch
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer

from chop import MaseGraph
import chop.passes as passes

checkpoint = "prajjwal1/bert-tiny"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

model.config.problem_type = "single_label_classification"

mg = MaseGraph(
    model,
    hf_input_names=[
        "input_ids",
        "attention_mask",
        "labels",
    ],
)

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

dummy_input = tokenizer(
    [
        "AI may take over the world one day",
        "This is why you should learn ADLS",,
    ],
    return_tensors="pt",
)

dummy_input["labels"] = torch.tensor([1, 0])

mg, _ = passes.init_metadata_analysis_pass(mg)
mg, _ = passes.add_common_metadata_analysis_pass(
    mg, pass_args={"dummy_in": dummy_input}
)

## Full Supervised Finetuning (SFT)

Set up the trainer...

In [None]:
from time import time
from datasets import load_dataset
import evaluate
import numpy as np
from transformers import (
    DataCollatorWithPadding,
    TrainingArguments,
    AutoModelForSequenceClassification,
    Trainer,
)

accuracy = evaluate.load("accuracy")


def compute_accuracy(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)


training_args = TrainingArguments(
    "test-trainer",
    use_cpu=True,
    report_to="none",
    num_train_epochs=1,
)

trainer = Trainer(
    mg.model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_accuracy,
)

Before running any fine tuning, let's see how the model performs out of the box. For this, we can define a `Trainer` and pass it a simple `compute_accuracy` function. Without any fine-tuning, we can see the model just performs a random guess - there are two labels in the dataset, so this corresponds to an accuracy of 50%.

In [None]:
# Evaluate accuracy
eval_results = trainer.evaluate()
print(f"Evaluation accuracy: {eval_results['eval_accuracy']}")

We should also see how many trainable parameters there are in the model. If you're familiar with Keras, you might have used the `model.summary()` API before, but it's not as easy to do the same in Pytorch - luckily, Mase has a module-level pass with this functionality. Below we see that our `bert-tiny` checkpoint with the appended classification head has almost 15M parameters in total.

In [None]:
from chop.passes.module import report_trainable_parameters_analysis_pass

report_trainable_parameters_analysis_pass(mg.model)

By now, you might be getting the sense that training this module without a large number of GPUs is highly unfeasible. Nonetheless, uncomment the `trainer.train()` line below to run a single training epoch. 

In [None]:
# Uncomment below to launch training
def train():
    start_time = time()
    trainer.train()
    end_time = time()

    print(f"Training for 1 epoch took {end_time - start_time} seconds")


train()

## Parameter Efficient Finetuning with LoRA

An alternative to full fine-tuning is PEFT, which uses a small number of trainable parameters to achieve similar performance. The LoRA adapter is a technique that allows us to achieve this. We can add a LoRA adapter to the model by using a pass in `MASE`:

In [None]:
mg, _ = passes.insert_lora_adapter_transform_pass(
    mg,
    pass_args={
        "rank": 6,
        "alpha": 1.0,
        "dropout": 0.5,
    },
)

Similar to before, let's report the number of trainable parameters.

In [None]:
report_trainable_parameters_analysis_pass(mg.model)

Run a single training epoch.

In [None]:
train()