<a href="https://colab.research.google.com/github/Akitsuyoshi/lora_sandbox/blob/main/lora_trainer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text classification task by fine-tuned Bert model, using LoRA

This notebook goal is applying `LoRA` PEFT technique to the `MRPC` text classification task with `Bert` model. The model is given two sentences from dataset, and determine if those two are paraphrases, the same meaning or not. The model uses `Accuracy` and `F1` as metric, while `CrossEntropyLoss` as loss for training. In this notebook, we will go through the following steps.

0. Install necessary libraries
1. Prepare the Bert model
2. Perform lightweight fine-tune, using LoRA
3. Perform inference with fine-tuned model
4. Future feature
5. References

The details will be discussed in each section later.

**Keywords**:
* PEFT technique: [LoRA](https://huggingface.co/papers/2309.15223)
* Model: [DistilBERT](https://huggingface.co/docs/transformers/v4.38.1/en/model_doc/distilbert#distilbert)
* Evaluation approach: Accuracy and F1
* Fine-tuning dataset: MRPC in [GLUE dataset](https://huggingface.co/datasets/glue)

**Note**:

This notebook is suuposed to be run on Google Colab, with `GPU`. It may work in Udacity workplace, but Colab is recomendatinal way to run this notebok. This notebook doesn't work if you use `TPU` since we use dynamic padding in training process.

## Install necessary libraries


In [None]:
!pip install transformers accelerate datasets evaluate peft

Collecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m49.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.9.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.9/190.9 kB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m21.9 MB/s[

## Prepare the Bert model

We follow this section in this order.

1. Load mrpc dataset
2. Check the raw mrpc data
3. Load Bert tokenizer
4. Preprocess the dataset
5. Set metrics, accuracy and f1
6. Load Bert model
7. Evaluate initial Bert performance without training

### Load mrpc dataset

We have 3 datasets. We use `train` and `validation` set when training, while `test` set for inferencing. The dataset is relatively small amount so I use whole set without taking a subset of it.

In [None]:
from datasets import load_dataset

ds = load_dataset("glue", "mrpc")
ds

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/649k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

### Check the raw mrpc data

The dataset containtans 4 features, `sentence1`, `sentence2`, `label`, and `idx`. We use two sentences as features, and label for true label. If `sentence1` and `sentence2` are paraphrace from one another, the instance is labeled by 1. If not, label is 0.

In [None]:
ds["train"][0], ds["validation"][0], ds["test"][0]

({'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
  'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
  'label': 1,
  'idx': 0},
 {'sentence1': "He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .",
  'sentence2': '" The foodservice pie business does not fit our long-term growth strategy .',
  'label': 1,
  'idx': 9},
 {'sentence1': "PCCW 's chief operating officer , Mike Butcher , and Alex Arena , the chief financial officer , will report directly to Mr So .",
  'sentence2': 'Current Chief Operating Officer Mike Butcher and Group Chief Financial Officer Alex Arena will report to So .',
  'label': 1,
  'idx': 0})

### Load Bert tokenizer and preprocess the dataset

We load distribert, lighweith bert pre-trained weights from Hugging Face. And then, we preprocess the datasets for later training. The thing here is using the same checkpoint, `distilbert-base-uncased` for tokenizer and model later.

In [None]:
from transformers import AutoTokenizer, DataCollatorWithPadding

checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=True)

def tokenize_function(example):
  return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

ds = ds.map(tokenize_function, batched=True)
# Comment out since dynamic padding is a default behavior in Trainer
# data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
ds

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 1725
    })
})

Each sentence is separated by `[SEP]` token. The begging of `sentence1` starts at `CLS` token. Each special token differs by tokenizer.

In [None]:
tokenizer.decode(ds["train"][0]["input_ids"]), tokenizer.decode(ds["validation"][0]["input_ids"])

('[CLS] amrozi accused his brother, whom he called " the witness ", of deliberately distorting his evidence. [SEP] referring to him as only " the witness ", amrozi accused his brother of deliberately distorting his evidence. [SEP]',
 '[CLS] he said the foodservice pie business doesn\'t fit the company\'s long - term growth strategy. [SEP] " the foodservice pie business does not fit our long - term growth strategy. [SEP]')

### Set metrics, accuracy and f1

We use 2 metrics, `accuracy` and `f1`. Those two are set by `evaluate.load` method according to mrpc dataset.

In [None]:
import numpy as np
import evaluate

metrics = evaluate.load("glue", "mrpc")
metrics.compute(predictions=[0, 1, 0, 1, 0], references=[0, 1, 1, 1, 1])

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.6, 'f1': 0.6666666666666666}

In [None]:
def compute_metrics(eval_preds):
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis=-1)
  # print(predictions, labels)
  merics = evaluate.load("glue", "mrpc")
  return metrics.compute(predictions=predictions, references=labels)

compute_metrics(([[1, 0], [0, 1], [4, 1], [3, 4], [2, 1]], [0, 1, 1, 1, 1]))

{'accuracy': 0.6, 'f1': 0.6666666666666666}

### Load Bert model

We load `distilbert-base-uncased` weights to get the classification model. We set `num_labels=2` to make the model's output as 2, pharaphrase or not. DisilBert is encoder model, so its best suited for sentence classification task like this mrpc dataset.

The warning below is expected output. The model is pretrained so all weights of the model are set already. However, new added classifier's weights are initialized randomely.

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
model.classifier

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Linear(in_features=768, out_features=2, bias=True)

### Evaluate initial Bert performance without training

We evaluate initial model's performance without training. We get following result. Remember, our metrics are two, `accuracy` and `f1`.



In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    eval_dataset=ds["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
init_eval = trainer.evaluate()
init_eval

{'eval_loss': 0.7170845866203308,
 'eval_accuracy': 0.336231884057971,
 'eval_f1': 0.0034812880765883376,
 'eval_runtime': 9.781,
 'eval_samples_per_second': 176.362,
 'eval_steps_per_second': 22.084}

## Perform lightweight fine-tune, using LoRA

We follow this section in this order below.

1. Create a PEFT model
2. Train the PEFT model
3. Save the PEFT model

### Create PEFT model

We create PEFT model with `LoRA`. We set `target_modules="all-linear"` so that all linear layers wiil be added by adapters like QLoRA. Even though, final trainable parameters are quite small percent.

In [None]:
from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(task_type="SEQ_CLS",
                         lora_dropout=0.1,
                         target_modules="all-linear") # all linear layers in the pretrained model will be trained
lora_model = get_peft_model(model, peft_config)
lora_model.print_trainable_parameters()

trainable params: 1,274,130 || all params: 68,247,588 || trainable%: 1.8669231211511828


### Train the PEFT model

We traing LoRA PEFT model for `6 epochs`.

**Note**:

We assume that this notebook runs on GPU, not TPU. If the running environment is not on GPU, the training fails or finish in quite a long time.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments("lora_trainer",
                                  evaluation_strategy="epoch",
                                  save_strategy="epoch",
                                  per_device_eval_batch_size=32,
                                  per_device_train_batch_size=32,
                                  num_train_epochs=6,
                                  weight_decay=0.01,
                                  label_smoothing_factor=0.01,
                                  load_best_model_at_end=True,
                                  fp16=True,
                                  )
lora_trainer = Trainer(
    lora_model,
    training_args,
    train_dataset=ds["train"],
    eval_dataset=ds["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
lora_trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.596742,0.683824,0.812227
2,No log,0.551365,0.720588,0.822981
3,No log,0.509742,0.735294,0.824675
4,No log,0.498819,0.754902,0.839744
5,0.561000,0.479981,0.772059,0.843697
6,0.561000,0.477317,0.77451,0.845638


TrainOutput(global_step=690, training_loss=0.5419005794801574, metrics={'train_runtime': 83.127, 'train_samples_per_second': 264.751, 'train_steps_per_second': 8.301, 'total_flos': 468933823767168.0, 'train_loss': 0.5419005794801574, 'epoch': 6.0})

### Save the PEFT model

We save trained model at `distilbert_lora` directory. We can check the saved weights on that directory that those files just occupay small storage.

In [None]:
lora_model_path = "distilbert_lora"
lora_model.save_pretrained(lora_model_path)
!ls {lora_model_path}

adapter_config.json  adapter_model.safetensors	README.md


## Perform inference with fine-tuned model

In this section, we go through in this following order.

1. Load the saved PEFT model
2. Evaluate the fine-tuned model
3. Compare initial model's performance with that of fine-tuned PEFT model
4. Check the data examples on which trained model make the wrong predictions

### Load the saved PEFT model

We load the saved PEFT model from local directory, `distilbert_lora`. And after loading the model, we evalute the loaded trained model.

In [None]:
from peft import AutoPeftModelForSequenceClassification

lora_model = AutoPeftModelForSequenceClassification.from_pretrained(lora_model_path)
lora_trainer = Trainer(
    lora_model,
    eval_dataset=ds["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
lora_eval = lora_trainer.evaluate()
lora_eval

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'eval_loss': 0.5058317184448242,
 'eval_accuracy': 0.7524637681159421,
 'eval_f1': 0.825214899713467,
 'eval_runtime': 5.6279,
 'eval_samples_per_second': 306.511,
 'eval_steps_per_second': 38.381}

### Compare initial model's performance with that of fine-tuned PEFT model

We realize that trained model has better result than that of initial model. Over many running, trained PEFT model surpasses always over initial untrained model.

In [None]:
import pandas as pd

pd.DataFrame([init_eval, lora_eval], index=["Untrained Model", "Trained Model"])[["eval_accuracy", "eval_f1"]]

Unnamed: 0,eval_accuracy,eval_f1
Untrained Model,0.336232,0.003481
Trained Model,0.752464,0.825215


### Check the data examples on which trained model make the wrong predictions

In [None]:
# show full cell output
pd.set_option("display.max_colwidth", None)

sampled_testsets = ds["test"].shuffle().select(range(500))
df = pd.DataFrame(sampled_testsets)
df = df[["sentence1", "sentence2", "label"]]

predicitons = lora_trainer.predict(sampled_testsets)
df["predicted_label"] = np.argmax(predicitons[0], axis=-1)
df[df["label"]!=df["predicted_label"]].head()

Unnamed: 0,sentence1,sentence2,label,predicted_label
5,"The stock rose $ 2.11 , or about 11 percent , to close on Friday at $ 21.51 on the New York Stock Exchange .",PG & E Corp. shares jumped $ 1.63 or 8 percent to $ 21.03 on the New York Stock Exchange on Friday .,1,0
6,Entrenched interests are positioning themselves to control the network 's chokepoints and they are lobbying the FCC to aid and abet them .,"It may be dying because entrenched interests are positioning themselves to control the Internet 's choke-points and they are lobbying the FCC to aid and abet them . """,0,1
14,"Gyorgy Heizler , head of the local disaster unit , said the coach had been carrying 38 passengers .","The head of the local disaster unit , Gyorgy Heizler , said the coach driver had failed to heed red stop lights .",0,1
19,"On Thursday , a Washington Post article argued that a 50 basis point cut from the Fed was more likely , contrary to the Wall Street Journal 's line .","On Thursday , a Post article argued that a 50 basis point cut from the Fed was more likely .",0,1
23,Montreal-based Bombardier 's Class B shares rose 6 Canadian cents to C $ 3.80 in Toronto on Friday .,Bombardier 's class B shares were up 13 Canadian cents or 3.2 percent at C $ 3.93 on the Toronto Stock Exchange late Monday morning .,1,0


## Future feature

One possible improvement is applying hyper parameter tuning to the model. The [code example](https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb) by Hugging Face will be helpful for that.

## References

* [GLUE dataset](https://huggingface.co/datasets/glue)
* [PEFT documentation](https://huggingface.co/docs/peft/en/index)
* [LoRA documentation](https://huggingface.co/docs/peft/package_reference/lora#lora)
* [DistilBERT documentation](https://huggingface.co/docs/transformers/v4.38.1/en/model_doc/distilbert#distilbert)
* [Official blog about PEFT](https://huggingface.co/blog/peft)
* [Tutorial about how to fine-tune a pretrained model by Hugging Face](https://huggingface.co/learn/nlp-course/chapter3/1?fw=pt#introduction)
* [Text-classification code example from Hugging Face repo](https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb)
* [Text-classification code example, using LoRA from Hugging Face Space](https://huggingface.co/spaces/PEFT/sequence-classification/blob/main/LoRA.ipynb)
* [Blog about basic usage of using LoRA with Bert for text classification](https://medium.com/@karkar.nizar/fine-tuning-bert-for-text-classification-with-lora-f12af7fa95e4)
* [Blog about LoraConfig](https://medium.com/@manyi.yim/more-about-loraconfig-from-peft-581cf54643db)