## Fine-tune large models using 🤗 `peft` adapters, `transformers` & `bitsandbytes`

In this tutorial we will cover how we can fine-tune large language models using the very recent `peft` library and `bitsandbytes` for loading large models in 8-bit.
The fine-tuning method will rely on a recent method called "Low Rank Adapters" (LoRA), instead of fine-tuning the entire model you just have to fine-tune these adapters and load them properly inside the model.
After fine-tuning the model you can also share your adapters on the 🤗 Hub and load them very easily. Let's get started!

### Install requirements

First, run the cells below to install the requirements:

In [1]:
!pip install --upgrade transformers
!pip install --upgrade peft
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


### Model loading

Here let's load the `opt-6.7b` model, its weights in half-precision (float16) are about 13GB on the Hub! If we load them in 8-bit we would require around 7GB of memory instead.

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-large")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [3]:
from datasets import load_dataset

dataset_id="glue"
dataset_config="sst2"

dataset = load_dataset(dataset_id, dataset_config)

def process(examples):
    tokenized_inputs = tokenizer(
        examples["sentence"], truncation=True, max_length=512, padding='max_length'
    )
    return tokenized_inputs

tokenized_datasets = dataset.map(process, batched=True)
tokenized_datasets = tokenized_datasets.rename_column("label","labels")

tokenized_datasets["train"].set_format(type='torch', columns=['input_ids', 'idx', 'attention_mask', 'labels'])
tokenized_datasets["validation"].set_format(type='torch', columns=['input_ids', 'idx', 'attention_mask', 'labels'])

# dataloader = torch.utils.data.DataLoader(tokenized_datasets["train"], batch_size=4)

tokenized_datasets["test"].features

{'sentence': Value(dtype='string', id=None),
 'labels': ClassLabel(names=['negative', 'positive'], id=None),
 'idx': Value(dtype='int32', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

In [4]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, AutoModelForSequenceClassification, BitsAndBytesConfig

labels = tokenized_datasets["train"].features["labels"].names
num_labels = len(labels)
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

# Create a BitsAndBytesConfig object with the desired quantization settings
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  # or load_in_8bit=True, as needed
    load_in_8bit=False  # if using 4-bit, set this to False
)

model = AutoModelForSequenceClassification.from_pretrained(
    "FacebookAI/roberta-large",
    # quantization_config=quantization_config,
    device_map='auto',
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Post-processing on the model

Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in `float32` for stability. We also cast the output of the last layer in `float32` for the same reasons.

In [5]:
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
# model.lm_head = CastOutputToFloat(model.lm_head)

### Apply LoRA

Here comes the magic with `peft`! Let's load a `PeftModel` and specify that we are going to use low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`.

In [6]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [7]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_CLS"
)

config = LoraConfig(task_type="SEQ_CLS",
                    r=16,
                    lora_alpha=32,
                    lora_dropout=0.01,
                    target_modules = ["query", "value"])

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 2624514 || all params: 357986308 || trainable%: 0.7331325085204097


### Training

In [9]:
from datasets import load_metric
import numpy as np

# define metrics and metrics function
accuracy = load_metric( "accuracy")

# define an evaluation function to pass into trainer later
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1)

    return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}

  accuracy = load_metric( "accuracy")


In [19]:
import transformers
import numpy as np

training_args = transformers.TrainingArguments(
    output_dir="outputs",
    evaluation_strategy="steps",
    logging_strategy="steps",
    logging_steps=1,  # Log every 100 steps
    eval_steps=1,  # Evaluate every 100 stepsz
    max_steps=10,  # Increase the max steps to allow more training iterations
    num_train_epochs=1,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    gradient_accumulation_steps=4,
    learning_rate=5e-4,  # Lower the learning rate
    weight_decay=0.01,
    warmup_steps=100,
    fp16=True,
    seed=42
)

trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    args=training_args,
    compute_metrics=compute_metrics,
    data_collator=transformers.default_data_collator
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
1,0.141,0.152812,{'accuracy': 0.9495412844036697}
2,0.1654,0.152512,{'accuracy': 0.948394495412844}
3,0.1753,0.151932,{'accuracy': 0.944954128440367}
4,0.1725,0.151441,{'accuracy': 0.9461009174311926}
5,0.1629,0.151161,{'accuracy': 0.9461009174311926}
6,0.1631,0.151439,{'accuracy': 0.9495412844036697}
7,0.2094,0.15163,{'accuracy': 0.9472477064220184}
8,0.1504,0.15197,{'accuracy': 0.948394495412844}
9,0.1751,0.152127,{'accuracy': 0.948394495412844}
10,0.203,0.151477,{'accuracy': 0.948394495412844}


Trainer is attempting to log a value of "{'accuracy': 0.9495412844036697}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.948394495412844}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.944954128440367}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.9461009174311926}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.9461009174311926}" of 

TrainOutput(global_step=10, training_loss=0.17181369811296462, metrics={'train_runtime': 149.4727, 'train_samples_per_second': 34.254, 'train_steps_per_second': 0.067, 'total_flos': 4812768616120320.0, 'train_loss': 0.17181369811296462, 'epoch': 0.07590132827324478})

In [None]:
id2label

{'0': 'negative', '1': 'positive'}

In [None]:
# define list of examples
text_list = ["a feel-good picture in the best sense of the term .", "resourceful and ingenious entertainment .", "it 's just incredibly dull .", "the movie 's biggest offense is its complete and utter lack of tension .",
             "impresses you with its open-endedness and surprises .", "unless you are in dire need of a diesel fix , there is no real reason to see it ."]

print("Untrained model predictions:")
print("----------------------------")
for text in text_list:
    # tokenize text
    inputs = tokenizer.encode(text, return_tensors="pt")
    # compute logits
    logits = model(inputs).logits
    # convert logits to label
    predictions = torch.argmax(logits).item()

    print(text + " - " + id2label[str(predictions)])

Untrained model predictions:
----------------------------
a feel-good picture in the best sense of the term . - positive




resourceful and ingenious entertainment . - positive
it 's just incredibly dull . - negative
the movie 's biggest offense is its complete and utter lack of tension . - negative
impresses you with its open-endedness and surprises . - positive
unless you are in dire need of a diesel fix , there is no real reason to see it . - negative


In [None]:
from transformers import TrainingArguments, Trainer
import transformers

In [None]:
# hyperparameters
lr = 1e-3
batch_size = 4
num_epochs = 5

# define training arguments
training_args = TrainingArguments(
    output_dir= 'outputs',
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    # fp16=True,
)



In [None]:
data_collator=transformers.default_data_collator

In [12]:
from datasets import load_metric
import numpy as np

# define metrics and metrics function
accuracy = load_metric( "accuracy")

# define an evaluation function to pass into trainer later
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1)

    return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}

In [None]:
# creater trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
# train model
trainer.train()



Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

## Share adapters on the 🤗 Hub

In [21]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [22]:
model.push_to_hub("hassanalsawadi/roberta-large-lora", use_auth_token=True)



adapter_model.safetensors:   0%|          | 0.00/10.5M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/hassanalsawadi/roberta-large-lora/commit/c21b9608689675a57a95caec618a354b2c3aa53d', commit_message='Upload model', commit_description='', oid='c21b9608689675a57a95caec618a354b2c3aa53d', pr_url=None, pr_revision=None, pr_num=None)

## Load adapters from the Hub

You can also directly load adapters from the Hub using the commands below:

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "ybelkada/opt-6.7b-lora"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 112
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda112.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)"adapter_model.bin";:   0%|          | 0.00/33.6M [00:00<?, ?B/s]

## Inference

You can then directly use the trained model or the model that you have loaded from the 🤗 Hub for inference as you would do it usually in `transformers`.

In [None]:
batch = tokenizer("Two things are infinite: ", return_tensors='pt')

with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, max_new_tokens=50)

print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))





 Two things are infinite:  the universe and human stupidity; and I'm not sure about the universe.  -Albert Einstein
I'm not sure about the universe either.


As you can see by fine-tuning for few steps we have almost recovered the quote from Albert Einstein that is present in the [training data](https://huggingface.co/datasets/Abirate/english_quotes).