# Efficient Large Language Model training with LoRA and Hugging Face

Showing how to apply Low-Rank Adaptation of Large Language Models (LoRA) to fine-tune FLAN-T5 XXL (11 billion parameters) on a single GPU.

## Setup Development Environment

In [1]:
!pip install -q -U peft
!pip install -q -U datasets
!pip install -q -U transformers
!pip install -q -U scikit-learn
!pip install -q -U accelerate
!pip install -q -U evaluate
!pip install -q -U bitsandbytes
!pip install -q -U loralib
!pip install -q -U rouge-score
!pip install -q -U tensorboard
!pip install -q -U py7zr

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m 

## Load and prepare the dataset

Using the samsum dataset, a collection of about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English.

In [2]:
from datasets import load_dataset
 
# Load dataset from the hub
dataset = load_dataset("samsum", trust_remote_code=True)
 
print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")
 
# Train dataset size: 14732
# Test dataset size: 819

Train dataset size: 14732
Test dataset size: 819


In [3]:
train_sample = dataset['train'].shuffle(seed=42).select(range(1000))
test_sample = dataset['test'].shuffle(seed=42).select(range(100))

print(f"Sample train dataset size: {len(train_sample)}")
print(f"Sample test dataset size: {len(test_sample)}")

Sample train dataset size: 1000
Sample test dataset size: 100


To train our model, we need to convert our inputs (text) to token IDs.
This is done by a Transformers Tokenizer.

In [4]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
 
model_id="google/flan-t5-xxl"
 
# Load tokenizer of FLAN-t5-XL
tokenizer = AutoTokenizer.from_pretrained(model_id)

Before we can start training, we need to preprocess our data. 
Abstractive Summarization is a text-generation task.
Our model will take a text as input and generate a summary as output.
We want to understand how long our input and output will take to batch our data efficiently

In [5]:
from datasets import concatenate_datasets
import numpy as np

# Concatenate the train and test datasets
combined_dataset = concatenate_datasets([train_sample, test_sample])

# Tokenize the dataset
tokenized_inputs = combined_dataset.map(
    lambda x: tokenizer(x["dialogue"], truncation=True),
    batched=True, remove_columns=["dialogue", "summary"]
)
# Calculate input lengths
input_lengths = [len(x) for x in tokenized_inputs["input_ids"]]
max_source_length = int(np.percentile(input_lengths, 85))
print(f"Max source length: {max_source_length}")


tokenized_targets = combined_dataset.map(
    lambda x: tokenizer(x["summary"], truncation=True),
    batched=True, remove_columns=["dialogue", "summary"]
)
target_lengths = [len(x) for x in tokenized_targets["input_ids"]]
max_target_length = int(np.percentile(target_lengths, 90))
print(f"Max target length: {max_target_length}")


Max source length: 272
Max target length: 51


We preprocess our dataset before training and save it to disk.
You could run this step on your local machine or a CPU and upload it to the Hugging Face Hub.

In [6]:
def preprocess_function(sample, padding='max_length'):
    inputs = ["summarize: " + item for item in sample["dialogue"]]
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)
    labels = tokenizer(text_target=sample["summary"], max_length=max_target_length, padding=padding, truncation=True)

    # Replace pad token id with -100 to ignore padding in the loss
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]
    
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_train_sample = train_sample.map(
    preprocess_function, batched=True, remove_columns=["dialogue", "summary", "id"]
)
tokenized_test_sample = test_sample.map(
    preprocess_function, batched=True, remove_columns=["dialogue", "summary", "id"]
)

# Print the keys of the tokenized dataset
print(f"Keys of tokenized train dataset: {list(tokenized_train_sample.features)}")
print(f"Keys of tokenized test dataset: {list(tokenized_test_sample.features)}")

# Save the tokenized datasets to disk
tokenized_train_sample.save_to_disk("data/train_sample")
tokenized_test_sample.save_to_disk("data/test_sample")

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Keys of tokenized train dataset: ['input_ids', 'attention_mask', 'labels']
Keys of tokenized test dataset: ['input_ids', 'attention_mask', 'labels']


Saving the dataset (0/1 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

In addition to the LoRA technique, we will use bitsandbytes LLM.int8() to quantize our frozen LLM to int8. This allows us to reduce the needed memory for FLAN-T5 XXL ~4x

The first step of our training is to load the model. We are going to use philschmid/flan-t5-xxl-sharded-fp16, which is a sharded version of google/flan-t5-xxl. The sharding will help us to not run off of memory when loading the model.

In [7]:
from transformers import AutoModelForSeq2SeqLM
 
# huggingface hub model id
model_id = "philschmid/flan-t5-xxl-sharded-fp16"
 
# load model from the hub
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, load_in_8bit=True, device_map="auto")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/12 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


Now, prepare our model for the LoRA int-8 training using `peft`

In [8]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


trainable params: 18,874,368 || all params: 11,154,206,720 || trainable%: 0.1692


As you can see, here we are only training 0.16% of the parameters of the model! This huge memory gain will enable us to fine-tune the model without memory issues.

Next is to create a `DataCollator` that will take care of padding our inputs and labels. We will use the `DataCollatorForSeq2Seq` from the transformers library.

In [9]:
from transformers import DataCollatorForSeq2Seq
 
# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

Define the hyperparameters (`TrainingArguments`) we want to use for our training.

In [10]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, AdamW
 
output_dir="lora-flan-t5-xxl"
 
# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
	auto_find_batch_size=True,
    learning_rate=1e-3, # higher learning rate
    num_train_epochs=5,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=500,
    save_strategy="no",
    report_to="tensorboard",
)
 
# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_train_sample,
    optimizers=(
        AdamW(model.parameters(), lr=training_args.learning_rate),
        None
    )
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
model.to(training_args.device)



PeftModelForSeq2SeqLM(
  (base_model): LoraModel(
    (model): T5ForConditionalGeneration(
      (shared): Embedding(32128, 4096)
      (encoder): T5Stack(
        (embed_tokens): Embedding(32128, 4096)
        (block): ModuleList(
          (0): T5Block(
            (layer): ModuleList(
              (0): T5LayerSelfAttention(
                (SelfAttention): T5Attention(
                  (q): lora.Linear8bitLt(
                    (base_layer): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.05, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=4096, out_features=16, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=16, out_features=4096, bias=False)
                    )
                    (lora_embedding_A): ParameterD

In [11]:
trainer.train()

Step,Training Loss
500,1.1703
1000,1.2961
1500,1.0984
2000,1.1666
2500,0.9025
3000,0.9028
3500,0.644
4000,0.6939
4500,0.4673
5000,0.4437


TrainOutput(global_step=5000, training_loss=0.8785602874755859, metrics={'train_runtime': 2614.0781, 'train_samples_per_second': 1.913, 'train_steps_per_second': 1.913, 'total_flos': 8.994446770176e+16, 'train_loss': 0.8785602874755859, 'epoch': 5.0})

The traning took ~45 minutes and cost $0.11 for 45 minutes of training on 3090 GPU. 

Saving the model to use it for inference and evaluate it. We will save it to disk for now

In [12]:
# Save our LoRA model & tokenizer results
peft_model_id="results"
trainer.model.save_pretrained(peft_model_id)
tokenizer.save_pretrained(peft_model_id)
# if you want to save the base model to call
# trainer.model.base_model.save_pretrained(peft_model_id)

('results/tokenizer_config.json',
 'results/special_tokens_map.json',
 'results/tokenizer.json')

## Evaluate & run Inference with LoRA FLAN-T5

Using `evaluate` library to evaluate the `rogue` score. 
We can run inference using `PEFT` and `transformers`. 
For our FLAN-T5 XXL model, we need atleast 18GB of GPU memory.

In [1]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
 
# Load peft config for pre-trained checkpoint etc.
peft_model_id = "results"
config = PeftConfig.from_pretrained(peft_model_id)
 
# load base LLM model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path,  load_in_8bit=True,  device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
 
# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id, device_map={"":0})
model.eval()
 
print("Peft model loaded")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/12 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Peft model loaded


Loading the dataset again with a random sample to try the summarization

In [2]:
from datasets import load_dataset
from random import randrange
 
 
# Load dataset from the hub and get a sample
dataset = load_dataset("samsum")
sample = dataset['test'][randrange(len(dataset["test"]))]
 
input_ids = tokenizer(sample["dialogue"], return_tensors="pt", truncation=True).input_ids.cuda()
# with torch.inference_mode():
outputs = model.generate(input_ids=input_ids, max_new_tokens=10, do_sample=True, top_p=0.9)
print(f"input sentence: {sample['dialogue']}\n{'---'* 20}")
 
print(f"summary:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]}")

input sentence: Breonna: Melissa so are you ok with men acting like pigs and grabbing you whenever they feel like it?
Melissa: Of course not! 
Breonna: So why you’re saying this?
Melissa: No one is saying men should behave like animals. Its that kind of thinking that makes men afraid to just be masculine in general though. Not all men are pigs.
Hailey: Thank you! I had a friend get yelled at by a girl for 5 minutes because he held the door for her and said “Ladies first”.
Melissa: Poor him. 
Christine: Men aren't allowed to be men? how?
Hailey: Yeah, I felt sorry for him.
Holly: Melissa yes this!
Michelle: not one of those fruit loop "feminists" speak for me!!! In fact MOST women cant stand them. gtfoh with your pussy hats, your metoo crap, you "screaming" at Potus through your vaginas etc. Dont try to shove your thinking on everyone. how about HIMTOO!!! i LOVE OUR strong REAL MEN!!!! TOO BAD IF ANY OF THOSE CRAZY CAT LADIES DONT LIKE IT
------------------------------------------------

In [3]:
import evaluate
import numpy as np
from datasets import load_from_disk
from tqdm import tqdm
 
# Metric
metric = evaluate.load("rouge")
 
def evaluate_peft_model(sample,max_target_length=50):
    # generate summary
    outputs = model.generate(input_ids=sample["input_ids"].unsqueeze(0).cuda(), do_sample=True, top_p=0.9, max_new_tokens=max_target_length)
    prediction = tokenizer.decode(outputs[0].detach().cpu().numpy(), skip_special_tokens=True)
    # decode eval sample
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(sample['labels'] != -100, sample['labels'], tokenizer.pad_token_id)
    labels = tokenizer.decode(labels, skip_special_tokens=True)
 
    # Some simple post-processing
    return prediction, labels
 
# load test dataset from distk
test_dataset = load_from_disk("data/test_sample/").with_format("torch")
 
# run predictions
# this can take ~45 minutes
predictions, references = [] , []
for sample in tqdm(test_dataset):
    p,l = evaluate_peft_model(sample)
    predictions.append(p)
    references.append(l)
 
# compute metric
rogue = metric.compute(predictions=predictions, references=references, use_stemmer=True)
 
# print results
print(f"Rogue1: {rogue['rouge1']* 100:2f}%")
print(f"rouge2: {rogue['rouge2']* 100:2f}%")
print(f"rougeL: {rogue['rougeL']* 100:2f}%")
print(f"rougeLsum: {rogue['rougeLsum']* 100:2f}%")
 
# Rogue1: 50.386161%
# rouge2: 24.842412%
# rougeL: 41.370130%
# rougeLsum: 41.394230%

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

100%|██████████| 100/100 [05:43<00:00,  3.44s/it]


Rogue1: 50.248456%
rouge2: 24.313786%
rougeL: 40.275036%
rougeLsum: 40.313621%


Our PEFT fine-tuned FLAN-T5 XXL achieved a rogue1 score of `50%` on the test dataset.