# Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering 

**Author:** {ref}Sean Song <sean-song>, AI Software Solutions\
**Read time:** 10 minutes\
**Last edited:** 4 Jan 2024

In this blog, we show you how to fine-tune Llama 2 on an AMD GPU with ROCm. We use Low-Rank Adaptation of Large Language Models (LoRA) to overcome memory and computing limitations and make open-source large language models (LLMs) more accessible. We also show you how to fine-tune and upload models to Hugging Face.

## Introduction<a class="anchor" id="Introduction"></a>
In the dynamic realm of Generative AI (GenAI), fine-tuning LLMs (such as Llama 2) poses distinctive challenges related to substantial computational and memory requirements. LoRA introduces a compelling solution, allowing rapid and cost-effective fine-tuning of state-of-the-art LLMs. This breakthrough capability not only expedites the tuning process, but also lowers associated costs.

To explore the benefits of LoRA, we provide a comprehensive walkthrough of the fine-tuning process for Llama 2 using LoRA specifically tailored for question-answering (QA) tasks on an AMD GPU.

Before jumping in, let's take a moment to briefly review the three pivotal components that form the foundation of our discussion:

- Llama 2: Meta's advanced language model with variants that scale up to 70 billion parameters.
- Fine-tuning: A crucial process that refines LLMs for specialized tasks, optimizing its performance.
- LoRA: The algorithm employed for fine-tuning Llama 2, ensuring effective adaptation to specialized tasks.

### Llama 2<a class="anchor" id="Understanding_Llama_2"></a>
[Llama 2: Open Foundation and Fine-Tuned Chat Models](https://arxiv.org/abs/2307.09288) is a collection of second-generation, open-source LLMs from Meta; it comes with a commercial license. Llama 2 is designed to handle a wide range of natural language processing (NLP) tasks, with models ranging in scale from 7 billion to 70 billion parameters.

Llama 2 Chat, which is optimized for dialogue, has shown similar performance to popular closed-source models like ChatGPT and PaLM. You can improve the performance of this model by fine-tuning it with a high-quality conversational data set. In this blog post, we delve into the process of refining a Llama 2 Chat model using a QA data set.

### Fine-tuning a model<a class="anchor" id="Model_Fine_Tuning"></a>
Fine-tuning in machine learning is the process of adjusting the weights and parameters of a pre-trained model using new data in order to improve its performance on a specific task. It involves using a new data set--one that is specific to the current task--to update the model's weights. It's typically not possible to fine-tune LLMs on consumer hardware due to inadequate memory and computing power. However, in this tutorial, we use LoRA to overcome these challenges.

### LoRA<a class="anchor" id="Lora"></a>
[LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) is an innovative technique-- developed by researchers at Microsoft--designed to address the challenges of fine-tuning LLMs. This results in a significant reduction in the number of parameters (by a factor of up to 10,000) that need to be fine-tuned, which significantly reduces GPU memory requirements. To learn more about the fundamental principles of LoRA, refer to [Using LoRA for efficient fine-tuning: Fundamental principles](../lora-fundamentals/README.md).

## Step-by-step Llama 2 fine-tuning<a class="anchor" id="Step-By-Step-Guide"></a>
Standard (full-parameter) fine-tuning involves considering all parameters. It requires significant computational power to manage optimizer states and gradient check-pointing. The resulting memory footprint is typically about four times larger than the model itself. For example, loading a 7 billion parameter model (e.g. Llama 2) in FP32 (4 bytes per parameter) requires approximately 28 GB of GPU memory, while fine-tuning demands around 28*4=112 GB of GPU memory. Note that the 112 GB figure is derived empirically, and various factors like batch size, data precision, and gradient accumulation contribute to overall memory usage.

To overcome this memory limitation, you can use a parameter-efficient fine-tuning (PEFT) technique, such as LoRA.

This example leverages two GCDs (Graphics Compute Dies) of a AMD MI250 GPU and each GCD are equipped with 64 GB of VRAM. Using this setup allows us to explore different settings for fine-tuning the Llama 2–7b weights with and without LoRA.

Our setup:
* Hardware: AMD Instinct MI250
* Software:
  * [ROCm 6.1.0+](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/)
  * [Pytorch 2.0.1+](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/3rd-party/pytorch-install.html#installing-pytorch-for-rocm)
* Libraries: `transformers`, `accelerate`, `peft`, `trl`, `bitsandbytes`, `scipy`

In this blog, we conducted our experiment using a single MI250GPU with the Docker image [rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2](https://hub.docker.com/layers/rocm/pytorch/rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2/images/sha256-f6ea7cee8aae299c7f6368187df7beed29928850c3929c81e6f24b34271d652b?context=explore).

### Step 1: Getting started

First, let's confirm the availability of the GPU.

In [None]:
!rocm-smi --showproductname

```python
========================= ROCm System Management Interface =========================
=================================== Product Info ===================================
GPU[0]		: Card series: 		AMD INSTINCT MI250 (MCM) OAM AC MBA
GPU[0]		: Card model: 		0x0b0c
GPU[0]		: Card vendor: 		Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0]		: Card SKU: 		D65209
GPU[1]		: Card series: 		AMD INSTINCT MI250 (MCM) OAM AC MBA
GPU[1]		: Card model: 		0x0b0c
GPU[1]		: Card vendor: 		Advanced Micro Devices, Inc. [AMD/ATI]
GPU[1]		: Card SKU: 		D65209
====================================================================================
=============================== End of ROCm SMI Log ================================
```

Next, install the required libraries.

In [2]:
!pip install -q pandas peft transformers==4.31.0 trl==0.4.7 accelerate scipy

[0m

#### Installing bitsandbytes
ROCm needs a special version of bitsandbytes (bitsandbytes-rocm).
1. Install bitsandbytes using the following code.

In [None]:
%%bash
git clone --recurse https://github.com/ROCm/bitsandbytes
cd bitsandbytes
git checkout rocm_enabled
pip install -r requirements-dev.txt
cmake -DCOMPUTE_BACKEND=hip -S . #Use -DBNB_ROCM_ARCH="gfx90a;gfx942" to target specific gpu arch
make
pip install .

2. Check the bitsandbytes version.
At the time of writing this blog, the version is 0.43.0.

In [None]:
%%bash
pip list | grep bitsandbytes

#### Import the required packages

In [5]:
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline
)
from peft import LoraConfig
from trl import SFTTrainer

  from .autonotebook import tqdm as notebook_tqdm



Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /opt/conda/envs/py_3.10/lib/python3.10/site-packages/bitsandbytes-0.39.1-py3.10.egg/bitsandbytes/libbitsandbytes_hip_nohipblaslt.so
CUDA SETUP: Loading binary /opt/conda/envs/py_3.10/lib/python3.10/site-packages/bitsandbytes-0.39.1-py3.10.egg/bitsandbytes/libbitsandbytes_hip_nohipblaslt.so...


### 2. Configuring the model and data

You can access Meta's official Llama-2 model from Hugging Face after making a request, which can
take a couple of days. Instead of waiting, we'll use NousResearch’s Llama-2-7b-chat-hf as our base
model (it's the same as the original, but quicker to access).

In [6]:
# Model and tokenizer names
base_model_name = "NousResearch/Llama-2-7b-chat-hf"
new_model_name = "llama-2-7b-enhanced" #You can give your own name for fine tuned model

# Tokenizer
llama_tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"  

# Model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto"
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.70s/it]


After you have the base model, you can start fine-tuning. We fine-tune our base model for a
question-and-answer task using a small data set called
[mlabonne/guanaco-llama2-1k](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k), which
is a subset (1,000 samples) of the
[timdettmers/openassistant-guanaco](https://huggingface.co/datasets/OpenAssistant/oasst1) data set.
This data set is a human-generated, human-annotated, assistant-style conversation corpus that
contains 161,443 messages in 35 different languages, annotated with 461,292 quality ratings. This
results in over 10,000 fully annotated conversation trees.

In [7]:
# Dataset
data_name = "mlabonne/guanaco-llama2-1k"
training_data = load_dataset(data_name, split="train")
# check the data
print(training_data.shape)
# #11 is a QA sample in English
print(training_data[11])

In [8]:
## There is a dependency during training
!pip install tensorboardX

[0m

### Step 3: Start fine-tuning

To set your training parameters, use the following code:

In [9]:
# Training Params
train_params = TrainingArguments(
    output_dir="./results_modified",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=50,
    logging_steps=50,
    learning_rate=4e-5,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
) 

#### Training with LoRA configuration<a class="anchor" id="Training_with_LoRA_configuration"></a>

Now you can integrate LoRA into the base model and assess its additional parameters. LoRA essentially adds pairs of rank-decomposition weight matrices (called update matrices) to existing weights, and only trains the newly added weights.

In [10]:
from peft import get_peft_model
# LoRA Config
peft_parameters = LoraConfig(
    lora_alpha=8,
    lora_dropout=0.1,
    r=8,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, peft_parameters)
model.print_trainable_parameters()

trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199


Note that there are only 0.062% parameters added by LoRA, which is a tiny portion of the original model. This is the percentage we'll update through fine-tuning, as follows.

In [11]:
# Trainer with LoRA configration
fine_tuning = SFTTrainer(
    model=base_model,
    train_dataset=training_data,
    peft_config=peft_parameters,
    dataset_text_field="text",
    tokenizer=llama_tokenizer,
    args=train_params
)

# Training
fine_tuning.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
50,1.9764
100,1.6135
150,1.4091
200,1.3915
250,1.3773


TrainOutput(global_step=250, training_loss=1.5535581665039062, metrics={'train_runtime': 484.7942, 'train_samples_per_second': 2.063, 'train_steps_per_second': 0.516, 'total_flos': 1.701064079130624e+16, 'train_loss': 1.5535581665039062, 'epoch': 1.0})

To save your model, run this code:

In [12]:
# Save Model
fine_tuning.model.save_pretrained(new_model_name)

##### Checking memory usage during training with LoRA<a class="anchor" id="Checking_memory_usage_during_training_with_LoRA"></a>
During training, you can check the memory usage by running the `rocm-smi` command in a terminal.
This command produces the following output:

To facilitate a comparison between fine-tuning with and without LoRA, our subsequent phase involves running a thorough fine-tuning process on the base model. This involves updating all parameters within the base model. We then analyze differences in memory usage, training speed, training loss, and other relevant metrics.

#### Training without LoRA configuration<a class="anchor" id="Training_without_LoRA_configuration"></a>

***To run this section, you need to to restart the kernel an and skip the [Training with LoRA configuration](#Training_with_LoRA_configuration) section.***

For a direct comparison between models using the same criteria, we maintain consistent settings (without any alterations) for train_params during the full-parameter fine-tuning process.

To check the trainable parameters in your base model, use the following code.

In [13]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}"
    )

print_trainable_parameters(base_model)

trainable params: 6738415616 || all params: 6738415616 || trainable%: 100.00

In [14]:
# Set a lower lerning Rate for fine tuning
train_params.learning_rate = 4e-7
print(train_params.learning_rate)

In [15]:
# Trainer without LoRA configration

fine_tuning_full = SFTTrainer(
    model=base_model,
    train_dataset=training_data,
    dataset_text_field="text",
    tokenizer=llama_tokenizer,
    args=train_params
)

# Training
fine_tuning_full.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
50,1.7123
100,1.487
150,1.3638
200,1.3711
250,1.3683


TrainOutput(global_step=250, training_loss=1.4604909362792968, metrics={'train_runtime': 10993.7995, 'train_samples_per_second': 0.091, 'train_steps_per_second': 0.023, 'total_flos': 1.6999849383985152e+16, 'train_loss': 1.4604909362792968, 'epoch': 1.0})

##### Checking memory usage during training without LoRA<a class="anchor" id="Checking_memory_usage_during_training_without_LoRA"></a>
During training, you can check the memory usage by running the `rocm-smi` command in a terminal.
This command produces the following output:

### Step 4: Comparison between fine-tuning with LoRA and full-parameter fine-tuning
<a class="anchor" id="Comparison"></a>
Comparing the results from the [Training with LoRA configuration](#Training_with_LoRA_configuration) and [Training without LoRA configuration](#Training_without_LoRA_configuration) sections, note the following:
- Memory usage: 
    - In the case of full-parameter fine-tuning, there are **6,738,415,616** trainable parameters, leading to significant memory consumption during the training back propagation stage.
    - In contrast, LoRA introduces only **4,194,304** trainable parameters, accounting for a mere **0.062%** of the total trainable parameters in full-parameter fine-tuning.
    - Monitoring memory usage during training with and without LoRA reveals that fine-tuning with LoRA
        uses only **65%** of the memory consumed by full-parameter fine-tuning. This presents an
        opportunity to increase batch size and max sequence length, and train on larger data sets using
        limited hardware resources.This presents an opportunity to increase batch size, max sequence length, and train on larger datasets within the constraints of limited hardware resources.
- Training speed: 
    - The results demonstrate that full-parameter fine-tuning takes **hours** to complete, while
        fine-tuning with LoRA finishes in less than **9 minutes**. Several factors contribute to this
        acceleration:
    - Several factors contribute to this acceleration:
        - Fewer trainable parameters in LoRA translate to fewer derivative calculations and less memory
        required to store and update weights.
        - Full-parameter fine-tuning is more prone to being memory-bound, where data movement
        becomes a bottleneck for training. This is reflected in lower GPU utilization. Although adjusting
        training settings can alleviate this, it may require more resources (additional GPUs) and a smaller
        batch size.
- Accuracy: 
    - In both training sessions, a notable reduction in training loss was observed. We achieved a closely
        aligned training loss for two both approaches: **1.368** for full-parameter fine-tuning and
        **1.377** for fine-tuning with LoRA. If you're interested in understanding the impact of LoRA on
        fine-tuning performance, refer to
        [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685).

### Step 5: Test the fine-tuned model with LoRA
<a class="anchor" id="Test_the_model"></a>

To test your model, run the following code:


In [16]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto"
)
from peft import LoraConfig, PeftModel
model = PeftModel.from_pretrained(base_model, new_model_name)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.34s/it]


Uploading the model to Hugging Face let's you conduct subsequent tests or share your model with
others (to proceed with this step, you'll need an active Hugging Face account).

In [17]:
from huggingface_hub import login
#You need to use your Hugging Face Access Tokens
login("xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
#Push the model to Hugging Face. This takes minutes and time depends the model size and your network speed.
model.push_to_hub(new_model_name, use_temp_dir=False)
tokenizer.push_to_hub(new_model_name, use_temp_dir=False)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s][A[A
pytorch_model-00001-of-00002.bin:   0%|          | 8.19k/9.98G [00:00<33:56:21, 81.7kB/s]

pytorch_model-00001-of-00002.bin:   0%|          | 336k/9.98G [00:00<1:28:01, 1.89MB/s]  [A[A

pytorch_model-00001-of-00002.bin:   0%|          | 2.00M/9.98G [00:00<19:39, 8.45MB/s]   [A[A

pytorch_model-00001-of-00002.bin:   0%|          | 6.23M/9.98G [00:00<07:46, 21.4MB/s][A[A

pytorch_model-00001-of-00002.bin:   0%|          | 12.2M/9.98G [00:00<04:46, 34.7MB/s][A[A

pytorch_model-00001-of-00002.bin:   0%|          | 16.0M/9.98G [00:00<06:27, 25.7MB/s][A[A

pytorch_model-00001-of-00002.bin:   0%|          | 22.8M/9.98G [00:00<04:34, 36.3MB/s][A[A

pytorch_model-00002-of-00002.bin:   1%|          | 22.8M/3.50G [00:00<01:32, 37.5MB/s][A[A

pytorch_model-00001-of-00002.bin:   0%|          | 28.9M/9.98G [00:00<04:07, 40.2M

CommitInfo(commit_url='https://huggingface.co/superbigtree/llama-2-7b-enhanced/commit/cd267caccfaa60711403ff5f44173801ea493c25', commit_message='Upload tokenizer', commit_description='', oid='cd267caccfaa60711403ff5f44173801ea493c25', pr_url=None, pr_revision=None, pr_num=None)

Now we can have a test with the base model (original) and the fine-tuned model.

In [20]:
# Generate Text using base model
query = "What do you think is the most important part of building an AI chatbot?"
text_gen = pipeline(task="text-generation", model=base_model_name, tokenizer=llama_tokenizer, max_length=200)
output = text_gen(f"<s>[INST] {query} [/INST]")
print(output[0]['generated_text'])

Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.36it/s]


<s>[INST] What do you think is the most important part of building an AI chatbot? [/INST]  There are several important aspects to consider when building an AI chatbot, but here are some of the most critical elements:

1. Natural Language Processing (NLP): A chatbot's ability to understand and interpret human language is crucial for effective communication. NLP is the foundation of any chatbot, and it involves training the AI model to recognize patterns in language, interpret meaning, and generate responses.
2. Conversational Flow: A chatbot's conversational flow refers to the way it interacts with users. A well-designed conversational flow should be intuitive, easy to follow, and adaptable to different user scenarios. This involves creating a dialogue flowchart that guides the conversation and ensures the chatbot responds appropriately to user inputs.
3. Domain Knowledge: A chat


In [21]:
# Generate Text using fine-tuned model
query = "What do you think is the most important part of building an AI chatbot?"
text_gen = pipeline(task="text-generation", model=new_model_name, tokenizer=llama_tokenizer, max_length=200)
output = text_gen(f"<s>[INST] {query} [/INST]")
print(output[0]['generated_text'])

Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.80s/it]


<s>[INST] What do you think is the most important part of building an AI chatbot? [/INST] The most important part of building an AI chatbot is to ensure that it is able to understand and respond to user input in a way that is both accurate and natural-sounding. This requires a combination of natural language processing (NLP) capabilities and a well-designed conversational flow.

Here are some key factors to consider when building an AI chatbot:

1. Natural Language Processing (NLP): The chatbot must be able to understand and interpret user input, including both text and voice commands. This requires a robust NLP engine that can handle a wide range of language and dialects.
2. Conversational Flow: The chatbot must be able to respond to user input in a way that is both natural and intuitive. This requires a well-designed conversational flow that can handle a wide range


You can observe the outputs of the two models based on a given query. These outputs exhibit slight
differences due to the fine-tuning process altering the model weights.