# Fine-tune Llama model with LoRA: Customizing a large language model for question-answering 

**Author:** {ref}Sean Song <sean-song>, AI Software Solutions\
**Read time:** 10 minutes\
**Last edited:** 4 Jan 2024

In this blog, we show you how to fine-tune the Llama model on an AMD GPU with ROCm. We use Low-Rank Adaptation of Large Language Models (LoRA) to overcome memory and computing limitations and make open-source large language models (LLMs) more accessible. We also show you how to fine-tune and upload models to Hugging Face.

## Introduction<a class="anchor" id="Introduction"></a>
In the dynamic realm of Generative AI (GenAI), fine-tuning LLMs (such as Llama 2) poses distinctive challenges related to substantial computational and memory requirements. LoRA introduces a compelling solution, allowing rapid and cost-effective fine-tuning of state-of-the-art LLMs. This breakthrough capability not only expedites the tuning process but also lowers associated costs.

To explore the benefits of LoRA, we provide a comprehensive walkthrough of the fine-tuning process for Llama 2 using LoRA specifically tailored for question-answering (QA) tasks on an AMD GPU.

Before jumping in, let's take a moment to briefly review the three pivotal components that form the foundation of our discussion:

- Llama Model: Meta's advanced language model with variants that scale up to 405 billion parameters.
- Fine-tuning: A crucial process that refines LLMs for specialized tasks, optimizing its performance.
- LoRA: The algorithm employed for fine-tuning Llama 2, ensuring effective adaptation to specialized tasks.

### Llama 2<a class="anchor" id="Understanding_Llama_2"></a>
[Llama 2: Open Foundation and Fine-Tuned Chat Models](https://arxiv.org/abs/2307.09288) is a collection of second-generation, open-source LLMs from Meta; it comes with a commercial license. Llama 2 is designed to handle a wide range of natural language processing (NLP) tasks, with models ranging in scale from 7 billion to 70 billion parameters.

Llama 2 Chat, which is optimized for dialogue, has shown similar performance to popular closed-source models like ChatGPT and PaLM. You can improve the performance of this model by fine-tuning it with a high-quality conversational data set. In this blog post, we delve into the process of refining a Llama 2 Chat model using a QA data set.

### Fine-tuning a model<a class="anchor" id="Model_Fine_Tuning"></a>
Fine-tuning in machine learning is the process of adjusting the weights and parameters of a pre-trained model using new data in order to improve its performance on a specific task. It involves using a new data set--one that is specific to the current task--to update the model's weights. It's typically not possible to fine-tune LLMs on consumer hardware due to inadequate memory and computing power. However, in this tutorial, we use LoRA to overcome these challenges.

### LoRA<a class="anchor" id="Lora"></a>
[LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) is an innovative technique-- developed by researchers at Microsoft--designed to address the challenges of fine-tuning LLMs. This results in a significant reduction in the number of parameters (by a factor of up to 10,000) that need to be fine-tuned, which significantly reduces GPU memory requirements. To learn more about the fundamental principles of LoRA, refer to [Using LoRA for efficient fine-tuning: Fundamental principles](../lora-fundamentals/README.md).

## Step-by-step Llama 2 fine-tuning<a class="anchor" id="Step-By-Step-Guide"></a>
Standard (full-parameter) fine-tuning involves considering all parameters. It requires significant computational power to manage optimizer states and gradient check-pointing. The resulting memory footprint is typically about four times larger than the model itself. For example, loading a 7 billion parameter model (e.g. Llama 2) in FP32 (4 bytes per parameter) requires approximately 28 GB of GPU memory, while fine-tuning demands around 28*4=112 GB of GPU memory. Note that the 112 GB figure is derived empirically, and various factors like batch size, data precision, and gradient accumulation contribute to overall memory usage.

To overcome this memory limitation, you can use a parameter-efficient fine-tuning (PEFT) technique, such as LoRA.

This example leverages two GCDs (Graphics Compute Dies) of a AMD MI250 GPU and each GCD are equipped with 64 GB of VRAM. Using this setup allows us to explore different settings for fine-tuning the Llama 2–7b weights with and without LoRA.

Our setup:
* Hardware: AMD Instinct MI250
* Software:
  * [ROCm 6.1.0+](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/)
  * [Pytorch 2.0.1+](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/3rd-party/pytorch-install.html#installing-pytorch-for-rocm)
* Libraries: `transformers`, `accelerate`, `peft`, `trl`, `bitsandbytes`, `scipy`

In this blog, we conducted our experiment using a single MI250GPU with the Docker image [rocm/pytorch:rocm6.1.2_ubuntu22.04_py3.10_pytorch_release-2.1.2](https://hub.docker.com/layers/rocm/pytorch/rocm6.1.2_ubuntu22.04_py3.10_pytorch_release-2.1.2/images/sha256-c8b4e8dfcc64e9bf68bf1b38a16fbc5d65b653ec600f98d3290f66e16c8b6078?context=explore).

### Step 1: Getting started

First, let's confirm the availability of the GPU.

In [1]:
!rocm-smi --showproductname



GPU[0]		: Card Series: 		AMD Instinct MI300X OAM
GPU[0]		: Card Model: 		0x74a1
GPU[0]		: Card Vendor: 		Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0]		: Card SKU: 		MI3SRIOV
GPU[0]		: Subsystem ID: 	0x74a1
GPU[0]		: Device Rev: 		0x00
GPU[0]		: Node ID: 		2
GPU[0]		: GUID: 		47056
GPU[0]		: GFX Version: 		gfx942


Next, install the required libraries.

In [1]:
!pip install -q pandas peft==0.9.0 transformers==4.31.0 trl==0.4.7 accelerate scipy

[0m

#### Import the required packages

In [2]:
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    pipeline
)
from peft import LoraConfig
from trl import SFTTrainer

### 2. Configuring the model and data

You can access Meta's official Llama models from Hugging Face after making a request, which can
take a couple of days. Instead of waiting, we'll use NousResearch’s Meta-Llama-3.1-8B as our base
model (it's the same as the original, but quicker to access).

In [3]:
# Model and tokenizer names
base_model_name = "NousResearch/Llama-2-7b-chat-hf"
new_model_name = "Llama-2-7b-chat-hf-enhanced" #You can give your own name for fine tuned model

# Tokenizer
llama_tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"  

# Model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto"
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

After you have the base model, you can start fine-tuning. We fine-tune our base model for a
question-and-answer task using a small data set called
[mlabonne/guanaco-llama2-1k](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k), which
is a subset (1,000 samples) of the
[timdettmers/openassistant-guanaco](https://huggingface.co/datasets/OpenAssistant/oasst1) data set.
This data set is a human-generated, human-annotated, assistant-style conversation corpus that
contains 161,443 messages in 35 different languages, annotated with 461,292 quality ratings. This
results in over 10,000 fully annotated conversation trees.

In [4]:
# Dataset
data_name = "mlabonne/guanaco-llama2-1k"
training_data = load_dataset(data_name, split="train")
# check the data
print(training_data.shape)
# #11 is a QA sample in English
print(training_data[11])

(1000, 1)
{'text': '<s>[INST] write me a 1000 words essay about deez nuts. [/INST] The Deez Nuts meme first gained popularity in 2015 on the social media platform Vine. The video featured a young man named Rodney Bullard, who recorded himself asking people if they had heard of a particular rapper. When they responded that they had not, he would respond with the phrase "Deez Nuts" and film their reactions. The video quickly went viral, and the phrase became a popular meme. \n\nSince then, Deez Nuts has been used in a variety of contexts to interrupt conversations, derail discussions, or simply add humor to a situation. It has been used in internet memes, in popular music, and even in politics. In the 2016 US presidential election, a 15-year-old boy named Brady Olson registered as an independent candidate under the name Deez Nuts. He gained some traction in the polls and even made appearances on national news programs.\n\nThe Deez Nuts meme has had a significant impact on popular culture

In [6]:
## There is a dependency during training
# !pip install tensorboardX

### Step 3: Start fine-tuning

To set your training parameters, use the following code:

In [5]:
# Training Params
train_params = TrainingArguments(
    output_dir="./results_modified",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    save_steps=50,
    logging_steps=50,
    learning_rate=4e-5,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant"
) 

#### Training with LoRA configuration<a class="anchor" id="Training_with_LoRA_configuration"></a>

Now you can integrate LoRA into the base model and assess its additional parameters. LoRA essentially adds pairs of rank-decomposition weight matrices (called update matrices) to existing weights, and only trains the newly added weights.

In [8]:
from peft import get_peft_model
# LoRA Config
peft_parameters = LoraConfig(
    lora_alpha=8,
    lora_dropout=0.1,
    r=8,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, peft_parameters)
model.print_trainable_parameters()

trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199


Note that there are only 0.062% parameters added by LoRA, which is a tiny portion of the original model. This is the percentage we'll update through fine-tuning, as follows.

In [9]:
# Trainer with LoRA configration
fine_tuning = SFTTrainer(
    model=base_model,
    train_dataset=training_data,
    peft_config=peft_parameters,
    dataset_text_field="text",
    tokenizer=llama_tokenizer,
    args=train_params
)

# Training
fine_tuning.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
50,1.9834
100,1.6183
150,1.4129
200,1.3955
250,1.3791




TrainOutput(global_step=250, training_loss=1.5578253479003907, metrics={'train_runtime': 196.455, 'train_samples_per_second': 5.09, 'train_steps_per_second': 1.273, 'total_flos': 1.701064079130624e+16, 'train_loss': 1.5578253479003907, 'epoch': 1.0})

To save your model, run this code:

In [10]:
# Save Model
fine_tuning.model.save_pretrained(new_model_name)

##### Checking memory usage during training with LoRA<a class="anchor" id="Checking_memory_usage_during_training_with_LoRA"></a>
During training, you can check the memory usage by running the `rocm-smi` command in a terminal.
This command produces the following output:

To facilitate a comparison between fine-tuning with and without LoRA, our subsequent phase involves running a thorough fine-tuning process on the base model. This involves updating all parameters within the base model. We then analyze differences in memory usage, training speed, training loss, and other relevant metrics.

#### Training without LoRA configuration<a class="anchor" id="Training_without_LoRA_configuration"></a>

***To run this section, you need to to restart the kernel an and skip the [Training with LoRA configuration](#Training_with_LoRA_configuration) section.***

For a direct comparison between models using the same criteria, we maintain consistent settings (without any alterations) for train_params during the full-parameter fine-tuning process.

To check the trainable parameters in your base model, use the following code.

In [6]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}"
    )

print_trainable_parameters(base_model)

trainable params: 6738415616 || all params: 6738415616 || trainable%: 100.00


In [7]:
# Set a lower lerning Rate for fine tuning
train_params.learning_rate = 4e-7
print(train_params.learning_rate)

4e-07


In [8]:
# Trainer without LoRA configration

fine_tuning_full = SFTTrainer(
    model=base_model,
    train_dataset=training_data,
    dataset_text_field="text",
    tokenizer=llama_tokenizer,
    args=train_params
)

# Training
fine_tuning_full.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
50,1.7211
100,1.491
150,1.3664
200,1.3742
250,1.3698


TrainOutput(global_step=250, training_loss=1.4645071716308593, metrics={'train_runtime': 683.314, 'train_samples_per_second': 1.463, 'train_steps_per_second': 0.366, 'total_flos': 1.6999849383985152e+16, 'train_loss': 1.4645071716308593, 'epoch': 1.0})

##### Checking memory usage during training without LoRA<a class="anchor" id="Checking_memory_usage_during_training_without_LoRA"></a>
During training, you can check the memory usage by running the `rocm-smi` command in a terminal.
This command produces the following output:

### Step 4: Comparison between fine-tuning with LoRA and full-parameter fine-tuning
<a class="anchor" id="Comparison"></a>
Comparing the results from the [Training with LoRA configuration](#Training_with_LoRA_configuration) and [Training without LoRA configuration](#Training_without_LoRA_configuration) sections, note the following:
- Memory usage: 
    - In the case of full-parameter fine-tuning, there are **6,742,609,920** trainable parameters, leading to significant memory consumption during the training backpropagation stage.
    - In contrast, LoRA introduces only **4,194,304** trainable parameters, accounting for a mere **0.06%** of the total trainable parameters in full-parameter fine-tuning.
    - Monitoring memory usage during training with and without LoRA reveals that fine-tuning with LoRA
        uses only **57%** of the memory consumed by full-parameter fine-tuning. This presents an
        opportunity to increase the batch size and max sequence length and train on larger data sets using
        limited hardware resources. This presents an opportunity to increase batch size, and max sequence length, and train on larger datasets within the constraints of limited hardware resources.
- Training speed: 
    - The results demonstrate that full-parameter fine-tuning takes **12 minutes** to complete, while
        fine-tuning with LoRA finishes in around **3 minutes**. Several factors contribute to this
        acceleration::
    - Several factors contribute to this acceleration:
        - Fewer trainable parameters in LoRA translate to fewer derivative calculations and less memory
        required to store and update weights.
        - Full-parameter fine-tuning is more prone to being memory-bound, where data movement
        becomes a bottleneck for training. This is reflected in lower GPU utilization. Although adjusting
        training settings can alleviate this, it may require more resources (additional GPUs) and a smaller
        batch size.
- Accuracy: 
    - In both training sessions, a notable reduction in training loss was observed. We achieved a closely
        aligned training loss for two both approaches: **1.368** for full-parameter fine-tuning and
        **1.377** for fine-tuning with LoRA. If you're interested in understanding the impact of LoRA on
        fine-tuning performance, refer to
        [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685).

### Step 5: Test the fine-tuned model with LoRA
<a class="anchor" id="Test_the_model"></a>

To test your model, run the following code:


In [13]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto"
)
from peft import LoraConfig, PeftModel
model = PeftModel.from_pretrained(base_model, new_model_name)
model_enhanced = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [14]:
from transformers import GenerationConfig
def inference(llm_model, tokenizer, user_input):
  def formatted_prompt(question)-> str:
    return f"<|start|>user\n{question}<|end|>\n<|start|>assistant:"
  prompt = formatted_prompt(user_input)
  inputs = tokenizer([prompt], return_tensors="pt")
  generation_config = GenerationConfig(penalty_alpha=0.6, do_sample = True,top_k=3, temperature=0.5,
                                       repetition_penalty=1.2, max_new_tokens=200, pad_token_id=tokenizer.eos_token_id
                                      )
  inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
  outputs = llm_model.generate(**inputs, generation_config=generation_config)
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Now we can test with the base model (original) and the fine-tuned model.

In [15]:
inference(base_model, tokenizer, user_input='What do you think is the most important part of building an AI chatbot?')

<|start|>user
What do you think is the most important part of building an AI chatbot?<|end|>
<|start|>assistant: That's a great question! I would say that understanding user needs and intentions are crucial for creating successful chatbots. It can be challenging to identify what users want from their interactions with your bot, but it’s essential to get this right if you want to build something truly useful and engaging. There are many different approaches to identifying user intents in natural language processing (NLP), so there isn’t one single best way to go about doing things here either – instead we need look at all available methods before deciding which approach might work best given specific context/problem set(s).<|end|> 


In [16]:
inference(model_enhanced, tokenizer, user_input='What do you think is the most important part of building an AI chatbot?')

<|start|>user
What do you think is the most important part of building an AI chatbot?<|end|>
<|start|>assistant: I believe that having a good understanding of human psychology and behavioral patterns is essential. This allows us to create more realistic responses, which in turn makes for a better user experience. Additionally, we need to ensure that our system can handle complex queries with ease, as this will help prevent frustration on behalf of users who may have questions or issues they want assistance with. Lastly, it’s crucial that our technology remains up-to-date so as not only does everything run smoothly but also because new developments mean there are always ways to improve upon what already exists – especially when considering how rapidly things change within both fields!<|end|>


You can observe the outputs of the two models based on a given query. These outputs exhibit slight
differences due to the fine-tuning process altering the model weights.