Fine Tune Large Language Model (LLM) on a Custom Dataset with QLoRA

The field of natural language processing has been revolutionized by large language models (LLMs), which showcase advanced capabilities and sophisticated solutions. Trained on extensive text datasets, these models excel in tasks like text generation, translation, summarization, and question-answering. Despite their power, LLMs may not always align with specific tasks or domains.

In this tutorial, we will explore how fine-tuning LLMs can significantly improve model performance, reduce training costs, and enable more accurate and context-specific results.

What is LLM Fine-tuning?
Fine-tuning LLM involves the additional training of a pre-existing model, which has previously acquired patterns and features from an extensive dataset, using a smaller, domain-specific dataset. In the context of “LLM Fine-Tuning,” LLM denotes a “Large Language Model,” such as the GPT series by OpenAI. This approach holds significance as training a large language model from the ground up is highly resource-intensive in terms of both computational power and time. Utilizing the existing knowledge embedded in the pre-trained model allows for achieving high performance on specific tasks with substantially reduced data and computational requirements.

Below are some of the key steps involved in LLM Fine-tuning:

Select a pre-trained model: For LLM Fine-tuning first step is to carefully select a base pre-trained model that aligns with our desired architecture and functionalities. Pre-trained models are generic purpose models that have been trained on a large corpus of unlabeled data.
Gather relevant Dataset: Then we need to gather a dataset that is relevant to our task. The dataset should be labeled or structured in a way that the model can learn from it.
Preprocess Dataset: Once the dataset is ready, we need to do some preprocessing for fine-tuning by cleaning it, splitting it into training, validation, and test sets, and ensuring it’s compatible with the model on which we want to fine-tune.
Fine-tuning: After selecting a pre-trained model we need to fine tune it on our preprocessed relevant dataset which is more specific to the task at hand. The dataset which we will select might be related to a particular domain or application, allowing the model to adapt and specialize for that context.
Task-specific adaptation: During fine-tuning, the model’s parameters are adjusted based on the new dataset, helping it better understand and generate content relevant to the specific task. This process retains the general language knowledge gained during pre-training while tailoring the model to the nuances of the target domain.
Fine-tuning LLMs is commonly used in natural language processing tasks such as sentiment analysis, named entity recognition, summarization, translation, or any other application where understanding context and generating coherent language is crucial. It helps leverage the knowledge encoded in pre-trained models for more specialized and domain-specific tasks.

Fine-tuning methods
Fine-tuning a Large Language Model (LLM) involves a supervised learning process. In this method, a dataset comprising labeled examples is utilized to adjust the model’s weights, enhancing its proficiency in specific tasks. Now, let’s delve into some noteworthy techniques employed in the fine-tuning process.

Full Fine Tuning (Instruction fine-tuning): Instruction fine-tuning is a strategy to enhance a model’s performance across various tasks by training it on examples that guide its responses to queries. The choice of the dataset is crucial and tailored to the specific task, such as summarization or translation. This approach, known as full fine-tuning, updates all model weights, creating a new version with improved capabilities. However, it demands sufficient memory and computational resources, similar to pre-training, to handle the storage and processing of gradients, optimizers, and other components during training.
Parameter Efficient Fine-Tuning (PEFT) is a form of instruction fine-tuning that is much more efficient than full fine-tuning. Training a language model, especially for full LLM fine-tuning, demands significant computational resources. Memory allocation is not only required for storing the model but also for essential parameters during training, presenting a challenge for simple hardware. PEFT addresses this by updating only a subset of parameters, effectively “freezing” the rest. This reduces the number of trainable parameters, making memory requirements more manageable and preventing catastrophic forgetting. Unlike full fine-tuning, PEFT maintains the original LLM weights, avoiding the loss of previously learned information. This approach proves beneficial for handling storage issues when fine-tuning for multiple tasks. There are various ways of achieving Parameter efficient fine-tuning. Low-Rank Adaptation LoRA & QLoRA are the most widely used and effective.



What is LoRA (Low-Rank Adaptation)?
LoRA, or Low-Rank Adaptation, is an advanced method for fine-tuning large language models (LLMs). Traditional fine-tuning methods require updating the entire set of model parameters, which can be computationally expensive and memory-intensive, especially when dealing with massive models. LoRA addresses these challenges by introducing a more efficient approach that significantly reduces the number of trainable parameters during the fine-tuning process.

Core Concept of LoRA
LoRA is built upon the idea of approximating the large weight matrices of a pre-trained LLM using two smaller matrices. Instead of fine-tuning all the parameters in the original model, LoRA focuses on fine-tuning these smaller matrices, which are collectively known as the LoRA adapter. This approach maintains the integrity of the original model while allowing for specialized adaptations to specific tasks or domains.

How LoRA Works:
Weight Matrix Decomposition:

In a neural network, each layer has weight matrices that are responsible for transforming input data into output features. Typically, these weight matrices are large and dense, making them expensive to train.
LoRA decomposes these large weight matrices into two smaller, low-rank matrices. Given a weight matrix 
𝑊
W, LoRA represents it as:
𝑊
′
=
𝑊
+
Δ
𝑊
W 
′
 =W+ΔW

where 
𝑊
W is the original pre-trained weight matrix, and 
Δ
𝑊
ΔW is the adaptation matrix that needs to be fine-tuned.
Instead of directly training 
Δ
𝑊
ΔW, LoRA decomposes it into two smaller matrices 
𝐴
A and 
𝐵
B, such that:
Δ
𝑊
=
𝐴
×
𝐵
ΔW=A×B

Here, 
𝐴
A and 
𝐵
B have much lower dimensions than the original matrix 
𝑊
W, reducing the number of trainable parameters.
Fine-Tuning the LoRA Adapter:

During the fine-tuning process, only the matrices 
𝐴
A and 
𝐵
B are trained, while the original weight matrix 
𝑊
W remains unchanged. This process allows the model to adapt to new tasks or domains without the need to adjust the entire model's parameters.
The LoRA adapter effectively captures the specific knowledge required for the new task while keeping the general knowledge encoded in the original model intact.
Inference with LoRA:

After fine-tuning, the original LLM and the LoRA adapter are combined during inference. The adapter's matrices 
𝐴
A and 
𝐵
B are applied to the original weight matrix 
𝑊
W, allowing the model to utilize the task-specific knowledge encoded in the adapter.
The combination of the original LLM and the LoRA adapter enables the model to perform specialized tasks without requiring a complete retraining of the entire model.
Advantages of LoRA:
Reduced Memory and Computational Requirements:

Since LoRA fine-tunes only a small subset of the original model's parameters, it requires significantly less memory and computational power. This makes it feasible to fine-tune large models on standard hardware, such as GPUs with limited memory.
The size of the LoRA adapter is often a small fraction of the original LLM size, typically in the range of megabytes (MBs) rather than gigabytes (GBs). This reduction in size makes LoRA particularly useful in scenarios where storage and memory resources are constrained.
Reusability Across Multiple Tasks:

One of the most significant benefits of LoRA is the ability to create multiple LoRA adapters for different tasks while using the same base LLM. This means that instead of maintaining multiple copies of large fine-tuned models, we can store and load lightweight LoRA adapters as needed.
For instance, a single LLM can be fine-tuned with different LoRA adapters for tasks like sentiment analysis, translation, and summarization. During inference, the appropriate adapter is loaded into the LLM, enabling task-specific performance without the need to store and deploy separate models for each task.
Task-Specific Specialization:

LoRA allows the LLM to specialize in specific tasks or domains while retaining its general language understanding capabilities. This is particularly beneficial when adapting a general-purpose LLM to a niche domain, such as legal text analysis, medical document processing, or technical content generation.
The fine-tuning process ensures that the model becomes proficient in the target task without compromising the broader knowledge it acquired during the initial pre-training phase.
Efficient Handling of Multi-Tasking:

By leveraging LoRA adapters, developers can efficiently handle multiple tasks with a single LLM. This approach reduces overall memory requirements and streamlines the deployment of specialized models across various applications.
The flexibility to switch between different tasks by loading different adapters into the same base model enhances productivity and simplifies model management.
Practical Example:
Suppose we have a pre-trained GPT-based model that we want to fine-tune for two different tasks: legal document classification and customer service chatbot responses. Using LoRA, we can create two separate adapters:

The first adapter is fine-tuned on a dataset of legal documents to classify them by type (e.g., contracts, wills, patents).
The second adapter is fine-tuned on customer service transcripts to improve the chatbot's ability to handle user queries in a conversational manner.
During deployment, the same GPT-based model can be used for both tasks by loading the appropriate LoRA adapter. This setup allows us to maintain a single, large LLM while efficiently adapting it to diverse, task-specific requirements with minimal overhead.

Conclusion
LoRA represents a significant advancement in the fine-tuning of large language models, offering a more resource-efficient and scalable approach to adapting pre-trained models to specific tasks. By focusing on fine-tuning only the low-rank matrices within the LoRA adapter, developers can achieve high performance in specialized domains without the need for extensive computational resources. This methodology not only reduces the cost and complexity of fine-tuning but also enables the reuse of the base LLM across multiple tasks, making it an invaluable tool in the era of large-scale natural language processing.

What is Quantized LoRA (QLoRA)?
Quantized LoRA (QLoRA) is an advanced version of the Low-Rank Adaptation (LoRA) fine-tuning technique, designed to make the fine-tuning process even more memory-efficient. QLoRA achieves this by not only leveraging the low-rank approximation of the weight matrices, as in LoRA, but also by quantizing these matrices to lower precision. This quantization significantly reduces the memory footprint and computational requirements, making it feasible to fine-tune large language models (LLMs) on more modest hardware, such as a single GPU.

Core Concept of QLoRA
The key innovation in QLoRA lies in the quantization of the weights of the LoRA adapters to lower precision, typically 4-bit precision, instead of the usual 8-bit or higher. This reduction in precision decreases the memory needed to store these weights, allowing the model to be fine-tuned with less computational overhead while maintaining comparable performance to standard LoRA.

In QLoRA:

Quantization: The weights of the LoRA adapters are quantized to 4-bit precision, reducing the amount of memory required to store these matrices. Despite the lower precision, the quantized weights still effectively capture the necessary task-specific information.
Memory Efficiency: By loading the pre-trained model into GPU memory with quantized 4-bit weights, QLoRA can fit larger models or multiple tasks into the same memory space that would otherwise be required for higher-precision weights.
Comparable Effectiveness: Despite the reduction in bit precision, QLoRA manages to maintain performance levels that are comparable to those achieved using LoRA, making it an attractive option for scenarios where memory resources are limited.
Detailed Steps to Fine-Tune an LLM Using QLoRA
Now, let’s explore how to fine-tune an LLM on a custom dataset using QLoRA, all on a single GPU. Below is a step-by-step guide:

1. Setting Up the Notebook
To start with QLoRA, you need to set up a development environment, typically in a Jupyter Notebook or similar platform that allows you to run Python code interactively. Ensure that your GPU is available and CUDA-compatible for optimal performance.

 ##Install required libraries

Now, let’s install the necessary libraries for this experiment.

In [1]:
!pip install -q -U bitsandbytes transformers peft accelerate datasets scipy einops evaluate trl rouge_score

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.3 requires scipy<1.14.0,>=1.7.0, but you have scipy 1.14.1 which is incompatible.


1. Bitsandbytes
Purpose: Bitsandbytes is a library designed to optimize the performance of large language models by providing custom CUDA (Compute Unified Device Architecture) kernels that accelerate key operations. It focuses on making LLMs faster and more memory-efficient by optimizing matrix multiplication, quantization, and gradient computation.

Key Features:

Quantization: One of the most powerful features of Bitsandbytes is its ability to perform 4-bit and 8-bit quantization of model weights. Quantization reduces the precision of the model's weights from floating-point (e.g., 16-bit or 32-bit) to a lower precision, significantly reducing memory usage while maintaining a high level of model accuracy.
Custom CUDA Kernels: Bitsandbytes provides highly optimized CUDA kernels for operations like matrix multiplication and optimizers, which are critical for training large models efficiently on GPUs. These kernels are specifically tailored to handle the large-scale operations involved in LLMs.
Efficient Memory Usage: By leveraging low-precision arithmetic and optimized CUDA functions, Bitsandbytes enables the loading and training of larger models on limited hardware resources, such as a single GPU, without compromising performance.
Importance in QLoRA: In QLoRA, Bitsandbytes is used to load the pre-trained model with quantized weights. This makes the entire fine-tuning process much more memory-efficient, allowing users to work with larger models or multiple tasks within the constraints of their hardware.

2. Transformers
Purpose: Transformers is a comprehensive library developed by Hugging Face that provides easy access to state-of-the-art pre-trained models for natural language processing (NLP). The library offers a wide range of transformer-based models, including GPT, BERT, and T5, along with tools for fine-tuning these models on specific tasks.

Key Features:

Pre-trained Models: The library hosts an extensive collection of pre-trained models that can be used out-of-the-box for various NLP tasks like text classification, translation, summarization, and more.
Model Architecture: Transformers provides implementations of several transformer-based architectures, making it easy to switch between different models or modify existing ones.
Training Utilities: It includes utilities for tokenization, model training, and evaluation, streamlining the process of adapting pre-trained models to specific datasets or tasks.
Community Support: With an active community and regular updates, Transformers stays at the forefront of NLP research, ensuring users have access to the latest advancements in the field.
Importance in QLoRA: The Transformers library is essential for loading the pre-trained models that are fine-tuned using QLoRA. It also provides the necessary tools for tokenizing data, training models, and evaluating performance, making it a cornerstone of the QLoRA fine-tuning process.

3. PEFT (Parameter-Efficient Fine-Tuning)
Purpose: The PEFT library, also developed by Hugging Face, focuses on parameter-efficient fine-tuning methods like LoRA and QLoRA. It enables the adaptation of large models to specific tasks without the need for extensive computational resources.

Key Features:

LoRA Support: PEFT directly supports Low-Rank Adaptation (LoRA), allowing users to fine-tune only a small subset of model parameters (the LoRA adapters) instead of the entire model, significantly reducing the computational cost.
Easy Integration: PEFT is designed to integrate seamlessly with the Transformers library, making it straightforward to apply LoRA or QLoRA to any model supported by Transformers.
Efficiency: By fine-tuning only a small number of parameters, PEFT enables faster training and reduces the risk of overfitting, which is particularly useful when working with limited datasets.
Importance in QLoRA: PEFT is critical for implementing the QLoRA fine-tuning process. It allows users to configure and apply LoRA adapters efficiently, ensuring that the fine-tuning process is both memory and compute-efficient.

4. Accelerate
Purpose: Accelerate is a library by Hugging Face that abstracts the complexity of scaling up model training across multiple GPUs, TPUs, or other hardware accelerators. It simplifies the process of handling distributed training, mixed precision, and other advanced training techniques.

Key Features:

Multi-GPU/TPU Support: Accelerate makes it easy to distribute the training process across multiple GPUs or TPUs, enabling the efficient training of large models on high-performance clusters.
Mixed Precision Training: It supports mixed precision training, which uses lower precision (e.g., FP16) to speed up training and reduce memory usage without sacrificing model accuracy.
Minimal Code Changes: One of the main advantages of Accelerate is that it requires minimal changes to the existing codebase, making it easy to integrate into existing projects.
Importance in QLoRA: In the context of QLoRA, Accelerate can be used to manage the distribution of training across multiple GPUs or to optimize the training process with mixed precision, ensuring that the fine-tuning process is both efficient and scalable.

5. Datasets
Purpose: The Datasets library, also by Hugging Face, provides easy access to a vast collection of datasets for NLP tasks. It is designed to handle large datasets efficiently, making it ideal for use with LLMs.

Key Features:

Wide Range of Datasets: Datasets provides access to thousands of datasets, covering various NLP tasks like text classification, machine translation, summarization, and more.
Efficient Data Handling: The library is optimized for performance, allowing users to load, process, and manipulate large datasets with minimal memory overhead.
Integration with Transformers: Datasets integrates seamlessly with the Transformers library, enabling easy tokenization and preparation of data for model training.
Importance in QLoRA: Datasets is crucial for loading and preprocessing the data used in QLoRA fine-tuning. It simplifies the process of preparing datasets for training, ensuring that the data pipeline is efficient and scalable.

6. Einops
Purpose: Einops is a library that simplifies the manipulation and transformation of tensors in deep learning. It provides a high-level, readable syntax for performing complex tensor operations, making it easier to work with multi-dimensional data.

Key Features:

Flexible Tensor Operations: Einops allows for easy reshaping, rearranging, and combining of tensors, which are common operations when working with deep learning models.
Readable Syntax: The library’s syntax is designed to be both intuitive and expressive, making tensor operations more understandable and reducing the likelihood of errors.
Compatibility: Einops is compatible with major deep learning frameworks like PyTorch and TensorFlow, making it a versatile tool for researchers and developers.
Importance in QLoRA: Although not specifically tied to QLoRA, Einops can be valuable when working with the tensor data structures involved in fine-tuning large models. It simplifies the process of preparing data for input into the model and manipulating outputs, which can be particularly useful in complex fine-tuning tasks.

In [2]:
from transformers import(
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    Trainer,
    GenerationConfig
)

from tqdm import tqdm
from trl import SFTTrainer
import torch
import time
import pandas as pd
import numpy as np
from huggingface_hub import interpreter_login
interpreter_login()

  from .autonotebook import tqdm as notebook_tqdm




    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token can be pasted using 'Right-Click'.
Token is valid (permission: fineGrained).
Your token has been saved in your 

In [3]:
import os
# disable Weights and Biases
os.environ['WANDB_DISABLED']="true"

In [5]:
from datasets import load_dataset

huggingface_dataset = 'neil-code/dialogsum-test'
dataset = load_dataset(huggingface_dataset)


Downloading readme: 100%|██████████| 4.56k/4.56k [00:00<00:00, 18.4kB/s]
Downloading data: 100%|██████████| 1.81M/1.81M [00:02<00:00, 898kB/s]
Downloading data: 100%|██████████| 441k/441k [00:00<00:00, 508kB/s]
Downloading data: 100%|██████████| 447k/447k [00:01<00:00, 397kB/s]
Generating train split: 100%|██████████| 1999/1999 [00:00<00:00, 20100.29 examples/s]
Generating validation split: 100%|██████████| 499/499 [00:00<00:00, 45407.28 examples/s]
Generating test split: 100%|██████████| 499/499 [00:00<00:00, 49884.59 examples/s]


**Create Bitsandbytes configuration

To load the model, we need a configuration class that specifies how we want the quantization to be performed. We’ll be using BitsAndBytesConfig to load our model in 4-bit format. This will reduce memory consumption considerably, at a cost of some accuracy.

In [7]:
compute_dtype=getattr(torch,'float16')
bnb_config=BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False
    
)

Loading the Pre-Trained model

Microsoft recently open-sourced the Phi-2, a Small Language Model(SLM) with 2.7 billion parameters. Here, we will use Phi-2 for the fine-tuning process. This language model exhibits remarkable reasoning and language understanding capabilities, achieving state-of-the-art performance among base language models.

Let’s now load Phi-2 using 4-bit quantization from HuggingFace

In [10]:
# from transformers import AutoModelForCausalLM
# import torch
# from transformers import BitsAndBytesConfig

# # Setting up the compute dtype to float16
# compute_dtype = torch.float16

# # Configuring BitsAndBytes for 4-bit quantization
# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_quant_type='nf4',
#     bnb_4bit_compute_dtype=compute_dtype,
#     bnb_4bit_use_double_quant=False
# )

# # Model name and device map
# model_name = 'microsoft/phi-2'
# device_map = {"": 0}

# # Loading the pre-trained model with the specified configurations
# original_model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     device_map=device_map,
#     quantization_config=bnb_config,
#     trust_remote_code=True,
#     use_auth_token=True
# )


In [12]:
# from transformers import AutoModelForCausalLM, AutoTokenizer
# import torch

# # Model name and device map
# model_name = 'microsoft/phi-2'

# # Load model on CPU
# original_model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     device_map="cpu",  # Ensure model is loaded on CPU
#     trust_remote_code=True,
#     use_auth_token=True
# )

# # Optionally, if the model supports float16 precision, you can set it (this is typically for GPU, but you can set it for consistency)
# original_model = original_model.to(torch.float16) if torch.cuda.is_available() else original_model

# # Load tokenizer
# tokenizer = AutoTokenizer.from_pretrained(model_name)

# # Check if GPU is available
# print("CUDA available:", torch.cuda.is_available())


In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True,padding_side="left",add_eos_token=True,add_bos_token=True,use_fast=False)
tokenizer.pad_token = tokenizer.eos_token

Test the Model with Zero Shot Inferencing
We will evaluate the base model that we loaded above using a few sample inputs.

In [None]:
%%time
from transformers import set_seed
seed = 42
set_seed(seed)

index = 10

prompt = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

formatted_prompt = f"Instruct: Summarize the following conversation.\n{prompt}\nOutput:\n"
res = gen(original_model,formatted_prompt,100,)
#print(res[0])
output = res[0].split('Output:\n')[1]

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{formatted_prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')