# LangQA – Language-powered question and answer system

## Imports

In [2]:
import torch
from trl import SFTTrainer
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from transformers import pipeline, TrainingArguments
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory
from langchain.chains import LLMChain
import warnings

warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


## Load Data

https://huggingface.co/datasets/nlpie/Llama2-MedTuned-Instructions

In [3]:
# Load dataset
dataset = load_dataset('nlpie/Llama2-MedTuned-Instructions')

In [4]:
train_data = dataset['train'].select(indices=range(1000))

train_data

Dataset({
    features: ['instruction', 'input', 'output', 'source'],
    num_rows: 1000
})

In [5]:
# Selecting the lines to test the model
test_data = dataset['train'].select(indices=range(1000, 1200))

## Understanding the format of the text

In [6]:
for i in range(3):
    data = dataset['train'][i]
    print(f"Data point {i + 1}:")
    print("Instruction:", data['instruction'])
    print("Input:", data['input'])
    print("Output:", data['output'])
    print("\n-----------------------------\n")

Data point 1:
Instruction: In your role as a medical professional, address the user's medical questions and concerns.
Input: My relative suffering from secondary lever cancer ( 4th stage as per Allopathic doctor) and primary is in rectum. He is continuously with 103 to 104 degree F fever. Allpathic doctor suggested chemo only after fever subsidises. Is treatment possible at Lavanya & what is the time scale of recover.
Output: Hi, dairy have gone through your question. I can understand your concern. He has rectal cancer with liver metastasis. It is stage 4 cancer. Surgery is not possible at this stage. Only treatment options are chemotherapy and radiotherapy according to type of cancer. Inspite of all treatment prognosis is poor. Life expectancy is not good. Consult your doctor and plan accordingly. Hope I have answered your question, if you have any doubts then contact me at bit.ly/ Chat Doctor. Thanks for using Chat Doctor. Wish you a very good health.

-----------------------------



## Automating the Creation of Prompts for Model Training

In [7]:
# Defines a function that takes a dictionary named sample
def create_prompt(sample):

    # Defines a pre_prompt string that serves as a template for the first part of the prompt
    pre_prompt = """[INST]<<SYS>> {instruction}\n"""

    # Concatenates pre_prompt with additional strings to form the complete prompt
    prompt = pre_prompt + "{input}" +"[/INST]"+"\n{output}"

    # Assigns the value of the 'instruction' key of the dictionary sample to the variable example_instruction
    example_instruction = sample['instruction']

    # Assigns the value of the 'input' key of the dictionary sample to the variable example_input
    example_input = sample['input']

    # Assigns the value of the 'output' key of the dictionary sample to the variable example_output
    example_output = sample['output']

    # Creates an instance of PromptTemplate with the previously defined prompt and input variables
    prompt_template = PromptTemplate(template = prompt,
    input_variables = ["instruction", "input", "output"])

    # Uses the format method of the prompt_template instance to replace the variables
    # in the template with the specified values
    unique_prompt = prompt_template.format(instruction = example_instruction,
                                          input = example_input,
                                          output = example_output)

    # Returns the formatted prompt
    return [unique_prompt]

In [8]:
# Testing the function
prompt = create_prompt(train_data[0])
print(prompt)

["[INST]<<SYS>> In your role as a medical professional, address the user's medical questions and concerns.\nMy relative suffering from secondary lever cancer ( 4th stage as per Allopathic doctor) and primary is in rectum. He is continuously with 103 to 104 degree F fever. Allpathic doctor suggested chemo only after fever subsidises. Is treatment possible at Lavanya & what is the time scale of recover.[/INST]\nHi, dairy have gone through your question. I can understand your concern. He has rectal cancer with liver metastasis. It is stage 4 cancer. Surgery is not possible at this stage. Only treatment options are chemotherapy and radiotherapy according to type of cancer. Inspite of all treatment prognosis is poor. Life expectancy is not good. Consult your doctor and plan accordingly. Hope I have answered your question, if you have any doubts then contact me at bit.ly/ Chat Doctor. Thanks for using Chat Doctor. Wish you a very good health."]


## Quantization Process

**PT-BR**

O processo de quantização serve para reduzir o tamanho do modelo e melhorar a eficiência computacional, tornando-o mais rápido e acessível a execução em máquinas com hardware de menor capacidade. A quantização reduz a precisão dos pesos e ativações do modelo, que geralmente estão representados em pontos flutuantes de 32 bits (FP32), para formatos de menor precisão, como int8, int4 e até mesmo int2.

Principais Benefícios da Quantização:
* Redução do uso de memória: Modelos quantizados ocupam menos espaço na RAM, tornando possível a execução em máquinas com menos memória, CPUs e GPUs menos potentes;
* Aceleração do tempo de inferência: Operações com números inteiros são mais rápidas do que operações com pontos flutuantes, reduzindo a latência de inferência;
* Menor consumo de energia: Como a quantização exige menos processamento de ponto flutuante, o gasto energético é menor, tornando possível a execução em dispositivos embarcados e em dispositivos móveis. 

---
**EN-US**

The quantization process serves to reduce the size of the model and improve computational efficiency, making it faster and more accessible to run on machines with lower-capacity hardware. Quantization reduces the precision of the model's weights and activations, which are usually represented in 32-bit floating point (FP32) formats, to lower-precision formats, such as int8, int4, and even int2.

Main Benefits of Quantization:
* Reduced memory usage: Quantized models take up less space in RAM, making it possible to run on machines with less memory, less powerful CPUs and GPUs;
* Accelerated inference time: Operations with integers are faster than operations with floating points, reducing inference latency;
* Lower energy consumption: Since quantization requires less floating point processing, energy expenditure is lower, making it possible to run on embedded devices and mobile devices.

#### Bitsandbytes Config

The `BitsAndBytesConfig` object from the **bitsandbytes** package is used to configure **model quantization** during loading. This quantization reduces the model's weight size and **makes it possible to train and infer large LLMs on smaller GPUs**, such as those with 8GB, 16 GB or 24 GB of VRAM. 

---

### 🔹 **Explanation of the Parameters**
#### `load_in_4bit=True`
- **What it does:** Enables **4-bit** quantization to reduce VRAM usage.  
- **Alternative:** `load_in_8bit=True`, which uses **8-bit** quantization instead of 4-bit.  
- **Why use it:** 4-bit models take up **half** the memory of an 8-bit model but may lose some precision.  

---

#### `bnb_4bit_quant_type="nf4"`
- **What it does:** Defines the quantization type used to store weights.  
- **Available options:**  
  - `"fp4"` → **Float4**, a 4-bit floating-point format.  
  - `"nf4"` → **Normal Float4**, an optimized version of Float4 that improves precision.  
- **Why use `"nf4"`:** This format has been optimized for **AI models**, providing better precision than `"fp4"` while reducing weight size.  

P.S.: The **NF4 (Normal Float 4)** format improves precision compared to **FP4 (Float 4)** because it uses a **non-linear distribution** of representable values, optimizing bit allocation to represent numbers that occur more frequently in AI models. It was specifically designed for **Deep Learning**, especially to handle **LLMs**.

---

#### `bnb_4bit_compute_dtype="float16"`
- **What it does:** Sets the data type used for computations during training/inference.  
- **Available options:**  
  - `"float16"` → Uses `torch.float16`, good for GPUs that support 16-bit calculations.  
  - `"bfloat16"` → Uses `torch.bfloat16`, better for modern GPUs like A100/H100.  
  - `"float32"` → Uses `torch.float32`, offering higher precision but consuming more memory.  
- **Why use `"float16"`:** Most GPUs support `float16`, which balances precision and efficiency. If using GPUs like A100 or H100, `bfloat16` might be a better choice.  

---

#### `bnb_4bit_use_double_quant=False`
- **What it does:** Controls whether **double quantization** will be used.  
- **Available options:**  
  - `False` → Only a single quantization is applied.  
  - `True` → **Applies a second quantization** to the already quantized weights.  
- **Why use `False`:** If your GPU has **enough memory** (e.g., 24 GB), double quantization **may not be necessary**. If you want to **save even more VRAM**, you can try `True`.  

---

### 🔹 **Other Possible Configurations**
Here are some variations you can test, depending on your hardware and goals:

#### 🔸 **For maximum memory efficiency**
```python
  bnb_config = BitsAndBytesConfig(
      load_in_4bit=True,  
      bnb_4bit_quant_type="nf4",
      bnb_4bit_compute_dtype="bfloat16",  # Uses less memory on modern GPUs
      bnb_4bit_use_double_quant=True      # Enables double quantization to save VRAM
)
```
**Recommended use:** If running a very large model on a GPU with **limited VRAM** (e.g., RTX 3090, 4090, or A100 with 40 GB).

---

#### 🔸 **For better precision during inference**
```python
  bnb_config = BitsAndBytesConfig(
      load_in_8bit=True,  # Uses 8-bit instead of 4-bit
      bnb_4bit_compute_dtype="float32"  # Uses float32 for more precise calculations
)
```
**Recommended use:** When you **don't need to save much memory** and want a **more accurate model**.

---

### 🔹 **Conclusion**
Using `BitsAndBytesConfig` is essential for **running large models on limited hardware**. Here's a quick summary of the most important parameters:

| Parameter | Function | Common Values | When to Use |
|-----------|--------|---------------|------------|
| `load_in_4bit` | Enables 4-bit quantization | `True` or `False` | For maximum memory savings |
| `bnb_4bit_quant_type` | Type of quantization | `"fp4"` or `"nf4"` | `"nf4"` for better precision |
| `bnb_4bit_compute_dtype` | Data type for computations | `"float16"`, `"bfloat16"`, `"float32"` | `"bfloat16"` for modern GPUs |
| `bnb_4bit_use_double_quant` | Enables double quantization | `True` or `False` | `True` if VRAM is limited |

If you want more efficiency, you can test different combinations and check VRAM consumption using `torch.cuda.memory_allocated()`. 🚀  

In [9]:
# Enables loading of the base model with 4-bit precision
use_4bit = True

# Quantization type
bnb_4bit_quant_type = "nf4"

# Sets the dtype for the base model
bnb_4bit_compute_dtype = "bfloat16"

# Disables double quantization
use_nested_quant = True

# Sets the dtype for computation in PyTorch
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

In [10]:
# Defining the config
bnb_config = BitsAndBytesConfig(load_in_4bit = use_4bit,
                                bnb_4bit_quant_type = bnb_4bit_quant_type,
                                bnb_4bit_compute_dtype = compute_dtype,
                                bnb_4bit_use_double_quant = use_nested_quant)

In [12]:
# Verifying if the GPU supports bfloat16
if compute_dtype == torch.bfloat16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("The GPU suporrts bfloat16. You can accelerate the train using bf16=True")
        print("=" * 80)

The GPU suporrts bfloat16. You can accelerate the train using bf16=True


## Load the LLM and the Tokenizer

https://huggingface.co/NousResearch/Llama-2-7b-chat-hf

In [13]:
# LLM
# llm_name = "NousResearch/Llama-2-7b-chat-hf"
llm_name = "Qwen/Qwen2.5-7B-Instruct"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(llm_name)

# Load the base model with quantization
model = AutoModelForCausalLM.from_pretrained(llm_name,
                                              quantization_config = bnb_config,
                                            #   trust_remote_code=True,
                                              device_map = "auto",
                                              use_cache = False
                                              )

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Loading checkpoint shards: 100%|██████████| 4/4 [00:24<00:00,  6.04s/it]


In [14]:
# Use the EOS token from the tokenizer to pad at the end of each sequence
tokenizer.pad_token = tokenizer.eos_token

# Enable padding at the end of each sentence
tokenizer.padding_side = "right"

## Configuring LoRa Adapters

Quantization represents data with fewer bits, making it a useful technique for reducing memory usage and speeding up inference, especially in the context of LLMs.  

Once a model is quantized, it is typically not trained **directly** for downstream tasks because training can become unstable due to the reduced precision of weights and activations. However, since PEFT methods only add extra trainable parameters, this allows for training a quantized model with a PEFT adapter on top! Combining quantization with PEFT can be a good strategy to train even the largest models on a single GPU. For example, QLoRA is a method that quantizes a model to 4 bits and then trains it with LoRA. This method enables fine-tuning a 65B parameter model on a single 48GB GPU, for instance.  

The goal of PEFT (Parameter-Efficient Fine-Tuning) is to keep most of the pre-trained model's parameters fixed while adjusting only a small subset of parameters to adapt the model to a specific task.

---

## PT-BR

**Parameter-Efficient Fine-Tuning (PEFT)**

PEFT, ou *Parameter-Efficient Fine-Tuning*, é uma abordagem de *fine-tuning* desenvolvida para ajustar grandes modelos de IA com eficiência, alterando apenas uma pequena parte de seus parâmetros. Em vez de atualizar todos os milhões ou bilhões de parâmetros de um modelo, como acontece no *fine-tuning* tradicional, o PEFT se concentra em modificar uma porção mínima deles. Isso reduz a necessidade de recursos computacionais e de memória, mantendo o desempenho próximo ou equivalente ao do *fine-tuning* completo.

As técnicas mais conhecidas de PEFT incluem:

1. **Adapters**: Inserem camadas adicionais entre as camadas originais do modelo, treinando apenas essas novas camadas. As camadas originais permanecem congeladas, o que preserva o conhecimento do pré-treinamento e permite ajustes leves e específicos para a tarefa.
2. **LoRA (Low-Rank Adaptation)**: Adiciona matrizes de baixa dimensionalidade que capturam pequenas mudanças nos parâmetros do modelo. O LoRA é treinado independentemente das camadas principais, reduzindo ainda mais os recursos necessários para adaptação.
3. **Prefix Tuning**: Adiciona vetores de "prefixo" no início de cada camada de atenção do modelo, alterando a forma como ele processa o contexto de entrada. Isso permite um ajuste eficaz sem modificar os parâmetros originais.
4. **Prompt Tuning**: Em vez de ajustar os parâmetros internos, o *Prompt Tuning* adiciona prompts ou "ganchos" específicos ao modelo, orientando suas respostas para uma tarefa particular.
5. **BitFit**: Congela todas as camadas do modelo, ajustando apenas os *bias terms* dos parâmetros. Isso reduz drasticamente o número de parâmetros treináveis, mantendo um desempenho relativamente bom para tarefas específicas.

Essas técnicas são úteis em casos onde se quer ajustar um modelo grande para tarefas específicas sem gastar muitos recursos. O PEFT é especialmente popular em modelos de linguagem natural (como *transformers*) devido ao seu custo-benefício e por possibilitar a adaptação de modelos muito grandes, como o GPT e o BERT, para múltiplas tarefas com eficiência.

---

**Low-Rank Adapatation (LoRA)**

LoRA, ou *Low-Rank Adaptation*, é uma técnica de *fine-tuning* eficiente, voltada para grandes modelos de IA, como redes neurais baseadas em *transformers*. Ela permite ajustar o modelo para uma tarefa específica modificando apenas uma pequena quantidade de parâmetros, em vez de realizar o ajuste completo de todos os milhões ou bilhões de parâmetros. O LoRA é um dos métodos do *Parameter-Efficient Fine-Tuning* (PEFT) e é especialmente eficaz para economizar recursos computacionais e de memória.

**Como o LoRA Funciona**

O princípio do LoRA é usar uma decomposição de baixa dimensionalidade (ou *low-rank*) para modelar as alterações necessárias no modelo original. Em vez de ajustar diretamente os pesos principais de cada camada, o LoRA adiciona matrizes de baixa dimensionalidade para capturar as pequenas variações de pesos necessárias para a nova tarefa. Essas matrizes extras ajustam a saída do modelo de forma a adaptá-lo, enquanto os parâmetros originais permanecem congelados.

**Principais Vantagens do LoRA**

1. **Redução de Recursos Computacionais**: Como apenas as matrizes de baixa dimensionalidade são treinadas, o uso de memória e de computação é muito menor do que em um ajuste completo.
2. **Preservação do Conhecimento do Modelo**: Ao manter os parâmetros originais do modelo congelados, o LoRA preserva o conhecimento geral do pré-treinamento, adaptando-o com eficiência a uma nova tarefa sem modificar seu comportamento central.
3. **Flexibilidade para Múltiplas Tarefas**: É possível treinar e armazenar várias matrizes LoRA para diferentes tarefas em um único modelo base, ativando-as conforme necessário. Isso permite uma adaptação rápida a várias tarefas sem a necessidade de *fine-tuning* completo para cada uma.

**Casos de Uso do LoRA**

LoRA é amplamente usado em processamento de linguagem natural, especialmente com grandes modelos de *transformers* em tarefas como classificação de texto, geração de linguagem e compreensão de linguagem. Ele também é útil em visão computacional e outras áreas onde grandes modelos são aplicados.

Em resumo, LoRA é uma abordagem prática e econômica para adaptar modelos de IA grandes e complexos a novos contextos ou tarefas específicas, tornando o processo de *fine-tuning* mais acessível e escalável.

---

**Quantized Low-Rank Adaptation (QLoRA)**

É uma técnica que combina dois métodos para tornar o *fine-tuning* de grandes modelos mais eficiente em termos de recursos computacionais: quantização e *low-rank adaptation* (LoRA). QLoRA permite que modelos enormes, como modelos de linguagem natural com bilhões de parâmetros, sejam ajustados em hardwares de menor potência, como GPUs com menor memória, sem sacrificar a precisão do modelo.

**Como Funciona o QLoRA**

1. **Quantização**: A quantização é o processo de reduzir a precisão dos parâmetros do modelo (por exemplo, de 16 bits para 4 bits). Isso reduz a quantidade de memória necessária para armazenar e processar os pesos do modelo, permitindo que modelos grandes sejam carregados em GPUs menores. Em QLoRA, o modelo é quantizado de forma cuidadosa para minimizar a perda de precisão.
2. **Low-Rank Adaptation (LoRA)**: O LoRA adiciona pequenas matrizes de baixa dimensão aos pesos do modelo, treinando apenas esses novos parâmetros em vez de atualizar o modelo inteiro. Essas matrizes capturam as mudanças necessárias para adaptar o modelo à tarefa específica, enquanto os pesos quantizados permanecem congelados.

A combinação desses dois métodos significa que QLoRA pode aproveitar a eficiência da quantização para reduzir o uso de memória, enquanto ainda permite a adaptação do modelo para uma tarefa específica usando LoRA, sem precisar ajustar todos os parâmetros.

**Principais Vantagens do QLoRA**

1. **Menor Consumo de Memória e Computação**: Com a quantização para 4 bits, o modelo ocupa significativamente menos espaço na memória, permitindo que seja carregado em GPUs de menor capacidade.
2. **Eficiência no Treinamento**: Apenas as matrizes de adaptação de baixa dimensionalidade são treinadas, reduzindo ainda mais o custo computacional e acelerando o processo de *fine-tuning*.
3. **Preservação do Desempenho do Modelo**: Apesar da quantização e da adaptação parcial dos parâmetros, o QLoRA consegue manter uma alta precisão, tornando-o uma alternativa viável para o *fine-tuning* de modelos muito grandes.

**Casos de Uso**

QLoRA é particularmente útil em modelos de linguagem com muitos parâmetros, onde os recursos computacionais podem ser um limitante para o *fine-tuning*. Isso o torna ideal para adaptar grandes modelos como LLaMA e GPT para tarefas específicas em processamento de linguagem natural (NLP) sem precisar de uma infraestrutura de hardware muito poderosa.

**Em Resumo**

QLoRA é uma técnica que possibilita o ajuste eficiente de grandes modelos ao combinar quantização com adaptação de baixa dimensionalidade. Essa abordagem torna o *fine-tuning* de grandes modelos acessível para mais usuários e aplicações, mantendo a eficiência de memória e o desempenho.

---

## EN

### **Parameter-Efficient Fine-Tuning (PEFT)**

PEFT, or *Parameter-Efficient Fine-Tuning*, is a fine-tuning approach designed to efficiently adjust large AI models by modifying only a small portion of their parameters. Instead of updating all the millions or billions of parameters in a model, as in traditional fine-tuning, PEFT focuses on modifying a minimal subset. This reduces computational and memory requirements while maintaining performance close to or equivalent to full fine-tuning.

The most well-known PEFT techniques include:

1. **Adapters**: Insert additional layers between the model's original layers, training only these new layers. The original layers remain frozen, preserving pre-trained knowledge while allowing for lightweight, task-specific adjustments.
2. **LoRA (Low-Rank Adaptation)**: Adds low-dimensional matrices that capture small changes in the model's parameters. LoRA is trained independently of the main layers, further reducing the resources required for adaptation.
3. **Prefix Tuning**: Adds "prefix" vectors at the beginning of each attention layer in the model, altering how it processes input context. This enables effective fine-tuning without modifying the original parameters.
4. **Prompt Tuning**: Instead of adjusting internal parameters, *Prompt Tuning* adds specific prompts or "hooks" to the model, guiding its responses toward a particular task.
5. **BitFit**: Freezes all model layers and adjusts only the *bias terms* of the parameters. This drastically reduces the number of trainable parameters while maintaining relatively good performance for specific tasks.

These techniques are useful when adjusting a large model for specific tasks without consuming excessive resources. PEFT is particularly popular in natural language processing models (such as *transformers*) due to its cost-effectiveness and ability to efficiently adapt very large models, like GPT and BERT, for multiple tasks.

---

### **Low-Rank Adaptation (LoRA)**

LoRA, or *Low-Rank Adaptation*, is an efficient fine-tuning technique designed for large AI models, such as transformer-based neural networks. It allows the model to be adapted for a specific task by modifying only a small number of parameters instead of fine-tuning all millions or billions of them. LoRA is one of the methods within *Parameter-Efficient Fine-Tuning* (PEFT) and is particularly effective in reducing computational and memory requirements.

#### **How LoRA Works**

The core idea of LoRA is to use a low-dimensional decomposition (*low-rank*) to model the necessary changes in the original model. Instead of directly adjusting the main weights of each layer, LoRA adds low-dimensional matrices to capture the small weight variations required for the new task. These extra matrices adjust the model’s output, adapting it while keeping the original parameters frozen.

#### **Key Advantages of LoRA**

1. **Reduced Computational Resources**: Since only the low-dimensional matrices are trained, memory and computational usage are much lower than in full fine-tuning.
2. **Preserving the Model’s Knowledge**: By keeping the model’s original parameters frozen, LoRA preserves the general knowledge from pre-training while efficiently adapting it to a new task without modifying its core behavior.
3. **Flexibility for Multiple Tasks**: It is possible to train and store multiple LoRA matrices for different tasks in a single base model, activating them as needed. This enables quick adaptation to various tasks without requiring full fine-tuning for each one.

#### **Use Cases of LoRA**

LoRA is widely used in natural language processing, particularly with large *transformer* models in tasks such as text classification, language generation, and language comprehension. It is also useful in computer vision and other fields where large models are applied.

In summary, LoRA is a practical and cost-effective approach for adapting large and complex AI models to new contexts or specific tasks, making the fine-tuning process more accessible and scalable.

---

### **Quantized Low-Rank Adaptation (QLoRA)**

QLoRA is a technique that combines two methods to make the fine-tuning of large models more efficient in terms of computational resources: **quantization** and **low-rank adaptation (LoRA)**. QLoRA allows enormous models, such as natural language models with billions of parameters, to be fine-tuned on lower-powered hardware, such as GPUs with limited memory, without sacrificing model accuracy.

#### **How QLoRA Works**

1. **Quantization**: Quantization is the process of reducing the precision of model parameters (e.g., from 16-bit to 4-bit). This reduces the amount of memory required to store and process model weights, enabling large models to run on smaller GPUs. In QLoRA, the model is quantized carefully to minimize precision loss.
2. **Low-Rank Adaptation (LoRA)**: LoRA adds small low-dimensional matrices to the model's weights, training only these new parameters instead of updating the entire model. These matrices capture the necessary changes to adapt the model to a specific task, while the quantized weights remain frozen.

By combining these two methods, QLoRA leverages **quantization** to reduce memory usage while still enabling model adaptation using **LoRA**, without requiring updates to all parameters.

#### **Key Advantages of QLoRA**

1. **Lower Memory and Compute Requirements**: With 4-bit quantization, the model takes up significantly less space in memory, allowing it to run on lower-capacity GPUs.
2. **Efficient Training**: Only the low-rank adaptation matrices are trained, further reducing computational costs and accelerating the fine-tuning process.
3. **Preserving Model Performance**: Despite quantization and partial parameter adaptation, QLoRA can maintain high accuracy, making it a viable alternative for fine-tuning very large models.

#### **Use Cases**

QLoRA is particularly useful in large-scale language models, where computational resources can be a limiting factor for fine-tuning. This makes it ideal for adapting massive models like LLaMA and GPT for specific tasks in **natural language processing (NLP)** without requiring a powerful hardware infrastructure.

#### **In Summary**

QLoRA is a technique that enables the efficient fine-tuning of large models by combining **quantization** with **low-rank adaptation**. This approach makes fine-tuning large models accessible to more users and applications while maintaining memory efficiency and performance.

In [15]:
# LoRa Parameters
peft_config = LoraConfig(r = 8,
                        lora_alpha = 16,
                        lora_dropout = 0.05,
                        bias = "none",
                        task_type = "CAUSAL_LM")

In [16]:
# Prepare the model to train
model = prepare_model_for_kbit_training(model)

In [17]:
# Merge the quantized model with the LoRa adapters
model = get_peft_model(model, peft_config=peft_config)

## Fine-tuning parameters

Sure! Let's break down each parameter in your `TrainingArguments` and understand their roles in fine-tuning your model.  

---

### **Understanding Each Parameter:**

1. **`output_dir = output_model`**  
   - Specifies where to save the trained model and checkpoints.  
   - `output_model` should be a string indicating a directory path.  

2. **`per_device_train_batch_size = 1`**  
   - Sets the number of training samples per GPU (or device) per step.  
   - Since it's set to `1`, each GPU processes only one example at a time.  
   - A small batch size is typically used when working with large models to save memory.  

3. **`gradient_accumulation_steps = 4`**  
   - Accumulates gradients over multiple steps before performing a weight update.  
   - Helps simulate a larger batch size without increasing memory consumption.  
   - Here, it means that the optimizer updates the model every **4 steps**, effectively simulating a batch size of `4`.  

4. **`optim = "paged_adamw_32bit"`**  
   - Chooses the optimizer.  
   - `"paged_adamw_32bit"` is a memory-efficient optimizer from `bitsandbytes`, optimized for handling large models.  
   - Uses AdamW (Adam with weight decay) but in a paged memory format, making it efficient for limited GPU memory.  

5. **`learning_rate = 2e-4`**  
   - Sets the learning rate for training.  
   - `2e-4` (or `0.0002`) is a relatively moderate learning rate, suitable for fine-tuning large models.  

6. **`lr_scheduler_type = "cosine"`**  
   - Defines the learning rate scheduler, which controls how the learning rate changes over training.  
   - `"cosine"` uses a cosine decay schedule, where the learning rate starts high and gradually decreases following a cosine function.  
   - Useful for fine-tuning as it avoids sudden changes in learning rate.  

7. **`save_strategy = "epoch"`**  
   - Determines when to save checkpoints.  
   - `"epoch"` means that the model will be saved at the end of each training epoch.  

8. **`logging_steps = 10`**  
   - Controls how often training logs (like loss and learning rate) are printed.  
   - Here, logs are printed every **10 training steps**.  

9. **`num_train_epochs = 3`**  
   - Sets the number of times the entire dataset is passed through the model.  
   - Here, the model will go through the dataset **3 times**.  

10. **`max_steps = 150`**  
    - Defines the maximum number of training steps before stopping.  
    - If set, this takes priority over `num_train_epochs`.  
    - Here, training will stop after **150 steps**, even if all epochs haven’t completed.  

11. **`fp16 = True`**  
    - Enables **mixed precision training**, where floating-point operations are done in 16-bit precision instead of 32-bit.  
    - Helps reduce memory usage and speed up training on GPUs that support FP16 (e.g., NVIDIA GPUs with Tensor Cores).  

---

### **Key Takeaways:**
- **Memory optimization**: `gradient_accumulation_steps`, `optim="paged_adamw_32bit"`, and `fp16=True` help reduce GPU memory usage.  
- **Training control**: `per_device_train_batch_size`, `num_train_epochs`, and `max_steps` define how the training progresses.  
- **Learning dynamics**: `learning_rate` and `lr_scheduler_type="cosine"` affect model convergence.  
- **Checkpointing & logging**: `save_strategy="epoch"` ensures periodic model saving, and `logging_steps=10` provides updates during training.  

Would you like recommendations based on your hardware or dataset size? 🚀

In [18]:
output_model = 'adjusted_model'

In [19]:
# Train arguments
training_arguments = TrainingArguments(output_dir = output_model,
                                       per_device_train_batch_size = 1,
                                       gradient_accumulation_steps = 4,
                                       optim = "paged_adamw_32bit",
                                       learning_rate = 2e-4,
                                       lr_scheduler_type = "cosine",
                                       save_strategy = "epoch",
                                       logging_steps = 10,
                                       num_train_epochs = 3,
                                       max_steps = 150,
                                       fp16 = True)

In [None]:
# Force the model to allocate memory correctly
model = model.to("cuda")

In [20]:
# Creates the Trainer
# Optimized for fine-tuning pre-trained models with smaller datasets on supervised learning tasks.
trainer = SFTTrainer(model = model,
                     peft_config = peft_config,
                    #  max_seq_length = 512,
                     tokenizer = tokenizer,
                    #  packing = True,
                     formatting_func = create_prompt,
                     args = training_arguments,
                     train_dataset = train_data,
                     eval_dataset = test_data)

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


## Fine-tuning training

In [21]:
%%time
trainer.train()

Step,Training Loss
10,1.4782
20,1.1387
30,0.805
40,0.3209
50,0.0335
60,0.0109
70,0.0091
80,0.0085
90,0.008
100,0.0077


CPU times: total: 26.7 s
Wall time: 14min 35s


TrainOutput(global_step=150, training_loss=0.25709187308947246, metrics={'train_runtime': 875.8123, 'train_samples_per_second': 0.685, 'train_steps_per_second': 0.171, 'total_flos': 6518607917875200.0, 'train_loss': 0.25709187308947246, 'epoch': 150.0})

In [22]:
# Model save
trainer.save_model('final_model')

In [23]:
# Merge
merged_model = model.merge_and_unload()

## Building the pipeline of Text Generation with LangChain

In [24]:
# Create pre-prompt with the instruction
pre_prompt = """[INST] <<SYS>>\nAnalyze the question and answer with the best option.\n"""

# Create the prompt adding the input
prompt = pre_prompt + "Here is my question {context}" + "[\INST]"

# Create the prompt template with LangChain
prompt = PromptTemplate(template = prompt, input_variables=["context"])

Pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract away most of the complex code in the library, providing a simple API dedicated to a variety of tasks, including named entity recognition, masked language modeling, sentiment analysis, feature extraction, and question answering.

In [25]:
# Create the pipeline object
pipe = pipeline("text-generation",
                 model = merged_model,
                 tokenizer = tokenizer,
                 max_new_tokens = 512,
                 use_cache = False,
                 do_sample = True,
                 pad_token_id = tokenizer.eos_token_id,
                 top_p = 0.7,
                 temperature = 0.5)

Device set to use cuda:0


In [26]:
# Create the Hugging Face Pipeline
llm_pipeline = HuggingFacePipeline(pipeline = pipe)

## Creating the LLM Chain

In [27]:
# create the memory
memory = ConversationBufferMemory()

In [28]:
# Create the LLM Chain
chat_llm_chain = LLMChain(llm = llm_pipeline,
                          prompt = prompt,
                          verbose = False,
                          memory = memory)

## Deploying the Model and Using the Question and Answer System

In [29]:
context = '''###Question: All of the following provisions are included in the Primary health care according to the Alma Ata declaration except:
###Options:
A. Adequate supply of safe drinking water
B. Nutrition
C. Provision of free medicines
D. Basic sanitation'''

In [30]:
%%time
response = chat_llm_chain.predict(context = context)

CPU times: total: 1min 25s
Wall time: 7min 52s


In [31]:
print(response)

[INST] <<SYS>>
Analyze the question and answer with the best option.
Here is my question ###Question: All of the following provisions are included in the Primary health care according to the Alma Ata declaration except:
###Options:
A. Adequate supply of safe drinking water
B. Nutrition
C. Provision of free medicines
D. Basic sanitation[\INST] [ANS] The correct answer is C. Provision of free medicines.

The Alma Ata Declaration, adopted in 1978, emphasized the importance of primary health care (PHC) as a fundamental approach to achieving health for all by the year 2000. According to this declaration, PHC includes several key components such as adequate supply of safe drinking water, nutrition, and basic sanitation. However, it does not specifically mention the provision of free medicines as one of the core components of PHC.

- **Option A (Adequate supply of safe drinking water)**: This is a critical component of PHC as it directly impacts public health and disease prevention.
- **Optio