# LangQA – Language-powered question and answer system

## Imports

In [None]:
import torch
from trl import SFTTrainer
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from transformers import pipeline, TrainingArguments
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory
from langchain.chains import LLMChain
import warnings

warnings.filterwarnings('ignore')

## Load Data

https://huggingface.co/datasets/nlpie/Llama2-MedTuned-Instructions

In [None]:
# Load dataset
dataset = load_dataset('nlpie/Llama2-MedTuned-Instructions')

In [None]:
train_data = dataset['train'].select(indices=range(1000))

train_data

In [None]:
# Selecting the lines to test the model
test_data = dataset['train'].select(indices=range(1000, 1200))

## Understanding the format of the text

In [None]:
for i in range(3):
    data = dataset['train'][i]
    print(f"Data point {i + 1}:")
    print("Instruction:", data['instruction'])
    print("Input:", data['input'])
    print("Output:", data['output'])
    print("\n-----------------------------\n")

## Automating the Creation of Prompts for Model Training

In [None]:
# Defines a function that takes a dictionary named sample
def create_prompt(sample):

    # Defines a pre_prompt string that serves as a template for the first part of the prompt
    pre_prompt = """[INST]<<SYS>> {instruction}\n"""

    # Concatenates pre_prompt with additional strings to form the complete prompt
    prompt = pre_prompt + "{input}" +"[/INST]"+"\n{output}"

    # Assigns the value of the 'instruction' key of the dictionary sample to the variable example_instruction
    example_instruction = sample['instruction']

    # Assigns the value of the 'input' key of the dictionary sample to the variable example_input
    example_input = sample['input']

    # Assigns the value of the 'output' key of the dictionary sample to the variable example_output
    example_output = sample['output']

    # Creates an instance of PromptTemplate with the previously defined prompt and input variables
    prompt_template = PromptTemplate(template = prompt,
    input_variables = ["instruction", "input", "output"])

    # Uses the format method of the prompt_template instance to replace the variables
    # in the template with the specified values
    unique_prompt = prompt_template.format(instruction = example_instruction,
                                          input = example_input,
                                          output = example_output)

    # Returns the formatted prompt
    return [unique_prompt]

In [None]:
# Testing the function
prompt = create_prompt(train_data[0])
print(prompt)

## Quantization Process

**PT-BR**

O processo de quantização serve para reduzir o tamanho do modelo e melhorar a eficiência computacional, tornando-o mais rápido e acessível a execução em máquinas com hardware de menor capacidade. A quantização reduz a precisão dos pesos e ativações do modelo, que geralmente estão representados em pontos flutuantes de 32 bits (FP32), para formatos de menor precisão, como int8, int4 e até mesmo int2.

Principais Benefícios da Quantização:
* Redução do uso de memória: Modelos quantizados ocupam menos espaço na RAM, tornando possível a execução em máquinas com menos memória, CPUs e GPUs menos potentes;
* Aceleração do tempo de inferência: Operações com números inteiros são mais rápidas do que operações com pontos flutuantes, reduzindo a latência de inferência;
* Menor consumo de energia: Como a quantização exige menos processamento de ponto flutuante, o gasto energético é menor, tornando possível a execução em dispositivos embarcados e em dispositivos móveis. 

---
**EN-US**

The quantization process serves to reduce the size of the model and improve computational efficiency, making it faster and more accessible to run on machines with lower-capacity hardware. Quantization reduces the precision of the model's weights and activations, which are usually represented in 32-bit floating point (FP32) formats, to lower-precision formats, such as int8, int4, and even int2.

Main Benefits of Quantization:
* Reduced memory usage: Quantized models take up less space in RAM, making it possible to run on machines with less memory, less powerful CPUs and GPUs;
* Accelerated inference time: Operations with integers are faster than operations with floating points, reducing inference latency;
* Lower energy consumption: Since quantization requires less floating point processing, energy expenditure is lower, making it possible to run on embedded devices and mobile devices.

#### Bitsandbytes Config

The `BitsAndBytesConfig` object from the **bitsandbytes** package is used to configure **model quantization** during loading. This quantization reduces the model's weight size and **makes it possible to train and infer large LLMs on smaller GPUs**, such as those with 16 GB or 24 GB of VRAM. 

---

### 🔹 **Explanation of the Parameters**
#### `load_in_4bit=True`
- **What it does:** Enables **4-bit** quantization to reduce VRAM usage.  
- **Alternative:** `load_in_8bit=True`, which uses **8-bit** quantization instead of 4-bit.  
- **Why use it:** 4-bit models take up **half** the memory of an 8-bit model but may lose some precision.  

---

#### `bnb_4bit_quant_type="nf4"`
- **What it does:** Defines the quantization type used to store weights.  
- **Available options:**  
  - `"fp4"` → **Float4**, a 4-bit floating-point format.  
  - `"nf4"` → **Normal Float4**, an optimized version of Float4 that improves precision.  
- **Why use `"nf4"`:** This format has been optimized for **AI models**, providing better precision than `"fp4"` while reducing weight size.  

P.S.: The **NF4 (Normal Float 4)** format improves precision compared to **FP4 (Float 4)** because it uses a **non-linear distribution** of representable values, optimizing bit allocation to represent numbers that occur more frequently in AI models. It was specifically designed for **Deep Learning**, especially to handle **LLMs**.

---

#### `bnb_4bit_compute_dtype="float16"`
- **What it does:** Sets the data type used for computations during training/inference.  
- **Available options:**  
  - `"float16"` → Uses `torch.float16`, good for GPUs that support 16-bit calculations.  
  - `"bfloat16"` → Uses `torch.bfloat16`, better for modern GPUs like A100/H100.  
  - `"float32"` → Uses `torch.float32`, offering higher precision but consuming more memory.  
- **Why use `"float16"`:** Most GPUs support `float16`, which balances precision and efficiency. If using GPUs like A100 or H100, `bfloat16` might be a better choice.  

---

#### `bnb_4bit_use_double_quant=False`
- **What it does:** Controls whether **double quantization** will be used.  
- **Available options:**  
  - `False` → Only a single quantization is applied.  
  - `True` → **Applies a second quantization** to the already quantized weights.  
- **Why use `False`:** If your GPU has **enough memory** (e.g., 24 GB), double quantization **may not be necessary**. If you want to **save even more VRAM**, you can try `True`.  

---

### 🔹 **Other Possible Configurations**
Here are some variations you can test, depending on your hardware and goals:

#### 🔸 **For maximum memory efficiency**
```python
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",  # Uses less memory on modern GPUs
    bnb_4bit_use_double_quant=True      # Enables double quantization to save VRAM
)
```
**Recommended use:** If running a very large model on a GPU with **limited VRAM** (e.g., RTX 3090, 4090, or A100 with 40 GB).

---

#### 🔸 **For better precision during inference**
```python
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,  # Uses 8-bit instead of 4-bit
    bnb_4bit_compute_dtype="float32"  # Uses float32 for more precise calculations
)
```
**Recommended use:** When you **don't need to save much memory** and want a **more accurate model**.

---

### 🔹 **Conclusion**
Using `BitsAndBytesConfig` is essential for **running large models on limited hardware**. Here's a quick summary of the most important parameters:

| Parameter | Function | Common Values | When to Use |
|-----------|--------|---------------|------------|
| `load_in_4bit` | Enables 4-bit quantization | `True` or `False` | For maximum memory savings |
| `bnb_4bit_quant_type` | Type of quantization | `"fp4"` or `"nf4"` | `"nf4"` for better precision |
| `bnb_4bit_compute_dtype` | Data type for computations | `"float16"`, `"bfloat16"`, `"float32"` | `"bfloat16"` for modern GPUs |
| `bnb_4bit_use_double_quant` | Enables double quantization | `True` or `False` | `True` if VRAM is limited |

If you want more efficiency, you can test different combinations and check VRAM consumption using `torch.cuda.memory_allocated()`. 🚀  

If you have any more questions or need adjustments for a specific hardware setup, let me know! 🔥

In [None]:
# Enables loading of the base model with 4-bit precision
use_4bit = True

# Sets the dtype for the base model
bnb_4bit_compute_dtype = "float16"

# Quantization type
bnb_4bit_quant_type = "nf4"

# Disables double quantization
use_nested_quant = False

# Sets the dtype for computation in PyTorch
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

In [None]:
# Defining the config
bnb_config = BitsAndBytesConfig(load_in_4bit = use_4bit,
                                bnb_4bit_quant_type = bnb_4bit_quant_type,
                                bnb_4bit_compute_dtype = compute_dtype,
                                bnb_4bit_use_double_quant = use_nested_quant)

In [None]:
# Verifying if the GPU supports bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("The GPU suporrts bfloat16. You can accelerate the train using bf16=True")
        print("=" * 80)

## Load the LLM and the Tokenizer

https://huggingface.co/NousResearch/Llama-2-7b-chat-hf

In [None]:
# LLM
# llm_name = "NousResearch/Llama-2-7b-chat-hf"
llm_name = "Qwen/Qwen2.5-7B-Instruct"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(llm_name)

# Load the base model with quantization
model = AutoModelForCausalLM.from_pretrained(llm_name,
                                              quantization_config = bnb_config,
                                            #   trust_remote_code=True,
                                              device_map = "auto",
                                              use_cache = False
                                              )

In [None]:
# Use the EOS token from the tokenizer to pad at the end of each sequence
tokenizer.pad_token = tokenizer.eos_token

# Enable padding at the end of each sentence
tokenizer.padding_side = "right"

## Configuring LoRa Adapters

Quantization represents data with fewer bits, making it a useful technique for reducing memory usage and speeding up inference, especially in the context of LLMs.  

Once a model is quantized, it is typically not trained **directly** for downstream tasks because training can become unstable due to the reduced precision of weights and activations. However, since PEFT methods only add extra trainable parameters, this allows for training a quantized model with a PEFT adapter on top! Combining quantization with PEFT can be a good strategy to train even the largest models on a single GPU. For example, QLoRA is a method that quantizes a model to 4 bits and then trains it with LoRA. This method enables fine-tuning a 65B parameter model on a single 48GB GPU, for instance.  

The goal of PEFT (Parameter-Efficient Fine-Tuning) is to keep most of the pre-trained model's parameters fixed while adjusting only a small subset of parameters to adapt the model to a specific task.

In [None]:
# LoRa Parameters
peft_config = LoraConfig(r = 8,
                        lora_alpha = 16,
                        lora_dropout = 0.05,
                        bias = "none",
                        task_type = "CAUSAL_LM")

In [None]:
# Prepare the model to train
model = prepare_model_for_kbit_training(model)

In [None]:
# Merge the quantized model with the LoRa adapters
model = get_peft_model(model, peft_config=peft_config)

## Fine-tuning parameters

In [None]:
output_model = 'adjusted_model'

In [None]:
# Train arguments
training_arguments = TrainingArguments(output_dir = output_model,
                                       per_device_train_batch_size = 1,
                                       gradient_accumulation_steps = 4,
                                       optim = "paged_adamw_32bit",
                                       learning_rate = 2e-4,
                                       lr_scheduler_type = "cosine",
                                       save_strategy = "epoch",
                                       logging_steps = 10,
                                       num_train_epochs = 3,
                                       max_steps = 150,
                                       fp16 = True)

In [None]:
# Force the model to allocate memory correctly
model = model.to("cuda")

In [None]:
# Creates the Trainer
# Optimized for fine-tuning pre-trained models with smaller datasets on supervised learning tasks.
trainer = SFTTrainer(model = model,
                     peft_config = peft_config,
                    #  max_seq_length = 512,
                     tokenizer = tokenizer,
                    #  packing = True,
                     formatting_func = create_prompt,
                     args = training_arguments,
                     train_dataset = train_data,
                     eval_dataset = test_data)

## Fine-tuning training

In [None]:
%%time
trainer.train()

In [None]:
# Model save
trainer.save_model('final_model')

In [None]:
# Merge
merged_model = model.merge_and_unload()

## Building the pipeline of Text Generation with LangChain

In [None]:
# Create pre-prompt with the instruction
pre_prompt = """[INST] <<SYS>>\nAnalyze the question and answer with the best option.\n"""

# Create the prompt adding the input
prompt = pre_prompt + "Here is my question {context}" + "[\INST]"

# Create the prompt template with LangChain
prompt = PromptTemplate(template = prompt, input_variables=["context"])

Pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract away most of the complex code in the library, providing a simple API dedicated to a variety of tasks, including named entity recognition, masked language modeling, sentiment analysis, feature extraction, and question answering.

In [None]:
# Create the pipeline object
pipe = pipeline("text-generation",
                 model = merged_model,
                 tokenizer = tokenizer,
                 max_new_tokens = 512,
                 use_cache = False,
                 do_sample = True,
                 pad_token_id = tokenizer.eos_token_id,
                 top_p = 0.7,
                 temperature = 0.5)

In [None]:
# Create the Hugging Face Pipeline
llm_pipeline = HuggingFacePipeline(pipeline = pipe)

## Creating the LLM Chain

In [None]:
# create the memory
memory = ConversationBufferMemory()

In [None]:
# Create the LLM Chain
chat_llm_chain = LLMChain(llm = llm_pipeline,
                          prompt = prompt,
                          verbose = False,
                          memory = memory)

## Deploying the Model and Using the Question and Answer System

In [None]:
context = '''###Question: All of the following provisions are included in the Primary health care according to the Alma Ata declaration except:
###Options:
A. Adequate supply of safe drinking water
B. Nutrition
C. Provision of free medicines
D. Basic sanitation'''

In [None]:
%%time
response = chat_llm_chain.predict(context = context)