# Fine-Tuning a Quantized LLaMA Model with QLoRA

This notebook demonstrates:

1. Converting an Excel file to JSON format.
2. Downloading/loading a quantized LLaMA model from Hugging Face (update the model ID/path accordingly).
3. Fine-tuning the model using QLoRA with the PEFT library.

Make sure to update file paths and model names as needed.

In [1]:
# Install dependencies
!pip install pandas openpyxl transformers accelerate peft bitsandbytes datasets huggingface_hub




[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## 1. Convert Excel to JSON

This cell reads an Excel file and converts it into a JSON file. Update the file paths as necessary.

In [2]:
import pandas as pd
import json

def excel_to_json(excel_file, output_json):
    # Read the Excel file into a pandas DataFrame
    df = pd.read_excel(excel_file)
    
    # Convert DataFrame to a list of dictionaries
    records = df.to_dict(orient='records')
    
    # Write the list of dictionaries to a JSON file
    with open(output_json, 'w', encoding='utf-8') as f:
        json.dump(records, f, ensure_ascii=False, indent=2)

# Example usage: Replace 'data.xlsx' and 'data.json' with your file paths
excel_file = 'D:\LLM\SOP_Log_Verification_Dataset.xlsx'  # path to your Excel file
output_json = 'data.json'  # desired output JSON file
excel_to_json(excel_file, output_json)
print(f"Converted {excel_file} to {output_json}")

Converted D:\LLM\SOP_Log_Verification_Dataset.xlsx to data.json


## 2. Download and Load the Quantized LLaMA Model

In this cell, we load a quantized LLaMA model. **Important:**
- Replace `MODEL_NAME_OR_PATH` with the model ID or local path of your quantized LLaMA model (for example, one you have access to on Hugging Face).
- The model is loaded in 4-bit mode using BitsAndBytes.

If you do not have a local copy, you may use the Hugging Face Hub (if available) by updating the model name accordingly.

In [2]:
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu118
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
pip install ipywidgets

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Check CUDA availability
print("CUDA available:", torch.cuda.is_available())

# Option 1: Interactive login (this will prompt you to enter your token)
from huggingface_hub import notebook_login
notebook_login()  # Follow the prompt to enter your Hugging Face access token

# Option 2: Alternatively, set your token directly:
ACCESS_TOKEN = "Yhf_VArMptTGMJcRwmPvfvclGIGPdxBTduMSlZ"

# Set your model name or path. Replace with your model ID/path.
MODEL_NAME_OR_PATH = "meta-llama/Llama-3.2-3B"  # update as needed

# Load the tokenizer (using your token for gated access)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME_OR_PATH, use_auth_token=True)  # or use: use_auth_token=ACCESS_TOKEN
tokenizer.pad_token = tokenizer.eos_token  

try:
    # Load the model in 4-bit mode (using bitsandbytes)
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME_OR_PATH,
        use_auth_token=True,  # or use: use_auth_token=ACCESS_TOKEN
        load_in_4bit=True,             # Enable 4-bit quantization
        torch_dtype=torch.float16,     # Use mixed precision
        device_map="auto"              # Automatically place the model on available GPU(s)
    )
    print("Model and tokenizer loaded successfully.")
except Exception as e:
    print("Error loading model:", e)
    print("\nIf you are on Windows or encountering CUDA detection issues with bitsandbytes,")
    print("please consider installing the multi-platform version of bitsandbytes as described at:")
    print("https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend")


CUDA available: True


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model and tokenizer loaded successfully.


Token has not been saved to git credential helper.


## 3. QLoRA Fine-Tuning

This cell fine-tunes the quantized LLaMA model using QLoRA with the PEFT library. Update the dataset path and training parameters as needed.

The example assumes you have a JSON dataset file (e.g., the output from the Excel-to-JSON conversion) with a field named `text` containing the training text.

In [11]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM,
    AutoConfig,
    TrainingArguments, 
    Trainer,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# --------------------------
# Configuration for QLoRA on a 8-bit Model
# --------------------------
MODEL_NAME_OR_PATH = "meta-llama/Llama-3.2-3B"
OUTPUT_DIR = "./qlora_llama3b_finetuned"

# LoRA hyperparameters
lora_r = 8
lora_alpha = 32
lora_dropout = 0.05

# Training hyperparameters (reduced for limited VRAM)
NUM_EPOCHS = 3
BATCH_SIZE = 1
GRAD_ACC_STEPS = 1
LEARNING_RATE = 2e-4
MAX_SEQ_LENGTH = 512  # Reduced further for memory savings

# --------------------------
# BitsAndBytes Configuration for 8-bit Quantization
# --------------------------
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
    # Remove CPU offloading to keep everything on one device
)

# --------------------------
# Load Tokenizer and Model
# --------------------------
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME_OR_PATH, use_auth_token=True)
tokenizer.pad_token = tokenizer.eos_token  # Ensure pad token is set

# Force everything to be on the GPU
device_map = {'': torch.cuda.current_device()}

# Load the model with 8-bit quantization, keeping everything on GPU
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME_OR_PATH,
    quantization_config=quantization_config,
    torch_dtype=torch.float16,
    device_map=device_map,  # Force all components to same device
    use_auth_token=True
)

# Prepare model for k-bit training (QLoRA)
model = prepare_model_for_kbit_training(model)

# --------------------------
# Configure LoRA
# --------------------------
lora_config = LoraConfig(
    r=lora_r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    bias="none",
    task_type="CAUSAL_LM"
)

# Wrap the model with LoRA modifications
model = get_peft_model(model, lora_config)

# Rest of your code remains the same...

# Rest of your code remains the same...

# --------------------------
# Load and Tokenize Dataset
# --------------------------
DATASET_PATH = "data.json"  # Update with your dataset file path
dataset = load_dataset("json", data_files=DATASET_PATH, split="train")

# Tokenization function
def tokenize_function(examples):
    prompts = [instr + "\n" + inp for instr, inp in zip(examples["Instruction"], examples["Input"])]
    return tokenizer(
        prompts,
        truncation=True,
        max_length=MAX_SEQ_LENGTH,  # Adjusted to reduce memory usage
        padding="max_length"
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Set labels for causal language modeling
def set_labels(examples):
    examples["labels"] = examples["input_ids"].copy()
    return examples

tokenized_dataset = tokenized_dataset.map(set_labels, batched=True)

# --------------------------
# Training Configuration and Execution
# --------------------------
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,  # Lowered batch size
    gradient_accumulation_steps=GRAD_ACC_STEPS,  # Lowered accumulation steps
    learning_rate=LEARNING_RATE,
    fp16=True,
    optim="paged_adamw_8bit",
    evaluation_strategy="no",
    save_strategy="epoch",
    logging_steps=10,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

# Start training
trainer.train()

# Save the fine-tuned model and tokenizer
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

print(f"Model fine-tuned and saved to {OUTPUT_DIR}")


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]



ValueError: You can't train a model that has been loaded in 8-bit or 4-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device()}` or `device_map={'':torch.xpu.current_device()}`

In [9]:
import torch
import ipywidgets as widgets
from IPython.display import display
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

# --------------------------
# Configuration for Inference
# --------------------------
BASE_MODEL_NAME = "meta-llama/Llama-3.2-3B"  # Base model used during fine-tuning
OUTPUT_DIR = "./qlora_llama3b_finetuned"      # Directory where your fine-tuned adapter is saved
MAX_SEQ_LENGTH = 512

# BitsAndBytes Configuration for 4-bit Quantization (Q4)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

# Explicitly assign the model to GPU 0.
device_map = {"": 0}

# --------------------------
# Load the Model and Tokenizer
# --------------------------
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_NAME,
    quantization_config=quantization_config,
    device_map=device_map,
    torch_dtype=torch.float16,
    use_auth_token=True  # Replace or remove if not needed
)
model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)
model.to("cuda")

tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR, use_auth_token=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# --------------------------
# Generation Function (Using Token Slicing)
# --------------------------
def generate_response(prompt):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=MAX_SEQ_LENGTH).to("cuda")
    input_length = inputs.input_ids.shape[1]  # Number of tokens in the prompt
    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=150,
            do_sample=True,
            temperature=0.7,
            top_p=0.95,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    # Only decode tokens generated after the prompt
    generated_tokens = output_ids[0][input_length:]
    return tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()

# --------------------------
# Setup ipywidgets for a Plain-Text Chatbox Interface
# --------------------------
chat_display = widgets.Textarea(
    value="Conversation:\n",
    placeholder="Conversation history...",
    description="Chat:",
    disabled=True,
    layout=widgets.Layout(width="100%", height="300px")
)

user_input_widget = widgets.Text(
    placeholder="Type your message here...",
    description="User:",
    disabled=False,
)

# Global conversation history (plain text)
conversation_history = "Conversation:\n"

def on_submit(sender):
    global conversation_history
    user_text = sender.value.strip()
    if user_text.lower() == "exit":
        sender.disabled = True
        chat_display.value += "\nChat session ended."
        return
    # Append the user's message to the conversation history
    conversation_history += "User: " + user_text + "\n"
    # Create a prompt for the assistant
    prompt = conversation_history + "Assistant: "
    reply = generate_response(prompt)
    conversation_history += "Assistant: " + reply + "\n"
    chat_display.value = conversation_history
    sender.value = ""

user_input_widget.on_submit(on_submit)
display(chat_display, user_input_widget)


Loading checkpoint shards: 100%|██████████| 2/2 [00:24<00:00, 12.14s/it]


ValueError: Can't find 'adapter_config.json' at './qlora_llama3b_finetuned'