<a href="https://colab.research.google.com/github/Lakshmiec/llama3-faq-classifier/blob/main/LLama_Finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🚀 Fine-Tuning Llama-3 for SME Customer Support Logic
  
**Project Objective:** Transform a general Llama-3 8B model into a specialized assistant for customer service agents. The model classifies customer inquiries against existing FAQs to identify semantic similarity, enabling automated response suggestions.

---
## 🛠️ Technical Stack

* **Model:** Llama-3 8B (Quantized to 4-bit)
* **Technique:** QLoRA (Quantized Low-Rank Adaptation)
* **Frameworks:** Unsloth (for efficient training), Hugging Face TRL (for SFT), Neptune.ai (for tracking)
* **Environment:** Google Colab (Nvidia T4 GPU)

### Hardware & Optimization Architecture:

**Optimization for Google Colab (Nvidia T4 GPU)**

Training an 8-billion parameter model normally requires high-end enterprise GPUs (like the A100 or H100). However, this project is optimized to run on a free-tier `Google Colab T4 GPU (16GB VRAM)` using three cutting-edge efficiency techniques:
1. Unsloth Optimization (The Accelerator)We utilize the Unsloth library, which provides hand-written OpenAI Triton kernels. This allows for:2x Faster Training: Reducing the time spent on the GPU.70% Less Memory Usage: Preventing "Out of Memory" (OOM) errors during the fine-tuning of Llama-3.

2. 4-bit Quantization (QLoRA)To fit the model into 16GB of VRAM, we load Llama-3 in 4-bit precision using bitsandbytes.Base Model: Frozen at 4-bit.Trainable Adapters: Only the small LoRA matrices ($W = W_0 + BA$) are updated in higher precision (FP16 or BF16), ensuring no loss in reasoning quality.

3. Chain-of-Thought (CoT) Synthetic ReasoningInstead of simple "Yes/No" labels, we enriched the dataset with GPT-4 generated explanations.
- The Benefit: By forcing the model to predict the explanation before the label, we reduce hallucinations and increase the "depth" of the model's understanding. This is crucial for SMEs where accuracy in policy explanation is just as important as the classification itself

## 1. Setup and Installations

In [19]:
!pip install weave

Collecting weave
  Downloading weave-0.52.26-py3-none-any.whl.metadata (25 kB)
Collecting diskcache==5.6.3 (from weave)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Collecting gql>=3.0.0 (from gql[httpx]>=3.0.0->weave)
  Downloading gql-4.0.0-py3-none-any.whl.metadata (10 kB)
Collecting polyfile-weave>=0.5.9 (from weave)
  Downloading polyfile_weave-0.5.9-py3-none-any.whl.metadata (8.5 kB)
Collecting graphql-core<3.3,>=3.2 (from gql>=3.0.0->gql[httpx]>=3.0.0->weave)
  Downloading graphql_core-3.2.7-py3-none-any.whl.metadata (11 kB)
Collecting backoff<3.0,>=1.11.1 (from gql>=3.0.0->gql[httpx]>=3.0.0->weave)
  Downloading backoff-2.2.1-py3-none-any.whl.metadata (14 kB)
Collecting abnf~=2.2.0 (from polyfile-weave>=0.5.9->weave)
  Downloading abnf-2.2.0-py3-none-any.whl.metadata (1.1 kB)
Collecting cint>=1.0.0 (from polyfile-weave>=0.5.9->weave)
  Downloading cint-1.0.0-py3-none-any.whl.metadata (511 bytes)
Collecting fickling>=0.0.8 (from polyfile-weave>=0.5.9->weave)
 

In [3]:
# Install Unsloth for 2x faster training and 70% less memory usage
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes
!pip install neptune
# !pip install trl==0.7.10
!pip install scikit-learn
!pip install python-dotenv

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-hrveqfn7/unsloth_1dfc41cf89be4a5ba2a877e31657eab9
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-hrveqfn7/unsloth_1dfc41cf89be4a5ba2a877e31657eab9
  Resolved https://github.com/unslothai/unsloth.git to commit ecd584a9167e1637b0a0e916af3c5b88690e24fb
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting trl!=0.19.0,<=0.24.0,>=0.18.2 (from unsloth_zoo>=2026.2.1->unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Using cached trl-0.24.0-py3-none-any.whl.metadata (11 kB)
Using cached trl-0.24.0-py3-none-any.whl (423 kB)
Inst

In [None]:
!pip uninstall -y transformers trl peft unsloth
!pip install transformers==4.36.2
!pip install trl==0.7.10
!pip install peft==0.7.1
!pip install unsloth

## 2. Load the Model

#### Load the Quantized Base Model




We use the **4-bit quantized** version of Llama-3 to fit the model within the 15GB VRAM limit of the T4 GPU. This reduces the memory footprint while preserving ~99% of the model's original reasoning performance.

In [4]:
from unsloth import FastLanguageModel
import torch

model_parameters = {
   'model_name' : 'unsloth/llama-3-8b-bnb-4bit',
   'model_dtype' : None ,
   'model_load_in_4bit' : True,
   'model_max_seq_length' : 2048
}


max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
   model_name = model_parameters['model_name'],
   max_seq_length = max_seq_length,
   dtype = model_parameters['model_dtype'],
   load_in_4bit = model_parameters['model_load_in_4bit'],
)


Please restructure your imports with 'import unsloth' at the top of your file.
  module = original_import(name, globals, locals, fromlist, level)


🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2026.2.1: Fast Llama patching. Transformers: 4.57.6.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.563 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.34. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/198 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

## 3. LoRA Configuration

#### Parameter-Efficient Fine-Tuning (LoRA)
Instead of updating all 8 billion parameters, we add small "adapter" matrices. This allows us to train the model significantly faster and with much lower hardware requirements.

In [5]:
lora_parameters = {
   'lora_r': 16, # Rank: Size of the adapters
   'target_modules': ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
   'lora_alpha': 16, # Scaling factor for the new weights
   'lora_dropout': 0,
   'lora_bias': "none",
   'lora_use_gradient_checkpointing': "unsloth",
   'lora_random_state': 42,
}

With this configuration, we can instantiate the model:

In [6]:
model = FastLanguageModel.get_peft_model(
   model,
   r = lora_parameters['lora_r'],
   target_modules = lora_parameters['target_modules'],
   lora_alpha = lora_parameters['lora_alpha'],
   lora_dropout = lora_parameters['lora_dropout'],
   bias = lora_parameters['lora_bias'],
   use_gradient_checkpointing =    lora_parameters['lora_use_gradient_checkpointing'],
   random_state = lora_parameters['lora_random_state'],
)

Unsloth 2026.2.1 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## 4. Data Preprocessing

#### Instruction Dataset Preprocessing
We use a sampled version of the **Quora Question Pairs (QQP)** dataset. To improve reasoning, we use an **Instruction-based format** that includes an explanation before the final label.

#### Dataset Enrichment: Synthetic Reasoning


To transform the raw data into a format suitable for Instruction Fine-Tuning, we perform two key steps:

**1. Synthetic Data Enrichment (The Teacher-Student Method)**

The original Quora Question Pairs (QQP) dataset consists of binary labels (0 for dissimilar, 1 for similar). However, simple binary labels do not provide the model with the "reasoning" required for complex SME financial or customer service tasks.

To solve this, we use a version of the dataset enriched via Synthetic Data Generation. In this process, a "Teacher Model" (GPT-4) was used to analyze each pair and generate a natural language Explanation. By training our "Student Model" (Llama-3) on these explanations, we utilize Chain-of-Thought (CoT) reasoning, which significantly improves accuracy and reduces hallucinations.

**2. Data Attribution**

This project utilizes the GLUE-QQP Sampled Explanation dataset.

> Source: borismartirosyan/glue-qqp-sampled-explanation

Processing: Samples 1,000 training and 200 validation points to stay within Google Colab’s T4 GPU memory limits.

In [7]:
from datasets import load_dataset
dataset = load_dataset('borismartirosyan/glue-qqp-sampled-explanation')

prompt_template = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Compare the following two questions and determine if they are semantically similar. Provide an explanation first, then the label (0 for dissimilar, 1 for similar).

Question 1: {}
Question 2: {}

### Response:
Explanation: {} Label: {}"""

EOS_TOKEN = tokenizer.eos_token
def format_prompts(examples):
    texts = []
    for q1, q2, exp, lab in zip(examples["question1"], examples["question2"], examples["explanation"], examples["label"]):
        text = prompt_template.format(q1, q2, exp, lab) + EOS_TOKEN
        texts.append(text)
    return { "text": texts }

dataset = dataset.map(format_prompts, batched=True)

processed_train_data.jsonl: 0.00B [00:00, ?B/s]

processed_valid_data.jsonl: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

## 5. Setting up the neptune.ai experiment tracker

In [14]:
import os
from dotenv import load_dotenv
load_dotenv() # This looks for the .env file and loads the variables

# Now you can use them safely
project = os.getenv("NEPTUNE_PROJECT")
api_token = os.getenv("NEPTUNE_API_TOKEN")


## 6.Monitoring and configuring the fine-tuning

Training a model in a "black box" is risky. To ensure project success, we persist all training metadata to Neptune.ai. This allows us to track the Training Loss and Validation Loss in real-time.

**Why we track these metrics:**

*Training Loss*: Measures how well the model is fitting the training data.

*Validation Loss*: Measures how well the model generalizes to new, unseen data.

The Goal: We want to see both losses decrease. If Training Loss drops while Validation Loss rises, it is a clear signal to stop training to prevent overfitting.


**Training Strategy & Hyperparameters**

We configure the TrainingArguments to pass into the SFTTrainer (from the TRL library). These settings define how the model learns and how we monitor that learning:

- `eval_strategy="steps"`: We evaluate our model every 10 steps (eval_steps=10) rather than waiting for an entire epoch. This provides immediate feedback on whether the model is generalizing or overfitting.

- `logging_steps=1`: We log training metadata at every single step to ensure a granular, high-resolution view of the loss curve in Neptune.

- `optim="adamw_8bit"`: We select an 8-bit optimizer, which reduces the required memory by 75% compared to standard 32-bit optimizers, allowing us to train larger models on a single T4 GPU.

- `lr_scheduler_type="cosine"`: I strongly recommend a Cosine Learning Rate Scheduler for transformers. While linear schedulers are common, cosine annealing facilitates faster convergence by smoothly decreasing the learning rate.

- `Mixed Precision (fp16/bf16)`: We activate 16-bit mixed-precision training to speed up calculations and further reduce the memory footprint.

- `Regularization (weight_decay=0.01)`: We apply weight decay to the L2 norm of the weights to prevent the model from becoming overly complex and "memorizing" the training set.

In [15]:
from trl import SFTTrainer
from transformers import TrainingArguments
from transformers import DataCollatorForLanguageModeling
from unsloth import is_bfloat16_supported
import neptune





training_arguments = {
   # Tracking parameters
   'eval_strategy' : "steps",
   'eval_steps': 10,
   'logging_strategy' : "steps",
   'logging_steps': 1,
   'save_strategy' : "epoch",

   # Training parameters
   'per_device_train_batch_size' : 2,
   'num_train_epochs' : 2,
   'optim' : "adamw_8bit",
   'fp16' : not is_bfloat16_supported(),
   'bf16' : is_bfloat16_supported(),
   'warmup_steps' : 5,
   'learning_rate' : 2e-4,
   'lr_scheduler_type' : "cosine",
   'weight_decay' : 0.01,

   'seed' : 3407,
   'output_dir' : "outputs",

}

##### Initialize the Run

In [17]:
import neptune

run = neptune.init_run()

# Log your hyperparameters so you remember them later
params = {**lora_parameters, **model_parameters, **training_arguments}
run["parameters"] = params

[neptune] [info   ] Neptune initialized. Open in the app: https://app.neptune.ai/projectceks07cs/LLama-Finetuning/e/LLAM-5


## 7. Launching a training run

#### Supervised Fine-Tuning (SFT)
We launch the training loop using the `SFTTrainer`. We use the **AdamW 8-bit optimizer** and a **Cosine learning rate scheduler** for faster and more stable convergence.

In [18]:
trainer = SFTTrainer(
   model = model,
   tokenizer = tokenizer,
   train_dataset = dataset['train'],
   eval_dataset = dataset['validation'],
   dataset_text_field = "text",
   max_seq_length = model_parameters['model_max_seq_length'],
   dataset_num_proc = 2,
   packing = False,
   args = TrainingArguments(
	**training_arguments
   ),
)

trainer.model.print_trainable_parameters()
trainer.train()

trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 2 | Total steps = 1,000
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


[neptune] [info   ] Neptune initialized. Open in the app: https://app.neptune.ai/projectceks07cs/LLama-Finetuning/e/LLAM-6


wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice:

 1


wandb: You chose 'Create a W&B account'
wandb: Create an account here: https://wandb.ai/authorize?signup=true&ref=models
wandb: After creating your account, create a new API key and store it securely.
wandb: Paste your API key and hit enter:

 ··········


wandb: No netrc file found, creating one.
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
wandb: Currently logged in as: projectceks07cs (projectceks07cs-) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin


wandb: Detected [huggingface_hub.inference, openai] in use.
wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/


Step,Training Loss,Validation Loss
10,0.7431,0.583669
20,0.402,0.44553
30,0.4352,0.406999
40,0.4531,0.39954
50,0.4119,0.403326
60,0.4647,0.398081
70,0.2764,0.3981
80,0.3578,0.398613
90,0.3639,0.394622
100,0.3418,0.395558




[neptune] [info   ] Shutting down background jobs, please wait a moment...
[neptune] [info   ] Done!
[neptune] [info   ] Waiting for the remaining 6 operations to synchronize with Neptune. Do not kill this process.
[neptune] [info   ] All 6 operations synced, thanks for waiting!
[neptune] [info   ] Explore the metadata in the Neptune app: https://app.neptune.ai/projectceks07cs/LLama-Finetuning/e/LLAM-6/metadata


0,1
eval/loss,▇▆▆▅▄▂▃▃▂▂▂▂▁▁▁▂▄▅▆▆▇▇▆▇▇▇██▇▇█▇▇▇▇▇▇▇▇▇
eval/runtime,▁▄▅█▅▅▅▅▅▅▅▄▅▅▅▅▅▄▅▅▅▅▄▅▅▄▅▄▅▅▅▄▄▅▄▄▅▄▅▅
eval/samples_per_second,█▆▅▄▂▄▄▅▅▄▅▄▅▄▄▅▅▅▅▅▅▅▅▅▄▅▁▅▅▅▅▅▅▅▅▅▅▅▅▅
eval/steps_per_second,██▄▁▁▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
train/epoch,▁▁▁▂▂▂▂▂▂▃▄▄▄▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇█████
train/global_step,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇█
train/grad_norm,█▂▂▁▅▂▁▄▂▂▂▂▁▁▃▄▁▃▂▂▁▁▃▂▂▄▂▃▅▄▃▃▃▃▄▃▃▄▃▅
train/learning_rate,███████████████▇▇▇▇▆▆▆▆▆▆▆▄▄▃▃▂▂▂▂▁▁▁▁▁▁
train/loss,█▃▁▁▂▃▂▃▂▂▃▃▂▃▃▃▁▁▁▁▂▂▁▂▁▁▁▂▂▂▁▁▁▂▂▂▁▁▂▂

0,1
eval/loss,0.40165
eval/runtime,57.3545
eval/samples_per_second,3.487
eval/steps_per_second,1.744
total_flos,1.4304151936499712e+16
train/epoch,2
train/global_step,1000
train/grad_norm,0.46281
train/learning_rate,0.0
train/loss,0.2499


TrainOutput(global_step=1000, training_loss=0.3478887019753456, metrics={'train_runtime': 7026.3892, 'train_samples_per_second': 0.285, 'train_steps_per_second': 0.142, 'total_flos': 1.4304151936499712e+16, 'train_loss': 0.3478887019753456, 'epoch': 2.0})

We initialize the SFTTrainer to manage the fine-tuning process. By setting `dataset_text_field to our pre-processed 'text' column`, we ensure the model trains on the enriched instruction-explanation-label format.

We verify the configuration via print_trainable_parameters(), ensuring that only **0.52%** of the model (the LoRA adapters) is being updated, maintaining high efficiency.

Finally, trainer.train() executes the optimization loop, persisting real-time metrics to Neptune

In [None]:
from tqdm import tqdm

# Enable the inference acceleration
FastLanguageModel.for_inference(trainer.model)
trainer.model.to('cuda')

predicted_classes = []
for dp in tqdm(trainer.eval_dataset):

   dp = tokenizer.decode(dp['input_ids'])

   dp = tokenizer(dp, add_special_tokens=False, return_tensors='pt')['input_ids'][0].to('cuda')

   dp = dp.unsqueeze(0)

   outputs = model.generate(dp, max_new_tokens = 400, use_cache = True)

   possible_label = tokenizer.decode(outputs[0]).split('label:')[-1].replace('<|end_of_text|>', '').replace('<|begin_of_text|>', '').replace('\n', '').replace('://', '').strip()
   if len(possible_label) == 1:

     predicted_classes.append(possible_label)
   else:
     predicted_classes.append(tokenizer.decode(outputs[0]))

y_pred = predicted_classes
y_true = [x['text'].split("label:")[-1].replace('\n<|end_of_text|>', '').strip() for x in dataset['validation']]

  9%|▉         | 18/200 [01:04<18:45,  6.18s/it]