# LLM Fine-Tuning with QLoRA (Senior AI Engineer Notebook)

This notebook demonstrates **responsible, cost-efficient fine-tuning of a Large Language Model (LLM)** using **PEFT (LoRA / QLoRA)** under realistic compute constraints (Kaggle GPU).

## 1. Environment Setup
Enable GPU in Kaggle settings before running.

In [1]:
!pip install -q transformers datasets peft bitsandbytes accelerate

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25h

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [3]:
import torch
print('CUDA available:', torch.cuda.is_available())
print('GPU:', torch.cuda.get_device_name(0))

CUDA available: True
GPU: Tesla T4


## 2. Load Tokenizer and Base Model (4-bit Quantized)

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = 'microsoft/phi-3-mini-4k-instruct'

tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    load_in_4bit=True,
    torch_dtype=torch.float16
)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!
2026-01-28 19:51:48.094461: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1769629908.277751      55 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1769629908.328688      55 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1769629908.748826      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1769629908.748864      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1769629908.748868      55

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

## 3. Baseline Inference

In [5]:
prompt = 'Explain customer churn in one concise sentence for a business audience.'
inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
out = base_model.generate(**inputs, max_new_tokens=40, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Explain customer churn in one concise sentence for a business audience. Customer churn refers to the rate atquantifying the percentage of customers who stop using a company's services or products within a given time frame.


This metric is crucial for business


## 4. Create Instruction Dataset

In [6]:
from datasets import Dataset
import json

concepts = [
    "Customer churn",
    "Customer lifetime value",
    "Churn reduction strategy",
    "Customer acquisition cost",
    "Net promoter score",
    "Monthly recurring revenue",
    "Annual recurring revenue",
    "User retention",
    "Upselling strategy",
    "Cross-selling strategy",
    "Customer segmentation",
    "Market penetration",
    "Pricing optimization",
    "Product-market fit",
    "Customer onboarding",
    "Customer engagement",
    "Subscription renewal",
    "Revenue forecasting",
    "Sales funnel",
    "Lead conversion rate"
]

templates = [
    {
        "definition": "is the percentage of customers who stop using a company's product or service over a defined period.",
        "impact": "directly affects revenue stability and increases customer acquisition costs.",
        "example": "A SaaS company losing 4% of users monthly experiences declining recurring revenue."
    },
    {
        "definition": "represents a key business metric used to evaluate customer behavior and long-term value.",
        "impact": "influences strategic decisions related to growth, retention, and profitability.",
        "example": "An e-commerce company prioritizes high-value customers for loyalty programs."
    },
    {
        "definition": "refers to a measurable indicator used to assess business performance over time.",
        "impact": "helps leadership identify risks and optimize operational strategies.",
        "example": "A telecom operator monitors this metric to reduce customer attrition."
    },
    {
        "definition": "is a business concept used to understand customer dynamics and revenue trends.",
        "impact": "plays a critical role in forecasting revenue and managing customer relationships.",
        "example": "A subscription platform adjusts pricing based on this indicator."
    },
    {
        "definition": "describes a strategic approach used by companies to improve customer outcomes.",
        "impact": "supports sustainable growth by aligning customer needs with business goals.",
        "example": "A fintech startup applies this strategy to reduce account closures."
    }
]

data = []

for concept in concepts:
    for t in templates:
        output_json = {
            "definition": f"{concept} {t['definition']}",
            "business_impact": f"This metric {t['impact']}",
            "example": t["example"]
        }

        data.append({
            "instruction": (
                "You are a senior business analyst. "
                "Answer STRICTLY in JSON with the keys: definition, business_impact, example. "
                "Do not add extra text."
            ),
            "input": concept,
            "output": json.dumps(output_json)
        })

dataset = Dataset.from_list(data)

len(dataset), dataset


(100,
 Dataset({
     features: ['instruction', 'input', 'output'],
     num_rows: 100
 }))

## 5. Format Dataset for Instruction Tuning

In [7]:
def format_example(e):
    return {
        'text': f"<|user|>\n{e['instruction']}\n{e['input']}\n<|assistant|>\n{e['output']}"
    }

formatted = dataset.map(format_example)
formatted

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 100
})

## 6. Tokenization

In [8]:
def tokenize(e):
    return tokenizer(e['text'], truncation=True, max_length=512)

tokenized = formatted.map(tokenize, remove_columns=formatted.column_names)
tokenized

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 100
})

## 7. Configure and Attach LoRA Adapters

In [9]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=['qkv_proj', 'o_proj'],
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM'
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()

trainable params: 4,718,592 || all params: 3,825,798,144 || trainable%: 0.1233


## 8. Training Configuration

In [10]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./lora-phi3',
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    num_train_epochs=3,
    fp16=True,
    logging_steps=10,
    report_to='none'
)

## 9. Train LoRA Adapters

In [11]:
from transformers import Trainer, DataCollatorForLanguageModeling

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

trainer.train()

Step,Training Loss
10,1.9784
20,1.0731
30,0.6788


TrainOutput(global_step=39, training_loss=1.0764971757546449, metrics={'train_runtime': 130.9816, 'train_samples_per_second': 2.29, 'train_steps_per_second': 0.298, 'total_flos': 747397701918720.0, 'train_loss': 1.0764971757546449, 'epoch': 3.0})

## 10. Save Fine-Tuned Adapters

In [12]:
trainer.save_model('./lora-phi3')

## 11. Load Fine-Tuned Model and Compare Outputs

In [13]:
from peft import PeftModel

ft_model = PeftModel.from_pretrained(
    AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map='auto',
        load_in_4bit=True,
        torch_dtype=torch.float16
    ),
    './lora-phi3'
)

base_out = base_model.generate(**inputs, max_new_tokens=40, do_sample=False)
ft_out = ft_model.generate(**inputs, max_new_tokens=40, do_sample=False)

print('BASE MODEL:\n', tokenizer.decode(base_out[0], skip_special_tokens=True))
print('\nFINE-TUNED MODEL:\n', tokenizer.decode(ft_out[0], skip_special_tokens=True))

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

BASE MODEL:
 Explain customer churn in one concise sentence for a business audience. Customer churn refers to the rate atemeny businesses monitor to gauge customer retention and loyalty.

Customer churn is a critical metric for businesses to understand customer retention and

FINE-TUNED MODEL:
 Explain customer churn in one concise sentence for a business audience. Customer churn refers to the rate atemeny businesses monitor to gauge customer retention and loyalty.

Customer churn is a critical metric for businesses to understand customer retention and


## 12. Conclusion

- QLoRA enables fine-tuning on low-memory GPUs.
- Only ~0.1% of parameters are trained.
- Fine-tuning is most useful for **behavior and format alignment**, not generic knowledge.