# LoRA + 4-bit Quantization + Chat UI with Unsloth

This Colab demonstrates multiple modern AI ideas together. Observe the following:

* Efficient fine-tuning (LoRA)

* Memory-saving quantization (bnb-4bit)

* Inference with a live chat interface

## 1. Setup & GPU Check

In [1]:
!pip install "pyarrow<20.0.0" -q
!pip install unsloth transformers datasets accelerate peft bitsandbytes gradio -q

!nvidia-smi || echo "No GPU found. Go to Runtime â†’ Change runtime type â†’ GPU â†’ Save."


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 4.4.1 requires pyarrow>=21.0.0, but you have pyarrow 19.0.1 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pylibcudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.
cudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.[0m[31m
[0mThu Nov  6 03:08:20 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|--------------------------------

## 2. Import Required Libraries

In [2]:
from unsloth import FastLanguageModel
from datasets import load_dataset
import torch
import gradio as gr


ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!


## 3. Load a Quantized Model (Gemma-3-1B-IT)

In [3]:
model_name = "unsloth/gemma-3-1b-it-unsloth-bnb-4bit"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    load_in_4bit=True,   # quantized for efficiency
    dtype=None,
)

print(f"âœ… Loaded model: {model_name}")
print(f"Tokenizer vocab size: {len(tokenizer)}")

==((====))==  Unsloth 2025.11.1: Fast Gemma3 patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
Unsloth: Gemma3 does not support SDPA - switching to fast eager.
âœ… Loaded model: unsloth/gemma-3-1b-it-unsloth-bnb-4bit
Tokenizer vocab size: 262145


## 4. Add LoRA Adapters (Parameter-Efficient Fine-Tuning)

In [4]:
# LoRA reduces trainable params dramatically
model = FastLanguageModel.get_peft_model(
    model,
    r=8,                        # rank
    lora_alpha=16,              # scaling factor
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"],
)

model = FastLanguageModel.for_training(model)
print("âœ… LoRA adapters added and model ready for fine-tuning.")

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.


Unsloth: Making `model.base_model.model.model` require gradients
âœ… LoRA adapters added and model ready for fine-tuning.


## 5. Load and Prepare Dataset

In [5]:
dataset = load_dataset("tatsu-lab/alpaca", split="train[:500]")
print("âœ… Dataset loaded with", len(dataset), "examples.")
dataset[0]

âœ… Dataset loaded with 500 examples.


{'instruction': 'Give three tips for staying healthy.',
 'input': '',
 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.',
 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}

## 6. Format & Tokenize Data

In [6]:
def format_instruction(sample):
    if sample["input"]:
        return f"### Instruction:\n{sample['instruction']}\n\n### Input:\n{sample['input']}\n\n### Response:\n{sample['output']}"
    else:
        return f"### Instruction:\n{sample['instruction']}\n\n### Response:\n{sample['output']}"

dataset = dataset.map(lambda x: {"text": format_instruction(x)})
dataset = dataset.remove_columns(["instruction", "input", "output"])

tokenized_dataset = dataset.map(
    lambda x: tokenizer(
        x["text"],
        truncation=True,
        padding="max_length",
        max_length=512,
    ),
    batched=True,
    remove_columns=["text"],
)

tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
print("âœ… Tokenization complete.")

âœ… Tokenization complete.


## 7. Train Using Hugging Face Trainer (LoRA Fine-Tuning)

In [7]:
tokenized_dataset = tokenized_dataset.map(lambda batch: {"labels": batch["input_ids"]})
print("âœ… Added labels column for training loss computation.")

âœ… Added labels column for training loss computation.


In [8]:
# ============================================
# ðŸš€ STEP 7: Manual LoRA fine-tuning loop (no AMP, no Trainer)
# ============================================

import torch
from torch.utils.data import DataLoader
from tqdm.auto import tqdm
import torch.nn.functional as F

# Make sure dataset has labels
tokenized_dataset = tokenized_dataset.map(lambda b: {"labels": b["input_ids"]})
print("âœ… Labels added for loss computation.")

# DataLoader
train_loader = DataLoader(tokenized_dataset, batch_size=4, shuffle=True)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model = model.half()                   # keep in float16 (bnb backend)
model.train()

optimizer = torch.optim.AdamW(model.parameters(), lr=2e-4)

print("ðŸš€ Starting manual LoRA fine-tuning loop â€¦")

num_epochs = 1
for epoch in range(num_epochs):
    total_loss = 0
    for batch in tqdm(train_loader):
        optimizer.zero_grad()
        # Move tensors to GPU
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)
        # Forward + loss
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1} average loss: {total_loss/len(train_loader):.4f}")

print("âœ… LoRA fine-tuning completed successfully!")


Map:   0%|          | 0/500 [00:00<?, ? examples/s]

âœ… Labels added for loss computation.
ðŸš€ Starting manual LoRA fine-tuning loop â€¦


  0%|          | 0/125 [00:00<?, ?it/s]

Unsloth: Will smartly offload gradients to save VRAM!
Epoch 1 average loss: nan
âœ… LoRA fine-tuning completed successfully!


## 8. Save the Fine-Tuned Model

In [9]:
save_path = "./gemma1b-lora-finetuned"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"âœ… Model and tokenizer saved at: {save_path}")

âœ… Model and tokenizer saved at: ./gemma1b-lora-finetuned


## 9. Inference: Test the Fine-Tuned Model

In [10]:
from unsloth import FastLanguageModel
import torch

# ðŸ”¹ Load your fine-tuned model freshly on CPU
model, tokenizer = FastLanguageModel.from_pretrained("./gemma1b-lora-finetuned")

# Put the model in eval (inference) mode
model.eval()
model.to("cpu")

print("âœ… Model loaded on CPU and ready for inference.")

==((====))==  Unsloth 2025.11.1: Fast Gemma3 patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
Unsloth: Gemma3 does not support SDPA - switching to fast eager.
âœ… Model loaded on CPU and ready for inference.


In [13]:
# ============================================
# âš¡ OPTION B: Inference with your fine-tuned LoRA model (GPU)
# ============================================
from unsloth import FastLanguageModel
import torch

# Load your saved LoRA-finetuned model
model, tokenizer = FastLanguageModel.from_pretrained("./gemma1b-lora-finetuned")

# Safe GPU setup
device = "cuda" if torch.cuda.is_available() else "cpu"
model.eval()
model.to(device)
# Keep generation short so it returns fast
MAX_NEW_TOKENS = 32

def generate_text(prompt, max_new_tokens=MAX_NEW_TOKENS):
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,                  # deterministic/fast
            eos_token_id=tokenizer.eos_token_id,
            use_cache=True,
        )
    return tokenizer.decode(out[0], skip_special_tokens=True)

prompt = "Explain overfitting in machine learning with a simple example."
print("ðŸ’¡ Prompt:", prompt)
print("\nðŸ§  Model Response:\n", generate_text(prompt))


==((====))==  Unsloth 2025.11.1: Fast Gemma3 patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
Unsloth: Gemma3 does not support SDPA - switching to fast eager.
ðŸ’¡ Prompt: Explain overfitting in machine learning with a simple example.

ðŸ§  Model Response:
 Explain overfitting in machine learning with a simple example.
