# Fine-tune LLM for Next Gen UI Agent using Unsloth and LoRA

This notebook demonstrates how to:
1. Fine-tune a 3B or 8B parameter model using LoRA (Low-Rank Adaptation)
2. Load training data from a simple JSON file (`question`/`answer` format) or multiple files
3. Export the model in GGUF format for use with Ollama

**Model**: Llama-3.2-3B-Instruct (you can also use Qwen2.5-3B-Instruct or similar)

**Data Format**: Simple JSON with `question` and `answer` fields

---

## ‚ö†Ô∏è IMPORTANT: GPU Required

**This notebook requires a GPU to run!**

Before you start, make sure GPU is enabled:
1. Click **Runtime** ‚Üí **Change runtime type**
2. Select **T4 GPU** (or better) under Hardware accelerator
3. Click **Save**

The notebook will check for GPU availability in Step 1 before installing dependencies.

**Note:** TPU is NOT supported. This notebook requires NVIDIA CUDA GPUs only (Unsloth and bitsandbytes do not support TPUs).


## Step 1: Verify GPU Availability

In [None]:
# Check if GPU is available
import torch

if not torch.cuda.is_available():
    print("‚ùå ERROR: No GPU detected!")
    print("\nüìã To enable GPU in Google Colab:")
    print("   1. Click 'Runtime' in the menu")
    print("   2. Select 'Change runtime type'")
    print("   3. Choose 'T4 GPU' under 'Hardware accelerator'")
    print("   4. Click 'Save'")
    print("   5. Re-run all cells from the beginning")
    raise RuntimeError("GPU is required for this notebook. Please enable GPU and restart.")
else:
    gpu_name = torch.cuda.get_device_name(0)
    print(f"‚úÖ GPU detected: {gpu_name}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")


## Step 2: Install Required Packages

**Note:** This step installs all dependencies. It may take 3-5 minutes. If you get xformers errors, don't worry - there's a fallback option below.


In [None]:
# Install Unsloth and dependencies
print("Installing Unsloth and required packages...")
print("This may take 3-5 minutes...\n")

# Install Unsloth
%pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# Install other dependencies without building from source
%pip install -q --no-deps "trl<0.9.0" peft accelerate bitsandbytes

# Install xformers - use pre-built wheel (no building from source)
print("\nInstalling xformers...")
%pip install -q xformers --no-build-isolation

print("\n‚úÖ Installation complete!")


### Alternative Installation (If xformers fails)

**Only run this cell if the above installation failed with xformers errors.**

Xformers is optional and provides a small speed boost, but the notebook works without it.


In [None]:
# Alternative installation without xformers
# Uncomment the lines below ONLY if you had xformers errors above

# print("Installing without xformers (optional speedup)...")
# %pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# %pip install -q --no-deps "trl<0.9.0" peft accelerate bitsandbytes
# print("‚úÖ Installation complete (without xformers)")


## Step 3: Import Libraries


In [None]:
import json
import os
from unsloth import FastLanguageModel
from datasets import Dataset
from trl import SFTTrainer
from transformers import TrainingArguments

print("‚úÖ All libraries imported successfully!")


## Step 4: Configuration


In [None]:
# Model configuration
max_seq_length = 2048  # Choose any! Unsloth supports RoPE Scaling internally
dtype = None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True  # Use 4bit quantization to reduce memory usage

# LoRA configuration
lora_r = 16  # LoRA rank
lora_alpha = 16  # LoRA alpha (scaling factor)
lora_dropout = 0  # Dropout for LoRA layers
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj"]  # Modules to apply LoRA

# Training configuration
output_dir = "./output"
num_train_epochs = 3
per_device_train_batch_size = 2
gradient_accumulation_steps = 4
learning_rate = 2e-4
warmup_steps = 5
logging_steps = 1
save_steps = 50

# Model selection (choose one)
model_name = "unsloth/Llama-3.2-3B-Instruct" # 3B model

# Alternative 3B models:
# model_name = "unsloth/Qwen2.5-3B-Instruct"  # 3B model
# model_name = "unsloth/Phi-3.5-mini-instruct"  # 3.8B model

# Alternative 8B models:
# model_name = "unsloth/Meta-Llama-3.1-8B-Instruct" # 8B model

## Step 5: Load Pre-trained Model


In [None]:
# Load model and tokenizer with Unsloth optimizations
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

print(f"Model loaded: {model_name}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()) / 1e9:.2f}B")


## Step 6: Configure LoRA Adapters


In [None]:
# Add LoRA adapters to the model
model = FastLanguageModel.get_peft_model(
    model,
    r=lora_r,
    target_modules=target_modules,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    bias="none",
    use_gradient_checkpointing="unsloth",  # Efficient gradient checkpointing
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

print("LoRA adapters configured!")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6:.2f}M")


## Step 7: Upload and Load Training Data

### Uploading Your Dataset

**Manual Upload Instructions:**
1. In Colab, click the **folder icon** (üìÅ) in the left sidebar to open the file browser
2. Create a folder for your training data named `training_data` by right-clicking root directory ‚Üí `New Folder`
3. Upload your JSON files into this folder by right-clicking the folder ‚Üí `Upload`
4. You can upload multiple JSON files - they will be automatically combined and shuffled

**Alternative:** Upload a single `training_data.json` file to the root directory

**Required JSON format for each file:**
```json
[
  {
    "question": "What is machine learning?",
    "answer": "Machine learning is a branch of AI that..."
  },
  {
    "question": "What is deep learning?",
    "answer": "Deep learning is a subset of machine learning..."
  }
]
```

The notebook will load all JSON files from your specified folder and shuffle them before training.


In [None]:
# Configure the path to your training data
# Option 1: Load from a folder containing multiple JSON files (recommended)
data_path = "training_data"  # Folder containing your JSON files

# Option 2: Load from a single JSON file (uncomment to use)
# data_path = "training_data.json"

print(f"Loading training data from: {data_path}")


In [None]:
# Load training data from JSON file(s)
import glob
import random

training_data = []

# Check if path is a directory or a file
if os.path.isdir(data_path):
    # Load all JSON files from the directory
    json_files = glob.glob(os.path.join(data_path, "*.json"))
    
    if not json_files:
        raise ValueError(f"No JSON files found in directory: {data_path}")
    
    print(f"Found {len(json_files)} JSON file(s) in {data_path}/")
    
    for json_file in sorted(json_files):
        print(f"  Loading: {os.path.basename(json_file)}")
        with open(json_file, 'r', encoding='utf-8') as f:
            data = json.load(f)
            if isinstance(data, list):
                training_data.extend(data)
                print(f"    Added {len(data)} examples")
            else:
                raise ValueError(f"File {json_file} must contain a JSON array")
    
elif os.path.isfile(data_path):
    # Load single JSON file
    print(f"Loading single file: {data_path}")
    with open(data_path, 'r', encoding='utf-8') as f:
        training_data = json.load(f)
        if not isinstance(training_data, list):
            raise ValueError(f"File {data_path} must contain a JSON array")
else:
    raise ValueError(f"Path not found: {data_path}")

# Validate data format
if not training_data:
    raise ValueError("Training data is empty!")

# Check first example has required fields
first_example = training_data[0]
if "question" not in first_example or "answer" not in first_example:
    raise ValueError("Training data must have 'question' and 'answer' fields!")

# Shuffle the combined data to mix examples from different files
print(f"\nShuffling {len(training_data)} training examples...")
random.seed(3407)  # Set seed for reproducibility
random.shuffle(training_data)

print(f"‚úÖ Loaded and shuffled {len(training_data)} training examples")
print("\nFirst example after shuffle:")
print(json.dumps(training_data[0], indent=2))


## Step 8: Format Training Data

**Note:** This notebook uses the tokenizer's `apply_chat_template()` method to format prompts. This makes the code model-agnostic and automatically adapts to any model's expected chat format (ChatML, Llama, Alpaca, etc.). No hardcoded format tags needed!


In [None]:
# Format the data for instruction tuning using tokenizer's chat template
# Required JSON format: [{"question": "...", "answer": "..."}]

def format_prompt(example):
    """
    Format a single example into a prompt using the tokenizer's chat template.
    This makes the notebook work with any model automatically.
    
    Args:
        example: Dict with 'question' and 'answer' keys
        
    Returns:
        Dict with 'text' key containing the formatted prompt
    """
    # Extract question and answer
    try:
        question = example["question"]
        answer = example["answer"]
    except KeyError as e:
        raise ValueError(f"Example missing required field: {e}. Required: 'question' and 'answer'")
    
    # Create conversation in standard format
    conversation = [
        {"role": "system", "content": "You are a helpful UI design assistant."},
        {"role": "user", "content": question},
        {"role": "assistant", "content": answer}
    ]
    
    # Use tokenizer's chat template (works with any model)
    prompt = tokenizer.apply_chat_template(
        conversation,
        tokenize=False,
        add_generation_prompt=False
    )
    
    return {"text": prompt}

# Convert to HuggingFace Dataset format
print("Formatting training data...")
formatted_data = [format_prompt(example) for example in training_data]
dataset = Dataset.from_list(formatted_data)

print(f"‚úÖ Dataset created with {len(dataset)} examples")
print("\nExample formatted prompt:")
print(dataset[0]['text'][:500] + "...")


## Step 9: Configure Training


In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    warmup_steps=warmup_steps,
    learning_rate=learning_rate,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=logging_steps,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=3407,
    save_steps=save_steps,
    save_total_limit=2,
    report_to="none", # Disables wandb/tensorboard prompts
)

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=training_args,
)

print("Trainer configured and ready!")


## Step 10: Train the Model


In [None]:
# Start training
print("Starting training...")
trainer_stats = trainer.train()
print("\nTraining completed!")
print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f} seconds")
print(f"Training loss: {trainer_stats.metrics['train_loss']:.4f}")


## Step 11: Test the Fine-tuned Model


In [None]:
# Test the model with a sample prompt
FastLanguageModel.for_inference(model)  # Enable inference mode

test_question = "The data is an array with fifteen k8s pod objects with pod name, pod creation date, CPU utilization and consumed memory. User wants to see these data. What UI component to use?"
#test_question = "The data is an array with fifteen k8s pod objects with pod name, pod creation date, CPU utilization and consumed memory. User is browsing them to get overview. What UI component to use?"
#test_question = "The data is an array with fifteen k8s pod objects with pod name, pod creation date, CPU utilization and consumed memory. User is browsing them to get basic overview. What type of UI component to use?"
#test_question = "The data is an array with fifteen k8s pod objects with pod name, pod creation date, image url containing CPU utilization and consumed memory info. User is browsing them to get basic overview. What type of UI component to use?"

# Create test conversation using tokenizer's chat template
test_conversation = [
    {"role": "system", "content": "You are a helpful UI design assistant."},
    {"role": "user", "content": test_question}
]

# Format using tokenizer's chat template
test_prompt = tokenizer.apply_chat_template(
    test_conversation,
    tokenize=False,
    add_generation_prompt=True  # Adds the assistant prompt at the end
)

inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.1,
    top_p=0.9,
    do_sample=True,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Test question:")
print(test_question)
print("\nModel response:")
print(response)


## Step 12: Save the Model


In [None]:
# Save the fine-tuned model and LoRA adapters
model_save_path = "./finetuned_model"
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)
print(f"Model saved to {model_save_path}")


## Step 13: Export to GGUF for Ollama

This step exports the model directly to GGUF format for Ollama. The LoRA adapters are automatically merged during export.

**Note:** This step (merging weights, `llama.cpp` and GGUF packages installation, final model quantization) takes around 20min!


In [None]:
# Optional: Save merged model in 16-bit precision
# Uncomment this section ONLY if you need the merged model for:
# - Uploading to HuggingFace Hub
# - Using with other tools (vLLM, TGI, etc.)
# - Testing before GGUF conversion
# 
# For Ollama-only workflow, this step is NOT needed (GGUF export merges automatically)

# merged_model_path = "./merged_model_16bit"
# model.save_pretrained_merged(
#     merged_model_path,
#     tokenizer,
#     save_method="merged_16bit",  # Can also use "merged_4bit" for smaller size
# )
# print(f"Merged 16-bit model saved to {merged_model_path}")


In [None]:
# Export to GGUF format for Ollama
# Generate filename based on base model: NGUI-{model}.{quant_method}.gguf
model_basename = model_name.split('/')[-1].lower()  # e.g., "llama-3.2-3b-instruct"
model_basename = model_basename.replace('-instruct', '').replace('instruct', '').strip('-')

print(f"Base model: {model_basename}")

# Export `q4_k_m` quantization. Other quantization options `q5_k_m`, `q8_0`, `f16`
quant_method = "q4_k_m"
print(f"\nExporting to GGUF format with {quant_method} quantization...")

# save_pretrained_gguf saves the file in the current directory with the model's name
# The first parameter is just a directory name that gets created (but file is in root)
gguf_dir = "gguf_output"
model.save_pretrained_gguf(
    gguf_dir,
    tokenizer,
    quantization_method=quant_method,
)

# Find the generated GGUF file in the root directory
import glob
import shutil
gguf_files = glob.glob("./*.gguf")
if gguf_files:
    generated_file = gguf_files[0]  # e.g., "./llama-3.2-3b-instruct.Q4_K_M.gguf"
    
    # Rename to our desired format: NGUI-{model}.{quant_method}.gguf
    gguf_filename = f"NGUI-{model_basename}.{quant_method}.gguf"
    gguf_path = f"./{gguf_filename}"
    
    # Rename the file if it's not already named correctly
    if generated_file != gguf_path:
        shutil.move(generated_file, gguf_path)
    
    print(f"‚úÖ GGUF model saved as: {gguf_filename}")
    
    # Check file size
    if os.path.exists(gguf_path):
        size_mb = os.path.getsize(gguf_path) / (1024 * 1024)
        print(f"   File size: {size_mb:.2f} MB")
    
    # Clean up the output directory if it exists
    if os.path.exists(gguf_dir):
        shutil.rmtree(gguf_dir, ignore_errors=True)
else:
    print("‚ùå Error: GGUF file not found after export")

## Step 14: Download the GGUF Model for Ollama

**Note:** Downloaded file is around 2GB so it takes some time until the browser save dialog appears!

In [None]:
# Download the GGUF file to your local machine
from google.colab import files

# Download the quantized model
print(f"Downloading GGUF model: {gguf_filename}")
files.download(gguf_path)
print("‚úÖ Download complete!")

## Step 15: Create Ollama Modelfile

**Note:** Browser save dialog appears, save the file to the same folder as the GGUF file!


In [None]:
# Generate a Modelfile for Ollama with the correct chat template
print("Generating Modelfile for Ollama...")

# Detect model type from model name
model_type = None
if "qwen" in model_name.lower():
    model_type = "qwen"
elif "llama" in model_name.lower():
    model_type = "llama"
elif "phi" in model_name.lower():
    model_type = "phi"

print(f"Detected model type: {model_type or 'unknown'}")
print(f"Using GGUF file: {gguf_filename}")

# Generate appropriate template based on model type
if model_type == "qwen":
    # Qwen uses ChatML format
    modelfile_content = f'''FROM ./{gguf_filename}

TEMPLATE """<|im_start|>system
{{{{ .System }}}}<|im_end|>
<|im_start|>user
{{{{ .Prompt }}}}<|im_end|>
<|im_start|>assistant
"""

PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"

SYSTEM """You are a helpful UI design assistant."""
'''
    print("‚úÖ Using Qwen/ChatML template")

elif model_type == "llama":
    # Llama 3.2 uses Llama 3 format with special tokens
    modelfile_content = f'''FROM ./{gguf_filename}

TEMPLATE """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{{{ .System }}}}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{{{ .Prompt }}}}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER stop "<|eot_id|>"

SYSTEM """You are a helpful UI design assistant."""
'''
    print("‚úÖ Using Llama 3.2 template")

elif model_type == "phi":
    # Phi uses a similar format to ChatML
    modelfile_content = f'''FROM ./{gguf_filename}

TEMPLATE """<|system|>
{{{{ .System }}}}<|end|>
<|user|>
{{{{ .Prompt }}}}<|end|>
<|assistant|>
"""

PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER stop "<|end|>"

SYSTEM """You are a helpful UI design assistant."""
'''
    print("‚úÖ Using Phi template")

else:
    # Generic fallback template
    print("‚ö†Ô∏è Warning: Unknown model type, using generic template")
    modelfile_content = f'''FROM ./{gguf_filename}

TEMPLATE """{{{{ .System }}}}

User: {{{{ .Prompt }}}}
Assistant:"""

PARAMETER temperature 0.1
PARAMETER top_p 0.9

SYSTEM """You are a helpful UI design assistant."""
'''

# Save Modelfile
with open("Modelfile", "w") as f:
    f.write(modelfile_content)

print("\n‚úÖ Modelfile created successfully!")
print("\nModelfile content:")
print(modelfile_content)

# Download the Modelfile
from google.colab import files
files.download("Modelfile")


## Step 16: Instructions for Using with Ollama

After downloading the GGUF file and `Modelfile`, follow these steps on your local machine:

```bash
# 1. Make sure the GGUF file and Modelfile are in the same directory, go to that directory

# 2. Create the Ollama model (you can use base model name instead of `-model` to compare more finetuned variants)
ollama create ngui-finetuned-model -f Modelfile

# 3. Run the model
ollama run ngui-finetuned-model

# 4. Test it with a prompt
# >>> What is machine learning?
```

You can also use it via the Ollama API etc.

## Summary

This notebook demonstrated:
- ‚úÖ Verifying GPU availability
- ‚úÖ Loading a 3B parameter base model (Llama-3.2-3B-Instruct or Qwen2.5-3B-Instruct)
- ‚úÖ Adding LoRA adapters for efficient fine-tuning
- ‚úÖ Loading and shuffling training data from multiple JSON files
- ‚úÖ Fine-tuning the model on custom data
- ‚úÖ Exporting to GGUF format with custom naming `NGUI-{model_basename}.{quant_method}.gguf`
- ‚úÖ Creating a model-specific `Modelfile` for Ollama integration

The fine-tuned model is now ready to use with Ollama on your local machine!
