# ü¶• Qwen3-0.6B ‚Üí Phone Deployment with Unsloth

**Fine-tune + Export GGUF for Android/iOS in 90 minutes**

| | |
|---|---|
| ü§ñ Model | Qwen/Qwen3-0.6B (600M params) |
| ‚ö° Framework | Unsloth (2x faster training) |
| üì± Output | GGUF Q4_K_M (~400MB) |
| ‚è±Ô∏è Runtime | ~90 min on T4 GPU |
| üéØ Target | Pixel 6, iPhone 15, any modern phone |

---

## üöÄ Quick Start
1. Enable **GPU T4 x2**: Settings ‚Üí Accelerator ‚Üí GPU T4 x2
2. Enable **Internet**: Settings ‚Üí Internet ‚Üí On
3. **Run All** cells in order
4. Download output GGUF file
5. Deploy to phone with llama.cpp or PocketPal AI app

> ‚ö†Ô∏è **Run cells in order!** Dependencies are version-pinned to avoid conflicts.

## 1Ô∏è‚É£ Install Dependencies

In [None]:
%%capture
# Pin versions to avoid Kaggle dependency conflicts
!pip install -q --upgrade pip
!pip install -q fsspec==2024.9.0 datasets==4.2.0 huggingface_hub>=0.23.0
!pip install -q psutil sentencepiece protobuf
!pip install -q peft accelerate bitsandbytes trl transformers>=4.45.0

# Install Unsloth from GitHub (latest optimizations)
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [None]:
# Verify installation
import torch
print(f"‚úÖ PyTorch: {torch.__version__}")
print(f"‚úÖ CUDA: {torch.cuda.is_available()} - {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'}")

if not torch.cuda.is_available():
    raise RuntimeError("‚ùå GPU required! Enable: Settings ‚Üí Accelerator ‚Üí GPU T4 x2")

## 2Ô∏è‚É£ Load Model
Load Qwen3-0.6B with Unsloth optimizations

In [None]:
# IMPORTANT: Import unsloth FIRST for kernel optimizations
import unsloth
import psutil  # Required by Unsloth trainer
from unsloth import FastLanguageModel, is_bfloat16_supported

print(f"‚úÖ Unsloth: {unsloth.__version__}")

In [None]:
# Configuration
max_seq_length = 2048
dtype = None  # Auto-detect
load_in_4bit = False  # Full precision for best quality

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen3-0.6B",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

print(f"‚úÖ Loaded Qwen3-0.6B: {model.num_parameters()/1e6:.0f}M params")

In [None]:
# Apply LoRA for efficient fine-tuning
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # 60% memory reduction
    random_state=42,
)

print("‚úÖ LoRA applied - training ~1% of parameters")

## 3Ô∏è‚É£ Prepare Dataset
Using Alpaca instruction-following dataset

In [None]:
from datasets import load_dataset

# Prompt template
prompt_template = """### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

# Load dataset
dataset = load_dataset("yahma/alpaca-cleaned", split="train")

def format_prompt(examples):
    texts = []
    for i, inp, out in zip(examples["instruction"], examples["input"], examples["output"]):
        text = prompt_template.format(instruction=i, input=inp, output=out)
        texts.append(text + tokenizer.eos_token)
    return {"text": texts}

dataset = dataset.map(format_prompt, batched=True, remove_columns=dataset.column_names)
print(f"‚úÖ Dataset: {len(dataset):,} samples")

## 4Ô∏è‚É£ Train Model
~15 minutes for 60 steps demo (increase `max_steps` for better results)

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,  # ‚ö° Demo: 60 steps. Production: 500-2000
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="outputs",
        report_to="none",
    ),
)

In [None]:
print("üöÄ Training started...")
stats = trainer.train()
print(f"\n‚úÖ Training complete!")
print(f"   Steps: {stats.global_step}")
print(f"   Loss: {stats.training_loss:.4f}")

## 5Ô∏è‚É£ Save Model

In [None]:
# Save LoRA adapters
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")
print("‚úÖ LoRA adapters saved")

# Merge into full model
model.save_pretrained_merged("merged_model", tokenizer, save_method="merged_16bit")
print("‚úÖ Merged model saved")

## 6Ô∏è‚É£ Export to GGUF
Convert to GGUF Q4_K_M format for mobile deployment (~15 min)

In [None]:
# Export to GGUF with Q4_K_M quantization
# This creates a ~400MB file optimized for mobile
model.save_pretrained_gguf(
    "qwen3_phone", 
    tokenizer, 
    quantization_method="q4_k_m"  # Best quality/size ratio for mobile
)

print("‚úÖ GGUF export complete!")

## 7Ô∏è‚É£ Test Inference

In [None]:
# Quick inference test
FastLanguageModel.for_inference(model)

prompt = """### Instruction:
Explain quantum computing in simple terms.

### Input:

### Response:
"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100, use_cache=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

## 8Ô∏è‚É£ Package for Download

In [None]:
import os
import shutil

# Find the GGUF file
gguf_files = [f for f in os.listdir('.') if f.endswith('.gguf')]
print("üì¶ Generated files:")
for f in gguf_files:
    size = os.path.getsize(f) / 1e6
    print(f"   {f} ({size:.1f} MB)")

# Copy tokenizer
if os.path.exists("merged_model/tokenizer.json"):
    shutil.copy("merged_model/tokenizer.json", "tokenizer.json")
    print("   tokenizer.json")

print("\n‚úÖ Ready to download!")
print("   1. Click 'Save Version' (top right)")
print("   2. After save, go to Output tab")
print("   3. Download .gguf and tokenizer.json")

## 9Ô∏è‚É£ Deploy to Phone

In [None]:
# Print deployment instructions
print("""
üì± PHONE DEPLOYMENT GUIDE
========================

Option 1: Android Apps (Easiest)
--------------------------------
‚Ä¢ PocketPal AI (Play Store) - Free, supports GGUF
‚Ä¢ MLC Chat (Play Store) - Open source
‚Üí Import your .gguf file in the app

Option 2: ADB Push (Advanced)
-----------------------------
adb shell mkdir -p /data/local/tmp/llm
adb push Qwen3-0.6B.Q4_K_M.gguf /data/local/tmp/llm/
adb push tokenizer.json /data/local/tmp/llm/

Option 3: Termux + llama.cpp
----------------------------
pkg install cmake git
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
./llama-cli -m /path/to/model.gguf -p "Hello!" -n 50

Option 4: iOS
-------------
‚Ä¢ LLM Farm (App Store)
‚Ä¢ Use Files app to import .gguf

üéâ Enjoy your fine-tuned LLM running locally!
""")

---

## üìä Summary

| Step | Time | Output |
|------|------|--------|
| Install deps | 3 min | - |
| Load model | 2 min | 600M params |
| Train (60 steps) | 15 min | LoRA adapters |
| Merge model | 1 min | merged_model/ |
| GGUF export | 15 min | ~400MB .gguf |
| **Total** | **~35 min** | **Phone-ready model** |

## üîó Resources
- [Unsloth GitHub](https://github.com/unslothai/unsloth)
- [llama.cpp](https://github.com/ggerganov/llama.cpp)
- [PocketPal AI](https://play.google.com/store/apps/details?id=com.pocketpalai)
- [Qwen3 Models](https://huggingface.co/Qwen)

---
*Created with Unsloth + Kaggle T4 GPU*