# Fine-tune Gemma 2B for Code Audit

**Goal:** Fine-tune gemma-2-2b on audit dataset for specialized code review

**Dataset:** 100 examples covering 13 audit tools

**Output:** HuggingFace model `amitrosen/audit-multi-v1`

**Setup:**
1. GPU: T4 x2 or P100
2. Add dataset: Upload `data/audit_dataset.jsonl` to Kaggle
3. Run All (~30 min)

## 1. Install Dependencies (Nuclear Clean)

**Strategy:** Clean install with Kaggle-proven stable versions
- torch 2.5.1 + torchvision 0.20.1 (no conflicts)
- unsloth colab-new variant (Kaggle optimized)
- xformers 0.0.28.post1 (pre-built, no compile)

**Time:** ~3-4 minutes

In [None]:
# KAGGLE UNSLOTH NUCLEAR FIX 2026
import os, sys

# Nuclear clean - remove all conflicting packages
print("Step 1/4: Cleaning existing packages...")
!pip uninstall -y torch torchvision torchaudio xformers unsloth transformers -q 2>&1 | tail -3
!pip cache purge -q

# Kaggle stable versions (proven to work)
print("\nStep 2/4: Installing PyTorch 2.5.1 (Kaggle stable)...")
!pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 -q 2>&1 | tail -5

# Unsloth + deps (no version conflicts)
print("\nStep 3/4: Installing Unsloth (colab-new variant)...")
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" --no-deps -q 2>&1 | tail -10

print("\nStep 4/4: Installing dependencies...")
!pip install xformers==0.0.28.post1 trl==0.9.6 peft accelerate bitsandbytes datasets -q 2>&1 | tail -5

print("\n" + "="*50)
print("[OK] CLEAN INSTALL COMPLETE!")
print("="*50)

# Verify versions
import torch
print(f"\n[OK] CUDA available: {torch.cuda.is_available()}")
print(f"[OK] torch: {torch.__version__}")
if torch.cuda.is_available():
    print(f"[OK] GPU: {torch.cuda.get_device_name(0)}")

print("\n[NEXT] Run the next cell to load the model!")

## 2. Load Model

Using `unsloth/gemma-2-2b-it-bnb-4bit` (works on T4 GPU) with LoRA adapters.

In [None]:
import torch
from unsloth import FastLanguageModel
from datasets import load_dataset
from transformers import TrainingArguments
from trl import SFTTrainer

# Configuration
max_seq_length = 2048
dtype = None  # Auto-detect
load_in_4bit = True
model_name = "unsloth/gemma-2-2b-it-bnb-4bit"

# Check GPU
print(f"🔧 CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"🔧 GPU: {torch.cuda.get_device_name(0)}")
    print(f"🔧 VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Load model
print(f"\n🔄 Loading {model_name}...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

model.print_trainable_parameters()
print("\n✅ Model ready with LoRA adapters")

## 3. Load Audit Dataset

In [None]:
import glob

# Find dataset - try multiple paths
possible_paths = [
    "/kaggle/input/audit-dataset/audit_dataset.jsonl",
    "/kaggle/input/*/audit_dataset.jsonl",
    "/kaggle/input/*/*.jsonl",
    "../data/audit_dataset.jsonl",
]

DATASET_PATH = None
for pattern in possible_paths:
    matches = glob.glob(pattern)
    if matches:
        DATASET_PATH = matches[0]
        break

if DATASET_PATH:
    print(f"📂 Loading dataset from {DATASET_PATH}")
    dataset = load_dataset("json", data_files=DATASET_PATH, split="train")
else:
    raise FileNotFoundError(
        "❌ Dataset not found! Please upload 'audit_dataset.jsonl' to Kaggle.\n"
        "   Go to: Add Data -> Upload -> Select file"
    )

print(f"✅ Loaded {len(dataset)} audit examples")
print(f"\n📝 Sample:")
print(f"  Instruction: {dataset[0]['instruction'][:70]}...")
print(f"  Output: {dataset[0]['output'][:70]}...")

## 4. Prepare Training Data

In [None]:
# Format dataset for Alpaca
alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    outputs = examples["output"]
    texts = []
    for instruction, output in zip(instructions, outputs):
        text = alpaca_prompt.format(instruction, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

print("🔄 Preparing dataset...")
dataset = dataset.map(formatting_prompts_func, batched=True)

print("✅ Dataset formatted for training")

## 5. Train Model

In [None]:
# Training configuration
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=100,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

print("✅ Trainer configured")
print("\n🚀 Starting training...")

# Train!
trainer_stats = trainer.train()

print("\n🎉 Training complete!")
print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f}s")
print(f"Samples/second: {trainer_stats.metrics['train_samples_per_second']:.2f}")

## 6. Test Inference

In [None]:
# Test inference
FastLanguageModel.for_inference(model)

test_instruction = "Analyze test coverage: 330 files found, 5 executable, 0% coverage"
inputs = tokenizer(
    [alpaca_prompt.format(test_instruction, "")],
    return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=256, use_cache=True)
result = tokenizer.batch_decode(outputs)

print("📝 Test Inference:")
print(result[0])

## 7. Save Model

In [None]:
# Save model locally
model.save_pretrained("audit-multi-v1")
tokenizer.save_pretrained("audit-multi-v1")

print("✅ Model saved locally to 'audit-multi-v1/'")

# Optional: Push to HuggingFace
# Uncomment and add your token:
# from huggingface_hub import login
# login(token="YOUR_HF_TOKEN")
# model.push_to_hub("amitrosen/audit-multi-v1", token="YOUR_HF_TOKEN")
# tokenizer.push_to_hub("amitrosen/audit-multi-v1", token="YOUR_HF_TOKEN")
# print("🚀 Model pushed to HuggingFace!")

## Summary

✅ **Model:** gemma-2-2b-it (Gemini-like architecture)

✅ **Dataset:** 100 audit examples

✅ **Training:** ~30 min on T4 GPU

✅ **Output:** audit-multi-v1 (LoRA adapters)

**Next Steps:**
1. Download the model from Kaggle output
2. Test on real audit scenarios
3. Integrate into MCP server
4. Compare with base model performance