# Tool-Calling Fine-Tuning with SFT

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ProfSynapse/Toolset-Training/blob/main/Trainers/notebooks/sft_colab_tool_calling.ipynb)

Train models to use the **Claudesidian-MCP toolset** for Obsidian vault operations.

**Method:** SFT (Supervised Fine-Tuning) - Direct supervision for learning tool-calling behavior

**Recommended GPU:** 
- 7B models: T4 (15GB) - Free Colab tier
- 13B models: A100 (40GB) - Colab Pro
- 70B models: A100 (80GB) - Colab Pro+

## 1. Installation

Install Unsloth and dependencies. This takes ~2 minutes.

In [None]:
# Install Unsloth for faster training
%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [None]:
# Install training dependencies with specific versions
%%capture
!pip install -U "transformers>=4.45.0"
!pip install "datasets==4.3.0"  # Specific version required by Unsloth
!pip install -U accelerate bitsandbytes
!pip install -U trl peft xformers triton

## 2. Configuration

Set your HuggingFace token and model parameters.

In [None]:
# HuggingFace credentials (get token from https://huggingface.co/settings/tokens)
import os
from google.colab import userdata

# Store your HF token in Colab secrets (left sidebar → key icon)
# Secret name: HF_TOKEN
HF_TOKEN = userdata.get('HF_TOKEN')
os.environ['HF_TOKEN'] = HF_TOKEN

print("✓ HuggingFace token loaded")

In [None]:
# Model Configuration
# Options:
#   7B:  "unsloth/mistral-7b-v0.3-bnb-4bit" (recommended for T4)
#   8B:  "unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit"
#   12B: "unsloth/Pixtral-12B-2409-bnb-4bit"
#   12B: "unsloth/gemma-3-12b-it-unsloth-bnb-4bit"
#   14B: "unsloth/Qwen3-14B-unsloth-bnb-4bit"
#   13B: "unsloth/llama-2-13b-bnb-4bit" (requires A100)
#   17B: "unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit"
#   20B: "unsloth/gpt-oss-20b-unsloth-bnb-4bit"
#   24B: "unsloth/Mistral-Small-3.2-24B-Instruct-2506-unsloth-bnb-4bit"
#   32B: "unsloth/Qwen3-VL-32B-Instruct-unsloth-bnb-4bit"
#   70B: "unsloth/Llama-3.3-70B-Instruct-bnb-4bit" 

MODEL_NAME = "unsloth/mistral-7b-v0.3-bnb-4bit"  # Change this for different models
MAX_SEQ_LENGTH = 2048

# Dataset
DATASET_NAME = "professorsynapse/claudesidian-synthetic-dataset"
DATASET_FILE = "syngen_tools_sft_pingpong_11.18.25.jsonl"

# Output
OUTPUT_MODEL_NAME = "nexus-tools-sft"  # Will be uploaded to: your-username/nexus-tools-sft

print(f"Model: {MODEL_NAME}")
print(f"Dataset: {DATASET_NAME}/{DATASET_FILE}")
print(f"Output: {OUTPUT_MODEL_NAME}")

## 3. Load Model and Tokenizer

In [None]:
from unsloth import FastLanguageModel
import torch

# Check GPU
print(f"Using GPU: {torch.cuda.get_device_name(0)}")
print(f"CUDA version: {torch.version.cuda}")
print(f"Available VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
print()

In [None]:
# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,  # Auto-detect
    load_in_4bit=True,
    token=HF_TOKEN,
)

print("✓ Model loaded successfully")

## 4. Apply LoRA Adapters

LoRA allows efficient fine-tuning by training only a small percentage of parameters.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=32,  # LoRA rank
    lora_alpha=64,  # LoRA alpha scaling
    lora_dropout=0.05,
    bias="none",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

print("✓ LoRA adapters applied")

## 5. Load and Prepare Dataset

In [None]:
from datasets import load_dataset

# Load dataset from HuggingFace
print(f"Loading dataset: {DATASET_NAME}")
dataset = load_dataset(
    DATASET_NAME,
    data_files=DATASET_FILE,
    split="train"
)

print(f"✓ Loaded {len(dataset)} examples")
print(f"\nSample:")
print(dataset[0])

In [None]:
# Format dataset for SFT training
def format_chat_template(example):
    """Convert conversations to tokenizer's chat template."""
    conversations = example["conversations"]
    
    # Apply chat template
    text = tokenizer.apply_chat_template(
        conversations,
        tokenize=False,
        add_generation_prompt=False
    )
    
    return {"text": text}

# Apply formatting
dataset = dataset.map(
    format_chat_template,
    remove_columns=dataset.column_names,
    desc="Formatting dataset"
)

print("✓ Dataset formatted for training")
print(f"\nFormatted example (first 500 chars):")
print(dataset[0]["text"][:500])

## 6. Configure Training

SFT uses higher learning rate than KTO since we're teaching behavior from scratch.

In [None]:
from trl import SFTTrainer, SFTConfig
from unsloth import is_bfloat16_supported

# Training configuration
training_args = SFTConfig(
    # Output
    output_dir="./outputs",
    
    # Batch configuration
    per_device_train_batch_size=2,  # Adjust based on GPU memory
    gradient_accumulation_steps=4,  # Effective batch = 8
    
    # Learning rate (MUCH higher than KTO)
    learning_rate=2e-4,
    max_grad_norm=1.0,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    
    # Training schedule
    num_train_epochs=3,
    
    # Sequence length
    max_seq_length=MAX_SEQ_LENGTH,
    
    # SFT-specific
    packing=False,
    dataset_text_field="text",
    
    # Optimizations
    fp16=not is_bfloat16_supported(),
    bf16=is_bfloat16_supported(),
    optim="adamw_8bit",
    gradient_checkpointing=True,
    
    # Logging
    logging_steps=5,
    save_steps=100,
    save_total_limit=2,
    
    # Misc
    seed=42,
    report_to="none",  # Disable W&B for Colab
)

print("✓ Training configuration ready")
print(f"  Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Epochs: {training_args.num_train_epochs}")

## 7. Initialize Trainer

In [None]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=training_args,
)

print("✓ Trainer initialized")

## 8. Train!

This will take ~45 minutes for 7B models, ~1.5 hours for 13B models.

In [None]:
# Check GPU memory before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
print()

In [None]:
# Start training
print("=" * 60)
print("STARTING TRAINING")
print("=" * 60)
print()

trainer_stats = trainer.train()

print()
print("=" * 60)
print("TRAINING COMPLETED")
print("=" * 60)

In [None]:
# Show memory stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

## 9. Save and Upload to HuggingFace

We'll save both LoRA adapters (small, ~320MB) and merged 16-bit model (full quality, ~14GB).

In [None]:
# Save LoRA adapters locally
model.save_pretrained("./lora_model")
tokenizer.save_pretrained("./lora_model")

print("✓ LoRA adapters saved locally")

In [None]:
# Upload LoRA adapters to HuggingFace
model.push_to_hub(
    OUTPUT_MODEL_NAME,
    token=HF_TOKEN,
    private=False
)
tokenizer.push_to_hub(
    OUTPUT_MODEL_NAME,
    token=HF_TOKEN,
    private=False
)

print(f"✓ LoRA adapters uploaded to HuggingFace")
print(f"  View at: https://huggingface.co/YOUR-USERNAME/{OUTPUT_MODEL_NAME}")

In [None]:
# Upload merged 16-bit model (full quality)
print("Merging LoRA weights into base model (16-bit)...")
print("This will take ~5 minutes...")

model.push_to_hub_merged(
    f"{OUTPUT_MODEL_NAME}-merged",
    tokenizer,
    save_method="merged_16bit",
    token=HF_TOKEN,
    private=False
)

print(f"✓ Merged model uploaded to HuggingFace")
print(f"  View at: https://huggingface.co/YOUR-USERNAME/{OUTPUT_MODEL_NAME}-merged")

## 10. Test the Model (Optional)

Quick test to see if the model learned tool-calling behavior.

In [None]:
# Enable inference mode
FastLanguageModel.for_inference(model)

# Test prompt
test_messages = [
    {"role": "user", "content": "Create a new note called 'Meeting Notes' with the content 'Discussed Q4 roadmap'"}
]

# Format and generate
inputs = tokenizer.apply_chat_template(
    test_messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=512,
    temperature=0.2,
    top_p=0.9,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Model response:")
print("=" * 60)
print(response)
print("=" * 60)

## Done!

Your model has been trained and uploaded to HuggingFace. 

**Next steps:**
1. Download the model for local testing
2. Create GGUF quantizations for Ollama/llama.cpp
3. Run the Evaluator to test tool-calling accuracy

**Model locations:**
- LoRA adapters: `https://huggingface.co/YOUR-USERNAME/{OUTPUT_MODEL_NAME}`
- Merged 16-bit: `https://huggingface.co/YOUR-USERNAME/{OUTPUT_MODEL_NAME}-merged`