# Tool-Calling Fine-Tuning with SFT

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ProfSynapse/Toolset-Training/blob/main/Trainers/notebooks/sft_colab_tool_calling.ipynb)

Train models to use the **Claudesidian-MCP toolset** for Obsidian vault operations.

**Method:** SFT (Supervised Fine-Tuning) - Direct supervision for learning tool-calling behavior

**Recommended GPU:** 
- 7B models: T4 (15GB) - Free Colab tier
- 13B models: A100 (40GB) - Colab Pro
- 70B models: A100 (80GB) - Colab Pro+

## 1. Installation

Install Unsloth and dependencies. This takes ~2 minutes.

In [None]:
# Install Unsloth for faster training
%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [None]:
# Install training dependencies with specific versions
%%capture
!pip install -U "transformers>=4.45.0"
!pip install "datasets==4.3.0"  # Specific version required by Unsloth
!pip install -U accelerate bitsandbytes
!pip install -U trl peft xformers triton

## 2. Mount Google Drive (Optional but Recommended)

Save checkpoints to Google Drive so they persist if runtime disconnects.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Create output directory in Google Drive
import os
DRIVE_OUTPUT_DIR = "/content/drive/MyDrive/SFT_Training"
os.makedirs(DRIVE_OUTPUT_DIR, exist_ok=True)

print(f"✓ Google Drive mounted")
print(f"✓ Checkpoints will be saved to: {DRIVE_OUTPUT_DIR}")

In [ ]:
# HuggingFace credentials (get token from https://huggingface.co/settings/tokens)
import os
from google.colab import userdata

# Store your HF token in Colab secrets (left sidebar → key icon)
# Secret name: HF_TOKEN
HF_TOKEN = userdata.get('HF_TOKEN')
os.environ['HF_TOKEN'] = HF_TOKEN

# Get HuggingFace username
from huggingface_hub import HfApi
api = HfApi()
hf_user = api.whoami(token=HF_TOKEN)["name"]

print(f"✓ HuggingFace token loaded")
print(f"✓ Username: {hf_user}")

In [None]:
# Model Configuration
# Options:
#   7B:  "unsloth/mistral-7b-v0.3-bnb-4bit" (recommended for T4)
#   8B:  "unsloth/llama-3.1-8b-instruct-bnb-4bit"
#   13B: "unsloth/llama-2-13b-bnb-4bit" (requires A100)
#   70B: "unsloth/llama-3.1-70b-instruct-bnb-4bit" (requires A100 80GB)

MODEL_NAME = "unsloth/mistral-7b-v0.3-bnb-4bit"  # Change this for different models
MAX_SEQ_LENGTH = 2048

# Dataset
DATASET_NAME = "professorsynapse/claudesidian-synthetic-dataset"
DATASET_FILE = "syngen_tools_sft_pingpong_11.18.25.jsonl"

# Output (will be uploaded to: hf_user/OUTPUT_MODEL_NAME)
OUTPUT_MODEL_NAME = "nexus-tools-sft"

print(f"Model: {MODEL_NAME}")
print(f"Dataset: {DATASET_NAME}/{DATASET_FILE}")
print(f"Output will be uploaded to:")
print(f"  - LoRA: {hf_user}/{OUTPUT_MODEL_NAME}")
print(f"  - Merged: {hf_user}/{OUTPUT_MODEL_NAME}-merged")

## 3. Load Model and Tokenizer

In [None]:
from unsloth import FastLanguageModel
import torch

# Check GPU
print(f"Using GPU: {torch.cuda.get_device_name(0)}")
print(f"CUDA version: {torch.version.cuda}")
print(f"Available VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
print()

In [None]:
# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,  # Auto-detect
    load_in_4bit=True,
    token=HF_TOKEN,
)

print("✓ Model loaded successfully")

## 4. Apply LoRA Adapters

LoRA allows efficient fine-tuning by training only a small percentage of parameters.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=32,  # LoRA rank
    lora_alpha=64,  # LoRA alpha scaling
    lora_dropout=0.05,
    bias="none",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

print("✓ LoRA adapters applied")

## 5. Load and Prepare Dataset

In [None]:
from datasets import load_dataset

# Load dataset from HuggingFace
print(f"Loading dataset: {DATASET_NAME}")
dataset = load_dataset(
    DATASET_NAME,
    data_files=DATASET_FILE,
    split="train"
)

print(f"✓ Loaded {len(dataset)} examples")
print(f"\nSample:")
print(dataset[0])

# IMPORTANT: Set chat template if not already set
if tokenizer.chat_template is None:
    print("\n⚠️  Tokenizer has no chat template, setting ChatML template...")
    tokenizer.chat_template = "{% for message in messages %}{% if message['role'] == 'user' %}{{ '<|im_start|>user\n' + message['content'] + '<|im_end|>\n' }}{% elif message['role'] == 'assistant' %}{{ '<|im_start|>assistant\n' + message['content'] + '<|im_end|>\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
    print("✓ Chat template set to ChatML format")
else:
    print("\n✓ Tokenizer already has chat template")

In [None]:
# Format dataset for SFT training
def format_chat_template(example):
    """Convert conversations to tokenizer's chat template."""
    conversations = example["conversations"]
    
    # Apply chat template
    text = tokenizer.apply_chat_template(
        conversations,
        tokenize=False,
        add_generation_prompt=False
    )
    
    return {"text": text}

# Apply formatting
dataset = dataset.map(
    format_chat_template,
    remove_columns=dataset.column_names,
    desc="Formatting dataset"
)

print("✓ Dataset formatted for training")
print(f"\nFormatted example (first 500 chars):")
print(dataset[0]["text"][:500])

In [ ]:
from trl import SFTTrainer, SFTConfig
from unsloth import is_bfloat16_supported
from datetime import datetime

# Create timestamped output directory
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_dir = f"{DRIVE_OUTPUT_DIR}/{timestamp}"

# Training configuration
training_args = SFTConfig(
    # Output (saved to Google Drive)
    output_dir=output_dir,
    
    # Batch configuration
    per_device_train_batch_size=2,  # Adjust based on GPU memory
    gradient_accumulation_steps=4,  # Effective batch = 8
    
    # Learning rate (MUCH higher than KTO)
    learning_rate=2e-4,
    max_grad_norm=1.0,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    
    # Training schedule
    num_train_epochs=3,
    
    # Sequence length
    max_seq_length=MAX_SEQ_LENGTH,
    
    # SFT-specific
    packing=False,
    dataset_text_field="text",
    
    # Optimizations
    fp16=not is_bfloat16_supported(),
    bf16=is_bfloat16_supported(),
    optim="adamw_8bit",
    gradient_checkpointing=True,
    
    # Logging & Checkpointing (saved to Google Drive)
    logging_steps=5,
    save_steps=100,  # Save checkpoint every 100 steps
    save_total_limit=3,  # Keep last 3 checkpoints
    
    # Misc
    seed=42,
    report_to="none",  # Disable W&B for Colab
)

print("✓ Training configuration ready")
print(f"  Output directory: {output_dir}")
print(f"  Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Checkpoints every: {training_args.save_steps} steps")
print(f"  Keeping last: {training_args.save_total_limit} checkpoints")

In [None]:
from trl import SFTTrainer, SFTConfig
from unsloth import is_bfloat16_supported

# Training configuration
training_args = SFTConfig(
    # Output
    output_dir="./outputs",
    
    # Batch configuration
    per_device_train_batch_size=2,  # Adjust based on GPU memory
    gradient_accumulation_steps=4,  # Effective batch = 8
    
    # Learning rate (MUCH higher than KTO)
    learning_rate=2e-4,
    max_grad_norm=1.0,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    
    # Training schedule
    num_train_epochs=3,
    
    # Sequence length
    max_seq_length=MAX_SEQ_LENGTH,
    
    # SFT-specific
    packing=False,
    dataset_text_field="text",
    
    # Optimizations
    fp16=not is_bfloat16_supported(),
    bf16=is_bfloat16_supported(),
    optim="adamw_8bit",
    gradient_checkpointing=True,
    
    # Logging
    logging_steps=5,
    save_steps=100,
    save_total_limit=2,
    
    # Misc
    seed=42,
    report_to="none",  # Disable W&B for Colab
)

print("✓ Training configuration ready")
print(f"  Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Epochs: {training_args.num_train_epochs}")

## 7. Initialize Trainer

In [None]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=training_args,
)

print("✓ Trainer initialized")

## 8. Train!

This will take ~45 minutes for 7B models, ~1.5 hours for 13B models.

In [None]:
# Start training
print("=" * 60)
print("STARTING TRAINING")
print("=" * 60)
print()

if RESUME_FROM_CHECKPOINT:
    print(f"Resuming from checkpoint: {RESUME_FROM_CHECKPOINT}\n")

trainer_stats = trainer.train(resume_from_checkpoint=RESUME_FROM_CHECKPOINT)

print()
print("=" * 60)
print("TRAINING COMPLETED")
print("=" * 60)

## 8. Resume from Checkpoint (Optional)

If training was interrupted, you can resume from the last checkpoint.

In [None]:
# Check GPU memory before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
print()

In [None]:
# Start training
print("=" * 60)
print("STARTING TRAINING")
print("=" * 60)
print()

trainer_stats = trainer.train()

print()
print("=" * 60)
print("TRAINING COMPLETED")
print("=" * 60)

In [None]:
# Upload LoRA adapters to HuggingFace
model.push_to_hub(
    f"{hf_user}/{OUTPUT_MODEL_NAME}",
    token=HF_TOKEN,
    private=False
)
tokenizer.push_to_hub(
    f"{hf_user}/{OUTPUT_MODEL_NAME}",
    token=HF_TOKEN,
    private=False
)

print(f"✓ LoRA adapters uploaded to HuggingFace")
print(f"  View at: https://huggingface.co/{hf_user}/{OUTPUT_MODEL_NAME}")

In [ ]:
# Upload merged 16-bit model (full quality)
print("Merging LoRA weights into base model (16-bit)...")
print("This will take ~5 minutes...")

model.push_to_hub_merged(
    f"{hf_user}/{OUTPUT_MODEL_NAME}-merged",
    tokenizer,
    save_method="merged_16bit",
    token=HF_TOKEN,
    private=False
)

print(f"✓ Merged model uploaded to HuggingFace")
print(f"  View at: https://huggingface.co/{hf_user}/{OUTPUT_MODEL_NAME}-merged")

In [None]:
# Create GGUF quantizations
# Quantization types:
#   - q4_k_m: 4-bit, medium quality, ~4GB (recommended for most use cases)
#   - q5_k_m: 5-bit, higher quality, ~5GB
#   - q8_0: 8-bit, best quality, ~8GB

quantization_methods = ["q4_k_m", "q5_k_m", "q8_0"]

print("Creating GGUF quantizations...")
print(f"This will create {len(quantization_methods)} versions and upload them to HuggingFace")
print()

model.push_to_hub_gguf(
    f"{hf_user}/{OUTPUT_MODEL_NAME}",
    tokenizer,
    quantization_method=quantization_methods,
    token=HF_TOKEN,
)

print()
print("✓ GGUF quantizations created and uploaded!")
print(f"  View at: https://huggingface.co/{hf_user}/{OUTPUT_MODEL_NAME}")
print()
print("GGUF files created:")
for method in quantization_methods:
    print(f"  - {OUTPUT_MODEL_NAME}-{method.upper()}.gguf")

## 10. Create GGUF Quantizations (Optional)

Create GGUF versions for Ollama and llama.cpp. This takes ~10-15 minutes.

## Done!

Your model has been trained and uploaded to HuggingFace. 

**Next steps:**
1. Download GGUF files for Ollama/LM Studio
2. Run the Evaluator to test tool-calling accuracy
3. Deploy to production

**Model locations:**
- LoRA adapters: https://huggingface.co/{hf_user}/{OUTPUT_MODEL_NAME}
- Merged 16-bit: https://huggingface.co/{hf_user}/{OUTPUT_MODEL_NAME}-merged
- GGUF quantizations: https://huggingface.co/{hf_user}/{OUTPUT_MODEL_NAME} (Files tab)

**Using the GGUF models:**

Ollama:
```bash
# Create Modelfile
FROM {hf_user}/{OUTPUT_MODEL_NAME}:Q4_K_M

# Create model
ollama create my-tool-model -f Modelfile
ollama run my-tool-model
```

LM Studio:
1. Go to "Discover" tab
2. Search for `{hf_user}/{OUTPUT_MODEL_NAME}`
3. Download Q4_K_M or Q5_K_M version
4. Load and test!

In [None]:
# Upload LoRA adapters to HuggingFace
model.push_to_hub(
    f"{hf_user}/{OUTPUT_MODEL_NAME}",
    token=HF_TOKEN,
    private=False
)
tokenizer.push_to_hub(
    f"{hf_user}/{OUTPUT_MODEL_NAME}",
    token=HF_TOKEN,
    private=False
)

print(f"✓ LoRA adapters uploaded to HuggingFace")
print(f"  View at: https://huggingface.co/{hf_user}/{OUTPUT_MODEL_NAME}")

## Done!

Your model has been trained and uploaded to HuggingFace. 

**Next steps:**
1. Download the model for local testing
2. Create GGUF quantizations for Ollama/llama.cpp
3. Run the Evaluator to test tool-calling accuracy

**Model locations:**
- LoRA adapters: https://huggingface.co/{hf_user}/{OUTPUT_MODEL_NAME}
- Merged 16-bit: https://huggingface.co/{hf_user}/{OUTPUT_MODEL_NAME}-merged

## 10. Test the Model (Optional)

Quick test to see if the model learned tool-calling behavior.

In [None]:
# Enable inference mode
FastLanguageModel.for_inference(model)

# Test prompt
test_messages = [
    {"role": "user", "content": "Create a new note called 'Meeting Notes' with the content 'Discussed Q4 roadmap'"}
]

# Format and generate
inputs = tokenizer.apply_chat_template(
    test_messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=512,
    temperature=0.2,
    top_p=0.9,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Model response:")
print("=" * 60)
print(response)
print("=" * 60)

## Done!

Your model has been trained and uploaded to HuggingFace. 

**Next steps:**
1. Download the model for local testing
2. Create GGUF quantizations for Ollama/llama.cpp
3. Run the Evaluator to test tool-calling accuracy

**Model locations:**
- LoRA adapters: `https://huggingface.co/YOUR-USERNAME/{OUTPUT_MODEL_NAME}`
- Merged 16-bit: `https://huggingface.co/YOUR-USERNAME/{OUTPUT_MODEL_NAME}-merged`