# Fine-tuning Llama 3.1 8B for Network Security Expert v2

This notebook fine-tunes Llama 3.1 8B Instruct to create a specialized **Network Security Expert AI** with:
- **Advanced tool calling** using native Llama 3.1 format (`<|python_tag|>`)
- **FireWeave orchestration** capabilities
- **Infosec conversational expertise**

**Training Configuration (Local VM - RTX 3090/4090 Optimized):**
- LoRA: r=32, alpha=32, dropout=0, **rsLoRA enabled**
- Learning rate: 2e-4 with cosine scheduler
- Max sequence length: 2048 (covers 99% of training data)
- **Packing disabled** for stability
- **NEFTune noise** (alpha=5) for better generalization
- Batch size: 2 with gradient accumulation 4 (effective=8)

**Runtime:** Local Ubuntu VM with GPU passthrough (RTX 3090/4090)

## 1. Install Dependencies

Run this cell if packages aren't installed yet.

In [None]:
# Environment setup for local VM (packages already installed via pip)
import os

# Prevent Triton timeout issues
os.environ["TRITON_CACHE_MANAGER"] = "unsloth.triton_cache:TritonCacheManager"
os.environ["CUDA_LAUNCH_BLOCKING"] = "0"

# Clear Triton cache if it exists
import shutil
from pathlib import Path
triton_cache = Path.home() / ".triton" / "cache"
if triton_cache.exists():
    print(f"Clearing stale Triton cache...")
    shutil.rmtree(triton_cache, ignore_errors=True)

print("‚úì Environment configured for stable training")
print("‚úì Running on local VM (packages pre-installed)")

## 2. Load Model

Load Llama 3.1 8B Instruct with 4-bit quantization.

In [None]:
from unsloth import FastLanguageModel
import torch
import os

# Fix for Triton timeout issues on some systems
os.environ["TRITON_CACHE_MANAGER"] = "unsloth.triton_cache:TritonCacheManager"
os.environ["CUDA_LAUNCH_BLOCKING"] = "0"  # Set to "1" only for debugging

# Configuration (Optimized for stability)
max_seq_length = 2048  # Reduced from 8192 - covers 99% of training data
dtype = None  # Auto-detect (bfloat16 for Ampere+)
load_in_4bit = True  # Use 4-bit quantization for QLoRA

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3.1-8b-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

print(f"‚úì Model loaded: Llama 3.1 8B Instruct (4-bit)")
print(f"‚úì Max sequence length: {max_seq_length}")
print(f"‚úì Data type: {dtype if dtype else 'Auto (bf16 on Ampere+)'}")

# GPU memory info
if torch.cuda.is_available():
    gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1024**3
    gpu_used = torch.cuda.memory_allocated(0) / 1024**3
    print(f"‚úì GPU: {torch.cuda.get_device_name(0)} ({gpu_mem:.1f} GB total, {gpu_used:.1f} GB used)")

## 3. Configure LoRA (2025 Best Practices)

Add LoRA adapters with **rsLoRA** (rank-stabilized LoRA) for better scaling at higher ranks.

In [None]:
# Add LoRA adapters (2025 Best Practices with rsLoRA)
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,  # Higher rank with rsLoRA (was 16)
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 32,  # Match rank when using rsLoRA
    lora_dropout = 0,  # Unsloth recommends 0 dropout
    bias = "none",
    use_gradient_checkpointing = "unsloth",  # 30% less VRAM
    random_state = 3407,
    use_rslora = True,  # NEW: Rank-stabilized LoRA for better high-rank performance
    loftq_config = None,
)

trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())

print("‚úì LoRA adapters configured (2025 Best Practices)")
print(f"  - Rank: 32 (increased from 16)")
print(f"  - Alpha: 32 (matched with rank for rsLoRA)")
print(f"  - Dropout: 0 (Unsloth recommended)")
print(f"  - rsLoRA: Enabled (scales by 1/sqrt(r) instead of 1/r)")
print(f"  - Target modules: All attention and MLP layers")
print(f"  - Trainable parameters: ~{trainable_params / 1e6:.1f}M ({100 * trainable_params / total_params:.2f}%)")
print(f"\n[rsLoRA] Better performance at higher ranks with proper gradient scaling")

## 4. Load Dataset

Load your network security training data in ChatML/ShareGPT format.

In [None]:
from datasets import load_dataset
import json
import os

# Local dataset path - adjust if needed
dataset_path = os.path.expanduser("~/finetuning/data/processed/all_training_data.json")

# Alternative paths to try
alt_paths = [
    "data/processed/all_training_data.json",
    "../data/processed/all_training_data.json",
    os.path.expanduser("~/finetuning/data/processed/high_quality_new.json"),
]

# Find the dataset
if not os.path.exists(dataset_path):
    for alt in alt_paths:
        if os.path.exists(alt):
            dataset_path = alt
            break

try:
    dataset = load_dataset("json", data_files=dataset_path, split="train")
    print(f"‚úì Dataset loaded from: {dataset_path}")
    print(f"‚úì Total examples: {len(dataset)}")
    
    # Count tool calling vs conversational
    tool_count = sum(1 for ex in dataset if '<|python_tag|>' in str(ex.get('conversations', [])))
    print(f"  - Tool calling: {tool_count} ({100*tool_count/len(dataset):.1f}%)")
    print(f"  - Conversational: {len(dataset) - tool_count}")
    
    # Show a sample
    print("\nSample conversation:")
    print("-" * 80)
    sample = dataset[0]['conversations']
    for msg in sample[:2]:
        role = msg['from']
        text = msg['value'][:200]
        print(f"{role.upper()}: {text}...\n")
    
except FileNotFoundError:
    print(f"‚ùå Dataset not found!")
    print(f"\nSearched locations:")
    print(f"  - {dataset_path}")
    for alt in alt_paths:
        print(f"  - {alt}")
    print("\nMake sure your training data is in ~/finetuning/data/processed/")
    raise

## 5. Format Dataset for Training

Apply Llama 3 chat template to the dataset.

In [None]:
from unsloth.chat_templates import get_chat_template
import json

# Apply Llama 3.1 chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

# System prompt for tool calling
SYSTEM_PROMPT = """You are a Network Security Expert AI with FireWeave orchestration capabilities.

Available tools: check_traffic_flow, analyze_attack_path, run_compliance_scan, find_shadowed_rules, create_firewall_rule, get_rule_hit_count, calculate_blast_radius, fetch_jira_issues

When calling tools, use the format: <|python_tag|>{"name": "tool_name", "parameters": {...}}

Provide accurate, detailed technical guidance with specific commands and configurations."""

def formatting_prompts_func(examples):
    """Format conversations for Llama 3.1 native tool calling."""
    conversations = examples["conversations"]
    tools_list = examples.get("tools", [None] * len(conversations))
    texts = []
    
    for convo, tools in zip(conversations, tools_list):
        # Build text manually for proper tool calling format
        text = "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
        text += SYSTEM_PROMPT
        
        # Add tool definitions if available
        if tools:
            text += "\n\nAvailable tools:\n"
            text += json.dumps(tools, indent=2)
        
        text += "<|eot_id|>"
        
        for turn in convo:
            role = turn.get("from", "")
            value = turn.get("value", "")
            
            if role == "human":
                text += f"<|start_header_id|>user<|end_header_id|>\n\n{value}<|eot_id|>"
            elif role == "gpt":
                # Check if this is a tool call (contains <|python_tag|>)
                if "<|python_tag|>" in value:
                    # Tool call ends with <|eom_id|> (end of message, expecting tool response)
                    text += f"<|start_header_id|>assistant<|end_header_id|>\n\n{value}<|eom_id|>"
                else:
                    # Regular response ends with <|eot_id|>
                    text += f"<|start_header_id|>assistant<|end_header_id|>\n\n{value}<|eot_id|>"
            elif role == "tool":
                # Tool response uses ipython role
                text += f"<|start_header_id|>ipython<|end_header_id|>\n\n{value}<|eot_id|>"
        
        texts.append(text)
    
    return {"text": texts}

# Apply formatting
dataset = dataset.map(formatting_prompts_func, batched=True)

print("‚úì Native Llama 3.1 tool calling format applied")
print("\nFormatted example (first 800 chars):")
print("-" * 80)
print(dataset[0]['text'][:800] + "...")

## 6. Configure Training (Stability Optimized)

Set up training hyperparameters optimized for stability:
- **Batch size 1**: Prevents Triton kernel timeout
- **Gradient accumulation 8**: Maintains effective batch size of 8
- **No packing**: Avoids creating very long packed sequences

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
import os

# Create output directory
os.makedirs("outputs/network-security-v2", exist_ok=True)

# Training configuration (Optimized for RTX 3090/4090 with 24GB VRAM)
training_args = TrainingArguments(
    # Output
    output_dir = "outputs/network-security-v2",
    
    # Batch size - increased for 24GB VRAM
    per_device_train_batch_size = 2,  # Can use 2 with 24GB VRAM
    gradient_accumulation_steps = 4,  # Effective batch size = 8
    
    # Training duration
    num_train_epochs = 3,
    
    # Learning rate (standard for QLoRA)
    learning_rate = 2e-4,
    lr_scheduler_type = "cosine",
    warmup_ratio = 0.03,
    
    # Optimization
    weight_decay = 0.01,
    max_grad_norm = 0.3,  # Gradient clipping for stability
    optim = "adamw_8bit",
    
    # Logging & saving
    logging_steps = 10,
    save_strategy = "steps",
    save_steps = 500,
    save_total_limit = 2,
    
    # Mixed precision
    fp16 = not torch.cuda.is_bf16_supported(),
    bf16 = torch.cuda.is_bf16_supported(),
    
    # Misc
    seed = 3407,
    report_to = "none",
    
    # Local VM settings
    dataloader_num_workers = 2,  # Can use more workers locally
)

print("‚úì Training configuration (RTX 3090/4090 Optimized):")
print(f"  - Epochs: {training_args.num_train_epochs}")
print(f"  - Batch size: {training_args.per_device_train_batch_size}")
print(f"  - Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"  - Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  - Learning rate: {training_args.learning_rate}")
print(f"  - LR Scheduler: {training_args.lr_scheduler_type}")
print(f"  - Warmup ratio: {training_args.warmup_ratio}")
print(f"  - Gradient clipping: {training_args.max_grad_norm}")
print(f"  - Mixed precision: {'BF16' if training_args.bf16 else 'FP16'}")

## 7. Initialize Trainer (Stability Settings)

Create the SFTTrainer with stability optimizations:
- **Packing disabled**: Prevents creating extremely long sequences that timeout
- **NEFTune**: Noisy embeddings for better generalization
- **Single process**: Avoids multiprocessing issues in local runtime

In [None]:
from unsloth import is_bfloat16_supported

# Stability settings (packing disabled to prevent Triton timeout)
USE_PACKING = False     # Disabled - can cause very long sequences and timeout
NEFTUNE_ALPHA = 5.0     # Noisy embeddings for generalization

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 1,  # Single process for stability
    packing = USE_PACKING,
    neftune_noise_alpha = NEFTUNE_ALPHA,
    args = training_args,
)

print("‚úì Trainer initialized (Stability Optimized)")
print(f"\n[Configuration]")
print(f"  Packing: {USE_PACKING} (disabled for stability)")
print(f"  NEFTune: alpha={NEFTUNE_ALPHA}")
print(f"  Max seq length: {max_seq_length}")
print(f"\nTraining {len(dataset)} examples")
print(f"Estimated steps per epoch: ~{len(dataset) // (training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps)}")
print(f"\nNote: If training still times out, try reducing max_seq_length to 1024")

## 8. Start Training

Begin the fine-tuning process. This cell first clears the Triton cache to prevent stale kernel issues.

**Target loss:** 0.5-1.0 is generally good
**Red flags:**
- Loss not decreasing ‚Üí adjust learning rate
- Loss near 0 ‚Üí overfitting, reduce epochs
- "Triton Error: launch timed out" ‚Üí reduce batch size or sequence length

**Common warnings (can be ignored):**
- "Model is already on multiple devices" - normal for QLoRA
- NEFTune warnings - normal during training

In [None]:
# Clear any stale Triton cache before training
import shutil
from pathlib import Path

triton_cache = Path.home() / ".triton" / "cache"
if triton_cache.exists():
    print(f"Clearing Triton cache at {triton_cache}...")
    shutil.rmtree(triton_cache, ignore_errors=True)
    print("‚úì Triton cache cleared")

# Note about "model is already on multiple devices" warning
print("\nNote: 'Model is already on multiple devices' warning is normal and can be ignored.")
print("="*50)

# Start training
print("\nStarting v2 training...")
print(f"Training on {len(dataset)} examples")
print(f"This will take several hours depending on your GPU.")
print("="*50)

trainer_stats = trainer.train()

print("="*50)
print("Training complete!")
print(f"Final loss: {trainer_stats.training_loss:.4f}")
print(f"Training time: {trainer_stats.metrics['train_runtime']/3600:.2f} hours")

## 9. Test the Model

Try out your fine-tuned model with some network security questions.

In [None]:
# Enable inference mode
FastLanguageModel.for_inference(model)

# Test questions
test_questions = [
    "How do I configure port security on a Cisco switch?",
    "Explain the difference between AWS Security Groups and Network ACLs.",
    "What Snort rules would detect SQL injection attempts?",
    "My VPN tunnel keeps dropping. How do I troubleshoot this?",
]

print("Testing the fine-tuned model...\n")
print("="*80)

for i, question in enumerate(test_questions, 1):
    print(f"\nTest {i}/{len(test_questions)}")
    print("-"*80)
    print(f"Question: {question}\n")
    
    # Format as chat
    messages = [
        {"role": "user", "content": question}
    ]
    
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to("cuda")
    
    # Generate response
    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )
    
    # Decode and print
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    # Extract just the assistant's response
    response = response.split("assistant\n\n")[-1] if "assistant" in response else response
    
    print(f"Answer:\n{response}")
    print("\n" + "="*80)

print("\n‚úÖ Testing complete!")

## 10. Save the Model

Save the LoRA adapter (small ~100-200MB file).

In [None]:
import os

# Create models directory
os.makedirs("models/network-security-lora", exist_ok=True)

# Save LoRA adapter locally
model.save_pretrained("models/network-security-lora")
tokenizer.save_pretrained("models/network-security-lora")

print("‚úì LoRA adapter saved to: models/network-security-lora")
print(f"  Size: ~100-200 MB")
print("\nTo upload to Hugging Face Hub, uncomment the cell below.")

In [None]:
# Optional: Push to Hugging Face Hub
# Replace 'your-username' with your HF username

# model.push_to_hub("your-username/llama3-network-security-lora", token="YOUR_HF_TOKEN")
# tokenizer.push_to_hub("your-username/llama3-network-security-lora", token="YOUR_HF_TOKEN")

# print("‚úì Model uploaded to Hugging Face Hub!")

## 11. Export to GGUF for Ollama

Convert to GGUF format for use with Ollama on your local machine.

In [None]:
import os

# Create output directories
os.makedirs("models/merged-16bit", exist_ok=True)
os.makedirs("models/gguf", exist_ok=True)

# First, save merged 16-bit model
print("Step 1: Merging LoRA with base model...")
print("(This may take a few minutes...)")
model.save_pretrained_merged(
    "models/merged-16bit",
    tokenizer,
    save_method="merged_16bit"
)
print("‚úì Merged model saved\n")

# Convert to GGUF with multiple quantization levels
print("Step 2: Converting to GGUF format...")
print("This will create 3 quantized versions (Q4_K_M, Q5_K_M, Q8_0)")
print("(This takes 10-30 minutes depending on your CPU...)\n")

model.save_pretrained_gguf(
    "models/gguf",
    tokenizer,
    quantization_method=["q4_k_m", "q5_k_m", "q8_0"]
)

print("\n" + "="*80)
print("‚úÖ GGUF CONVERSION COMPLETE!")
print("="*80)
print("\nCreated files in models/gguf/:")
print("  - unsloth.Q4_K_M.gguf (~4.5GB) - Fastest, good quality")
print("  - unsloth.Q5_K_M.gguf (~5.5GB) - Balanced [RECOMMENDED]")
print("  - unsloth.Q8_0.gguf (~8GB) - Highest quality")
print("\n" + "="*80)
print("NEXT STEPS (run in terminal):")
print("="*80)
print("""
# 1. Rename the GGUF file
mv models/gguf/unsloth.Q4_K_M.gguf models/gguf/network-security-expert.Q4_K_M.gguf

# 2. Create Ollama model
cd models
ollama create network-security-expert -f Modelfile

# 3. Test your model
ollama run network-security-expert
""")

## Summary

You've fine-tuned Llama 3.1 8B to be a Network Security expert on your local VM!

### What You've Done:
1. ‚úÖ Loaded Llama 3.1 8B Instruct with 4-bit quantization
2. ‚úÖ Configured LoRA adapters for efficient training
3. ‚úÖ Trained on your network security dataset
4. ‚úÖ Tested the model with example questions
5. ‚úÖ Saved the LoRA adapter
6. ‚úÖ Converted to GGUF for Ollama deployment

### Files Created:
- `models/network-security-lora/` - LoRA adapter (~100-200MB)
- `models/merged-16bit/` - Full merged model
- `models/gguf/*.gguf` - Quantized models for Ollama

### Deploy with Ollama:
```bash
# Install Ollama (if not already)
curl -fsSL https://ollama.com/install.sh | sh

# Create and run your model
cd ~/finetuning/models
ollama create network-security-expert -f Modelfile
ollama run network-security-expert
```

### Test Your Model:
```
>>> How do I configure a Palo Alto firewall rule?
>>> What's the difference between IDS and IPS?
>>> Explain zero trust architecture
```

üéâ Training complete!