# Sys-Scan Security Model Fine-Tuning

This notebook orchestrates the fine-tuning of a Mistral-7B model for security analysis tasks.

## Pipeline Overview

1. **Environment Setup**: Install dependencies and configure compute
2. **Data Generation**: Generate or load synthetic security findings
3. **Model Setup**: Load teacher and student models with quantization
4. **Training**: Fine-tune with Lion optimizer and optional distillation
5. **Evaluation**: Test on held-out security scenarios
6. **Export**: Save model for deployment in sys-scan-graph

## Requirements

- GPU with 24GB+ VRAM (A100, V100, or equivalent)
- Python 3.10+
- CUDA 11.8+


## 1. Environment Setup

In [None]:
# Install dependencies (run once)
!pip install -q --upgrade pip
!pip install -q torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install -q transformers==4.36.0 datasets==2.15.0 accelerate==0.25.0
!pip install -q peft==0.7.0 bitsandbytes==0.41.3 trl==0.7.9
!pip install -q sentencepiece protobuf tensorboard

print("✓ Dependencies installed")

In [None]:
# Imports
import os
import json
import torch
import logging
from pathlib import Path
from datetime import datetime

# Check GPU
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("⚠️  No GPU detected - training will be very slow!")

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

## 2. Configuration

In [None]:
# Training configuration
CONFIG = {
    # Models
    "teacher_model": "mistralai/Mistral-7B-Instruct-v0.3",
    "student_model": "mistralai/Mistral-7B-Instruct-v0.3",
    
    # Data paths (adjust for your environment)
    "data_dir": "/home/joseph-mazzini/sys-scan-embedded-agent/synthetic_data",
    "output_dir": "./mistral-7b-security-finetuned",
    "cache_dir": "./cache",
    
    # Training hyperparameters
    "num_epochs": 3,
    "batch_size": 4,
    "gradient_accumulation_steps": 4,
    "learning_rate": 2e-4,
    "max_seq_length": 2048,
    
    # LoRA
    "lora_r": 64,
    "lora_alpha": 128,
    "lora_dropout": 0.05,
    
    # Distillation
    "use_distillation": True,
    "temperature": 2.0,
    "distillation_alpha": 0.5,
    
    # Optimizer
    "use_lion": True,
    "lion_beta1": 0.9,
    "lion_beta2": 0.99,
    "weight_decay": 0.01,
    
    # Quantization
    "use_4bit": True,
    "use_bf16": True,
}

# Create output directory
Path(CONFIG["output_dir"]).mkdir(parents=True, exist_ok=True)
Path(CONFIG["cache_dir"]).mkdir(parents=True, exist_ok=True)

# Save config
with open(f"{CONFIG['output_dir']}/training_config.json", "w") as f:
    json.dump(CONFIG, f, indent=2)

print("✓ Configuration saved")
print(json.dumps(CONFIG, indent=2))

## 3. Generate Synthetic Data

In [None]:
# Option A: Generate new synthetic data
# Uncomment and modify producer counts as needed

"""
import sys
sys.path.append(CONFIG["data_dir"])

from synthetic_data_pipeline import run_synthetic_data_pipeline

result = run_synthetic_data_pipeline(
    output_path=f"{CONFIG['data_dir']}/training_dataset.json",
    producer_counts={
        "processes": 100,
        "network": 80,
        "kernel_params": 50,
        "modules": 40,
        "world_writable": 30,
        "suid": 25,
        "ioc": 60,
        "mac": 20,
        "dns": 70,
        "endpoint_behavior": 50
    },
    use_langchain=False,  # Set True if you have LangChain API key
    compress=False,
    conservative_parallel=False  # Use True for local dev, False for cloud
)

print(f"✓ Generated {result['findings_count']} findings")
"""

# Option B: Use existing data
data_files = list(Path(CONFIG["data_dir"]).glob("*.json"))
data_files = [str(f) for f in data_files if "schema" not in str(f) and "config" not in str(f)]

print(f"Found {len(data_files)} data files:")
for f in data_files:
    print(f"  - {Path(f).name}")

## 4. Load and Inspect Data

In [None]:
# Load one dataset to inspect
with open(data_files[0]) as f:
    sample_data = json.load(f)

print("Dataset structure:")
print(json.dumps({k: type(v).__name__ for k, v in sample_data.items()}, indent=2))

# Check findings
if 'data' in sample_data:
    findings = sample_data['data'].get('findings', {})
else:
    findings = sample_data.get('findings', {})

print(f"\nFinding categories: {list(findings.keys())}")

# Count total findings
total_findings = 0
for category, severity_groups in findings.items():
    if isinstance(severity_groups, dict):
        for findings_list in severity_groups.values():
            if isinstance(findings_list, list):
                total_findings += len(findings_list)

print(f"Total findings: {total_findings}")

## 5. Initialize Fine-Tuning Pipeline

In [None]:
# Import pipeline modules
import sys
sys.path.append(CONFIG["data_dir"])

from ml_finetuning_pipeline import (
    FinetuningConfig,
    SecurityFineTuningPipeline,
    SecurityDataPreprocessor
)
from lion_optimizer import Lion, create_lion_optimizer
from distillation_trainer import create_distillation_trainer, DistillationTrainer

# Create pipeline config from notebook config
pipeline_config = FinetuningConfig(
    teacher_model_name=CONFIG["teacher_model"],
    student_model_name=CONFIG["student_model"],
    lora_r=CONFIG["lora_r"],
    lora_alpha=CONFIG["lora_alpha"],
    lora_dropout=CONFIG["lora_dropout"],
    use_4bit=CONFIG["use_4bit"],
    num_train_epochs=CONFIG["num_epochs"],
    per_device_train_batch_size=CONFIG["batch_size"],
    gradient_accumulation_steps=CONFIG["gradient_accumulation_steps"],
    learning_rate=CONFIG["learning_rate"],
    weight_decay=CONFIG["weight_decay"],
    use_lion=CONFIG["use_lion"],
    lion_beta1=CONFIG["lion_beta1"],
    lion_beta2=CONFIG["lion_beta2"],
    use_distillation=CONFIG["use_distillation"],
    distillation_temperature=CONFIG["temperature"],
    distillation_alpha=CONFIG["distillation_alpha"],
    max_seq_length=CONFIG["max_seq_length"],
    output_dir=CONFIG["output_dir"],
    cache_dir=CONFIG["cache_dir"],
    bf16=CONFIG["use_bf16"],
)

print("✓ Pipeline configuration created")

## 6. Setup Models and Tokenizer

In [None]:
# Initialize pipeline
pipeline = SecurityFineTuningPipeline(pipeline_config)

# Setup tokenizer
print("Loading tokenizer...")
tokenizer = pipeline.setup_tokenizer()
print(f"✓ Tokenizer loaded (vocab size: {len(tokenizer)})")

# Setup models
print("\nLoading models (this may take several minutes)...")
teacher_model, student_model = pipeline.setup_models()
print("✓ Models loaded")

# Print trainable parameters
if hasattr(student_model, 'print_trainable_parameters'):
    student_model.print_trainable_parameters()

## 7. Prepare Training Dataset

In [None]:
# Prepare dataset from synthetic data files
print(f"Preparing dataset from {len(data_files)} files...")

dataset = pipeline.prepare_dataset(
    data_paths=data_files,
    train_split=0.9
)

print(f"\n✓ Dataset prepared:")
print(f"  Train examples: {len(dataset['train'])}")
print(f"  Eval examples: {len(dataset['eval'])}")

# Inspect a sample
print("\nSample training example:")
print(dataset['train'][0]['text'][:500] + "...")

## 8. Setup Training Arguments

In [None]:
# Get training arguments
training_args = pipeline.setup_training_args()

print("Training configuration:")
print(f"  Output dir: {training_args.output_dir}")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Weight decay: {training_args.weight_decay}")
print(f"  Optimizer: {training_args.optim}")
print(f"  BF16: {training_args.bf16}")
print(f"  Gradient checkpointing: {training_args.gradient_checkpointing}")

## 9. Train Model

In [None]:
# Start training
print("="*60)
print("Starting training...")
print("="*60)

start_time = datetime.now()

trainer = pipeline.train(dataset, training_args)

end_time = datetime.now()
duration = end_time - start_time

print("\n" + "="*60)
print(f"✓ Training completed in {duration}")
print("="*60)

## 10. Evaluate Model

In [None]:
# Run evaluation
print("Running final evaluation...")

eval_results = trainer.evaluate()

print("\nEvaluation results:")
for key, value in eval_results.items():
    print(f"  {key}: {value:.4f}")

# Save evaluation results
with open(f"{CONFIG['output_dir']}/eval_results.json", "w") as f:
    json.dump(eval_results, f, indent=2)

print(f"\n✓ Evaluation results saved to {CONFIG['output_dir']}/eval_results.json")

## 11. Test Inference

In [None]:
# Test the fine-tuned model on a sample security finding

test_finding = {
    "id": "test_001",
    "title": "Suspicious process spawning shell",
    "description": "Process /usr/bin/python3 spawned /bin/bash with unusual arguments",
    "metadata": {
        "pid": 12345,
        "command": "/usr/bin/python3 -c 'import os; os.system(\"/bin/bash -i\")'",
        "user": "www-data",
        "ppid": 1234
    },
    "category": "processes"
}

prompt = f"""[INST] Analyze this security finding and provide risk scoring with subscores (impact, exposure, anomaly, confidence), severity classification, and actionability assessment.

Input:
{json.dumps(test_finding, indent=2)}
[/INST]
"""

# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to(student_model.device)

# Generate
print("Generating response...")
with torch.no_grad():
    outputs = student_model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("\n" + "="*60)
print("Model Response:")
print("="*60)
print(response)
print("="*60)

## 12. Save Model

In [None]:
# Save the final model
print(f"Saving model to {CONFIG['output_dir']}...")

pipeline.save_model(trainer)

print("\n✓ Model saved successfully!")
print(f"\nModel location: {CONFIG['output_dir']}")
print("\nTo use this model in sys-scan-graph:")
print(f"1. Copy {CONFIG['output_dir']} to your sys-scan-graph agent directory")
print("2. Update sys_scan_agent/llm.py to load from this path")
print("3. Test with: sys-scan-graph analyze --report <report.json>")

## 13. Upload to HuggingFace Hub (Optional)

In [None]:
# Uncomment and configure to upload to HuggingFace

"""
from huggingface_hub import login, HfApi

# Login (you'll need a HuggingFace token)
login()

# Push to hub
repo_name = "your-username/mistral-7b-security-analyst"

student_model.push_to_hub(repo_name)
tokenizer.push_to_hub(repo_name)

print(f"✓ Model uploaded to https://huggingface.co/{repo_name}")
"""

## Summary

This notebook fine-tuned a Mistral-7B model for security analysis tasks using:

- **4-bit quantization** for memory efficiency
- **LoRA adapters** for parameter-efficient fine-tuning
- **Lion optimizer** for improved convergence
- **Optional distillation** from teacher model
- **Synthetic security data** generated from sys-scan patterns

The resulting model can analyze security findings, score risks, identify correlations, and generate actionable recommendations - all locally without external API calls.