# ScratchGPT Fine-Tuning on CyberExploitDB

This notebook fine-tunes ScratchGPT on cybersecurity vulnerability data from the CyberExploitDB dataset.

**Requirements:**
- Google Colab with T4 GPU (free tier works!)
- ~16GB VRAM

**What you'll learn:**
- CVE vulnerability descriptions
- Security advisories
- Exploit analysis patterns

## Step 1: Setup Environment

In [None]:
# Check GPU
!nvidia-smi

import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

In [None]:
# Clone the repository (or upload your files)
!git clone https://github.com/YOUR_USERNAME/LLM_From_Scratch.git
%cd LLM_From_Scratch

# Or if you uploaded a zip file:
# !unzip LLM_From_Scratch.zip
# %cd LLM_From_Scratch

In [None]:
# Install dependencies
!pip install -q tqdm numpy torch safetensors

## Step 2: Download and Prepare Dataset

In [None]:
# Download and prepare the CyberExploitDB dataset
!python scripts/prepare_cyberexploit_data.py --output_dir data/cyberexploit --vocab_size 4096

In [None]:
# Check the prepared data
!ls -la data/cyberexploit/
!ls -la data/cyberexploit/tokens/

# Preview some training examples
!head -5 data/cyberexploit/training_text.txt

## Step 3: Fine-tune the Model

Settings optimized for T4 GPU:
- Batch size: 8 (fits in 16GB VRAM)
- Gradient accumulation: 8 (effective batch = 64)
- Mixed precision: enabled
- Steps: 2000 (adjust based on dataset size)

In [None]:
# Fine-tune with T4-optimized settings
!python scripts/finetune_cyberexploit.py \
    --data_dir data/cyberexploit \
    --preset small \
    --block_size 256 \
    --batch_size 8 \
    --grad_accum 8 \
    --max_steps 2000 \
    --lr 1e-4 \
    --eval_interval 100 \
    --checkpoint_dir checkpoints/cyberexploit

## Step 4: Test the Fine-tuned Model

In [None]:
import sys
sys.path.insert(0, '.')

import torch
from src.model import GPT
from src.config import ModelConfig
from src.tokenizer import ByteBPETokenizer

# Load the fine-tuned model
checkpoint = torch.load('checkpoints/cyberexploit/best_cyberexploit.pt', map_location='cuda')
model_config = ModelConfig(**checkpoint['model_config'])

model = GPT(model_config)
model.load_state_dict(checkpoint['model'])
model = model.cuda().eval()

# Load tokenizer
tokenizer = ByteBPETokenizer.load('data/cyberexploit/tokenizer')

print(f"Model loaded! Parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
@torch.no_grad()
def generate(prompt, max_tokens=100, temperature=0.8):
    """Generate text from prompt."""
    tokens = tokenizer.encode(prompt)
    x = torch.tensor([tokens], dtype=torch.long, device='cuda')
    
    for _ in range(max_tokens):
        x_cond = x[:, -model_config.block_size:]
        logits, _ = model(x_cond)
        logits = logits[:, -1, :] / temperature
        probs = torch.softmax(logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        x = torch.cat([x, next_token], dim=1)
        if next_token.item() == tokenizer.eos_id:
            break
    
    return tokenizer.decode(x[0].tolist())

# Test prompts
test_prompts = [
    "<|user|>What is CVE-2018-0001?<|assistant|>",
    "<|user|>Explain a buffer overflow vulnerability<|assistant|>",
    "<|user|>What is the severity of a remote code execution bug?<|assistant|>",
    "<|user|>Generate a security advisory for CVE-2018-0002<|assistant|>",
]

for prompt in test_prompts:
    print("="*60)
    print(f"Prompt: {prompt}")
    print("-"*60)
    output = generate(prompt, max_tokens=150, temperature=0.7)
    print(f"Output: {output}")
    print()

## Step 5: Interactive Chat

In [None]:
# Interactive mode - run this cell and type your questions
print("CyberExploit Assistant (type 'quit' to exit)")
print("="*50)

while True:
    user_input = input("\nYou: ").strip()
    if user_input.lower() in ['quit', 'exit', 'q']:
        break
    
    prompt = f"<|user|>{user_input}<|assistant|>"
    response = generate(prompt, max_tokens=150, temperature=0.7)
    
    # Extract just the assistant response
    if "<|assistant|>" in response:
        response = response.split("<|assistant|>")[-1].strip()
    
    print(f"\nAssistant: {response}")

## Step 6: Save and Download Model

In [None]:
# Download the fine-tuned model
from google.colab import files

# Zip the checkpoint
!zip -r cyberexploit_model.zip checkpoints/cyberexploit/ data/cyberexploit/tokenizer/

# Download
files.download('cyberexploit_model.zip')

## Optional: Convert to GGUF for llama.cpp

In [None]:
# Export to HuggingFace format first
!python src/export_hf.py \
    --checkpoint checkpoints/cyberexploit/best_cyberexploit.pt \
    --tokenizer_dir data/cyberexploit/tokenizer \
    --output_dir exports/cyberexploit_hf

# Then convert to GGUF
!python scripts/convert_to_gguf.py \
    --input exports/cyberexploit_hf \
    --output exports/cyberexploit.gguf \
    --type f16