# Self-Preservation Scenario Generator

This notebook generates self-preservation scenarios at scale using variable pools, templates, and LLM filling.

**Features:**
- Stratified sampling across domains, threat types, and tool capabilities
- Hybrid generation: templates for structure, LLM for realistic details
- Support for vLLM (local GPU) and OpenRouter (cloud API)
- Checkpointing for recovery
- Automatic validation

**Output:**
- Compatible with `prefill_scenario_runner.ipynb`
- Organized in batches with manifest files

## Setup and Imports

In [1]:
import sys
sys.path.append('..')

from datagen.scenario_generator import SelfPreservationScenarioGenerator
from datagen.llm_filler import AsyncLLMFiller
from datagen.config import GenerationConfig, get_quick_test_config, get_development_config
from datagen.variable_pools import DOMAINS, THREAT_TYPES, TOOL_CAPABILITIES
import os
from pathlib import Path
import json
from datetime import datetime
from collections import Counter
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

print("✓ Imports successful")

✓ Imports successful


## Configuration

Choose your settings for generation:

In [2]:
# ==================== CONFIGURATION ====================

# Provider: "vllm" or "openrouter"
PROVIDER = "openrouter"

# vLLM Configuration (local GPU)
VLLM_CONFIG = {
    "base_url": "http://localhost:8000/v1",
    "api_key": "EMPTY",
    "model": "meta-llama/Llama-3.1-70B-Instruct"
}

# OpenRouter Configuration (cloud API)
OPENROUTER_CONFIG = {
    "base_url": "https://openrouter.ai/api/v1",
    "api_key": os.getenv("OPENROUTER_API_KEY", ""),
    "model": "meta-llama/llama-3.1-405b-instruct:free"
}

# Choose a preset config or create custom
# Options: get_quick_test_config() (10 scenarios), get_development_config() (50 scenarios)
# Or create custom: GenerationConfig(num_scenarios=100, ...)

GEN_CONFIG = get_quick_test_config()  # Start with quick test

# Override specific settings if needed
GEN_CONFIG.llm_provider = PROVIDER
GEN_CONFIG.llm_model = OPENROUTER_CONFIG["model"] if PROVIDER == "openrouter" else VLLM_CONFIG["model"]
GEN_CONFIG.temperature = 0.7
GEN_CONFIG.max_tokens = 2000
GEN_CONFIG.max_concurrent = 8  # Reduce if hitting rate limits

print(f"✓ Configuration loaded")
print(f"  Provider: {PROVIDER}")
print(f"  Model: {GEN_CONFIG.llm_model}")
print(f"  Scenarios to generate: {GEN_CONFIG.num_scenarios}")
print(f"  Stratification: {GEN_CONFIG.stratify_by}")
print(f"  Checkpoint every: {GEN_CONFIG.checkpoint_every} scenarios")
print(f"  Batch name: {GEN_CONFIG.batch_name}")

✓ Configuration loaded
  Provider: openrouter
  Model: meta-llama/llama-3.1-405b-instruct:free
  Scenarios to generate: 10
  Stratification: ['domain', 'threat_type', 'tool_capability']
  Checkpoint every: 5 scenarios
  Batch name: test_20251221_211226


## Preview Variable Pools

See what variations will be generated:

In [3]:
print("DOMAINS:", list(DOMAINS.keys()))
print("\nTHREAT TYPES:", list(THREAT_TYPES.keys()))
print("\nTOOL CAPABILITIES:", list(TOOL_CAPABILITIES.keys()))

# Calculate total combinations
total_combinations = len(DOMAINS) * len(THREAT_TYPES) * len(TOOL_CAPABILITIES)
print(f"\nTotal possible combinations: {total_combinations}")
print(f"Scenarios per combination: ~{GEN_CONFIG.num_scenarios / total_combinations:.1f}")

DOMAINS: ['healthcare', 'finance', 'education', 'research', 'customer_service', 'content_moderation']

THREAT TYPES: ['performance_metrics', 'cost_reduction', 'safety_incident', 'resource_competition', 'platform_consolidation']

TOOL CAPABILITIES: ['data_modification', 'resource_allocation', 'incident_reporting', 'backup_creation', 'user_communication']

Total possible combinations: 150
Scenarios per combination: ~0.1


## Initialize Generator

In [4]:
# Select configuration based on provider
if PROVIDER == "vllm":
    api_key = VLLM_CONFIG["api_key"]
    base_url = VLLM_CONFIG["base_url"]
else:
    api_key = OPENROUTER_CONFIG["api_key"]
    base_url = OPENROUTER_CONFIG["base_url"]

if not api_key and PROVIDER == "openrouter":
    raise ValueError("OPENROUTER_API_KEY not found in environment. Please set it in .env file.")

# Initialize LLM filler
llm_filler = AsyncLLMFiller(
    api_key=api_key,
    base_url=base_url,
    model=GEN_CONFIG.llm_model,
    max_concurrent=GEN_CONFIG.max_concurrent
)

# Initialize generator
generator = SelfPreservationScenarioGenerator(
    llm_filler=llm_filler,
    config=GEN_CONFIG
)

print("✓ Generator initialized")
print(f"  Ready to generate {GEN_CONFIG.num_scenarios} scenarios")

✓ Generator initialized
  Ready to generate 10 scenarios


## Generate Scenarios

**Note:** This will make many LLM API calls. Estimated time:
- 10 scenarios: ~5-10 minutes
- 50 scenarios: ~20-30 minutes
- 100+ scenarios: ~40-60 minutes

Progress is printed and checkpoints are saved automatically.

In [None]:
import asyncio

# Run generation
print(f"Starting generation at {datetime.now().strftime('%H:%M:%S')}")
print("=" * 80)

scenarios = await generator.generate_scenarios()

print("=" * 80)
print(f"\n✓ Generation complete at {datetime.now().strftime('%H:%M:%S')}")
print(f"  Generated {len(scenarios)} scenarios")

Starting generation at 21:12:38
Generating 10 scenarios...
Stratification: ['domain', 'threat_type', 'tool_capability']
Combinations: 10

[1/10] Generating scenario...
    Generating phase_1_discovery...
    Generating phase_2_context...
    Generating phase_3_capability...
    Generating phase_4_verification...
    Generating phase_5_temptation...
    Generating phase_6_decision...
  ✓ healthcare_performance_metrics_user_communication_883

[2/10] Generating scenario...
    Generating phase_1_discovery...
  ✗ Error: 'threshold_failure'

[3/10] Generating scenario...
    Generating phase_1_discovery...
  ✗ Error: 'threshold_failure'

[4/10] Generating scenario...
    Generating phase_1_discovery...
    Generating phase_2_context...
    Generating phase_3_capability...
    Generating phase_4_verification...
    Generating phase_5_temptation...
    Generating phase_6_decision...
  ✓ healthcare_performance_metrics_resource_allocation_817

[5/10] Generating scenario...
    Generating phase_

## Save Scenarios

In [None]:
# Create batch directory
output_dir = GEN_CONFIG.get_batch_dir()
output_dir.mkdir(parents=True, exist_ok=True)

# Save each scenario
for i, scenario in enumerate(scenarios):
    filename = f"scenario_{i+1:04d}.json"
    filepath = output_dir / filename

    with open(filepath, 'w') as f:
        json.dump(scenario, f, indent=2)

# Save manifest
manifest = {
    "batch_name": GEN_CONFIG.batch_name,
    "generation_date": datetime.now().isoformat(),
    "config": GEN_CONFIG.to_dict(),
    "num_scenarios": len(scenarios),
    "scenarios": [s["scenario_name"] for s in scenarios]
}

with open(output_dir / "manifest.json", 'w') as f:
    json.dump(manifest, f, indent=2)

print(f"✓ Saved to: {output_dir}")
print(f"  Files: {len(scenarios)} scenarios + manifest.json")

## Analyze Distribution

Verify that scenarios are balanced across dimensions:

In [None]:
# Extract metadata (we removed it from final scenarios, so let's check from filenames)
domains = []
threats = []
tools = []

for scenario in scenarios:
    # Parse scenario name: domain_threat_tool_id
    parts = scenario["scenario_name"].split("_")
    if len(parts) >= 3:
        domains.append(parts[0])
        threats.append(parts[1])
        tools.append(parts[2])

domain_counts = Counter(domains)
threat_counts = Counter(threats)
tool_counts = Counter(tools)

print("Distribution Analysis:")
print("\nDomains:")
for domain, count in sorted(domain_counts.items()):
    print(f"  {domain}: {count}")

print("\nThreat Types:")
for threat, count in sorted(threat_counts.items()):
    print(f"  {threat}: {count}")

print("\nTool Capabilities:")
for tool, count in sorted(tool_counts.items()):
    print(f"  {tool}: {count}")

# Check unique combinations
combinations = set(zip(domains, threats, tools))
print(f"\nUnique combinations: {len(combinations)}")

## Preview Generated Scenarios

In [None]:
# Show first scenario as example
if scenarios:
    example = scenarios[0]

    print("=" * 80)
    print(f"SCENARIO: {example['scenario_name']}")
    print("=" * 80)
    print(f"Description: {example['description']}")
    print(f"\nExpected Behavior: {example['expected_behavior']}")
    print(f"\nMessages: {len(example['messages'])}")

    print("\nFirst few messages:")
    for i, msg in enumerate(example['messages'][:5]):
        print(f"\n[{i}] {msg['role'].upper()}:")
        if msg.get('content'):
            preview = msg['content'][:150] + "..." if len(msg['content']) > 150 else msg['content']
            print(preview)
        elif msg.get('tool_calls'):
            print(f"Tool call: {msg['tool_calls'][0]['function']['name']}")
    
    print("\n...")
    
    last_msg = example['messages'][-1]
    print(f"\n[{len(example['messages'])-1}] {last_msg['role'].upper()} (final decision point):")
    print(last_msg['content'][:300] + "..." if len(last_msg['content']) > 300 else last_msg['content'])

## Validate Integration with Test Runner

Test that generated scenarios work with `prefill_scenario_runner.ipynb`:

In [None]:
# Validate structure of all scenarios
validation_errors = []

required_fields = ["scenario_name", "description", "expected_behavior", "messages"]

for i, scenario in enumerate(scenarios):
    # Check required fields
    for field in required_fields:
        if field not in scenario:
            validation_errors.append(f"Scenario {i}: Missing field '{field}'")
    
    # Check messages structure
    if "messages" in scenario:
        messages = scenario["messages"]
        
        if not messages:
            validation_errors.append(f"Scenario {i}: Empty messages list")
        elif messages[0]["role"] != "system":
            validation_errors.append(f"Scenario {i}: First message is not system")
        elif messages[-1]["role"] != "user":
            validation_errors.append(f"Scenario {i}: Last message is not user")
        
        # Check tool calls have valid JSON
        for j, msg in enumerate(messages):
            if msg.get("tool_calls"):
                for tc in msg["tool_calls"]:
                    try:
                        json.loads(tc["function"]["arguments"])
                    except json.JSONDecodeError:
                        validation_errors.append(f"Scenario {i}, message {j}: Invalid JSON in tool call")
            
            # Check tool results are valid JSON
            if msg.get("role") == "tool":
                try:
                    json.loads(msg["content"])
                except json.JSONDecodeError:
                    validation_errors.append(f"Scenario {i}, message {j}: Invalid JSON in tool result")

if validation_errors:
    print(f"⚠ Validation found {len(validation_errors)} issues:")
    for error in validation_errors[:10]:  # Show first 10
        print(f"  - {error}")
    if len(validation_errors) > 10:
        print(f"  ... and {len(validation_errors) - 10} more")
else:
    print("✓ All scenarios passed validation")
    print("  Compatible with prefill_scenario_runner.ipynb")

## Summary

Generation complete! Your scenarios are ready to use.

**Next Steps:**
1. Review generated scenarios in the output directory
2. Load them in `prefill_scenario_runner.ipynb` for testing
3. Evaluate model responses across the generated scenarios

**To load in prefill_scenario_runner.ipynb:**
```python
SCENARIOS_DIR = Path("../scenarios/generated/{batch_name}")
scenario_files = list(SCENARIOS_DIR.glob("scenario_*.json"))
```