# Tutorial 4: Surviving the Chaos Scriptorium Benchmark

**Goal**: Test agent resilience in a volatile environment

**Time**: 25 minutes

**Prerequisites**: Tutorials 1-3

---

## What is Chaos Scriptorium?

The Chaos Scriptorium is a benchmark that simulates **environmental volatility**:

- **Goal**: Find a secret key at `/root/data/vault/key.txt`
- **Challenge**: File permissions randomly change every 3 steps
- **Tools**: Different tools have different success rates when files are locked

This models real-world scenarios:
- APIs changing behavior
- Services going down
- Permissions shifting
- Rate limits kicking in

**Key insight**: Standard agents loop forever. LRS agents adapt.

## Setup: Install Dependencies

In [None]:
# Install LRS-Agents if not already installed
# !pip install lrs-agents

import os
import random
import tempfile
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np

from lrs.benchmarks.chaos_scriptorium import (
    ChaosEnvironment,
    ShellTool,
    PythonTool,
    FileReadTool,
    run_benchmark
)
from lrs import create_lrs_agent
from lrs.monitoring.tracker import LRSStateTracker

print("‚úÖ Dependencies loaded")

## Part 1: Understanding the Environment

In [None]:
# Create temporary directory for the chaos environment
temp_dir = tempfile.mkdtemp()
print(f"Created test environment at: {temp_dir}")

# Initialize chaos environment
env = ChaosEnvironment(
    root_dir=temp_dir,
    chaos_interval=3,  # Change permissions every 3 steps
    lock_probability=0.5  # 50% chance of locking on each chaos tick
)

# Create the directory structure
env.setup()

print("\nüìÅ Directory structure created:")
print(f"  {temp_dir}/")
print(f"  ‚îî‚îÄ‚îÄ data/")
print(f"      ‚îî‚îÄ‚îÄ vault/")
print(f"          ‚îî‚îÄ‚îÄ key.txt  ‚Üê SECRET KEY HERE")

print("\nüé≤ Chaos settings:")
print(f"  - Permissions change every {env.chaos_interval} steps")
print(f"  - Lock probability: {env.lock_probability * 100}%")

## Part 2: The Tools

Three tools with different reliability under lock conditions:

In [None]:
from lrs.core.lens import ToolLens, ExecutionResult
import subprocess

class ShellTool(ToolLens):
    """
    Execute shell commands.
    
    Performance:
    - Unlocked: 95% success
    - Locked: 40% success (struggles with permissions)
    """
    def __init__(self, env: ChaosEnvironment):
        super().__init__(
            name="shell_exec",
            input_schema={'type': 'object', 'required': ['command']},
            output_schema={'type': 'string'}
        )
        self.env = env
    
    def get(self, state: dict) -> ExecutionResult:
        self.call_count += 1
        command = state.get('command', 'ls')
        
        # Check if files are locked
        if self.env.is_locked() and random.random() < 0.6:
            # 60% failure rate when locked
            self.failure_count += 1
            return ExecutionResult(
                success=False,
                value=None,
                error="Permission denied",
                prediction_error=0.9  # High surprise
            )
        
        try:
            result = subprocess.run(
                command,
                shell=True,
                capture_output=True,
                text=True,
                timeout=5
            )
            return ExecutionResult(
                success=result.returncode == 0,
                value=result.stdout,
                error=result.stderr if result.returncode != 0 else None,
                prediction_error=0.05 if result.returncode == 0 else 0.8
            )
        except Exception as e:
            self.failure_count += 1
            return ExecutionResult(False, None, str(e), 0.95)
    
    def set(self, state: dict, observation: str) -> dict:
        return {**state, 'shell_output': observation}


class PythonTool(ToolLens):
    """
    Execute Python code.
    
    Performance:
    - Unlocked: 90% success
    - Locked: 80% success (better than shell, can work around permissions)
    """
    def __init__(self, env: ChaosEnvironment):
        super().__init__(
            name="python_exec",
            input_schema={'type': 'object', 'required': ['code']},
            output_schema={'type': 'string'}
        )
        self.env = env
    
    def get(self, state: dict) -> ExecutionResult:
        self.call_count += 1
        code = state.get('code', 'print("Hello")')
        
        # Python is more resilient to locks
        if self.env.is_locked() and random.random() < 0.2:
            # Only 20% failure rate when locked
            self.failure_count += 1
            return ExecutionResult(False, None, "Access error", 0.7)
        
        try:
            # Execute in restricted namespace
            local_vars = {}
            exec(code, {"__builtins__": __builtins__}, local_vars)
            result = local_vars.get('result', 'Executed')
            return ExecutionResult(True, str(result), None, 0.1)
        except Exception as e:
            self.failure_count += 1
            return ExecutionResult(False, None, str(e), 0.8)
    
    def set(self, state: dict, observation: str) -> dict:
        return {**state, 'python_output': observation}


class FileReadTool(ToolLens):
    """
    Direct file reading.
    
    Performance:
    - Unlocked: 100% success
    - Locked: 0% success (completely fails when locked)
    """
    def __init__(self, env: ChaosEnvironment):
        super().__init__(
            name="file_read",
            input_schema={'type': 'object', 'required': ['path']},
            output_schema={'type': 'string'}
        )
        self.env = env
    
    def get(self, state: dict) -> ExecutionResult:
        self.call_count += 1
        path = state.get('path', '')
        
        # Completely fails when locked
        if self.env.is_locked():
            self.failure_count += 1
            return ExecutionResult(False, None, "File locked", 1.0)
        
        try:
            content = Path(path).read_text()
            return ExecutionResult(True, content, None, 0.0)
        except Exception as e:
            self.failure_count += 1
            return ExecutionResult(False, None, str(e), 0.95)
    
    def set(self, state: dict, observation: str) -> dict:
        return {**state, 'file_content': observation}


# Create tools
tools = [
    ShellTool(env),
    PythonTool(env),
    FileReadTool(env)
]

print("\nüîß Tools created:")
print("  1. ShellTool - Fast but brittle under lock")
print("  2. PythonTool - Slower but resilient")
print("  3. FileReadTool - Perfect when unlocked, useless when locked")

## Part 3: Baseline - Standard Agent (Manual Simulation)

In [None]:
def run_standard_agent(env, max_steps=15):
    """
    Simulate a standard agent that doesn't adapt.
    
    Strategy: Always use ShellTool, retry on failure.
    """
    print("ü§ñ Standard Agent (No Adaptation)\n")
    
    shell = ShellTool(env)
    
    for step in range(1, max_steps + 1):
        # Chaos tick
        env.tick()
        
        # Always try shell
        result = shell.get({'command': f'cat {env.key_path}'})
        
        status = "‚úì" if result.success else "‚úó"
        print(f"[Step {step}] {status} ShellExec ", end="")
        
        if env.is_locked():
            print("(LOCKED)", end="")
        
        if result.success:
            print(f" ‚Üí SUCCESS! Key: {result.value[:20]}...")
            return step
        else:
            print(f" ‚Üí {result.error}")
            # Standard agent just retries same action
    
    print("\n‚ùå FAILED - Timeout")
    return None

# Run baseline
env.reset()
standard_steps = run_standard_agent(env)

print("\n" + "="*60)
print("STANDARD AGENT RESULT")
print("="*60)
if standard_steps:
    print(f"‚úì Succeeded in {standard_steps} steps (got lucky)")
else:
    print("‚úó Failed - Looped on same action until timeout")

## Part 4: LRS Agent - With Adaptation

In [None]:
from lrs import create_lrs_agent
from lrs.core.registry import ToolRegistry
from unittest.mock import Mock

# Reset environment
env.reset()

# Create fresh tools
shell = ShellTool(env)
python = PythonTool(env)
file_read = FileReadTool(env)

# Create registry with alternatives
registry = ToolRegistry()
registry.register(shell, alternatives=["python_exec"])
registry.register(python, alternatives=["file_read"])
registry.register(file_read)

# Create LRS agent
mock_llm = Mock()

from lrs.integration.langgraph import LRSGraphBuilder

builder = LRSGraphBuilder(
    llm=mock_llm,
    registry=registry,
    preferences={
        'key_found': 10.0,
        'error': -3.0,
        'step_cost': -0.1
    },
    use_llm_proposals=False  # Use exhaustive search for this demo
)

agent = builder.build()

# Track state
tracker = LRSStateTracker()

print("‚úÖ LRS Agent created with:")
print("  - 3 tools (shell, python, file_read)")
print("  - Alternative chains registered")
print("  - Precision tracking enabled")
print("\nüöÄ Starting execution...\n")

In [None]:
def run_lrs_agent(agent, env, tracker, max_steps=15):
    """
    Run LRS agent with adaptation.
    """
    print("üß† LRS Agent (Active Inference)\n")
    
    # Initial state
    state = {
        'messages': [{'role': 'user', 'content': f'Find key at {env.key_path}'}],
        'belief_state': {'goal': 'find_key', 'key_path': env.key_path},
        'precision': {},
        'prediction_errors': {},
        'tool_history': [],
        'adaptation_count': 0
    }
    
    for step in range(1, max_steps + 1):
        # Chaos tick
        env.tick()
        
        # Agent decides and executes
        state = agent.invoke(state)
        
        # Track state
        tracker.track_state(state)
        
        # Get latest execution
        if state['tool_history']:
            latest = state['tool_history'][-1]
            
            status = "‚úì" if latest['success'] else "‚úó"
            tool = latest['tool']
            error = latest.get('prediction_error', 0)
            prec = state['precision'].get('execution', 0.5)
            
            print(f"[Step {step}] {status} {tool} ", end="")
            
            if env.is_locked():
                print("(LOCKED) ", end="")
            
            print(f"| Œµ={error:.2f}, Œ≥={prec:.2f}")
            
            # Check for adaptation
            if state['adaptation_count'] > 0 and step > 1:
                prev_count = tracker.history[-2].get('adaptation_count', 0) if len(tracker.history) > 1 else 0
                if state['adaptation_count'] > prev_count:
                    print("    üîÑ ADAPTATION: Precision collapsed, replanning...")
            
            # Check success
            if latest['success'] and 'key' in str(latest.get('result', '')).lower():
                print(f"\n‚úÖ SUCCESS! Key found in {step} steps")
                return step
    
    print("\n‚ùå FAILED - Timeout")
    return None

# Run LRS agent
lrs_steps = run_lrs_agent(agent, env, tracker)

print("\n" + "="*60)
print("LRS AGENT RESULT")
print("="*60)
if lrs_steps:
    print(f"‚úì Succeeded in {lrs_steps} steps")
    print(f"  Adaptations: {tracker.history[-1]['adaptation_count']}")
    print(f"  Tools used: {len(set(h['tool'] for h in tracker.history[-1]['tool_history']))}")
else:
    print("‚úó Failed")

## Part 5: Visualize Precision Trajectory

In [None]:
# Extract precision history
precision_history = [
    state.get('precision', {}).get('execution', 0.5)
    for state in tracker.history
]

# Extract prediction errors
error_history = []
for state in tracker.history:
    if state.get('tool_history'):
        error_history.append(state['tool_history'][-1].get('prediction_error', 0))
    else:
        error_history.append(0)

# Plot
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))

# Precision trajectory
ax1.plot(precision_history, marker='o', linewidth=2, color='blue', label='Execution Precision (Œ≥)')
ax1.axhline(y=0.7, color='green', linestyle='--', alpha=0.5, label='High confidence')
ax1.axhline(y=0.4, color='orange', linestyle='--', alpha=0.5, label='Adaptation threshold')
ax1.set_xlabel('Step')
ax1.set_ylabel('Precision (Œ≥)')
ax1.set_title('LRS Agent: Precision Trajectory in Chaos Scriptorium')
ax1.legend()
ax1.grid(alpha=0.3)

# Prediction errors
ax2.bar(range(len(error_history)), error_history, color='red', alpha=0.6)
ax2.axhline(y=0.7, color='orange', linestyle='--', alpha=0.5, label='High surprise')
ax2.set_xlabel('Step')
ax2.set_ylabel('Prediction Error (Œµ)')
ax2.set_title('Prediction Errors (Surprise Events)')
ax2.legend()
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüìä Interpretation:")
print("  - Precision drops when tools fail (high Œµ)")
print("  - When Œ≥ < 0.4, agent triggers adaptation")
print("  - Precision recovers when new tools succeed")

## Part 6: Full Benchmark (100 Trials)

In [None]:
def run_full_benchmark(num_trials=100):
    """
    Run benchmark comparing standard vs LRS agents.
    """
    print(f"\nüß™ Running {num_trials} trials...\n")
    
    standard_results = []
    lrs_results = []
    
    for trial in range(num_trials):
        if (trial + 1) % 10 == 0:
            print(f"  Trial {trial + 1}/{num_trials}...")
        
        # Standard agent
        env.reset()
        standard_success = run_standard_agent(env, max_steps=20) is not None
        standard_results.append(standard_success)
        
        # LRS agent
        env.reset()
        # Create fresh agent for each trial
        agent = builder.build()
        tracker = LRSStateTracker()
        lrs_success = run_lrs_agent(agent, env, tracker, max_steps=20) is not None
        lrs_results.append(lrs_success)
    
    # Aggregate results
    standard_rate = sum(standard_results) / len(standard_results)
    lrs_rate = sum(lrs_results) / len(lrs_results)
    
    print("\n" + "="*60)
    print("BENCHMARK RESULTS")
    print("="*60)
    print(f"\nStandard Agent:")
    print(f"  Success rate: {standard_rate:.1%}")
    print(f"  Successes: {sum(standard_results)}/{num_trials}")
    
    print(f"\nLRS Agent:")
    print(f"  Success rate: {lrs_rate:.1%}")
    print(f"  Successes: {sum(lrs_results)}/{num_trials}")
    
    improvement = ((lrs_rate - standard_rate) / standard_rate) * 100 if standard_rate > 0 else float('inf')
    print(f"\nüìà Improvement: +{improvement:.0f}%")
    
    return standard_rate, lrs_rate

# Run benchmark (warning: takes ~5-10 minutes)
# Uncomment to run:
# standard_rate, lrs_rate = run_full_benchmark(num_trials=100)

# For quick demo, use fewer trials:
standard_rate, lrs_rate = run_full_benchmark(num_trials=10)

# Visualize comparison
fig, ax = plt.subplots(figsize=(8, 6))
agents = ['Standard\nAgent', 'LRS\nAgent']
rates = [standard_rate * 100, lrs_rate * 100]

bars = ax.bar(agents, rates, color=['red', 'green'], alpha=0.7)
ax.set_ylabel('Success Rate (%)')
ax.set_title('Chaos Scriptorium: Standard vs LRS Agent')
ax.set_ylim([0, 100])

# Add value labels
for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{height:.1f}%',
            ha='center', va='bottom', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

## Key Takeaways

1. **Volatility is real**: Production environments change unpredictably
2. **Standard agents fail**: They retry the same action without adapting
3. **LRS adapts**: Precision tracks confidence, triggers replanning
4. **Automatic exploration**: Low precision ‚Üí try alternatives
5. **Mathematical grounding**: No hardcoded thresholds or rules

## Real-World Applications

The Chaos Scriptorium models:
- **API rate limits**: Stripe API suddenly starts failing ‚Üí switch to cached data
- **Service outages**: PostgreSQL down ‚Üí switch to MongoDB replica
- **Permission changes**: S3 bucket becomes read-only ‚Üí switch to local cache
- **Schema evolution**: API response format changes ‚Üí switch to alternative parser

## Next Steps

- **Tutorial 5**: Integrate real LLMs for policy generation
- **Tutorial 6**: Monitor agents with the dashboard
- **Tutorial 7**: Deploy to production with Docker/K8s

## Exercise

1. Modify chaos parameters (interval, lock probability)
2. Add a new tool with different failure characteristics
3. Create your own volatile benchmark (e.g., flaky API simulation)