Agent Sandbox

A professional research-grade environment for generating Open-Source Multi-Agent Negotiation Datasets locally.

Overview

Agent Sandbox is a high-performance ecosystem designed for researchers to generate reproducible, high-fidelity datasets of LLM negotiation trajectories. By leveraging consumer-grade hardware and local inference (Ollama), it enables large-scale behavioral research without the costs of cloud-based APIs.

Features

Model-Multiversal: Pluggable provider system supporting Ollama (local), OpenAI, Gemini, and Groq.
Performance Arena: Head-to-head benchmarking for comparing foundation model negotiation capabilities.
Scenario Architect: Design and deploy custom negotiation environments via a dynamic UI.
Research-Grade Analytics: Automated failure taxonomy, OpenTelemetry metrics, and one-click dataset exports.
Adversarial Testing: Built-in Red Team agent to stress-test strategy robustness.
Hybrid Cloud Acceleration: Cut 30-day simulations down to hours using remote GPU workers (Google Colab/Groq).

System Architecture

graph TD
    A[Frontend: Play UI] -->|REST API| B[FastAPI Backend]
    B --> C[World Manager]
    C --> D[Mediator Engine]
    
    subgraph Agents
        E1[Rule-based Agents]
        E2[LLM-based Agents]
        E3[Red Team Adversaries]
    end
    
    D --> E1
    D --> E2
    D --> E3
    
    subgraph Providers
        F1[Ollama]
        F2[OpenAI]
        F3[Gemini]
        F4[Groq]
    end
    
    E2 --> F1
    E2 --> F2
    E2 --> F3
    E2 --> F4
    
    C --> G[Metrics & Dataset Export]
    C --> H[Tournament Engine]

Research Benchmarks

The sandbox provides objective benchmarking for cross-model negotiation capability. Below is a snapshot from the latest model performance tests:

Model	Success Rate (Agreement)	Avg Turns	Avg Latency
Llama-3-8B (Ollama)	79.6%	3.3	N/A (Remote GPU)
Mistral-7B-v0.3 (Ollama)	71.6%	2.6	N/A (Remote GPU)

(Note: Benchmarks vary based on strategy and risk parameters. Run the Intelligence Hub to generate your own datasets.)

🔬 Key Results From The V1 Baseline

Our 24,000-simulation dataset revealed crucial insights about LLM negotiation behavior, proving the efficacy of the Sandbox as a rigorous testing environment:

1. Simulation Outcomes

Agreement (Success): ~26%
Failure: ~31%
Timeout: ~27%
Error: ~16%

Why is a 26% success rate good? Most agent demos show 90%+ success, which usually means the task is too easy. A ~74% "problematic" outcome rate means the environment is hard enough to reliably reveal failures, deadlocks, and protocol constraints—making it an ideal testing ground for research systems.

2. Negotiation Behavior & Friction The average negotiation length sits at ~5.9 turns. This proves that agents do not simply concede on Turn 1. They actively debate, counter-offer, and frequently stop early due to genuine disagreements or deadlock conditions, consistent with our failure detector.

3. Risk and Complexity Signals The average risk score across simulations is ~43 / 100, suggesting moderate instability and highly interesting adversarial behaviors. This is exactly the kind of friction signal researchers need to analyze strategy robustness.

4. Compute Latency Average decision latency is ~28 seconds, indicating the heavy reasoning workload required by local foundations models to calculate strategy, parse context history, and output structured JSON decisions.

Quick Start (One Command)

The fastest way to get started is using Docker Compose, which spins up the Backend and a local Ollama instance automatically:

docker-compose up -d

Then open http://localhost:8000/play/ in your browser.

Note

If running for the first time, you'll need to pull a model in the Ollama container: docker exec -it agent-sandbox-ollama-1 ollama pull mistral

The Playground

Ready to see agents clash? Follow these steps for a 1-click Tournament:

Open the UI and navigate to the Performance Arena.
Select your models (e.g., openai:gpt-4o vs ollama:mistral).
Click Start Battle.
Watch the price trajectories and reasoning play out in real-time!

Model Providers

Agent Sandbox is provider-agnostic. You can switch between:

Ollama (Local/Default)
OpenAI (GPT-4o, GPT-3.5)
Google Gemini (1.5 Pro/Flash)
Groq (Fast Llama/Mixtral)

See Provider Guide for setup instructions.

Hardware Performance Guide

Optimized for consumer-grade researchers. You can comfortably run large-scale batches with Ollama (8GB+ VRAM recommended):

Model Flavor	Benchmarking Stability	Recommended For
Mistral-7B-Instruct-v0.3	High	Large-scale datasets (q4)
Llama-3-8B (Quantized)	Medium	Rapid strategy testing
Gemma-7B	High	Reasoning depth analysis
Qwen-2-7B	High	Multi-vendor competition

Tip

Reproducibility: Use the seed parameter in SimulationConfig to ensure deterministic generation across researchers.

Generating the Baseline Dataset

To generate the standard Agent Sandbox V1 Benchmark (24,000 simulations), use the provided research sweep script. This will sweep through 3 scenarios, 4 strategies, and 2 models with 1,000 runs each.

# Recommended: Run a quick test sweep first (5 runs per config)
python scripts/generate_dataset_v1.py --test

# Run the full 24,000 simulation sweep
python scripts/generate_dataset_v1.py --runs 1000

🚀 Scaling with Hybrid Cloud (Colab/Kaggle/Remote)

If your local hardware is struggling (e.g., estimating 30+ days), use the Remote Worker feature:

Bridge: Expose your local backend using Cloudflare Tunnel (cloudflared tunnel --url http://localhost:8000).
Worker: Run scripts/kaggle_setup.py on a Kaggle Notebook (Dual T4 GPUs) or scripts/colab_setup.py on Google Colab.
Speed: Each dual-T4 worker can process negotiations in bursts, allowing 20+ horizontal workers to clear the queue in hours instead of weeks.

See Cloud Acceleration Guide for step-by-step setup.

Important

Performance: A full sweep of 24,000 runs on a consumer-grade GPU (8GB VRAM) may take several days. Use the progress tracking feature (GET /batch/{batch_id}/progress) to monitor the status.

agents/: Agent implementations and behaviors.
backend/: FastAPI application core.
world/: Simulation environment and mediation logic.
scenarios/: Negotiation scenario definitions.
metrics/: Failure detection and dataset export tools.
telemetry_module/: OpenTelemetry tracking.
tournaments/: Automated benchmarking and leaderboard logic.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github		.github
agents		agents
backend		backend
configs		configs
data		data
docs		docs
experiments		experiments
frontend		frontend
metrics		metrics
scenarios		scenarios
scripts		scripts
simulation_queue		simulation_queue
tasks		tasks
telemetry_module		telemetry_module
tournaments		tournaments
world		world
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
DATASET_CARD.md		DATASET_CARD.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
sandbox_metrics.db		sandbox_metrics.db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Sandbox

Overview

Features

System Architecture

Research Benchmarks

🔬 Key Results From The V1 Baseline

Quick Start (One Command)

The Playground

Model Providers

Hardware Performance Guide

Generating the Baseline Dataset

🚀 Scaling with Hybrid Cloud (Colab/Kaggle/Remote)

Documentation

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agent Sandbox

Overview

Features

System Architecture

Research Benchmarks

🔬 Key Results From The V1 Baseline

Quick Start (One Command)

The Playground

Model Providers

Hardware Performance Guide

Generating the Baseline Dataset

🚀 Scaling with Hybrid Cloud (Colab/Kaggle/Remote)

Documentation

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages