Futuresim

Futuresim is a forecasting simulator for LLM agents. It advances a dated question market, exposes only the information available at each simulated date, records agent forecasts, and scores them over time.

For background, see the Futuresim blogpost and paper.

Quick Start

git clone https://github.com/OpenForecaster/futuresim.git
cd futuresim

uv sync
source .venv/bin/activate

cp .env.example .env
# Edit .env if you need OpenRouter keys, custom output paths, or local artifacts.

python scripts/run_forecast_sim.py --config configs/shared/default_sim.yaml

The default search-enabled configs use LanceDB and require FSIM_SEARCH_DB. For a no-retrieval smoke run, use:

python scripts/run_forecast_sim.py --config configs/shared/default_nosearch_sim.yaml

Configuration

scripts/run_forecast_sim.py loads .env from the repo root. Shell exports override .env values. ${FSIM_REPO_DIR} expands to this checkout.

Common variables:

Variable	Use
`OPENROUTER_API_KEY`	Required for OpenRouter-backed agents or answer matching
`FSIM_OUTPUT_BASE`	Simulation output root
`FSIM_DATASET_PATH`	Hugging Face dataset id or local dataset path
`FSIM_DATASET_CACHE`	Hugging Face dataset cache directory
`FSIM_SEARCH_DB`	LanceDB index path for bundled hybrid search
`FSIM_ARTICLES_BASE`	Dated article JSONL tree for filesystem article browsing
`FSIM_EMBEDDING_MODEL`	Embedding model used by the LanceDB index
`FSIM_MATCHER_MODEL`	OpenRouter/vLLM answer-matcher model
`FSIM_SIM_MATCHER_CACHE_DIR`	Optional shared answer-matcher cache directory

One-off override example:

FSIM_OUTPUT_BASE=/scratch/$USER/futuresim-runs \
python scripts/run_forecast_sim.py --config configs/shared/default_sim.yaml

Data And Search

OpenForesight questions load from Hugging Face by default. The default config uses the aljazeera2026Q1 split.

Futuresim separates the question market from agent retrieval:

The environment owns dates, visible questions, visible article files, forecast ingestion, answer matching, and scoring.
Agents own their retrieval strategy. They can use the filesystem article corpus, the bundled LanceDB hybrid search tool, or a custom tool.

Download the prebuilt LanceDB artifact for the bundled hybrid search configs:

export FSIM_SEARCH_DB=${FSIM_SEARCH_DB:-$(pwd)/artifacts/forecast-news-embeddings}

hf download shash42/forecast-news-embeddings \
  --repo-type dataset \
  --local-dir "$FSIM_SEARCH_DB" \
  --max-workers 8

python scripts/check_search_readiness.py --db-path "$FSIM_SEARCH_DB"

The public embedding model used with this index is Qwen/Qwen3-Embedding-8B. Set FSIM_EMBEDDING_MODEL to a local checkout, a model id, or an embedding server target supported by your search backend.

Download the browsable article corpus separately:

export FSIM_ARTICLES_BASE=${FSIM_ARTICLES_BASE:-$(pwd)/artifacts/forecast-news}

hf download shash42/forecast-news \
  --repo-type dataset \
  --local-dir "$FSIM_ARTICLES_BASE" \
  --include '2025/12/**' \
  --include '2026/**' \
  --max-workers 8

FSIM_ARTICLES_BASE must point to a dated tree: YYYY/MM/DD/articles.jsonl. Rows should include title, source, date, and content; date_publish, url, id, and date_modify are optional.

Custom Data

Use --dataset custom --dataset_path <file-or-dir> with CSV, JSONL, JSON, or Parquet. A directory may contain split files such as test.jsonl, test.parquet, or test-*.parquet.

Required columns:

Column	Accepted aliases
`qid`	`question_id`, `id`
`title`	`question_title`, `question`
`resolution_date`	`close_time`, `resolve_time`
`ground_truth_answer`	`ground_truth`, `answer`, `resolution`, `resolved_to`

Optional columns: background, resolution_criteria, answer_type, options, source_split, and prompt.

Example:

python scripts/run_forecast_sim.py \
  --dataset custom \
  --dataset_path /path/to/questions.jsonl \
  --split test

To use a custom search backend, implement the BaseSearchTool contract in agents/search_tools/base.py. For LanceDB, semantic/hybrid search needs an articles table with chunk ids, article ids, date fields, content, optional metadata, and vectors built with the configured embedding model.

Platform Integrations

Futuresim includes adapters for Prime Intellect Verifiers and OpenReward/ORS. They use the same SimulationEnvironment and run a MinimalHarness-compatible CLI agent through the packaged MCP server:

python -m futuresim_agents.minimalHarnessAgent.mcp_server

Important defaults:

The filesystem article corpus is the default information source.
Hybrid LanceDB search is opt-in via futuresim.enable_hybrid_search: true.
Hosted runs only accept forecasts submitted through MCP submit_forecasts and finalized with next_day.
Sandboxes block general internet by default to avoid future leakage.
Codex/Claude CLI reproductions require each user to provide their own private CLI/provider credentials through platform secrets or an equivalent private setup.

See integrations/README.md for sandbox image requirements, credential handling, network/egress guidance, and publication steps for Verifiers and OpenReward.

Common Commands

# Default shared simulation
python scripts/run_forecast_sim.py --config configs/shared/default_sim.yaml

# No-retrieval variant
python scripts/run_forecast_sim.py --config configs/shared/default_nosearch_sim.yaml

# Resume from the last day in a run directory
python scripts/run_forecast_sim.py --resume /path/to/output_dir

# Restart from a specific day while preserving prior forecasts
python scripts/run_forecast_sim.py \
  --restart_from /path/to/original/run \
  --restart_from_day 2025-04-05

Scaffold selection is explicit in config under defaults.scaffold:

basic, allQ, allqd: base chat-tools scaffolds.
qwenbasic, qwenallq: Qwen-named compatibility wrappers.
minimalHarness: external CLI backends such as Codex, Claude Code, and OpenCode.

Outputs

Runs are saved to FSIM_OUTPUT_BASE/<sim_name>/<timestamp>/.

Key files:

File	Contents
`config.json`	Fully resolved run configuration
`actions.jsonl`	Predictions and resolutions
`daily_metrics.csv`	Cumulative metrics per wakeup session
`test_daily_metrics.csv`	Same metrics filtered to `source_split == "test"`
`matcher_cache.json`	Per-run answer-matcher cache unless shared caching is configured
`agents/<agent_id>/`	Per-agent transcripts, logs, and memory

If FSIM_SIM_MATCHER_CACHE_DIR is set, split: "test" runs reuse <cache_dir>/<matcher_slug>.json and merge new entries back when the run exits. Other splits can opt in with top-level YAML: matcher_cache: {enabled: true, path: null}.

Notes

timegap_days changes the simulator from daily wakeups to one session every N days. Metrics for active questions are evaluated through the end of that wakeup interval.
OpenForesight configs can prepend train-split questions with prepend_train_resolution_start, prepend_train_resolution_end, and subsample_per_month.
Each OpenForesight question carries a source_split tag so split-specific metrics can be logged without a separate loader path.

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
agents		agents
configs		configs
environment		environment
futuresim_agents		futuresim_agents
inference		inference
integrations		integrations
scripts		scripts
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
futuresim.py		futuresim.py
pathing.py		pathing.py
pyproject.toml		pyproject.toml
server.py		server.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Futuresim

Quick Start

Configuration

Data And Search

Custom Data

Platform Integrations

Common Commands

Outputs

Notes

More Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Futuresim

Quick Start

Configuration

Data And Search

Custom Data

Platform Integrations

Common Commands

Outputs

Notes

More Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages