solverforge-bench

This repository is the official benchmark surface for SolverForge comparison work. It groups benchmarks by SolverForge solve shape and runs every problem through one shared framework: the framework owns the CLI, TOML configuration, benchmark registry, run matrix, timing, overshoot calculation, watchdog containment, result rows, CSV writing, production logging, solver stdout/stderr capture, and optional PostgreSQL persistence. Problem packages are adapters: they load/select cases, create solver callables, validate/evaluate returned solutions, and expose native fields.

Benchmarks

list-variable/cvrp/ is the canonical CVRP comparison imported from ~/hack/cvrp_solver_comparison. It compares commercially usable Python solver integrations, native VROOM, rustvrp, Timefold, and the retained SolverForge list-variable runtime.
scalar-variable/employee-scheduling/ is the nurse rostering benchmark. It uses the bundled INRC-II TXT corpus and compares solverforge, timefold, and ortools on nurse-to-shift assignments.

Setup

Create the repository virtualenv and install every benchmark dependency into it:

python3.14 -m venv .venv
. .venv/bin/activate
pip install -e .

The root Makefile uses this same .venv for CVRP, employee scheduling, normalization, and nightly runs. make install-python-deps creates or refreshes it before benchmark builds.

CVRP Commands

Run CVRP validation and smoke benchmarks from the repository root:

make validate-cvrp
make bench-cvrp-quick
make bench-cvrp-solverforge-quick

bench-cvrp-quick uses all registered CVRP solvers on three instances at 1 and 10 seconds. bench-cvrp-solverforge-quick keeps the SolverForge-only development smoke path.

Run the full CVRP benchmark:

make bench-cvrp

The full CVRP path installs Python dependencies into the root .venv, builds the Timefold fat JAR, builds native OR-Tools, VROOM, and rustvrp binaries, and builds the local solverforge_cvrp Python extension into that same environment before running the shared benchmark harness. Generated CSV files are benchmark evidence artifacts; commit them only when they are intentional result records.

The CVRP benchmark code under list-variable/cvrp/ intentionally tracks the source checkout at ~/hack/cvrp_solver_comparison. Keep CVRP solver behavior source-identical there; put cross-category reporting, database-loading behavior, and shared execution policy at the repository level instead.

Employee Scheduling Commands

Run local INRC-II reference validation:

make validate-employee-scheduling
make validate-employee-model-parity

validate-employee-model-parity is not a benchmark run. It checks that the employee-scheduling adapters encode the same mathematical model contract: optimal/minimum coverage slot generation, hard feasibility clauses, candidate domains, and soft objective weights. It also re-validates bundled reference solution costs through the shared Python validator.

Build employee-scheduling native integrations without running a benchmark:

make build-employee-scheduling

Run the quick and canonical employee-scheduling benchmarks:

make bench-employee-scheduling-quick
make bench-employee-scheduling

The quick target runs n005w4 for solverforge, timefold, and ortools at 1 and 10 seconds. The canonical target uses the canonical group in scalar-variable/employee-scheduling/data/inrc2/manifest.json.

Unified Harness

The root entrypoint runs the shared framework with problem-specific specs:

PYTHONPATH=src:list-variable/cvrp/src:scalar-variable/employee-scheduling/src \
  .venv/bin/python3 scripts/run_benchmark.py cvrp --run-kind quick --num-instances 3 --time-limits 1 10

PYTHONPATH=src:list-variable/cvrp/src:scalar-variable/employee-scheduling/src \
  .venv/bin/python3 scripts/run_benchmark.py employee-scheduling --run-kind quick --datasets n005w4 --time-limits 1 10

Shared CLI options include:

--config CONFIG
--solver SOLVER
--time-limits SECONDS...
--wall-time-tolerance FLOAT
--watchdog-multiplier FLOAT
--watchdog-grace-seconds FLOAT
--output PATH
--run-kind quick|candidate|tag
--nightly | --no-nightly
--release-tag TAG
--save-postgres | --no-save-postgres
--postgres-url URL
--log-level LEVEL
--log-dir PATH
--log-file PATH
--show-solver-output | --no-show-solver-output
--capture-solver-output | --no-capture-solver-output

cvrp adds --num-instances. employee-scheduling adds --dataset-set and --datasets.

The benchmark budget and watchdog are separate. The requested time limit is passed to the solver and measured with wall-clock timing around solver(instance, time_limit). If the solver returns after the nominal budget but before the watchdog, the framework preserves the solution, validates it, records actual_time_seconds, and reports overshoot:

overshoot_seconds = max(0, actual_time_seconds - time_limit_seconds)
overshoot_ratio = overshoot_seconds / time_limit_seconds
wall_time_over_limit = actual_time_seconds > time_limit_seconds * 1.1

The watchdog exists only for runaway containment. By default it is max(time_limit * 1.25, time_limit + 5), configurable with --watchdog-multiplier and --watchdog-grace-seconds. Only watchdog-killed invocations lose the returned solution because the child process was forcibly terminated.

Logging

Every benchmark run writes a run log and, by default, per-solver stdout/stderr logs. The default paths are:

logs/<benchmark>_<run_stamp>/<benchmark>_<run_stamp>.log
logs/<benchmark>_<run_stamp>/solvers/<instance>__<solver>__<time_limit>s.stdout.log
logs/<benchmark>_<run_stamp>/solvers/<instance>__<solver>__<time_limit>s.stderr.log

When --log-dir PATH is provided, PATH is treated as a parent directory and the framework creates PATH/<benchmark>_<run_stamp>/.... When --log-file PATH is provided, the run log uses that exact file and solver captures are written under PATH's parent in a <log_stem>_<benchmark>_<run_stamp>/solvers/ directory.

Solver output is mirrored to the benchmark console and persisted to files. Use --no-show-solver-output to keep the console quieter while still capturing files, or --no-capture-solver-output to disable per-solver output files. Solver exceptions are caught per solver/case/time-limit and become result rows with run_error; full tracebacks are kept in the stderr/run logs. Fatal output integrity errors, such as CSV or PostgreSQL write failures, still fail the run.

TOML Configuration

The unified harness can read benchmark settings from a TOML file:

PYTHONPATH=src:list-variable/cvrp/src:scalar-variable/employee-scheduling/src \
  .venv/bin/python3 scripts/run_benchmark.py --config benchmark.example.toml

The config file may select the benchmark, solvers, time limits, run catalog kind, nightly flag, output path, watchdog settings, benchmark-specific filters, logging settings, and PostgreSQL persistence:

benchmark = "cvrp"
solver = ["pyvrp"]
time_limits = [1]
run_kind = "quick"
nightly = false

[postgres]
save = false
url = "postgresql://postgres@localhost/solverforge_bench"

[logging]
level = "INFO"
show_solver_output = true
capture_solver_output = true

[benchmarks.cvrp]
num_instances = 3

[benchmarks.employee-scheduling]
dataset_set = "quick"
datasets = ["n005w4"]

Command-line options override TOML values. release_tag qualifies only the effective run_kind = "tag" catalog. If a tag-oriented TOML file is run with a CLI override such as --run-kind quick or --run-kind candidate, the TOML tag does not carry into the overridden run; an explicit CLI --release-tag with a non-tag run kind is rejected. Make targets accept the same file through BENCH_CONFIG:

make bench-cvrp-quick BENCH_CONFIG=benchmark.example.toml

Set nightly = true for cron-driven runs that should be kept distinct from normal runs with the same run_kind. Set [postgres].save = true or pass --save-postgres to persist with the configured URL. A TOML PostgreSQL URL by itself does not enable persistence; an explicit CLI --postgres-url does. Use benchmark.nightly.example.toml as the cron-oriented template for the combined nightly job; the nightly Make target runs the same root harness once per benchmark with that config.

PostgreSQL Results

PostgreSQL persistence uses plain SQL migrations in migrations/ with SQLx-compatible file naming. SQLx is the Rust-side migration convention here because it keeps schema changes as checked-in SQL, supports sqlx migrate run, and can embed the same migrations in Rust code with sqlx::migrate!() when a Rust service owns startup.

The local default database URL is:

postgresql://postgres@localhost/solverforge_bench

Set DATABASE_URL or BENCH_DATABASE_URL to point the Makefile database targets at another benchmark warehouse.

Prepare the database and apply migrations:

make db-check
make db-create
make db-migrate

db-migrate requires sqlx-cli on PATH:

cargo install sqlx-cli --no-default-features --features postgres

Save a benchmark run to PostgreSQL while still writing the normal CSV:

PYTHONPATH=src:list-variable/cvrp/src:scalar-variable/employee-scheduling/src \
  .venv/bin/python3 scripts/run_benchmark.py cvrp \
    --run-kind quick \
    --num-instances 3 \
    --time-limits 1 10 \
    --save-postgres

Run kinds are:

quick      smoke-scale runs
candidate  candidate comparison runs before release tagging
tag        SolverForge release snapshots, requires --release-tag

nightly is stored separately from run_kind; a run may be nightly and still be quick, candidate, or tag.

The normal benchmark Make targets are CSV-only. Use the -db variants to apply migrations and save the same run to PostgreSQL as well:

make bench-cvrp-quick
make bench-cvrp-quick-db
make bench-employee-scheduling-quick
make bench-employee-scheduling-quick-db
make bench-cvrp-db BENCH_ARGS="--run-kind tag --release-tag v0.11.1"
make bench-cvrp-db BENCH_ARGS="--run-kind quick --nightly"
make bench-nightly-db

make bench-nightly-db is the cron entrypoint. It builds both benchmark stacks, applies migrations once, then calls scripts/run_benchmark.py directly for CVRP and employee scheduling. By default it uses benchmark.nightly.example.toml; set BENCH_CONFIG for a different nightly config, or append explicit child harness overrides with NIGHTLY_ARGS. It does not pass --solver, so each benchmark uses its full default solver set.

Result Schema

Benchmark runs now write one global snake_case CSV schema directly. Native problem fields are stable optional columns, for example nurses, weeks, validator_model_delta, and score_drift.

PostgreSQL stores run-level catalog data in benchmark_runs, one solver-version row per solver involved in the run in benchmark_solver_versions, and one row per solver/case/time-limit result in benchmark_results. Each result references the corresponding solver-version row with a foreign key. The live persistence path uses a Polars DataFrame ETL boundary fed by in-memory BenchmarkRow objects; generated CSV files are evidence artifacts, not the PostgreSQL source of truth. Core benchmark columns are typed columns. Benchmark-specific native fields are preserved in native_fields, and the complete emitted row is preserved in row_payload. Native solver versions are recorded from the built executables, not from Makefile default variables. Runs have an independent nightly flag and are catalogued as running, completed, or failed; each completed result row is persisted immediately, so interrupted runs keep their partial rows but are excluded from latest-run display views. Run logs are linked from benchmark_runs.log_path; solver output logs are linked from benchmark_results.solver_stdout_path and benchmark_results.solver_stderr_path. PostgreSQL is the warehouse source of truth. Display consumers should read benchmark_result_facts, latest_benchmark_runs, or latest_benchmark_result_facts instead of reconstructing the run/result join themselves. The latest-run views keep normal and nightly runs separate.

For file-based loading, scripts/normalize_results.py normalizes generated global CSV artifacts through Polars:

make normalize-results INPUT=path/to/benchmark.csv OUTPUT=results/normalized.csv
make normalize-results INPUT=path/to/benchmark.csv OUTPUT=results/normalized.ndjson ARGS="--format ndjson"

If a filtered run produces no result rows, the CSV remains a valid schema-only artifact and normalizes to a schema-only output file.

The normalized rows use these columns:

run_id, benchmark_name, benchmark_category, dataset, dataset_set, instance,
instance_size, solver, solver_version, time_limit_seconds, actual_time_seconds,
overshoot_seconds, overshoot_ratio, wall_time_over_limit,
watchdog_limit_seconds, watchdog_killed, run_error, solver_stdout_path,
solver_stderr_path, hard_feasible, cost, reported_cost, fresh_cost,
reference_cost, quality_ratio, validation_error, solution_artifact, nurses,
weeks, validator_model_delta, score_drift, source_file

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
archive		archive
list-variable/cvrp		list-variable/cvrp
logs		logs
migrations		migrations
scalar-variable/employee-scheduling		scalar-variable/employee-scheduling
scripts		scripts
src/solverforge_bench		src/solverforge_bench
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
Makefile		Makefile
README.md		README.md
benchmark.example.toml		benchmark.example.toml
benchmark.nightly.example.toml		benchmark.nightly.example.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

solverforge-bench

Benchmarks

Setup

CVRP Commands

Employee Scheduling Commands

Unified Harness

Logging

TOML Configuration

PostgreSQL Results

Result Schema

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

solverforge-bench

Benchmarks

Setup

CVRP Commands

Employee Scheduling Commands

Unified Harness

Logging

TOML Configuration

PostgreSQL Results

Result Schema

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages