This repository is the official benchmark surface for SolverForge comparison work. It groups benchmarks by SolverForge solve shape and runs every problem through one shared framework: the framework owns the CLI, TOML configuration, benchmark registry, run matrix, timing, overshoot calculation, watchdog containment, result rows, CSV writing, production logging, solver stdout/stderr capture, and optional PostgreSQL persistence. Problem packages are adapters: they load/select cases, create solver callables, validate/evaluate returned solutions, and expose native fields.
list-variable/cvrp/is the canonical CVRP comparison imported from~/hack/cvrp_solver_comparison. It compares commercially usable Python solver integrations, native VROOM, rustvrp, Timefold, and the retained SolverForge list-variable runtime.scalar-variable/employee-scheduling/is the nurse rostering benchmark. It uses the bundled INRC-II TXT corpus and comparessolverforge,timefold, andortoolson nurse-to-shift assignments.
Create the repository virtualenv and install every benchmark dependency into it:
python3.14 -m venv .venv
. .venv/bin/activate
pip install -e .The root Makefile uses this same .venv for CVRP, employee scheduling,
normalization, and nightly runs. make install-python-deps creates or refreshes
it before benchmark builds.
Run CVRP validation and smoke benchmarks from the repository root:
make validate-cvrp
make bench-cvrp-quick
make bench-cvrp-solverforge-quickbench-cvrp-quick uses all registered CVRP solvers on three instances at 1
and 10 seconds. bench-cvrp-solverforge-quick keeps the SolverForge-only
development smoke path.
Run the full CVRP benchmark:
make bench-cvrpThe full CVRP path installs Python dependencies into the root .venv, builds
the Timefold fat JAR, builds native OR-Tools, VROOM, and rustvrp binaries, and
builds the local solverforge_cvrp Python extension into that same environment
before running the shared benchmark harness. Generated CSV files are benchmark
evidence artifacts; commit them only when they are intentional result records.
The CVRP benchmark code under list-variable/cvrp/ intentionally tracks the
source checkout at ~/hack/cvrp_solver_comparison. Keep CVRP solver behavior
source-identical there; put cross-category reporting, database-loading behavior,
and shared execution policy at the repository level instead.
Run local INRC-II reference validation:
make validate-employee-scheduling
make validate-employee-model-parityvalidate-employee-model-parity is not a benchmark run. It checks that the
employee-scheduling adapters encode the same mathematical model contract:
optimal/minimum coverage slot generation, hard feasibility clauses, candidate
domains, and soft objective weights. It also re-validates bundled reference
solution costs through the shared Python validator.
Build employee-scheduling native integrations without running a benchmark:
make build-employee-schedulingRun the quick and canonical employee-scheduling benchmarks:
make bench-employee-scheduling-quick
make bench-employee-schedulingThe quick target runs n005w4 for solverforge, timefold, and ortools
at 1 and 10 seconds.
The canonical target uses the canonical group in
scalar-variable/employee-scheduling/data/inrc2/manifest.json.
The root entrypoint runs the shared framework with problem-specific specs:
PYTHONPATH=src:list-variable/cvrp/src:scalar-variable/employee-scheduling/src \
.venv/bin/python3 scripts/run_benchmark.py cvrp --run-kind quick --num-instances 3 --time-limits 1 10
PYTHONPATH=src:list-variable/cvrp/src:scalar-variable/employee-scheduling/src \
.venv/bin/python3 scripts/run_benchmark.py employee-scheduling --run-kind quick --datasets n005w4 --time-limits 1 10Shared CLI options include:
--config CONFIG
--solver SOLVER
--time-limits SECONDS...
--wall-time-tolerance FLOAT
--watchdog-multiplier FLOAT
--watchdog-grace-seconds FLOAT
--output PATH
--run-kind quick|candidate|tag
--nightly | --no-nightly
--release-tag TAG
--save-postgres | --no-save-postgres
--postgres-url URL
--log-level LEVEL
--log-dir PATH
--log-file PATH
--show-solver-output | --no-show-solver-output
--capture-solver-output | --no-capture-solver-output
cvrp adds --num-instances. employee-scheduling adds --dataset-set and
--datasets.
The benchmark budget and watchdog are separate. The requested time limit is
passed to the solver and measured with wall-clock timing around
solver(instance, time_limit). If the solver returns after the nominal budget
but before the watchdog, the framework preserves the solution, validates it,
records actual_time_seconds, and reports overshoot:
overshoot_seconds = max(0, actual_time_seconds - time_limit_seconds)
overshoot_ratio = overshoot_seconds / time_limit_seconds
wall_time_over_limit = actual_time_seconds > time_limit_seconds * 1.1
The watchdog exists only for runaway containment. By default it is
max(time_limit * 1.25, time_limit + 5), configurable with
--watchdog-multiplier and --watchdog-grace-seconds. Only watchdog-killed
invocations lose the returned solution because the child process was forcibly
terminated.
Every benchmark run writes a run log and, by default, per-solver stdout/stderr logs. The default paths are:
logs/<benchmark>_<run_stamp>/<benchmark>_<run_stamp>.log
logs/<benchmark>_<run_stamp>/solvers/<instance>__<solver>__<time_limit>s.stdout.log
logs/<benchmark>_<run_stamp>/solvers/<instance>__<solver>__<time_limit>s.stderr.log
When --log-dir PATH is provided, PATH is treated as a parent directory and
the framework creates PATH/<benchmark>_<run_stamp>/.... When --log-file PATH
is provided, the run log uses that exact file and solver captures are written
under PATH's parent in a <log_stem>_<benchmark>_<run_stamp>/solvers/
directory.
Solver output is mirrored to the benchmark console and persisted to files. Use
--no-show-solver-output to keep the console quieter while still capturing
files, or --no-capture-solver-output to disable per-solver output files.
Solver exceptions are caught per solver/case/time-limit and become result rows
with run_error; full tracebacks are kept in the stderr/run logs. Fatal output
integrity errors, such as CSV or PostgreSQL write failures, still fail the run.
The unified harness can read benchmark settings from a TOML file:
PYTHONPATH=src:list-variable/cvrp/src:scalar-variable/employee-scheduling/src \
.venv/bin/python3 scripts/run_benchmark.py --config benchmark.example.tomlThe config file may select the benchmark, solvers, time limits, run catalog kind, nightly flag, output path, watchdog settings, benchmark-specific filters, logging settings, and PostgreSQL persistence:
benchmark = "cvrp"
solver = ["pyvrp"]
time_limits = [1]
run_kind = "quick"
nightly = false
[postgres]
save = false
url = "postgresql://postgres@localhost/solverforge_bench"
[logging]
level = "INFO"
show_solver_output = true
capture_solver_output = true
[benchmarks.cvrp]
num_instances = 3
[benchmarks.employee-scheduling]
dataset_set = "quick"
datasets = ["n005w4"]Command-line options override TOML values. release_tag qualifies only the
effective run_kind = "tag" catalog. If a tag-oriented TOML file is run with a
CLI override such as --run-kind quick or --run-kind candidate, the TOML tag
does not carry into the overridden run; an explicit CLI --release-tag with a
non-tag run kind is rejected. Make targets accept the same file through
BENCH_CONFIG:
make bench-cvrp-quick BENCH_CONFIG=benchmark.example.tomlSet nightly = true for cron-driven runs that should be kept distinct from
normal runs with the same run_kind. Set [postgres].save = true or pass
--save-postgres to persist with the configured URL. A TOML PostgreSQL URL by
itself does not enable persistence; an explicit CLI --postgres-url does.
Use benchmark.nightly.example.toml as the cron-oriented template for the
combined nightly job; the nightly Make target runs the same root harness once
per benchmark with that config.
PostgreSQL persistence uses plain SQL migrations in migrations/ with
SQLx-compatible file naming. SQLx is the Rust-side migration convention here
because it keeps schema changes as checked-in SQL, supports sqlx migrate run,
and can embed the same migrations in Rust code with sqlx::migrate!() when a
Rust service owns startup.
The local default database URL is:
postgresql://postgres@localhost/solverforge_benchSet DATABASE_URL or BENCH_DATABASE_URL to point the Makefile database
targets at another benchmark warehouse.
Prepare the database and apply migrations:
make db-check
make db-create
make db-migratedb-migrate requires sqlx-cli on PATH:
cargo install sqlx-cli --no-default-features --features postgresSave a benchmark run to PostgreSQL while still writing the normal CSV:
PYTHONPATH=src:list-variable/cvrp/src:scalar-variable/employee-scheduling/src \
.venv/bin/python3 scripts/run_benchmark.py cvrp \
--run-kind quick \
--num-instances 3 \
--time-limits 1 10 \
--save-postgresRun kinds are:
quick smoke-scale runs
candidate candidate comparison runs before release tagging
tag SolverForge release snapshots, requires --release-tag
nightly is stored separately from run_kind; a run may be nightly and still be
quick, candidate, or tag.
The normal benchmark Make targets are CSV-only. Use the -db variants to apply
migrations and save the same run to PostgreSQL as well:
make bench-cvrp-quick
make bench-cvrp-quick-db
make bench-employee-scheduling-quick
make bench-employee-scheduling-quick-db
make bench-cvrp-db BENCH_ARGS="--run-kind tag --release-tag v0.11.1"
make bench-cvrp-db BENCH_ARGS="--run-kind quick --nightly"
make bench-nightly-dbmake bench-nightly-db is the cron entrypoint. It builds both benchmark stacks,
applies migrations once, then calls scripts/run_benchmark.py directly for
CVRP and employee scheduling. By default it uses
benchmark.nightly.example.toml; set BENCH_CONFIG for a different nightly
config, or append explicit child harness overrides with NIGHTLY_ARGS. It does
not pass --solver, so each benchmark uses its full default solver set.
Benchmark runs now write one global snake_case CSV schema directly. Native
problem fields are stable optional columns, for example nurses, weeks,
validator_model_delta, and score_drift.
PostgreSQL stores run-level catalog data in benchmark_runs, one solver-version
row per solver involved in the run in benchmark_solver_versions, and one row
per solver/case/time-limit result in benchmark_results. Each result references
the corresponding solver-version row with a foreign key. The live persistence
path uses a Polars DataFrame ETL boundary fed by in-memory BenchmarkRow
objects; generated CSV files are evidence artifacts, not the PostgreSQL source
of truth. Core benchmark columns are typed columns. Benchmark-specific native
fields are preserved in native_fields, and the complete emitted row is
preserved in row_payload. Native solver versions are recorded from the built
executables, not from Makefile default variables.
Runs have an independent nightly flag and are catalogued as running,
completed, or failed; each completed result row is persisted immediately, so
interrupted runs keep their partial rows but are excluded from latest-run display
views.
Run logs are linked from benchmark_runs.log_path; solver output logs are
linked from benchmark_results.solver_stdout_path and
benchmark_results.solver_stderr_path.
PostgreSQL is the warehouse source of truth. Display consumers should read
benchmark_result_facts, latest_benchmark_runs, or
latest_benchmark_result_facts instead of reconstructing the run/result join
themselves. The latest-run views keep normal and nightly runs separate.
For file-based loading, scripts/normalize_results.py normalizes generated
global CSV artifacts through Polars:
make normalize-results INPUT=path/to/benchmark.csv OUTPUT=results/normalized.csv
make normalize-results INPUT=path/to/benchmark.csv OUTPUT=results/normalized.ndjson ARGS="--format ndjson"If a filtered run produces no result rows, the CSV remains a valid schema-only artifact and normalizes to a schema-only output file.
The normalized rows use these columns:
run_id, benchmark_name, benchmark_category, dataset, dataset_set, instance,
instance_size, solver, solver_version, time_limit_seconds, actual_time_seconds,
overshoot_seconds, overshoot_ratio, wall_time_over_limit,
watchdog_limit_seconds, watchdog_killed, run_error, solver_stdout_path,
solver_stderr_path, hard_feasible, cost, reported_cost, fresh_cost,
reference_cost, quality_ratio, validation_error, solution_artifact, nurses,
weeks, validator_model_delta, score_drift, source_file