Skip to content

SolverForge/solverforge-bench

Repository files navigation

solverforge-bench

This repository is the official benchmark surface for SolverForge comparison work. It groups benchmarks by SolverForge solve shape and runs every problem through one shared framework: the framework owns the CLI, TOML configuration, benchmark registry, run matrix, timing, overshoot calculation, watchdog containment, result rows, CSV writing, production logging, solver stdout/stderr capture, and optional PostgreSQL persistence. Problem packages are adapters: they load/select cases, create solver callables, validate/evaluate returned solutions, and expose native fields.

Benchmarks

  • list-variable/cvrp/ is the canonical CVRP comparison imported from ~/hack/cvrp_solver_comparison. It compares commercially usable Python solver integrations, native VROOM, rustvrp, Timefold, and the retained SolverForge list-variable runtime.
  • scalar-variable/employee-scheduling/ is the nurse rostering benchmark. It uses the bundled INRC-II TXT corpus and compares solverforge, timefold, and ortools on nurse-to-shift assignments.

Setup

Create the repository virtualenv and install every benchmark dependency into it:

python3.14 -m venv .venv
. .venv/bin/activate
pip install -e .

The root Makefile uses this same .venv for CVRP, employee scheduling, normalization, and nightly runs. make install-python-deps creates or refreshes it before benchmark builds.

CVRP Commands

Run CVRP validation and smoke benchmarks from the repository root:

make validate-cvrp
make bench-cvrp-quick
make bench-cvrp-solverforge-quick

bench-cvrp-quick uses all registered CVRP solvers on three instances at 1 and 10 seconds. bench-cvrp-solverforge-quick keeps the SolverForge-only development smoke path.

Run the full CVRP benchmark:

make bench-cvrp

The full CVRP path installs Python dependencies into the root .venv, builds the Timefold fat JAR, builds native OR-Tools, VROOM, and rustvrp binaries, and builds the local solverforge_cvrp Python extension into that same environment before running the shared benchmark harness. Generated CSV files are benchmark evidence artifacts; commit them only when they are intentional result records.

The CVRP benchmark code under list-variable/cvrp/ intentionally tracks the source checkout at ~/hack/cvrp_solver_comparison. Keep CVRP solver behavior source-identical there; put cross-category reporting, database-loading behavior, and shared execution policy at the repository level instead.

Employee Scheduling Commands

Run local INRC-II reference validation:

make validate-employee-scheduling
make validate-employee-model-parity

validate-employee-model-parity is not a benchmark run. It checks that the employee-scheduling adapters encode the same mathematical model contract: optimal/minimum coverage slot generation, hard feasibility clauses, candidate domains, and soft objective weights. It also re-validates bundled reference solution costs through the shared Python validator.

Build employee-scheduling native integrations without running a benchmark:

make build-employee-scheduling

Run the quick and canonical employee-scheduling benchmarks:

make bench-employee-scheduling-quick
make bench-employee-scheduling

The quick target runs n005w4 for solverforge, timefold, and ortools at 1 and 10 seconds. The canonical target uses the canonical group in scalar-variable/employee-scheduling/data/inrc2/manifest.json.

Unified Harness

The root entrypoint runs the shared framework with problem-specific specs:

PYTHONPATH=src:list-variable/cvrp/src:scalar-variable/employee-scheduling/src \
  .venv/bin/python3 scripts/run_benchmark.py cvrp --run-kind quick --num-instances 3 --time-limits 1 10

PYTHONPATH=src:list-variable/cvrp/src:scalar-variable/employee-scheduling/src \
  .venv/bin/python3 scripts/run_benchmark.py employee-scheduling --run-kind quick --datasets n005w4 --time-limits 1 10

Shared CLI options include:

--config CONFIG
--solver SOLVER
--time-limits SECONDS...
--wall-time-tolerance FLOAT
--watchdog-multiplier FLOAT
--watchdog-grace-seconds FLOAT
--output PATH
--run-kind quick|candidate|tag
--nightly | --no-nightly
--release-tag TAG
--save-postgres | --no-save-postgres
--postgres-url URL
--log-level LEVEL
--log-dir PATH
--log-file PATH
--show-solver-output | --no-show-solver-output
--capture-solver-output | --no-capture-solver-output

cvrp adds --num-instances. employee-scheduling adds --dataset-set and --datasets.

The benchmark budget and watchdog are separate. The requested time limit is passed to the solver and measured with wall-clock timing around solver(instance, time_limit). If the solver returns after the nominal budget but before the watchdog, the framework preserves the solution, validates it, records actual_time_seconds, and reports overshoot:

overshoot_seconds = max(0, actual_time_seconds - time_limit_seconds)
overshoot_ratio = overshoot_seconds / time_limit_seconds
wall_time_over_limit = actual_time_seconds > time_limit_seconds * 1.1

The watchdog exists only for runaway containment. By default it is max(time_limit * 1.25, time_limit + 5), configurable with --watchdog-multiplier and --watchdog-grace-seconds. Only watchdog-killed invocations lose the returned solution because the child process was forcibly terminated.

Logging

Every benchmark run writes a run log and, by default, per-solver stdout/stderr logs. The default paths are:

logs/<benchmark>_<run_stamp>/<benchmark>_<run_stamp>.log
logs/<benchmark>_<run_stamp>/solvers/<instance>__<solver>__<time_limit>s.stdout.log
logs/<benchmark>_<run_stamp>/solvers/<instance>__<solver>__<time_limit>s.stderr.log

When --log-dir PATH is provided, PATH is treated as a parent directory and the framework creates PATH/<benchmark>_<run_stamp>/.... When --log-file PATH is provided, the run log uses that exact file and solver captures are written under PATH's parent in a <log_stem>_<benchmark>_<run_stamp>/solvers/ directory.

Solver output is mirrored to the benchmark console and persisted to files. Use --no-show-solver-output to keep the console quieter while still capturing files, or --no-capture-solver-output to disable per-solver output files. Solver exceptions are caught per solver/case/time-limit and become result rows with run_error; full tracebacks are kept in the stderr/run logs. Fatal output integrity errors, such as CSV or PostgreSQL write failures, still fail the run.

TOML Configuration

The unified harness can read benchmark settings from a TOML file:

PYTHONPATH=src:list-variable/cvrp/src:scalar-variable/employee-scheduling/src \
  .venv/bin/python3 scripts/run_benchmark.py --config benchmark.example.toml

The config file may select the benchmark, solvers, time limits, run catalog kind, nightly flag, output path, watchdog settings, benchmark-specific filters, logging settings, and PostgreSQL persistence:

benchmark = "cvrp"
solver = ["pyvrp"]
time_limits = [1]
run_kind = "quick"
nightly = false

[postgres]
save = false
url = "postgresql://postgres@localhost/solverforge_bench"

[logging]
level = "INFO"
show_solver_output = true
capture_solver_output = true

[benchmarks.cvrp]
num_instances = 3

[benchmarks.employee-scheduling]
dataset_set = "quick"
datasets = ["n005w4"]

Command-line options override TOML values. release_tag qualifies only the effective run_kind = "tag" catalog. If a tag-oriented TOML file is run with a CLI override such as --run-kind quick or --run-kind candidate, the TOML tag does not carry into the overridden run; an explicit CLI --release-tag with a non-tag run kind is rejected. Make targets accept the same file through BENCH_CONFIG:

make bench-cvrp-quick BENCH_CONFIG=benchmark.example.toml

Set nightly = true for cron-driven runs that should be kept distinct from normal runs with the same run_kind. Set [postgres].save = true or pass --save-postgres to persist with the configured URL. A TOML PostgreSQL URL by itself does not enable persistence; an explicit CLI --postgres-url does. Use benchmark.nightly.example.toml as the cron-oriented template for the combined nightly job; the nightly Make target runs the same root harness once per benchmark with that config.

PostgreSQL Results

PostgreSQL persistence uses plain SQL migrations in migrations/ with SQLx-compatible file naming. SQLx is the Rust-side migration convention here because it keeps schema changes as checked-in SQL, supports sqlx migrate run, and can embed the same migrations in Rust code with sqlx::migrate!() when a Rust service owns startup.

The local default database URL is:

postgresql://postgres@localhost/solverforge_bench

Set DATABASE_URL or BENCH_DATABASE_URL to point the Makefile database targets at another benchmark warehouse.

Prepare the database and apply migrations:

make db-check
make db-create
make db-migrate

db-migrate requires sqlx-cli on PATH:

cargo install sqlx-cli --no-default-features --features postgres

Save a benchmark run to PostgreSQL while still writing the normal CSV:

PYTHONPATH=src:list-variable/cvrp/src:scalar-variable/employee-scheduling/src \
  .venv/bin/python3 scripts/run_benchmark.py cvrp \
    --run-kind quick \
    --num-instances 3 \
    --time-limits 1 10 \
    --save-postgres

Run kinds are:

quick      smoke-scale runs
candidate  candidate comparison runs before release tagging
tag        SolverForge release snapshots, requires --release-tag

nightly is stored separately from run_kind; a run may be nightly and still be quick, candidate, or tag.

The normal benchmark Make targets are CSV-only. Use the -db variants to apply migrations and save the same run to PostgreSQL as well:

make bench-cvrp-quick
make bench-cvrp-quick-db
make bench-employee-scheduling-quick
make bench-employee-scheduling-quick-db
make bench-cvrp-db BENCH_ARGS="--run-kind tag --release-tag v0.11.1"
make bench-cvrp-db BENCH_ARGS="--run-kind quick --nightly"
make bench-nightly-db

make bench-nightly-db is the cron entrypoint. It builds both benchmark stacks, applies migrations once, then calls scripts/run_benchmark.py directly for CVRP and employee scheduling. By default it uses benchmark.nightly.example.toml; set BENCH_CONFIG for a different nightly config, or append explicit child harness overrides with NIGHTLY_ARGS. It does not pass --solver, so each benchmark uses its full default solver set.

Result Schema

Benchmark runs now write one global snake_case CSV schema directly. Native problem fields are stable optional columns, for example nurses, weeks, validator_model_delta, and score_drift.

PostgreSQL stores run-level catalog data in benchmark_runs, one solver-version row per solver involved in the run in benchmark_solver_versions, and one row per solver/case/time-limit result in benchmark_results. Each result references the corresponding solver-version row with a foreign key. The live persistence path uses a Polars DataFrame ETL boundary fed by in-memory BenchmarkRow objects; generated CSV files are evidence artifacts, not the PostgreSQL source of truth. Core benchmark columns are typed columns. Benchmark-specific native fields are preserved in native_fields, and the complete emitted row is preserved in row_payload. Native solver versions are recorded from the built executables, not from Makefile default variables. Runs have an independent nightly flag and are catalogued as running, completed, or failed; each completed result row is persisted immediately, so interrupted runs keep their partial rows but are excluded from latest-run display views. Run logs are linked from benchmark_runs.log_path; solver output logs are linked from benchmark_results.solver_stdout_path and benchmark_results.solver_stderr_path. PostgreSQL is the warehouse source of truth. Display consumers should read benchmark_result_facts, latest_benchmark_runs, or latest_benchmark_result_facts instead of reconstructing the run/result join themselves. The latest-run views keep normal and nightly runs separate.

For file-based loading, scripts/normalize_results.py normalizes generated global CSV artifacts through Polars:

make normalize-results INPUT=path/to/benchmark.csv OUTPUT=results/normalized.csv
make normalize-results INPUT=path/to/benchmark.csv OUTPUT=results/normalized.ndjson ARGS="--format ndjson"

If a filtered run produces no result rows, the CSV remains a valid schema-only artifact and normalizes to a schema-only output file.

The normalized rows use these columns:

run_id, benchmark_name, benchmark_category, dataset, dataset_set, instance,
instance_size, solver, solver_version, time_limit_seconds, actual_time_seconds,
overshoot_seconds, overshoot_ratio, wall_time_over_limit,
watchdog_limit_seconds, watchdog_killed, run_error, solver_stdout_path,
solver_stderr_path, hard_feasible, cost, reported_cost, fresh_cost,
reference_cost, quality_ratio, validation_error, solution_artifact, nurses,
weeks, validator_model_delta, score_drift, source_file

About

Inspectable multi-domain benchmark framework

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors