Skip to content

Benchmarking

Joshua Shinavier edited this page Apr 13, 2026 · 10 revisions

Benchmarking in Hydra

Hydra's benchmarking infrastructure measures and compares performance across language implementations by running identical kernel tests in each. Currently benchmarking supports Haskell, Java, Python, and all four Lisp dialects (Clojure, Common Lisp, Emacs Lisp, Scheme), with Scala support planned. It can:

  • Track performance over time: Detect regressions and measure improvements across commits
  • Compare implementations: Understand relative performance characteristics across supported languages
  • Identify bottlenecks: Find expensive operations that need optimization
  • Reduce noise: Run multiple repetitions per language and report median timing with standard deviation

Prerequisites

Before running benchmarks, you need a working development environment for each language you want to benchmark. See the Developers guide for general setup, and Testing for test-specific requirements.

Language Requirements
Haskell Stack installed; heads/haskell builds and tests pass (cd heads/haskell && stack test)
Java JDK 11+ and Gradle; tests pass (./gradlew :hydra-java:test)
Python Python 3.12+ with a virtual environment at heads/python/.venv; pytest installed; tests pass (cd heads/python && .venv/bin/python -m pytest src/test/python/test_suite_runner.py)

You also need Python 3 available as python3 (used by the benchmark script and dashboard).

Quick start

Run all benchmarks and view the dashboard:

bin/run-benchmark-tests.sh

Run a single language:

bin/run-benchmark-tests.sh --hosts java

Run multiple repetitions for more stable results (reports median with standard deviation):

bin/run-benchmark-tests.sh --hosts java --repeat 5
bin/run-benchmark-tests.sh --hosts all --repeat 3

Tag a run for easier identification later:

bin/run-benchmark-tests.sh --hosts java --tag baseline

This creates a directory like run_2026-02-26_063120_353_baseline.

Just view the dashboard from existing runs (no tests executed):

bin/run-benchmark-tests.sh dashboard

The dashboard can also be invoked directly with additional options:

python3 bin/benchmark-dashboard.py --dir benchmark/runs

What gets timed

Kernel tests are the benchmark target. They form a tree of TestGroup/TestCase values exercising type inference, type checking, term reduction, and other Hydra runtime logic. These are defined in the common test suite and code-generated into all three languages. Haskell is the source of truth; its test groups define the rows in the dashboard.

Generation tests (which verify that generated code compiles correctly) are not benchmarked, as their timing reflects string comparison speed rather than Hydra kernel performance.

For background on how the test suite is structured, see Testing.

How it works

Group-level wall-clock timing

The test runners measure wall-clock time around each test group rather than timing individual tests. When entering a group, the runner records the current time; when the last test in that group finishes, it records the time again. This avoids the resolution problem of per-test timing (where most individual tests report 0ms at the framework's timer resolution floor). Even a group of trivially fast tests produces a measurable total that is meaningful for cross-language comparison.

Each test runner checks the HYDRA_BENCHMARK_OUTPUT environment variable:

  • When unset: the runner behaves identically to a normal test run
  • When set: the runner additionally records group-level timing and writes a JSON file to the specified path

You do not normally need to set this variable yourself; the bin/run-benchmark-tests.sh script handles it.

Repetitions and aggregation

With --repeat N, each language's tests are run N times in separate processes. Each process is independent (no shared JVM, no warm-start effects between runs). The dashboard computes the median timing across repetitions for each group and reports the standard deviation alongside it. The median is robust to outliers from OS-level noise (CPU scheduling, filesystem cache state, background load) without needing to manually discard runs.

Instrumentation by language

Haskell (heads/haskell/src/test/haskell/Hydra/TestSuiteSpec.hs): Uses HSpec beforeAll_/afterAll_ hooks around each describe block with Data.Time.Clock.POSIX.getPOSIXTime for timing. Results accumulate in IORef maps and are written as JSON in an afterAll_ hook.

Java (heads/java/src/test/java/hydra/TestSuiteRunner.java): Inserts sentinel DynamicTest nodes (000_TIMER_START/999_TIMER_END) at group boundaries using System.nanoTime(). Results are written via a JVM shutdown hook.

Python (heads/python/src/test/python/hydra/test_suite_runner.py): Inserts sentinel test functions (000_TIMER_START/999_TIMER_END) at group boundaries using time.perf_counter_ns(). Results are written via atexit.

The dashboard

The dashboard reads JSON benchmark files from benchmark/runs/ and displays results in the terminal. Columns are separated by | for readability.

Latest comparison (default)

Shows the most recent run for each language, side by side. Haskell is always the first column (as the source of truth for the test suite):

python3 bin/benchmark-dashboard.py --dir benchmark/runs
Latest runs:
  Haskell    2026-02-20T00:04 (54610c423, feature_254) "WIP" x5
  Java       2026-02-20T00:05 (54610c423, feature_254) "WIP" x5
  Python     2026-02-20T00:04 (54610c423, feature_254) "WIP" x5

Group                                    | Haskell                        | Java                           | Python
-------------------------------------------------------------------------------------------------------------------------------------------
hydra.lib.chars primitives               |   20             0.70    ±0.00 |   20            35.2   ±16.9   |   20              127     ±306
hydra.lib.lists primitives               |  180             6.20    ±0.20 |  180            57.2    ±2.70  |  180             93.2    ±1.10
checking                                 |  335             33.6    ±0.00 |  335              478    ±0.00 |  334    1         1550    ±0.00
  Fundamentals                           |  103             6.90    ±0.00 |  103              102    ±0.00 |  103              434     ±0.00
  Nominal types                          |  131             11.0    ±0.00 |  131              147    ±0.00 |  131              585     ±0.00
inference                                |  297        3     408    ±0.00 |  297        3     279    ±0.00 |  297        3    1232     ±0.00
  Expected failures                      |  113              379    ±0.00 |  113            53.1    ±0.00  |  113              258     ±0.00
rewriting                                |  176             42.8    ±1.50 |  176            34.4    ±3.00  |  176             50.5    ±0.20
  unshadowVariables                      |   31             1.20    ±0.00 |   --         --                |   31             13.0    ±0.00
...
-------------------------------------------------------------------------------------------------------------------------------------------
TOTAL                                    | 2133       21     663   ±44.4  | 2133       21    1278   ±33.0  | 2028     126     4398     ±307

Each row shows:

  • Passed count (blank if 0)
  • Failed count in red (blank if 0)
  • Skipped count in gray (blank if 0)
  • Median time in milliseconds
  • Standard deviation in gray (±0.00 for single runs)

Visual cues:

  • Groups missing from an implementation show --
  • Times exceeding 10x the Haskell time are highlighted in red
  • Times faster than Haskell are highlighted in green

Controlling depth

By default, subgroups are shown one level deep. Use --depth to control this:

# Top-level groups only
python3 bin/benchmark-dashboard.py --dir benchmark/runs --depth 1

# Two levels (default)
python3 bin/benchmark-dashboard.py --dir benchmark/runs --depth 2

# Three levels
python3 bin/benchmark-dashboard.py --dir benchmark/runs --depth 3

Selecting a specific run

By default, the latest view shows the most recent run. Use --run to display a specific run by directory name, tag, or substring:

# By tag
python3 bin/benchmark-dashboard.py --run baseline

# By full directory name
python3 bin/benchmark-dashboard.py --run run_2026-02-26_063120_353_baseline

# Combined with other options
python3 bin/benchmark-dashboard.py --run baseline --depth 1 --lang java

Filtering

# Show only a specific group and its subgroups
python3 bin/benchmark-dashboard.py --dir benchmark/runs --group rewriting

# Show only one language
python3 bin/benchmark-dashboard.py --dir benchmark/runs --lang java

Run diff

Compare the two most recent runs for each language (useful for before/after comparisons):

python3 bin/benchmark-dashboard.py diff
python3 bin/benchmark-dashboard.py diff --lang python

Use --old and --new to select specific runs by directory name, tag, or substring:

python3 bin/benchmark-dashboard.py diff --old baseline --new optimized
python3 bin/benchmark-dashboard.py diff --new baseline

When --new is provided without --old, the "old" run is the one immediately preceding it in chronological order.

Shows previous and current times with percentage deltas. Groups exceeding the threshold (default 10%) are marked with <-.

Branch comparison

Compare the latest run on two branches:

python3 bin/benchmark-dashboard.py compare main feature_x

Commit history

Show how a specific group's timing evolved across commits:

python3 bin/benchmark-dashboard.py history --group inference

Dashboard options

Option Description
--dir DIR Benchmark runs directory (default: benchmark/runs)
--lang LANG Show only one language (haskell, java, python)
--group GROUP Filter to a specific group
--depth N How many levels deep to display (default: 2)
--run RUN Show a specific run directory in latest mode (accepts exact name, tag, or substring)
--threshold N Highlight deltas exceeding N percent (default: 10)
--slowdown N Highlight times exceeding Nx Haskell's in red (default: 10)
--old RUN Run directory for "previous" in diff mode (default: run before --new)
--new RUN Run directory for "current" in diff mode (default: latest)
--last N Show only N most recent runs (history mode)

Run log

Benchmark results are stored in timestamped directories under benchmark/runs/:

benchmark/runs/
  run_2026-02-19_202116_746/              # Single run (no --repeat)
    haskell.json
    java.json
    python.json
  run_2026-02-20_000451_332_baseline/     # --tag baseline, --repeat 3
    haskell_1.json
    haskell_2.json
    haskell_3.json
    java_1.json
    java_2.json
    java_3.json
    python_1.json
    python_2.json
    python_3.json

Each directory represents one benchmark session. Without --repeat, each language produces a single file (e.g., haskell.json). With --repeat N, each language produces N files (e.g., haskell_1.json through haskell_N.json). The dashboard automatically detects multiple files and aggregates them.

Run directories can be renamed with a tag suffix for easier tracking (e.g. renaming run_2026-02-20_000451_332 to run_2026-02-20_000451_332_baseline), or tagged at creation time with --tag. Chronological ordering uses only the timestamp portion, so tags do not affect sort order.

The benchmark/runs/ directory is gitignored. It is an append-only local log; old runs can be deleted manually when no longer needed.

JSON output format

Each benchmark file follows this schema:

{
  "metadata": {
    "timestamp": "2026-02-19T20:21:00Z",
    "language": "haskell",
    "branch": "feature_254_bootstrapping_demo",
    "commit": "44730e270",
    "commitMessage": "WIP (stable)"
  },
  "groups": [
    {
      "path": "common/hydra.lib.chars primitives",
      "passed": 20,
      "failed": 0,
      "skipped": 0,
      "totalTimeMs": 0.8,
      "subgroups": [
        {
          "path": "common/hydra.lib.chars primitives/isAlphaNum",
          "passed": 4,
          "failed": 0,
          "skipped": 0,
          "totalTimeMs": 0.3
        }
      ]
    }
  ],
  "summary": {
    "totalPassed": 2133,
    "totalFailed": 0,
    "totalSkipped": 21,
    "totalTimeMs": 655.0
  }
}

The path field is the cross-language join key, derived from the TestGroup.name values in the common test suite.

The benchmark script

bin/run-benchmark-tests.sh orchestrates benchmark runs:

bin/run-benchmark-tests.sh                               # Run default (Haskell + Java + Python)
bin/run-benchmark-tests.sh --hosts haskell               # Run Haskell only
bin/run-benchmark-tests.sh --hosts python                # Run Python only
bin/run-benchmark-tests.sh --hosts java                  # Run Java only
bin/run-benchmark-tests.sh --hosts all                   # Run all 7 implementations
bin/run-benchmark-tests.sh --hosts lisp                  # Run all four Lisp dialects
bin/run-benchmark-tests.sh dashboard                     # Just show the dashboard
bin/run-benchmark-tests.sh --hosts java --repeat 5       # Run Java 5 times (median + stddev)
bin/run-benchmark-tests.sh --hosts all --repeat 3        # Run everything 3 times each
bin/run-benchmark-tests.sh --hosts java --tag baseline   # Tag the run directory for easy reference

The script creates a timestamped directory under benchmark/runs/ (with an optional tag suffix), sets HYDRA_BENCHMARK_OUTPUT for each language's test runner, and displays the dashboard when complete.

See also

  • Testing - Test suite architecture, how to run tests, and how to add new test cases
  • Developers - Environment setup and source code organization
  • Issue #234 - Cross-language benchmarking design and implementation history

Clone this wiki locally