Skip to content

Benchmarking

Joshua Shinavier edited this page Jan 16, 2026 · 10 revisions

Benchmarking in Hydra

Hydra's benchmarking infrastructure uses the Common Test Suite to measure and compare performance across implementations. By running identical test cases in Haskell, Java, and Python, we can:

  • Track performance over time: Detect regressions and measure improvements
  • Compare implementations: Understand relative performance characteristics
  • Identify bottlenecks: Find expensive operations that need optimization

Test Types for Benchmarking

Both kernel tests and generation tests from the Common Test Suite can be benchmarked:

Test Type What It Measures
Kernel tests Runtime performance (type inference, reduction, etc.)
Generation tests Generated code execution time

Python Benchmark Tool

The Python implementation includes a benchmark tool that runs all tests and outputs timing data in CSV format.

Quick Start

cd hydra-python
./bin/benchmark.sh

This runs all tests (kernel and generation) and outputs results to test_timings.csv.

Options

./bin/benchmark.sh [options]
Option Description
--include-slow Include tests tagged as slow (disabledForPython)
--kernel-only Only run kernel tests
--generation-only Only run generation tests
--all Run both kernel and generation tests (default)
-o FILE Output CSV filename (default: test_timings.csv)

Examples

# Run all tests (kernel + generation)
./bin/benchmark.sh

# Include slow tests (takes longer)
./bin/benchmark.sh --include-slow

# Run kernel tests only (faster)
./bin/benchmark.sh --kernel-only

Manual Invocation

You can also run the benchmark script directly:

cd hydra-python
PYTHONPATH=src/main/python:src/gen-main/python:src/gen-test/python:src/test/python \
    python3 src/test/python/benchmark_pytest.py [options]

Output Format

The benchmark produces a CSV file with:

Column Description Example
test_type Test category kernel or generation
hydra_path Cross-language test identifier common/inference/Fundamentals/Let terms/Nested let/#3
python_name Python pytest test name test_common_inference_fundamentals_let_terms_nested_let_case_3
status Test result passed, failed, skipped
duration_ms Duration in milliseconds 9330.00

The hydra_path column preserves the original test hierarchy from the test sources and can be used to correlate results across Hydra implementations.

Example Output

=== SUMMARY ===

KERNEL TESTS:
  Total: 1857
  Passed: 1647, Failed: 63, Skipped: 147
  Total time: 602.2s

Top 10 slowest tests:
   112320.00ms  common/inference/Fundamentals/Let terms/Let-polymorphism/#2
    83710.00ms  common/checking/Nominal types/.../Using kernel types/...
    45150.00ms  common/checking/Nominal types/.../case Either with polymorphic let bindings
    ...

Cross-Language Comparison

The benchmark is designed to support cross-language comparison (see Issue #234). The hydra_path column provides a language-agnostic test identifier that can be matched across implementations:

# Python
common/inference/Fundamentals/Let terms/Nested let/#3 -> 9330ms

# Haskell (future)
common/inference/Fundamentals/Let terms/Nested let/#3 -> 15ms

# Java (future)
common/inference/Fundamentals/Let terms/Nested let/#3 -> 45ms

Performance Tracking

Detecting Regressions

Compare benchmark results before and after changes:

# Before changes
python3 benchmark_pytest.py -o before.csv

# Make changes...

# After changes
python3 benchmark_pytest.py -o after.csv

# Compare (manually or with diff tools)

Known Slow Tests

Some tests are intentionally slow due to algorithmic complexity (e.g., complex polymorphic type inference). These are tagged with disabledForPython and excluded by default. Use --include-slow to benchmark them.

Tests taking >10 seconds typically involve:

  • Complex let-polymorphism
  • Multi-parameter polymorphic types
  • Nested recursive definitions
  • Large constraint solving

How It Works

The benchmark tool (benchmark_pytest.py) works by:

  1. Loading the test hierarchy: Imports HYDRA_PATH_MAP from test_suite_runner.py, which maps Python test names to their original Hydra paths (e.g., test_common_inference_fundamentals_...common/inference/Fundamentals/...)

  2. Running pytest with timing: Executes pytest with --durations=0 to capture timing for every test, and -v to capture pass/fail status

  3. Parsing output: Extracts test names, durations, and status from pytest's output

  4. Correlating paths: Uses HYDRA_PATH_MAP to convert Python test names back to Hydra paths

  5. Writing CSV: Outputs results sorted by duration (slowest first)

Path Mapping

The test_suite_runner.py module builds HYDRA_PATH_MAP by traversing the TestGroup hierarchy from the generated test suite:

# In test_suite_runner.py
HYDRA_PATH_MAP: dict[str, str] = {
    "test_common_inference_fundamentals_let_terms_nested_let_case_3":
        "common/inference/Fundamentals/Let terms/Nested let/#3",
    # ... 1800+ mappings
}

This mapping preserves the original test names from the test sources, enabling cross-language correlation.

Performance Optimizations

The Python implementation includes short-circuit optimizations for type substitution that significantly improve performance on complex tests. See python-test-performance.md for details.

See Also

Clone this wiki locally