Benchmarking

Benchmarking in Hydra

Hydra's benchmarking infrastructure uses the Common Test Suite to measure and compare performance across implementations. By running identical test cases in Haskell, Java, and Python, we can:

Track performance over time: Detect regressions and measure improvements
Compare implementations: Understand relative performance characteristics
Identify bottlenecks: Find expensive operations that need optimization

Test Types for Benchmarking

Both kernel tests and generation tests from the Common Test Suite can be benchmarked:

Test Type	What It Measures
Kernel tests	Runtime performance (type inference, reduction, etc.)
Generation tests	Generated code execution time

Python Benchmark Tool

The Python implementation includes a benchmark tool that runs all tests and outputs timing data in CSV format.

Quick Start

cd hydra-python
./bin/benchmark.sh

This runs all tests (kernel and generation) and outputs results to test_timings.csv.

Options

./bin/benchmark.sh [options]

Option	Description
`--include-slow`	Include tests tagged as slow (`disabledForPython`)
`--kernel-only`	Only run kernel tests
`--generation-only`	Only run generation tests
`--all`	Run both kernel and generation tests (default)
`-o FILE`	Output CSV filename (default: `test_timings.csv`)

Examples

# Run all tests (kernel + generation)
./bin/benchmark.sh

# Include slow tests (takes longer)
./bin/benchmark.sh --include-slow

# Run kernel tests only (faster)
./bin/benchmark.sh --kernel-only

Manual Invocation

You can also run the benchmark script directly:

cd hydra-python
PYTHONPATH=src/main/python:src/gen-main/python:src/gen-test/python:src/test/python \
    python3 src/test/python/benchmark_pytest.py [options]

Output Format

The benchmark produces a CSV file with:

Column	Description	Example
`test_type`	Test category	`kernel` or `generation`
`hydra_path`	Cross-language test identifier	`common/inference/Fundamentals/Let terms/Nested let/#3`
`python_name`	Python pytest test name	`test_common_inference_fundamentals_let_terms_nested_let_case_3`
`status`	Test result	`passed`, `failed`, `skipped`
`duration_ms`	Duration in milliseconds	`9330.00`

The hydra_path column preserves the original test hierarchy from the test sources and can be used to correlate results across Hydra implementations.

Example Output

=== SUMMARY ===

KERNEL TESTS:
  Total: 1857
  Passed: 1647, Failed: 63, Skipped: 147
  Total time: 602.2s

Top 10 slowest tests:
   112320.00ms  common/inference/Fundamentals/Let terms/Let-polymorphism/#2
    83710.00ms  common/checking/Nominal types/.../Using kernel types/...
    45150.00ms  common/checking/Nominal types/.../case Either with polymorphic let bindings
    ...

Cross-Language Comparison

The benchmark is designed to support cross-language comparison (see Issue #234). The hydra_path column provides a language-agnostic test identifier that can be matched across implementations:

# Python
common/inference/Fundamentals/Let terms/Nested let/#3 -> 9330ms

# Haskell (future)
common/inference/Fundamentals/Let terms/Nested let/#3 -> 15ms

# Java (future)
common/inference/Fundamentals/Let terms/Nested let/#3 -> 45ms

Performance Tracking

Detecting Regressions

Compare benchmark results before and after changes:

# Before changes
python3 benchmark_pytest.py -o before.csv

# Make changes...

# After changes
python3 benchmark_pytest.py -o after.csv

# Compare (manually or with diff tools)

Known Slow Tests

Some tests are intentionally slow due to algorithmic complexity (e.g., complex polymorphic type inference). These are tagged with disabledForPython and excluded by default. Use --include-slow to benchmark them.

Tests taking >10 seconds typically involve:

Complex let-polymorphism
Multi-parameter polymorphic types
Nested recursive definitions
Large constraint solving

How It Works

The benchmark tool (benchmark_pytest.py) works by:

Loading the test hierarchy: Imports HYDRA_PATH_MAP from test_suite_runner.py, which maps Python test names to their original Hydra paths (e.g., test_common_inference_fundamentals_... → common/inference/Fundamentals/...)
Running pytest with timing: Executes pytest with --durations=0 to capture timing for every test, and -v to capture pass/fail status
Parsing output: Extracts test names, durations, and status from pytest's output
Correlating paths: Uses HYDRA_PATH_MAP to convert Python test names back to Hydra paths
Writing CSV: Outputs results sorted by duration (slowest first)

Path Mapping

The test_suite_runner.py module builds HYDRA_PATH_MAP by traversing the TestGroup hierarchy from the generated test suite:

# In test_suite_runner.py
HYDRA_PATH_MAP: dict[str, str] = {
    "test_common_inference_fundamentals_let_terms_nested_let_case_3":
        "common/inference/Fundamentals/Let terms/Nested let/#3",
    # ... 1800+ mappings
}

This mapping preserves the original test names from the test sources, enabling cross-language correlation.

Performance Optimizations

The Python implementation includes short-circuit optimizations for type substitution that significantly improve performance on complex tests. See python-test-performance.md for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking

Benchmarking in Hydra

Test Types for Benchmarking

Python Benchmark Tool

Quick Start

Options

Examples

Manual Invocation

Output Format

Example Output

Cross-Language Comparison

Performance Tracking

Detecting Regressions

Known Slow Tests

How It Works

Path Mapping

Performance Optimizations

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally