-
Notifications
You must be signed in to change notification settings - Fork 24
Benchmarking
Hydra's benchmarking infrastructure uses the Common Test Suite to measure and compare performance across implementations. By running identical test cases in Haskell, Java, and Python, we can:
- Track performance over time: Detect regressions and measure improvements
- Compare implementations: Understand relative performance characteristics
- Identify bottlenecks: Find expensive operations that need optimization
Both kernel tests and generation tests from the Common Test Suite can be benchmarked:
| Test Type | What It Measures |
|---|---|
| Kernel tests | Runtime performance (type inference, reduction, etc.) |
| Generation tests | Generated code execution time |
The Python implementation includes a benchmark tool that runs all tests and outputs timing data in CSV format.
cd hydra-python
./bin/benchmark.shThis runs all tests (kernel and generation) and outputs results to test_timings.csv.
./bin/benchmark.sh [options]| Option | Description |
|---|---|
--include-slow |
Include tests tagged as slow (disabledForPython) |
--kernel-only |
Only run kernel tests |
--generation-only |
Only run generation tests |
--all |
Run both kernel and generation tests (default) |
-o FILE |
Output CSV filename (default: test_timings.csv) |
# Run all tests (kernel + generation)
./bin/benchmark.sh
# Include slow tests (takes longer)
./bin/benchmark.sh --include-slow
# Run kernel tests only (faster)
./bin/benchmark.sh --kernel-onlyYou can also run the benchmark script directly:
cd hydra-python
PYTHONPATH=src/main/python:src/gen-main/python:src/gen-test/python:src/test/python \
python3 src/test/python/benchmark_pytest.py [options]The benchmark produces a CSV file with:
| Column | Description | Example |
|---|---|---|
test_type |
Test category |
kernel or generation
|
hydra_path |
Cross-language test identifier | common/inference/Fundamentals/Let terms/Nested let/#3 |
python_name |
Python pytest test name | test_common_inference_fundamentals_let_terms_nested_let_case_3 |
status |
Test result |
passed, failed, skipped
|
duration_ms |
Duration in milliseconds | 9330.00 |
The hydra_path column preserves the original test hierarchy from the test sources and can be used to correlate results across Hydra implementations.
=== SUMMARY ===
KERNEL TESTS:
Total: 1857
Passed: 1647, Failed: 63, Skipped: 147
Total time: 602.2s
Top 10 slowest tests:
112320.00ms common/inference/Fundamentals/Let terms/Let-polymorphism/#2
83710.00ms common/checking/Nominal types/.../Using kernel types/...
45150.00ms common/checking/Nominal types/.../case Either with polymorphic let bindings
...
The benchmark is designed to support cross-language comparison (see Issue #234). The hydra_path column provides a language-agnostic test identifier that can be matched across implementations:
# Python
common/inference/Fundamentals/Let terms/Nested let/#3 -> 9330ms
# Haskell (future)
common/inference/Fundamentals/Let terms/Nested let/#3 -> 15ms
# Java (future)
common/inference/Fundamentals/Let terms/Nested let/#3 -> 45ms
Compare benchmark results before and after changes:
# Before changes
python3 benchmark_pytest.py -o before.csv
# Make changes...
# After changes
python3 benchmark_pytest.py -o after.csv
# Compare (manually or with diff tools)Some tests are intentionally slow due to algorithmic complexity (e.g., complex polymorphic type inference). These are tagged with disabledForPython and excluded by default. Use --include-slow to benchmark them.
Tests taking >10 seconds typically involve:
- Complex let-polymorphism
- Multi-parameter polymorphic types
- Nested recursive definitions
- Large constraint solving
The benchmark tool (benchmark_pytest.py) works by:
-
Loading the test hierarchy: Imports
HYDRA_PATH_MAPfromtest_suite_runner.py, which maps Python test names to their original Hydra paths (e.g.,test_common_inference_fundamentals_...→common/inference/Fundamentals/...) -
Running pytest with timing: Executes pytest with
--durations=0to capture timing for every test, and-vto capture pass/fail status -
Parsing output: Extracts test names, durations, and status from pytest's output
-
Correlating paths: Uses
HYDRA_PATH_MAPto convert Python test names back to Hydra paths -
Writing CSV: Outputs results sorted by duration (slowest first)
The test_suite_runner.py module builds HYDRA_PATH_MAP by traversing the TestGroup hierarchy from the generated test suite:
# In test_suite_runner.py
HYDRA_PATH_MAP: dict[str, str] = {
"test_common_inference_fundamentals_let_terms_nested_let_case_3":
"common/inference/Fundamentals/Let terms/Nested let/#3",
# ... 1800+ mappings
}This mapping preserves the original test names from the test sources, enabling cross-language correlation.
The Python implementation includes short-circuit optimizations for type substitution that significantly improve performance on complex tests. See python-test-performance.md for details.
- Testing - Complete testing documentation
- Common Test Suite - Test suite architecture
- Issue #234 - Cross-language benchmarking proposal