This repo stores benchmark results for CursorCult rules and rulesets.
Benchmarks live in repos named:
_benchmark_<RULE>(benchmarks a specific rule pack)
Benchmarks should publish their results here via PRs so results are:
- versioned and reviewable
- easy to browse
- easy to compare across rule versions (
v0,v1,v2, …)
This repo pins the exact versions used to generate results via git submodules:
_metrics/->CursorCult/_metrics(standard metrics)rules/<RULE>/_benchmark/->CursorCult/_benchmark_<RULE>(rule benchmark implementation)
Updating a submodule pointer is what triggers “regenerate results”.
- Open a PR against
CursorCult/_results. - Update one or more submodules (e.g. bump
_metrics, or bumprules/TDD/_benchmark). - Regenerate results for what changed. The
scripts/generate_changed_results.pyscript now orchestrates running the benchmarks multiple times and averaging the results:
git submodule update --init --recursive
python3 scripts/generate_changed_results.py --bench TDD --runs 10- `--runs N`: Specifies the number of times to execute each benchmark case. Results are averaged.
- Commit the updated
rules/**/RESULTS.mdfiles and push.
CI checks that submodule pointer bumps are accompanied by corresponding RESULTS.md updates (it does not re-run LLM benchmarks).
rules/<RULE>/<language>/RESULTS.md
Example:
rules/TDD/python/RESULTS.md
rulesets/<RULESET>/<language>/RESULTS.md
Ruleset result files should summarize/aggregate results from the relevant per-rule benchmarks.
At minimum:
- the benchmark repo link (e.g.
CursorCult/_benchmark_TDD) - what was measured
- a table of results by rule version (
v0,v1,v2, …)
Unlicense / public domain. See LICENSE.