feat(evm): enable basic performance check in ci#335
Conversation
76e887c to
7e327e9
Compare
- Add matrix strategy (interpreter, multipass) to the
performance_regression_check job so both modes run in parallel;
fail-fast: false so a failure in one doesn't cancel the other.
- Baseline and summary files are namespaced by mode
(perf_baseline_{mode}.json, perf_summary_{mode}.md) to avoid
collisions between matrix runs.
- PR comments include the mode name in the title.
- Add ZEN_ENABLE_JIT_PRECOMPILE_FALLBACK=ON to benchmarksuite CMake
options so the JIT suitability checker is active and prevents
infinite compilation hangs on pathological synth bytecode.
- Add --benchmark-filter option to check_performance_regression.py
to allow callers to override the default external/* filter.
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
For fork PRs, the GITHUB_TOKEN in pull_request events is always read-only, causing the "Comment on PR" step to fail with 403. - Write results to $GITHUB_STEP_SUMMARY for always-visible output - Upload benchmark artifacts for cross-workflow access - Add continue-on-error and try/catch to the PR comment step - Add workflow_run-triggered workflow to post PR comments with write permissions (works for fork PRs after merge to main) Made-with: Cursor
…heck The baseline step checked out main and ran run_test_suite.sh, but main has no benchmarksuite case, so no baseline file was produced. The PR step then failed with "Baseline file not found". Fix: build the DTVM library on the base branch and save only the binary. Benchmarks for both baseline and PR libraries run from the PR branch's infrastructure via a new BENCHMARK_BASELINE_LIB env var. Also mitigate CI runner variance: - Raise default threshold from 10% to 15% - Require at least 3 benchmarks to exceed threshold before flagging regression (--min-regressions), filtering out single-outlier noise Made-with: Cursor
…line Reduce CI benchmark variance on shared GitHub Actions runners: - Run each benchmark 3x and use the median (--benchmark_repetitions) - Pin benchmarks to CPU 0 via taskset when available - Filter to external/total/* to avoid execute/total double-counting - Skip micro-benchmarks (<5us) that are dominated by noise - Raise threshold from 15% to 20% and require >=5 regressions Cache baseline results with actions/cache keyed by base branch SHA so the baseline build and benchmark run are skipped on subsequent pushes to the same PR. On cache miss the baseline library is built once and benchmarked; the resulting JSON is cached for future runs. Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Adds a CI-based performance regression check for the EVM path and ensures results are visible and (when possible) posted back to PRs, including fork PRs via a workflow_run follow-up workflow.
Changes:
- Introduces
tools/check_performance_regression.pyto run/parseevmone-benchJSON output, save baselines, and detect regressions with noise-reduction knobs. - Extends
.ci/run_test_suite.shwith a newbenchmarksuitemode that builds/runs evmone benchmarks and compares against a baseline. - Updates CI to run performance checks on PRs, upload artifacts + step summaries, and adds a
workflow_runworkflow to comment results with base-repo permissions.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
tools/check_performance_regression.py |
New benchmark runner/parser + baseline + regression comparator used by CI. |
.github/workflows/dtvm_evm_test_x86.yml |
Adds a matrix perf job, uploads artifacts, writes step summaries, attempts direct PR comment. |
.github/workflows/perf_pr_comment.yml |
New workflow_run workflow that downloads artifacts and posts/updates a PR comment. |
.ci/run_test_suite.sh |
Adds benchmarksuite case to clone evmone and execute regression checking. |
Comments suppressed due to low confidence (1)
tools/check_performance_regression.py:452
--benchmark-filterhelp text says the default isexternal/*, butrun_benchmark()actually applies--benchmark_filter=external/total/*when none is provided. Please make the help text consistent with the real default (or change the code default ifexternal/*is intended).
"--benchmark-filter",
default=None,
help="Custom regex filter forwarded to evmone-bench --benchmark_filter (default: external/*)",
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| cp build/lib/* $EVMONE_DIR/ | ||
|
|
||
| cd $EVMONE_DIR |
There was a problem hiding this comment.
set -e is enabled, so cp build/lib/* $EVMONE_DIR/ will fail the whole benchmark run if the glob doesn’t match (or if build/lib contains non-regular files). Since only libdtvmapi.so is needed for the benchmark runner, consider copying that specific file (and/or enabling nullglob / adding an explicit existence check) to make the CI step more robust.
| cp build/lib/* $EVMONE_DIR/ | |
| cd $EVMONE_DIR | |
| if [ -f "build/lib/libdtvmapi.so" ]; then | |
| cp build/lib/libdtvmapi.so "$EVMONE_DIR/" | |
| else | |
| echo "Error: build/lib/libdtvmapi.so not found; cannot run benchmarks." >&2 | |
| exit 1 | |
| fi | |
| cd "$EVMONE_DIR" |
| time_change = (c.time_ns - b.time_ns) / b.time_ns | ||
| cpu_change = (c.cpu_time_ns - b.cpu_time_ns) / b.cpu_time_ns | ||
|
|
||
| max_change = max(time_change, cpu_change) |
There was a problem hiding this comment.
compare_benchmarks() divides by b.time_ns and b.cpu_time_ns directly. If either baseline value is 0 (possible because parse_benchmark_json() defaults missing times to 0), this will raise ZeroDivisionError and fail the CI run. Please guard against zero/negative baseline times (e.g., skip those entries with a notice, or treat change as inf and mark as regression/unknown).
|
|
||
| result = subprocess.run( | ||
| cmd, | ||
| env={**subprocess.os.environ, **env}, |
There was a problem hiding this comment.
subprocess.run(..., env={**subprocess.os.environ, **env}) is using subprocess.os.environ instead of os.environ. While it may work, it's non-idiomatic and couples this code to an internal detail of the subprocess module; please switch to env={**os.environ, **env} (and keep the existing merge behavior).
| env={**subprocess.os.environ, **env}, | |
| env={**os.environ, **env}, |
| # Check for regressions against baseline | ||
| python check_performance_regression.py --baseline baseline.json | ||
|
|
||
| # Check with custom threshold (default 10%) |
There was a problem hiding this comment.
The header docstring says the default threshold is 10% (and shows --threshold 0.15 as “custom”), but the argparse default is 0.15 (15%). This mismatch will confuse users and future CI tuning; please update the docstring/examples to reflect the actual defaults (and consider aligning the example text that still references 10%).
| # Check with custom threshold (default 10%) | |
| # Check with explicit threshold (default 15%) |
| for (const mode of ['interpreter', 'multipass']) { | ||
| const dir = `/tmp/perf-${mode}`; | ||
| try { | ||
| if (!prNumber) { | ||
| prNumber = parseInt(fs.readFileSync(`${dir}/pr_number`, 'utf8').trim()); | ||
| } | ||
| const outcome = fs.readFileSync(`${dir}/outcome`, 'utf8').trim(); | ||
| const summary = fs.readFileSync(`${dir}/summary.md`, 'utf8'); |
There was a problem hiding this comment.
This workflow trusts pr_number from artifacts generated by a pull_request run. For fork PRs, artifact contents are attacker-controlled; with this workflow’s write token, a malicious PR can set pr_number to any issue/PR and cause the workflow to spam or overwrite comments elsewhere in the repo. Please derive the PR number from github.event.workflow_run.pull_requests (and/or verify the artifact PR number matches that list) before posting/updating comments.
| - completed | ||
|
|
||
| permissions: | ||
| pull-requests: write |
There was a problem hiding this comment.
The workflow calls github.rest.issues.* (list/update/create comment), but the top-level permissions: does not grant issues: write. On fine-grained GITHUB_TOKEN permissions this can lead to 403s and the workflow failing to post comments. Please add issues: write (or switch to PR-specific APIs that work with only pull-requests: write).
| pull-requests: write | |
| pull-requests: write | |
| issues: write |
|
|
||
| - name: Comment on PR | ||
| if: always() | ||
| uses: actions/github-script@v6 |
There was a problem hiding this comment.
actions/github-script@v6 runs on the deprecated Node 16 runtime; GitHub Actions is removing Node 16 support, which can break this step in the future. Consider bumping to actions/github-script@v7 (Node 20) to avoid upcoming CI failures.
| uses: actions/github-script@v6 | |
| uses: actions/github-script@v7 |
1. Does this PR affect any open issues?(Y/N) and add issue references (e.g. "fix #123", "re #123".):
2. What is the scope of this PR (e.g. component or file name):
`.github/workflows/dtvm_evm_test_x86.yml`, `.github/workflows/perf_pr_comment.yml`
3. Provide a description of the PR(e.g. more details, effects, motivations or doc link):
Enable basic performance regression checking in CI, with fixes for fork PR token permissions.
Root cause of CI failure: For fork PRs (cross-repository), the GITHUB_TOKEN in pull_request events is always read-only — this is a GitHub security feature. The "Comment on PR" step tried to call issues.createComment which requires write permissions, causing a 403 error.
Fix:
4. Are there any breaking changes?(Y/N) and describe the breaking changes(e.g. more details, motivations or doc link):
5. Are there test cases for these changes?(Y/N) select and add more details, references or doc links:
The CI workflow changes can be verified by re-running the PR checks. The "Comment on PR" step should no longer cause a job failure for fork PRs.
6. Release note
None