Skip to content

feat(evm): enable basic performance check in ci#335

Merged
zoowii merged 9 commits intoDTVMStack:mainfrom
starwarfan:perf-ci
Feb 28, 2026
Merged

feat(evm): enable basic performance check in ci#335
zoowii merged 9 commits intoDTVMStack:mainfrom
starwarfan:perf-ci

Conversation

@starwarfan
Copy link
Contributor

@starwarfan starwarfan commented Feb 5, 2026

1. Does this PR affect any open issues?(Y/N) and add issue references (e.g. "fix #123", "re #123".):

  • N
  • Y

2. What is the scope of this PR (e.g. component or file name):

`.github/workflows/dtvm_evm_test_x86.yml`, `.github/workflows/perf_pr_comment.yml`

3. Provide a description of the PR(e.g. more details, effects, motivations or doc link):

  • Affects user behaviors
  • Contains CI/CD configuration changes
  • Contains documentation changes
  • Contains experimental features
  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Other

Enable basic performance regression checking in CI, with fixes for fork PR token permissions.

Root cause of CI failure: For fork PRs (cross-repository), the GITHUB_TOKEN in pull_request events is always read-only — this is a GitHub security feature. The "Comment on PR" step tried to call issues.createComment which requires write permissions, causing a 403 error.

Fix:

  1. Write to GITHUB_STEP_SUMMARY: Performance results are always visible in the GitHub Actions job summary, regardless of token permissions.
  2. Upload artifacts: Benchmark results and PR metadata are uploaded as workflow artifacts for cross-workflow access.
  3. Graceful PR comment: The direct PR comment step now uses continue-on-error with try/catch, so it succeeds for same-repo PRs and fails silently for fork PRs.
  4. New workflow_run workflow: A separate perf_pr_comment.yml workflow triggers on workflow_run events. Since it runs in the context of the base branch (main), it has write permissions and can post PR comments for fork PRs. This takes effect after merge to main.

4. Are there any breaking changes?(Y/N) and describe the breaking changes(e.g. more details, motivations or doc link):

  • N
  • Y

5. Are there test cases for these changes?(Y/N) select and add more details, references or doc links:

  • Unit test
  • Integration test
  • Benchmark (add benchmark stats below)
  • Manual test (add detailed scripts or steps below)
  • Other

The CI workflow changes can be verified by re-running the PR checks. The "Comment on PR" step should no longer cause a job failure for fork PRs.

6. Release note

None

@starwarfan starwarfan marked this pull request as draft February 5, 2026 05:51
@starwarfan starwarfan force-pushed the perf-ci branch 4 times, most recently from 76e887c to 7e327e9 Compare February 9, 2026 06:31
starwarfan and others added 6 commits February 25, 2026 03:32
- Add matrix strategy (interpreter, multipass) to the
  performance_regression_check job so both modes run in parallel;
  fail-fast: false so a failure in one doesn't cancel the other.
- Baseline and summary files are namespaced by mode
  (perf_baseline_{mode}.json, perf_summary_{mode}.md) to avoid
  collisions between matrix runs.
- PR comments include the mode name in the title.
- Add ZEN_ENABLE_JIT_PRECOMPILE_FALLBACK=ON to benchmarksuite CMake
  options so the JIT suitability checker is active and prevents
  infinite compilation hangs on pathological synth bytecode.
- Add --benchmark-filter option to check_performance_regression.py
  to allow callers to override the default external/* filter.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
For fork PRs, the GITHUB_TOKEN in pull_request events is always
read-only, causing the "Comment on PR" step to fail with 403.

- Write results to $GITHUB_STEP_SUMMARY for always-visible output
- Upload benchmark artifacts for cross-workflow access
- Add continue-on-error and try/catch to the PR comment step
- Add workflow_run-triggered workflow to post PR comments with
  write permissions (works for fork PRs after merge to main)

Made-with: Cursor
…heck

The baseline step checked out main and ran run_test_suite.sh, but main
has no benchmarksuite case, so no baseline file was produced. The PR
step then failed with "Baseline file not found".

Fix: build the DTVM library on the base branch and save only the
binary. Benchmarks for both baseline and PR libraries run from the PR
branch's infrastructure via a new BENCHMARK_BASELINE_LIB env var.

Also mitigate CI runner variance:
- Raise default threshold from 10% to 15%
- Require at least 3 benchmarks to exceed threshold before flagging
  regression (--min-regressions), filtering out single-outlier noise

Made-with: Cursor
@starwarfan starwarfan marked this pull request as ready for review February 26, 2026 12:36
…line

Reduce CI benchmark variance on shared GitHub Actions runners:
- Run each benchmark 3x and use the median (--benchmark_repetitions)
- Pin benchmarks to CPU 0 via taskset when available
- Filter to external/total/* to avoid execute/total double-counting
- Skip micro-benchmarks (<5us) that are dominated by noise
- Raise threshold from 15% to 20% and require >=5 regressions

Cache baseline results with actions/cache keyed by base branch SHA
so the baseline build and benchmark run are skipped on subsequent
pushes to the same PR. On cache miss the baseline library is built
once and benchmarked; the resulting JSON is cached for future runs.

Made-with: Cursor
@zoowii zoowii requested a review from Copilot February 28, 2026 07:35
@zoowii zoowii merged commit 56b42b8 into DTVMStack:main Feb 28, 2026
15 checks passed
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a CI-based performance regression check for the EVM path and ensures results are visible and (when possible) posted back to PRs, including fork PRs via a workflow_run follow-up workflow.

Changes:

  • Introduces tools/check_performance_regression.py to run/parse evmone-bench JSON output, save baselines, and detect regressions with noise-reduction knobs.
  • Extends .ci/run_test_suite.sh with a new benchmarksuite mode that builds/runs evmone benchmarks and compares against a baseline.
  • Updates CI to run performance checks on PRs, upload artifacts + step summaries, and adds a workflow_run workflow to comment results with base-repo permissions.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
tools/check_performance_regression.py New benchmark runner/parser + baseline + regression comparator used by CI.
.github/workflows/dtvm_evm_test_x86.yml Adds a matrix perf job, uploads artifacts, writes step summaries, attempts direct PR comment.
.github/workflows/perf_pr_comment.yml New workflow_run workflow that downloads artifacts and posts/updates a PR comment.
.ci/run_test_suite.sh Adds benchmarksuite case to clone evmone and execute regression checking.
Comments suppressed due to low confidence (1)

tools/check_performance_regression.py:452

  • --benchmark-filter help text says the default is external/*, but run_benchmark() actually applies --benchmark_filter=external/total/* when none is provided. Please make the help text consistent with the real default (or change the code default if external/* is intended).
        "--benchmark-filter",
        default=None,
        help="Custom regex filter forwarded to evmone-bench --benchmark_filter (default: external/*)",
    )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +184 to +186
cp build/lib/* $EVMONE_DIR/

cd $EVMONE_DIR
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set -e is enabled, so cp build/lib/* $EVMONE_DIR/ will fail the whole benchmark run if the glob doesn’t match (or if build/lib contains non-regular files). Since only libdtvmapi.so is needed for the benchmark runner, consider copying that specific file (and/or enabling nullglob / adding an explicit existence check) to make the CI step more robust.

Suggested change
cp build/lib/* $EVMONE_DIR/
cd $EVMONE_DIR
if [ -f "build/lib/libdtvmapi.so" ]; then
cp build/lib/libdtvmapi.so "$EVMONE_DIR/"
else
echo "Error: build/lib/libdtvmapi.so not found; cannot run benchmarks." >&2
exit 1
fi
cd "$EVMONE_DIR"

Copilot uses AI. Check for mistakes.
Comment on lines +241 to +244
time_change = (c.time_ns - b.time_ns) / b.time_ns
cpu_change = (c.cpu_time_ns - b.cpu_time_ns) / b.cpu_time_ns

max_change = max(time_change, cpu_change)
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare_benchmarks() divides by b.time_ns and b.cpu_time_ns directly. If either baseline value is 0 (possible because parse_benchmark_json() defaults missing times to 0), this will raise ZeroDivisionError and fail the CI run. Please guard against zero/negative baseline times (e.g., skip those entries with a notice, or treat change as inf and mark as regression/unknown).

Copilot uses AI. Check for mistakes.

result = subprocess.run(
cmd,
env={**subprocess.os.environ, **env},
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subprocess.run(..., env={**subprocess.os.environ, **env}) is using subprocess.os.environ instead of os.environ. While it may work, it's non-idiomatic and couples this code to an internal detail of the subprocess module; please switch to env={**os.environ, **env} (and keep the existing merge behavior).

Suggested change
env={**subprocess.os.environ, **env},
env={**os.environ, **env},

Copilot uses AI. Check for mistakes.
# Check for regressions against baseline
python check_performance_regression.py --baseline baseline.json

# Check with custom threshold (default 10%)
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header docstring says the default threshold is 10% (and shows --threshold 0.15 as “custom”), but the argparse default is 0.15 (15%). This mismatch will confuse users and future CI tuning; please update the docstring/examples to reflect the actual defaults (and consider aligning the example text that still references 10%).

Suggested change
# Check with custom threshold (default 10%)
# Check with explicit threshold (default 15%)

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +52
for (const mode of ['interpreter', 'multipass']) {
const dir = `/tmp/perf-${mode}`;
try {
if (!prNumber) {
prNumber = parseInt(fs.readFileSync(`${dir}/pr_number`, 'utf8').trim());
}
const outcome = fs.readFileSync(`${dir}/outcome`, 'utf8').trim();
const summary = fs.readFileSync(`${dir}/summary.md`, 'utf8');
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow trusts pr_number from artifacts generated by a pull_request run. For fork PRs, artifact contents are attacker-controlled; with this workflow’s write token, a malicious PR can set pr_number to any issue/PR and cause the workflow to spam or overwrite comments elsewhere in the repo. Please derive the PR number from github.event.workflow_run.pull_requests (and/or verify the artifact PR number matches that list) before posting/updating comments.

Copilot uses AI. Check for mistakes.
- completed

permissions:
pull-requests: write
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow calls github.rest.issues.* (list/update/create comment), but the top-level permissions: does not grant issues: write. On fine-grained GITHUB_TOKEN permissions this can lead to 403s and the workflow failing to post comments. Please add issues: write (or switch to PR-specific APIs that work with only pull-requests: write).

Suggested change
pull-requests: write
pull-requests: write
issues: write

Copilot uses AI. Check for mistakes.

- name: Comment on PR
if: always()
uses: actions/github-script@v6
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actions/github-script@v6 runs on the deprecated Node 16 runtime; GitHub Actions is removing Node 16 support, which can break this step in the future. Consider bumping to actions/github-script@v7 (Node 20) to avoid upcoming CI failures.

Suggested change
uses: actions/github-script@v6
uses: actions/github-script@v7

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants