docs(may-agent): 40-benchmark-strategy — EverMem + EvoAgent + Evil Agent Bench#20
Open
Fearvox wants to merge 7 commits into
Open
docs(may-agent): 40-benchmark-strategy — EverMem + EvoAgent + Evil Agent Bench#20Fearvox wants to merge 7 commits into
Fearvox wants to merge 7 commits into
Conversation
…lack mirror sync Adds two issue templates under .github/ISSUE_TEMPLATE/ for long-lived, auditable mirrors of in-flight upstream PRs: - pr_tracker.yml: general PR mirror (scope, evidence, decision log, closure) - security_tracker.yml: high-priority variant (CWE, severity, reachability, verification, disclosure hygiene) Both carry a `pr-mirror` label so the Linear evermind-dash project and the Slack #bots channel can subscribe by label. Bilingual EN + 中文.
…tream Runs every 6 hours via cron + manual workflow_dispatch. - Rebases fork main onto upstream/main (preserves fork-only commits like the issue templates) - Force-pushes with --force-with-lease for safety - Opens a tracking issue on conflict instead of failing silently Uses default GITHUB_TOKEN — no PAT needed since we only push to fork.
Triggers on issues.opened and issues.labeled. When pr-mirror label is
present, creates a corresponding Linear issue in the EverMind-Dash
project via Linear GraphQL API. Comments back on GitHub with the
EVE-id link.
Idempotency: skips if a '🔗 Linear:' marker comment already exists.
Priority: 'urgent' label -> Linear urgent (1); otherwise medium (3).
On API failure: applies 'sync-failed' label for triage.
Requires (configured separately):
Secret: LINEAR_API_KEY (Linear Personal API key, lin_api_*)
Vars: LINEAR_TEAM_ID (EverMind team UUID)
LINEAR_PROJECT_ID (EverMind-Dash project UUID)
…nnel Update the disclosure-hygiene checkbox to reference #p-evermind-dash (the actual Slack channel linked to the EverMind-Dash Linear project) instead of the placeholder #bots.
…ents Two compounding fixes to avoid creating multiple Linear issues from a single GitHub issue creation: 1. concurrency group keyed on issue.number with cancel-in-progress=false serializes runs per issue. Second run will see the first run's comment and skip via existing idempotency check. 2. Tighten 'labeled' event filter to only fire when the added label is pr-mirror itself, not any other label. Eliminates the four extra runs that gh issue create --label A --label B ... triggers (one issues.opened + four issues.labeled = 5 events for a 4-label create). Reproduction: gh issue create with 4 labels including pr-mirror was firing the workflow 5 times concurrently. Idempotency check has a ~5s race window before the first run posts its bot comment, so 2-3 runs created duplicate Linear issues before the rest skipped. Verified via Issue #4 sync producing both EVE-3 and EVE-4.
Adds the fork overnight patrol workflow, Linear-aware tracking issue creation, and docs guard support for coming-soon use-case placeholders. Verified with local script checks and passing Docs CI.
…ent Bench 3-benchmark strategy: EverMem (memory recall, existing), EvoAgent (self-evolution, existing adapter needed), Evil Agent Bench (sandbox escape, proposed). Success criteria per benchmark. Publication strategy for Star growth narrative. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
VERDICT: BLOCK
VERDICT_SUMMARY: Adds benchmark planning docs plus new fork/Linear tracking automation; core automation concept is sound.
Primary risk is inconsistent Linear marker semantics causing idempotency failures / potential duplicate Linear issues.
Next action: align Linear marker prefix usage, then address workflow operational hardening (concurrency + conflict issue spam) and re-scope PR metadata.
EVIDENCE:
- .github/scripts/overnight-watch.mjs: issue marker detection uses "Linear:" and the posted marker is "Linear:" (lines ~222-225, ~283-286), while .github/workflows/linear-sync.yml idempotency requires the exact "🔗 Linear:" prefix (line ~40) and posts "🔗 Linear:" (line ~123).
- .github/workflows/sync-upstream.yml: scheduled rebase + force-with-lease push has no concurrency guard (lines ~1-14) and conflict handling always creates a new issue (lines ~50-60), risking races and issue spam.
- PR metadata/title is benchmark-docs oriented, but the diff includes multiple new GitHub Actions workflows, a new GitHub script, and new issue templates.
Purpose: document a May Agent benchmark publication strategy while adding fork-maintenance automation (overnight watch, upstream sync) and Linear mirroring for tracking issues.
Changes:
- Add benchmark strategy planning doc covering EverMem, EvoAgent, and a proposed “Evil Agent Bench”.
- Introduce fork automation: upstream sync workflow, overnight watch workflow + Node script, and Linear mirroring workflow.
- Adjust docs link validation workflow to tolerate “Coming soon” entries; add PR/security tracking issue templates.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/fork-playground/overnight-watch.md | Documents the fork overnight watch workflow behavior and local invocation. |
| .planning/may-agent/40-benchmark-strategy.md | Draft benchmark strategy and success criteria for EverMem/EvoAgent/Evil Agent benches. |
| .github/workflows/sync-upstream.yml | Scheduled fork sync job that rebases fork main onto upstream and pushes. |
| .github/workflows/overnight-watch.yml | Schedules/runs the overnight watch script and sets required env. |
| .github/workflows/linear-sync.yml | Mirrors pr-mirror labeled GitHub issues to Linear with idempotency marker. |
| .github/workflows/docs.yml | Tweaks docs validation to skip cells marked “Coming soon” without primary links. |
| .github/scripts/overnight-watch.mjs | Implements the overnight watch report + tracking issue + optional Linear mirroring. |
| .github/ISSUE_TEMPLATE/security_tracker.yml | Adds a security mirror issue template with evidence/closure hygiene. |
| .github/ISSUE_TEMPLATE/pr_tracker.yml | Adds a PR tracking/mirroring issue template for Linear/Slack workflows. |
Comment on lines
+222
to
+225
| function issueHasLinearMarker(issueNumber) { | ||
| const comments = ghJson(`/repos/${repoSlug}/issues/${issueNumber}/comments?per_page=100`); | ||
| return comments.some((comment) => comment.body.includes("Linear:")); | ||
| } |
| } | ||
|
|
||
| const linearIssue = data.data.issueCreate.issue; | ||
| const marker = `Linear: [${linearIssue.identifier}](${linearIssue.url})\n\n_Auto-created by overnight-watch._`; |
Comment on lines
+1
to
+12
| # 40 — Benchmark Strategy: EverMem + EvoAgent Bench Harness Plan | ||
|
|
||
| **Status**: Draft | ||
| **Date**: 2026-05-13 | ||
| **Depends on**: 00-vision.md, `benchmarks/EverMemBench/`, `benchmarks/EvoAgentBench/` | ||
|
|
||
| ## TL;DR | ||
|
|
||
| Two benchmark suites validate the May Agent's two differentiators: memory | ||
| quality (EverMem Bench) and agent self-evolution (EvoAgent Bench). A third | ||
| benchmark (Evil Agent Bench) is proposed for sandbox escape testing. Target: | ||
| beat OpenClaw on 2+ axes with publishable results. |
| permissions: | ||
| contents: write | ||
| issues: write | ||
|
|
Comment on lines
+50
to
+60
| - name: Open issue on conflict | ||
| if: failure() && steps.rebase.outputs.conflict == 'true' | ||
| uses: actions/github-script@v7 | ||
| with: | ||
| script: | | ||
| await github.rest.issues.create({ | ||
| owner: context.repo.owner, | ||
| repo: context.repo.repo, | ||
| title: `[sync] Rebase conflict syncing fork from upstream (${new Date().toISOString().slice(0,10)})`, | ||
| body: `Auto-sync from \`upstream/main\` failed due to rebase conflicts.\n\nRun id: ${context.runId}\nWorkflow: ${context.workflow}\n\nResolve manually:\n\n\`\`\`\ncd ~/EverOS && git fetch upstream && git rebase upstream/main\n# resolve conflicts, then:\ngit push origin main --force-with-lease\n\`\`\``, | ||
| labels: ['tracking'] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan
🤖 Generated with Claude Code