docs(may-agent): 40-benchmark-strategy — EverMem + EvoAgent + Evil Agent Bench by Fearvox · Pull Request #20 · Fearvox/EverOS

Fearvox · 2026-05-13T07:15:35Z

Summary

3-benchmark strategy with success criteria per benchmark
EverMem Bench: already published (arXiv: 2602.01313), measure gateway overhead
EvoAgent Bench: proposed Python adapter for May Agent, target >10% Δ gain
Evil Agent Bench: proposed sandbox escape harness, 6 example escape cases
Publication strategy for Star growth narrative

Test plan

References resolve to actual benchmark READMEs
Escape cases cover file read, network egress, self-modification, container escape, supply chain, persistence
Success criteria are measurable

🤖 Generated with Claude Code

…lack mirror sync Adds two issue templates under .github/ISSUE_TEMPLATE/ for long-lived, auditable mirrors of in-flight upstream PRs: - pr_tracker.yml: general PR mirror (scope, evidence, decision log, closure) - security_tracker.yml: high-priority variant (CWE, severity, reachability, verification, disclosure hygiene) Both carry a `pr-mirror` label so the Linear evermind-dash project and the Slack #bots channel can subscribe by label. Bilingual EN + 中文.

…tream Runs every 6 hours via cron + manual workflow_dispatch. - Rebases fork main onto upstream/main (preserves fork-only commits like the issue templates) - Force-pushes with --force-with-lease for safety - Opens a tracking issue on conflict instead of failing silently Uses default GITHUB_TOKEN — no PAT needed since we only push to fork.

Triggers on issues.opened and issues.labeled. When pr-mirror label is present, creates a corresponding Linear issue in the EverMind-Dash project via Linear GraphQL API. Comments back on GitHub with the EVE-id link. Idempotency: skips if a '🔗 Linear:' marker comment already exists. Priority: 'urgent' label -> Linear urgent (1); otherwise medium (3). On API failure: applies 'sync-failed' label for triage. Requires (configured separately): Secret: LINEAR_API_KEY (Linear Personal API key, lin_api_*) Vars: LINEAR_TEAM_ID (EverMind team UUID) LINEAR_PROJECT_ID (EverMind-Dash project UUID)

…nnel Update the disclosure-hygiene checkbox to reference #p-evermind-dash (the actual Slack channel linked to the EverMind-Dash Linear project) instead of the placeholder #bots.

…ents Two compounding fixes to avoid creating multiple Linear issues from a single GitHub issue creation: 1. concurrency group keyed on issue.number with cancel-in-progress=false serializes runs per issue. Second run will see the first run's comment and skip via existing idempotency check. 2. Tighten 'labeled' event filter to only fire when the added label is pr-mirror itself, not any other label. Eliminates the four extra runs that gh issue create --label A --label B ... triggers (one issues.opened + four issues.labeled = 5 events for a 4-label create). Reproduction: gh issue create with 4 labels including pr-mirror was firing the workflow 5 times concurrently. Idempotency check has a ~5s race window before the first run posts its bot comment, so 2-3 runs created duplicate Linear issues before the rest skipped. Verified via Issue #4 sync producing both EVE-3 and EVE-4.

Adds the fork overnight patrol workflow, Linear-aware tracking issue creation, and docs guard support for coming-soon use-case placeholders. Verified with local script checks and passing Docs CI.

…ent Bench 3-benchmark strategy: EverMem (memory recall, existing), EvoAgent (self-evolution, existing adapter needed), Evil Agent Bench (sandbox escape, proposed). Success criteria per benchmark. Publication strategy for Star growth narrative. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

VERDICT: BLOCK
VERDICT_SUMMARY: Adds benchmark planning docs plus new fork/Linear tracking automation; core automation concept is sound.
Primary risk is inconsistent Linear marker semantics causing idempotency failures / potential duplicate Linear issues.
Next action: align Linear marker prefix usage, then address workflow operational hardening (concurrency + conflict issue spam) and re-scope PR metadata.
EVIDENCE:
- .github/scripts/overnight-watch.mjs: issue marker detection uses "Linear:" and the posted marker is "Linear:" (lines ~222-225, ~283-286), while .github/workflows/linear-sync.yml idempotency requires the exact "🔗 Linear:" prefix (line ~40) and posts "🔗 Linear:" (line ~123).
- .github/workflows/sync-upstream.yml: scheduled rebase + force-with-lease push has no concurrency guard (lines ~1-14) and conflict handling always creates a new issue (lines ~50-60), risking races and issue spam.
- PR metadata/title is benchmark-docs oriented, but the diff includes multiple new GitHub Actions workflows, a new GitHub script, and new issue templates.

Purpose: document a May Agent benchmark publication strategy while adding fork-maintenance automation (overnight watch, upstream sync) and Linear mirroring for tracking issues.

Changes:

Add benchmark strategy planning doc covering EverMem, EvoAgent, and a proposed “Evil Agent Bench”.
Introduce fork automation: upstream sync workflow, overnight watch workflow + Node script, and Linear mirroring workflow.
Adjust docs link validation workflow to tolerate “Coming soon” entries; add PR/security tracking issue templates.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
docs/fork-playground/overnight-watch.md	Documents the fork overnight watch workflow behavior and local invocation.
.planning/may-agent/40-benchmark-strategy.md	Draft benchmark strategy and success criteria for EverMem/EvoAgent/Evil Agent benches.
.github/workflows/sync-upstream.yml	Scheduled fork sync job that rebases fork `main` onto upstream and pushes.
.github/workflows/overnight-watch.yml	Schedules/runs the overnight watch script and sets required env.
.github/workflows/linear-sync.yml	Mirrors `pr-mirror` labeled GitHub issues to Linear with idempotency marker.
.github/workflows/docs.yml	Tweaks docs validation to skip cells marked “Coming soon” without primary links.
.github/scripts/overnight-watch.mjs	Implements the overnight watch report + tracking issue + optional Linear mirroring.
.github/ISSUE_TEMPLATE/security_tracker.yml	Adds a security mirror issue template with evidence/closure hygiene.
.github/ISSUE_TEMPLATE/pr_tracker.yml	Adds a PR tracking/mirroring issue template for Linear/Slack workflows.

+function issueHasLinearMarker(issueNumber) {
+  const comments = ghJson(`/repos/${repoSlug}/issues/${issueNumber}/comments?per_page=100`);
+  return comments.some((comment) => comment.body.includes("Linear:"));
+}


+  }
+
+  const linearIssue = data.data.issueCreate.issue;
+  const marker = `Linear: [${linearIssue.identifier}](${linearIssue.url})\n\n_Auto-created by overnight-watch._`;


+# 40 — Benchmark Strategy: EverMem + EvoAgent Bench Harness Plan
+
+**Status**: Draft
+**Date**: 2026-05-13
+**Depends on**: 00-vision.md, `benchmarks/EverMemBench/`, `benchmarks/EvoAgentBench/`
+
+## TL;DR
+
+Two benchmark suites validate the May Agent's two differentiators: memory
+quality (EverMem Bench) and agent self-evolution (EvoAgent Bench). A third
+benchmark (Evil Agent Bench) is proposed for sandbox escape testing. Target:
+beat OpenClaw on 2+ axes with publishable results.


+permissions:
+  contents: write
+  issues: write
+


+      - name: Open issue on conflict
+        if: failure() && steps.rebase.outputs.conflict == 'true'
+        uses: actions/github-script@v7
+        with:
+          script: |
+            await github.rest.issues.create({
+              owner: context.repo.owner,
+              repo: context.repo.repo,
+              title: `[sync] Rebase conflict syncing fork from upstream (${new Date().toISOString().slice(0,10)})`,
+              body: `Auto-sync from \`upstream/main\` failed due to rebase conflicts.\n\nRun id: ${context.runId}\nWorkflow: ${context.workflow}\n\nResolve manually:\n\n\`\`\`\ncd ~/EverOS && git fetch upstream && git rebase upstream/main\n# resolve conflicts, then:\ngit push origin main --force-with-lease\n\`\`\``,
+              labels: ['tracking']


Fearvox and others added 7 commits May 13, 2026 01:25

docs(templates): align security_tracker Slack reference to actual cha…

bee6f1d

…nnel Update the disclosure-hygiene checkbox to reference #p-evermind-dash (the actual Slack channel linked to the EverMind-Dash Linear project) instead of the placeholder #bots.

ci(watch): add overnight fork patrol

fe80ca1

Adds the fork overnight patrol workflow, Linear-aware tracking issue creation, and docs guard support for coming-soon use-case placeholders. Verified with local script checks and passing Docs CI.

Fearvox added pr-mirror Long-lived mirror of an upstream PR for Linear/Slack tracking tracking Issue tracks a long-lived workflow labels May 13, 2026

github-actions Bot mentioned this pull request May 18, 2026

[watch] Overnight fork patrol: 2026-05-18 #34

Open

github-actions Bot force-pushed the main branch from b98bc32 to 02d63e8 Compare May 19, 2026 15:53

Fearvox marked this pull request as ready for review May 20, 2026 13:14

Copilot AI review requested due to automatic review settings May 20, 2026 13:14

Copilot started reviewing on behalf of Fearvox May 20, 2026 13:14 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

github-actions Bot mentioned this pull request May 21, 2026

[watch] Overnight fork patrol: 2026-05-21 #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(may-agent): 40-benchmark-strategy — EverMem + EvoAgent + Evil Agent Bench#20

docs(may-agent): 40-benchmark-strategy — EverMem + EvoAgent + Evil Agent Bench#20
Fearvox wants to merge 7 commits into
mainfrom
sleep-iter-15-benchmark-strategy

Fearvox commented May 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Fearvox commented May 13, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants