Skip to content

docs(may-agent): 40-benchmark-strategy — EverMem + EvoAgent + Evil Agent Bench#20

Open
Fearvox wants to merge 7 commits into
mainfrom
sleep-iter-15-benchmark-strategy
Open

docs(may-agent): 40-benchmark-strategy — EverMem + EvoAgent + Evil Agent Bench#20
Fearvox wants to merge 7 commits into
mainfrom
sleep-iter-15-benchmark-strategy

Conversation

@Fearvox
Copy link
Copy Markdown
Owner

@Fearvox Fearvox commented May 13, 2026

Summary

  • 3-benchmark strategy with success criteria per benchmark
  • EverMem Bench: already published (arXiv: 2602.01313), measure gateway overhead
  • EvoAgent Bench: proposed Python adapter for May Agent, target >10% Δ gain
  • Evil Agent Bench: proposed sandbox escape harness, 6 example escape cases
  • Publication strategy for Star growth narrative

Test plan

  • References resolve to actual benchmark READMEs
  • Escape cases cover file read, network egress, self-modification, container escape, supply chain, persistence
  • Success criteria are measurable

🤖 Generated with Claude Code

Fearvox and others added 7 commits May 13, 2026 01:25
…lack mirror sync

Adds two issue templates under .github/ISSUE_TEMPLATE/ for long-lived,
auditable mirrors of in-flight upstream PRs:

- pr_tracker.yml: general PR mirror (scope, evidence, decision log, closure)
- security_tracker.yml: high-priority variant (CWE, severity, reachability,
  verification, disclosure hygiene)

Both carry a `pr-mirror` label so the Linear evermind-dash project and the
Slack #bots channel can subscribe by label. Bilingual EN + 中文.
…tream

Runs every 6 hours via cron + manual workflow_dispatch.
- Rebases fork main onto upstream/main (preserves fork-only commits like
  the issue templates)
- Force-pushes with --force-with-lease for safety
- Opens a tracking issue on conflict instead of failing silently

Uses default GITHUB_TOKEN — no PAT needed since we only push to fork.
Triggers on issues.opened and issues.labeled. When pr-mirror label is
present, creates a corresponding Linear issue in the EverMind-Dash
project via Linear GraphQL API. Comments back on GitHub with the
EVE-id link.

Idempotency: skips if a '🔗 Linear:' marker comment already exists.
Priority: 'urgent' label -> Linear urgent (1); otherwise medium (3).
On API failure: applies 'sync-failed' label for triage.

Requires (configured separately):
  Secret:   LINEAR_API_KEY    (Linear Personal API key, lin_api_*)
  Vars:     LINEAR_TEAM_ID    (EverMind team UUID)
            LINEAR_PROJECT_ID (EverMind-Dash project UUID)
…nnel

Update the disclosure-hygiene checkbox to reference #p-evermind-dash
(the actual Slack channel linked to the EverMind-Dash Linear project)
instead of the placeholder #bots.
…ents

Two compounding fixes to avoid creating multiple Linear issues from a
single GitHub issue creation:

1. concurrency group keyed on issue.number with cancel-in-progress=false
   serializes runs per issue. Second run will see the first run's
   comment and skip via existing idempotency check.

2. Tighten 'labeled' event filter to only fire when the added label is
   pr-mirror itself, not any other label. Eliminates the four extra
   runs that gh issue create --label A --label B ... triggers (one
   issues.opened + four issues.labeled = 5 events for a 4-label create).

Reproduction: gh issue create with 4 labels including pr-mirror was
firing the workflow 5 times concurrently. Idempotency check has a
~5s race window before the first run posts its bot comment, so 2-3
runs created duplicate Linear issues before the rest skipped.

Verified via Issue #4 sync producing both EVE-3 and EVE-4.
Adds the fork overnight patrol workflow, Linear-aware tracking issue creation, and docs guard support for coming-soon use-case placeholders. Verified with local script checks and passing Docs CI.
…ent Bench

3-benchmark strategy: EverMem (memory recall, existing), EvoAgent
(self-evolution, existing adapter needed), Evil Agent Bench (sandbox
escape, proposed). Success criteria per benchmark. Publication strategy
for Star growth narrative.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Fearvox Fearvox added pr-mirror Long-lived mirror of an upstream PR for Linear/Slack tracking tracking Issue tracks a long-lived workflow labels May 13, 2026
@Fearvox Fearvox marked this pull request as ready for review May 20, 2026 13:14
Copilot AI review requested due to automatic review settings May 20, 2026 13:14
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

VERDICT: BLOCK
VERDICT_SUMMARY: Adds benchmark planning docs plus new fork/Linear tracking automation; core automation concept is sound.
Primary risk is inconsistent Linear marker semantics causing idempotency failures / potential duplicate Linear issues.
Next action: align Linear marker prefix usage, then address workflow operational hardening (concurrency + conflict issue spam) and re-scope PR metadata.
EVIDENCE:
- .github/scripts/overnight-watch.mjs: issue marker detection uses "Linear:" and the posted marker is "Linear:" (lines ~222-225, ~283-286), while .github/workflows/linear-sync.yml idempotency requires the exact "🔗 Linear:" prefix (line ~40) and posts "🔗 Linear:" (line ~123).
- .github/workflows/sync-upstream.yml: scheduled rebase + force-with-lease push has no concurrency guard (lines ~1-14) and conflict handling always creates a new issue (lines ~50-60), risking races and issue spam.
- PR metadata/title is benchmark-docs oriented, but the diff includes multiple new GitHub Actions workflows, a new GitHub script, and new issue templates.

Purpose: document a May Agent benchmark publication strategy while adding fork-maintenance automation (overnight watch, upstream sync) and Linear mirroring for tracking issues.

Changes:

  • Add benchmark strategy planning doc covering EverMem, EvoAgent, and a proposed “Evil Agent Bench”.
  • Introduce fork automation: upstream sync workflow, overnight watch workflow + Node script, and Linear mirroring workflow.
  • Adjust docs link validation workflow to tolerate “Coming soon” entries; add PR/security tracking issue templates.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
docs/fork-playground/overnight-watch.md Documents the fork overnight watch workflow behavior and local invocation.
.planning/may-agent/40-benchmark-strategy.md Draft benchmark strategy and success criteria for EverMem/EvoAgent/Evil Agent benches.
.github/workflows/sync-upstream.yml Scheduled fork sync job that rebases fork main onto upstream and pushes.
.github/workflows/overnight-watch.yml Schedules/runs the overnight watch script and sets required env.
.github/workflows/linear-sync.yml Mirrors pr-mirror labeled GitHub issues to Linear with idempotency marker.
.github/workflows/docs.yml Tweaks docs validation to skip cells marked “Coming soon” without primary links.
.github/scripts/overnight-watch.mjs Implements the overnight watch report + tracking issue + optional Linear mirroring.
.github/ISSUE_TEMPLATE/security_tracker.yml Adds a security mirror issue template with evidence/closure hygiene.
.github/ISSUE_TEMPLATE/pr_tracker.yml Adds a PR tracking/mirroring issue template for Linear/Slack workflows.

Comment on lines +222 to +225
function issueHasLinearMarker(issueNumber) {
const comments = ghJson(`/repos/${repoSlug}/issues/${issueNumber}/comments?per_page=100`);
return comments.some((comment) => comment.body.includes("Linear:"));
}
}

const linearIssue = data.data.issueCreate.issue;
const marker = `Linear: [${linearIssue.identifier}](${linearIssue.url})\n\n_Auto-created by overnight-watch._`;
Comment on lines +1 to +12
# 40 — Benchmark Strategy: EverMem + EvoAgent Bench Harness Plan

**Status**: Draft
**Date**: 2026-05-13
**Depends on**: 00-vision.md, `benchmarks/EverMemBench/`, `benchmarks/EvoAgentBench/`

## TL;DR

Two benchmark suites validate the May Agent's two differentiators: memory
quality (EverMem Bench) and agent self-evolution (EvoAgent Bench). A third
benchmark (Evil Agent Bench) is proposed for sandbox escape testing. Target:
beat OpenClaw on 2+ axes with publishable results.
permissions:
contents: write
issues: write

Comment on lines +50 to +60
- name: Open issue on conflict
if: failure() && steps.rebase.outputs.conflict == 'true'
uses: actions/github-script@v7
with:
script: |
await github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: `[sync] Rebase conflict syncing fork from upstream (${new Date().toISOString().slice(0,10)})`,
body: `Auto-sync from \`upstream/main\` failed due to rebase conflicts.\n\nRun id: ${context.runId}\nWorkflow: ${context.workflow}\n\nResolve manually:\n\n\`\`\`\ncd ~/EverOS && git fetch upstream && git rebase upstream/main\n# resolve conflicts, then:\ngit push origin main --force-with-lease\n\`\`\``,
labels: ['tracking']
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-mirror Long-lived mirror of an upstream PR for Linear/Slack tracking tracking Issue tracks a long-lived workflow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants