Skip to content

feat(openclaw): upgrade integration with multi-user, agent playbooks, and 4-component design (skill, hook, rule, commands)#4

Merged
yilu331 merged 18 commits into
mainfrom
feat/openclaw-integration-upgrade
Apr 13, 2026
Merged

feat(openclaw): upgrade integration with multi-user, agent playbooks, and 4-component design (skill, hook, rule, commands)#4
yilu331 merged 18 commits into
mainfrom
feat/openclaw-integration-upgrade

Conversation

@yilu331
Copy link
Copy Markdown
Collaborator

@yilu331 yilu331 commented Apr 13, 2026

Summary

  • Upgrade the OpenClaw integration to match Claude Code integration maturity: multi-user support, agent playbook aggregation, per-message search, and skill-driven publishing
  • Each OpenClaw agent instance (identified by agentId) becomes a unique Reflexio user — playbooks and profiles are isolated per-instance, while agent playbooks are shared across all instances
  • Single expert mode (no normal/expert split) — one unified set of skill, hook, rule, and commands covers search, publish, and aggregation
  • Follows the 4-component integration pattern from agent-integration-design.md: Skill (procedural guidance), Hook (deterministic enforcement), Rule (behavioral constraints), Command (user-triggered actions)

Changes

Hook (handler.ts)

  • Renamed from handler.js to handler.ts — OpenClaw's hook loader compiles TypeScript at gateway startup; .ts is the canonical handler format
  • Add resolveUserId(event) with 4-step fallback chain: env var → sessionKey agentId → OpenClaw config → "openclaw" default
  • Add handleSearchBeforeResponse for per-message message:received search injection (5s timeout, skips short/trivial prompts)
  • Add triggerAggregationIfNeeded() — fire-and-forget aggregation after successful publish with flag file concurrency guard
  • Add ensureServerRunning() — auto-starts local Reflexio server at bootstrap and on search connection errors (same pattern as Claude Code's session_start_hook.sh)
  • Added package-lock.json for reproducible npm installs of better-sqlite3

Skill (SKILL.md)

  • Complete rewrite as unified expert-mode skill: search before tasks + publish on corrections/completions
  • Two publish scenarios: (1) user correction with rich context, (2) key step completion with learnings
  • Agent playbook commands, multi-user concept, server management

Rule (rules/reflexio.md) — new

  • Always-active behavioral constraints loaded at every session start
  • Tells agent to follow injected REFLEXIO_CONTEXT blocks from the search hook
  • Manual search fallback when hook doesn't fire
  • Enforces transparency (never mention Reflexio), non-blocking (proceed if unavailable), silent infrastructure (never ask user to manage server)

Commands — new

  • /reflexio-extract — manual comprehensive conversation extraction
  • /reflexio-aggregate — manual playbook aggregation trigger with cron guidance

CLI fixes

  • Fix reflexio search to surface agent playbooks (was passing user_playbooks as agent_playbooks)
  • Add user_playbooks parameter to format_context() output formatter

Setup wizard (setup_cmd.py)

  • One-command install (reflexio setup openclaw) and uninstall (reflexio setup openclaw --uninstall) for all 4 components
  • Hook installed by copying (not symlinking) — works around OpenClaw symlink discovery bug (GitHub #11861)
  • Dynamic command discovery in both install and uninstall

Documentation

  • README.md: Fully rewritten with multi-user architecture, agent playbooks, cron guidance, all 4 integration components
  • TESTING.md (new): Step-by-step manual testing guide covering 8 phases — install, auto-start, cold-start search, capture/publish, warm-start retrieval, manual commands, multi-user isolation, graceful degradation, uninstall

Eval suite — new

  • 8 evaluation scenarios: correction capture, retrieval, multi-user isolation, aggregation dedup, search relevance, tool failure extraction, cron aggregation, graceful degradation
  • Data-driven runner with JSON dataset and dispatch table

E2E tests — new

  • 10 tests across 4 classes: multi-user, aggregation, unified search, graceful degradation

Known Limitations

  • Hook events: message:received and message:sent only fire in channel sessions (Telegram, WhatsApp, etc.) — not in CLI (openclaw agent -m) or TUI mode. This is an OpenClaw architectural limitation where these events are emitted from dispatchReplyFromConfig() (channel dispatch path only). The skill and rule provide fallback coverage for non-channel modes.
  • Rule loading: Custom .md files in ~/.openclaw/workspace/ are not loaded as workspace context — OpenClaw only loads specific named files (AGENTS.md, SOUL.md, etc.). The rule content may need to be appended to AGENTS.md instead.

Test Plan

  • All 10 E2E tests passing: uv run pytest tests/e2e_tests/test_openclaw_integration.py -o 'addopts='
  • Ruff linting clean on all modified Python files
  • Review-fix loop completed (6 findings found and fixed, all verified)
  • Hook loads in OpenClaw gateway (6/6 ready after handler.ts rename + copy install)
  • agent:bootstrap fires correctly in all modes (CLI, TUI, channel)
  • Server auto-start triggers on bootstrap when local server is down
  • Skills discovered: reflexio, reflexio-extract, reflexio-aggregate all ✓ ready
  • Manual channel-based testing (Telegram) for message:received/message:sent events
  • Manual testing via TESTING.md

@yilu331 yilu331 force-pushed the feat/openclaw-integration-upgrade branch from 24620cc to f18bb85 Compare April 13, 2026 04:13
@yilu331 yilu331 changed the title feat(openclaw): upgrade integration with multi-user, agent playbooks, per-message search feat(openclaw): upgrade integration with multi-user, agent playbooks, and 4-component design (skill, hook, rule, commands) Apr 13, 2026
@yilu331 yilu331 force-pushed the feat/openclaw-integration-upgrade branch from 4ce7538 to 20dfd47 Compare April 13, 2026 07:17
yilu331 added 16 commits April 13, 2026 02:25
The search command was passing response.user_playbooks to the
agent_playbooks parameter of format_context, dropping actual agent
playbooks entirely. Fixed to pass response.agent_playbooks correctly,
added user_playbooks parameter to format_context to display both types,
and updated the count display to report all three result types.
- Add resolveUserId(event) with fallback chain: REFLEXIO_USER_ID env >
  agentId from sessionKey > ~/.openclaw/openclaw.json > "openclaw"
- Replace all hardcoded `process.env.REFLEXIO_USER_ID || "openclaw"` with
  resolveUserId(event) in handleBootstrap, handleMessageSent, handleSessionEnd
- Add message:received handler that runs reflexio search before each agent
  response, injecting results as REFLEXIO_CONTEXT.md bootstrap file
- Update HOOK.md events list and add message:received documentation
…blish

After each successful interactions publish, fire-and-forget a detached
reflexio agent-playbooks aggregate process. A flag file at
~/.reflexio/logs/.aggregation-running (checked via statSync mtime)
prevents concurrent runs within a 5-minute window.
During `reflexio setup openclaw`, copy each subdirectory of the commands/
directory to ~/.openclaw/skills/<command-name> with dirs_exist_ok=True,
installing reflexio-extract and reflexio-aggregate as OpenClaw skills.
…tegration

Tests the full data lifecycle for OpenClaw: multi-user scoped publishing,
profile isolation, playbook aggregation producing agent playbooks, unified
search returning both user and agent playbooks, and graceful degradation
for cold-start and minimal-input scenarios.
…t description

- _uninstall_openclaw() now removes reflexio-extract and reflexio-aggregate
  command directories from ~/.openclaw/skills/ alongside the main reflexio skill
- Update REFLEXIO_USER_ID default in HOOK.md from 'openclaw' to 'auto (from agentId)'
  to accurately reflect the 4-step fallback chain behaviour
- F001: set REFLEXIO_URL (not REFLEXIO_API_URL) in eval runner CLI subprocess env
- F002: scope _wait_extraction by user_id to prevent cross-scenario false positives
- F003: replace naive JSON5 comment stripping with string-aware line-by-line parser
- F004: simplify nested ternary in aggregation message logic
- F005: dynamically discover commands in uninstall (match install behavior)
- F006: truncate search prompt to 4096 chars to avoid ARG_MAX overflow
…nstraints

Aligns with agent-integration-design.md which specifies 4 components:
Skill, Hook, Rule, Command. The Rule ensures agents follow injected
Reflexio context, run manual search fallback, and enforce transparency,
non-blocking, and silent infrastructure conventions.
- All test scenarios use self-contained prompts (no external project needed)
- Added 'How to view agent logs' section
- Phase 6: Added openclaw agents add/remove commands for creating test instances
- Use realistic agent names with note to substitute user's own names
- Added manual playbook seeding as fallback when extraction batch gate isn't met
- Added troubleshooting entries for multi-agent setup
Add ensureServerRunning() to bootstrap handler — checks server health via
'reflexio status check' and starts it in background if down (local only).
Also triggers auto-start from search hook on connection errors, so the
next message finds the server ready.

Uses flag file (~/.reflexio/logs/.server-starting) to prevent concurrent
start attempts, with 2-minute stale cleanup — same pattern as Claude Code.

Updated TESTING.md Phase 2 to verify auto-start and Phase 7 to verify
auto-recovery after server stop.
…er compatibility

OpenClaw's hook loader expects handler.ts (compiled at gateway startup).
Also add package-lock.json for reproducible npm installs.
@yilu331 yilu331 force-pushed the feat/openclaw-integration-upgrade branch from 460b997 to 67c2f2f Compare April 13, 2026 09:59
yilu331 added 2 commits April 13, 2026 10:37
The reflexio CLI has ~2s Python startup overhead when invoked via
symlink, causing the health check to timeout and falsely report the
server as down. Replace execFileSync('reflexio', ['status', 'check'])
with execFileSync('curl', ['-sf', '--max-time', '2', serverUrl/health])
which completes in milliseconds.
@yilu331 yilu331 merged commit 7b3a913 into main Apr 13, 2026
@yyiilluu yyiilluu deleted the feat/openclaw-integration-upgrade branch April 14, 2026 07:20
yilu331 added a commit that referenced this pull request May 1, 2026
Adds an opt-in LLM relevance-judge rerank stage to search_user_profiles
(and the playbook variants), parallel to the existing cross-encoder
rerank. The new stage bridges synonym/brand→category gaps that pure
lexical/semantic models can't bridge — e.g. "Thrive Market" = grocery
service, "Suica card" = Tokyo transit, "TripIt app" = travel-organizer.
Cross-encoder upgrades (bge-reranker-v2-m3) were tested and rejected:
they don't have the retail-brand world knowledge needed.

Architecture:
- New helper score_pairs_llm() in reflexio/server/llm/rerank/llm_reranker.py
- New prompt rerank_relevance/v1.0.0 (relevance-judge with explicit
  brand→category and tool→use-case guidance, scoring rubric, and a rule
  that user-owned tools/cards/apps score 7-9 on help/tips questions)
- New tool arg llm_rerank: bool = False on SearchUserProfilesArgs and
  the playbook variants
- _maybe_rerank_hits dispatches LLM rerank → cross-encoder → hybrid
  order in fallback chain; any failure path returns None and the
  caller falls back gracefully
- Bundle wiring: search-tool handlers now receive llm_client +
  prompt_manager via _bundle_handler_with_llm

Search prompt v1.10.0 documents llm_rerank in the tool palette and adds
targeted exceptions to Patterns A, C, D, F where brand/proper-noun
profiles are likely the answer but don't share the question's literal
keywords. Pattern B explicitly OPTS OUT (recency dominates; rerank
scrambles date order). All exceptions are tightly scoped to the
question shape.

Tested:
- 16 unit tests for score_pairs_llm fallback chain
- 10 unit tests for _maybe_rerank_hits dispatch + fallback semantics
- Trip-wire test updated; semver-sort bug in _get_latest_prompt_version
  fixed (would have locked v1.10.0 → v1.9.0 lexically)
- Smoke test on gpt4_2ba83207 (grocery superlative): Thrive Market
  ranked #14 baseline → #4 with llm_rerank=True
- Smoke test on 0a34ad58 (Tokyo Suica/TripIt): TripIt missing baseline
  → #3 with llm_rerank=True
- LongMemEval tune-100 r93 vs r91: 76/100 vs 74/100 (+2 acc); macro
  81.6% vs 80.5% (+1.1pt); M-S +14pt (the target gain), SS-P +10pt;
  K-U regression observed but traced to extraction-time non-determinism
  (knowledge updates not captured during re-ingest), not the rerank
  changes

Bundled prompt-bank state catch-up:
- answer_synthesis v1.3.0/v1.4.0 (rules 13/14 from earlier rounds)
- extraction_user_profile v1.1.0/v1.1.1/v1.1.2 (relative-time
  resolution, started/finished pair preservation)
- compress_session_for_query v1.0.0–v1.3.0 (the in-tool denoiser
  introduced earlier; currently hard-disabled at the code level)
- Older prompt versions flipped to active: false

Misc:
- LiteLLMClient seeds default to "42" for benchmark reproducibility
- /api/search response now exposes rehydrated_text (set by the search
  agent when it called read_session_text)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant