feat(openclaw): upgrade integration with multi-user, agent playbooks, and 4-component design (skill, hook, rule, commands) by yilu331 · Pull Request #4 · ReflexioAI/reflexio

yilu331 · 2026-04-13T04:00:27Z

Summary

Upgrade the OpenClaw integration to match Claude Code integration maturity: multi-user support, agent playbook aggregation, per-message search, and skill-driven publishing
Each OpenClaw agent instance (identified by agentId) becomes a unique Reflexio user — playbooks and profiles are isolated per-instance, while agent playbooks are shared across all instances
Single expert mode (no normal/expert split) — one unified set of skill, hook, rule, and commands covers search, publish, and aggregation
Follows the 4-component integration pattern from agent-integration-design.md: Skill (procedural guidance), Hook (deterministic enforcement), Rule (behavioral constraints), Command (user-triggered actions)

Changes

Hook (`handler.ts`)

Renamed from handler.js to handler.ts — OpenClaw's hook loader compiles TypeScript at gateway startup; .ts is the canonical handler format
Add resolveUserId(event) with 4-step fallback chain: env var → sessionKey agentId → OpenClaw config → "openclaw" default
Add handleSearchBeforeResponse for per-message message:received search injection (5s timeout, skips short/trivial prompts)
Add triggerAggregationIfNeeded() — fire-and-forget aggregation after successful publish with flag file concurrency guard
Add ensureServerRunning() — auto-starts local Reflexio server at bootstrap and on search connection errors (same pattern as Claude Code's session_start_hook.sh)
Added package-lock.json for reproducible npm installs of better-sqlite3

Skill (`SKILL.md`)

Complete rewrite as unified expert-mode skill: search before tasks + publish on corrections/completions
Two publish scenarios: (1) user correction with rich context, (2) key step completion with learnings
Agent playbook commands, multi-user concept, server management

Rule (`rules/reflexio.md`) — new

Always-active behavioral constraints loaded at every session start
Tells agent to follow injected REFLEXIO_CONTEXT blocks from the search hook
Manual search fallback when hook doesn't fire
Enforces transparency (never mention Reflexio), non-blocking (proceed if unavailable), silent infrastructure (never ask user to manage server)

Commands — new

/reflexio-extract — manual comprehensive conversation extraction
/reflexio-aggregate — manual playbook aggregation trigger with cron guidance

CLI fixes

Fix reflexio search to surface agent playbooks (was passing user_playbooks as agent_playbooks)
Add user_playbooks parameter to format_context() output formatter

Setup wizard (`setup_cmd.py`)

One-command install (reflexio setup openclaw) and uninstall (reflexio setup openclaw --uninstall) for all 4 components
Hook installed by copying (not symlinking) — works around OpenClaw symlink discovery bug (GitHub #11861)
Dynamic command discovery in both install and uninstall

Documentation

README.md: Fully rewritten with multi-user architecture, agent playbooks, cron guidance, all 4 integration components
TESTING.md (new): Step-by-step manual testing guide covering 8 phases — install, auto-start, cold-start search, capture/publish, warm-start retrieval, manual commands, multi-user isolation, graceful degradation, uninstall

Eval suite — new

8 evaluation scenarios: correction capture, retrieval, multi-user isolation, aggregation dedup, search relevance, tool failure extraction, cron aggregation, graceful degradation
Data-driven runner with JSON dataset and dispatch table

E2E tests — new

10 tests across 4 classes: multi-user, aggregation, unified search, graceful degradation

Known Limitations

Hook events: message:received and message:sent only fire in channel sessions (Telegram, WhatsApp, etc.) — not in CLI (openclaw agent -m) or TUI mode. This is an OpenClaw architectural limitation where these events are emitted from dispatchReplyFromConfig() (channel dispatch path only). The skill and rule provide fallback coverage for non-channel modes.
Rule loading: Custom .md files in ~/.openclaw/workspace/ are not loaded as workspace context — OpenClaw only loads specific named files (AGENTS.md, SOUL.md, etc.). The rule content may need to be appended to AGENTS.md instead.

Test Plan

All 10 E2E tests passing: uv run pytest tests/e2e_tests/test_openclaw_integration.py -o 'addopts='
Ruff linting clean on all modified Python files
Review-fix loop completed (6 findings found and fixed, all verified)
Hook loads in OpenClaw gateway (6/6 ready after handler.ts rename + copy install)
agent:bootstrap fires correctly in all modes (CLI, TUI, channel)
Server auto-start triggers on bootstrap when local server is down
Skills discovered: reflexio, reflexio-extract, reflexio-aggregate all ✓ ready
Manual channel-based testing (Telegram) for message:received/message:sent events
Manual testing via TESTING.md

The search command was passing response.user_playbooks to the agent_playbooks parameter of format_context, dropping actual agent playbooks entirely. Fixed to pass response.agent_playbooks correctly, added user_playbooks parameter to format_context to display both types, and updated the count display to report all three result types.

- Add resolveUserId(event) with fallback chain: REFLEXIO_USER_ID env > agentId from sessionKey > ~/.openclaw/openclaw.json > "openclaw" - Replace all hardcoded `process.env.REFLEXIO_USER_ID || "openclaw"` with resolveUserId(event) in handleBootstrap, handleMessageSent, handleSessionEnd - Add message:received handler that runs reflexio search before each agent response, injecting results as REFLEXIO_CONTEXT.md bootstrap file - Update HOOK.md events list and add message:received documentation

…blish After each successful interactions publish, fire-and-forget a detached reflexio agent-playbooks aggregate process. A flag file at ~/.reflexio/logs/.aggregation-running (checked via statSync mtime) prevents concurrent runs within a 5-minute window.

During `reflexio setup openclaw`, copy each subdirectory of the commands/ directory to ~/.openclaw/skills/<command-name> with dirs_exist_ok=True, installing reflexio-extract and reflexio-aggregate as OpenClaw skills.

… search, publish, and aggregation

…gregation

…tegration Tests the full data lifecycle for OpenClaw: multi-user scoped publishing, profile isolation, playbook aggregation producing agent playbooks, unified search returning both user and agent playbooks, and graceful degradation for cold-start and minimal-input scenarios.

…t description - _uninstall_openclaw() now removes reflexio-extract and reflexio-aggregate command directories from ~/.openclaw/skills/ alongside the main reflexio skill - Update REFLEXIO_USER_ID default in HOOK.md from 'openclaw' to 'auto (from agentId)' to accurately reflect the 4-step fallback chain behaviour

- F001: set REFLEXIO_URL (not REFLEXIO_API_URL) in eval runner CLI subprocess env - F002: scope _wait_extraction by user_id to prevent cross-scenario false positives - F003: replace naive JSON5 comment stripping with string-aware line-by-line parser - F004: simplify nested ternary in aggregation message logic - F005: dynamically discover commands in uninstall (match install behavior) - F006: truncate search prompt to 4096 chars to avoid ARG_MAX overflow

… extraction batch gate

…nstraints Aligns with agent-integration-design.md which specifies 4 components: Skill, Hook, Rule, Command. The Rule ensures agents follow injected Reflexio context, run manual search fallback, and enforce transparency, non-blocking, and silent infrastructure conventions.

- All test scenarios use self-contained prompts (no external project needed) - Added 'How to view agent logs' section - Phase 6: Added openclaw agents add/remove commands for creating test instances - Use realistic agent names with note to substitute user's own names - Added manual playbook seeding as fallback when extraction batch gate isn't met - Added troubleshooting entries for multi-agent setup

Add ensureServerRunning() to bootstrap handler — checks server health via 'reflexio status check' and starts it in background if down (local only). Also triggers auto-start from search hook on connection errors, so the next message finds the server ready. Uses flag file (~/.reflexio/logs/.server-starting) to prevent concurrent start attempts, with 2-minute stale cleanup — same pattern as Claude Code. Updated TESTING.md Phase 2 to verify auto-start and Phase 7 to verify auto-recovery after server stop.

…er compatibility OpenClaw's hook loader expects handler.ts (compiled at gateway startup). Also add package-lock.json for reproducible npm installs.

The reflexio CLI has ~2s Python startup overhead when invoked via symlink, causing the health check to timeout and falsely report the server as down. Replace execFileSync('reflexio', ['status', 'check']) with execFileSync('curl', ['-sf', '--max-time', '2', serverUrl/health]) which completes in milliseconds.

Adds an opt-in LLM relevance-judge rerank stage to search_user_profiles (and the playbook variants), parallel to the existing cross-encoder rerank. The new stage bridges synonym/brand→category gaps that pure lexical/semantic models can't bridge — e.g. "Thrive Market" = grocery service, "Suica card" = Tokyo transit, "TripIt app" = travel-organizer. Cross-encoder upgrades (bge-reranker-v2-m3) were tested and rejected: they don't have the retail-brand world knowledge needed. Architecture: - New helper score_pairs_llm() in reflexio/server/llm/rerank/llm_reranker.py - New prompt rerank_relevance/v1.0.0 (relevance-judge with explicit brand→category and tool→use-case guidance, scoring rubric, and a rule that user-owned tools/cards/apps score 7-9 on help/tips questions) - New tool arg llm_rerank: bool = False on SearchUserProfilesArgs and the playbook variants - _maybe_rerank_hits dispatches LLM rerank → cross-encoder → hybrid order in fallback chain; any failure path returns None and the caller falls back gracefully - Bundle wiring: search-tool handlers now receive llm_client + prompt_manager via _bundle_handler_with_llm Search prompt v1.10.0 documents llm_rerank in the tool palette and adds targeted exceptions to Patterns A, C, D, F where brand/proper-noun profiles are likely the answer but don't share the question's literal keywords. Pattern B explicitly OPTS OUT (recency dominates; rerank scrambles date order). All exceptions are tightly scoped to the question shape. Tested: - 16 unit tests for score_pairs_llm fallback chain - 10 unit tests for _maybe_rerank_hits dispatch + fallback semantics - Trip-wire test updated; semver-sort bug in _get_latest_prompt_version fixed (would have locked v1.10.0 → v1.9.0 lexically) - Smoke test on gpt4_2ba83207 (grocery superlative): Thrive Market ranked #14 baseline → #4 with llm_rerank=True - Smoke test on 0a34ad58 (Tokyo Suica/TripIt): TripIt missing baseline → #3 with llm_rerank=True - LongMemEval tune-100 r93 vs r91: 76/100 vs 74/100 (+2 acc); macro 81.6% vs 80.5% (+1.1pt); M-S +14pt (the target gain), SS-P +10pt; K-U regression observed but traced to extraction-time non-determinism (knowledge updates not captured during re-ingest), not the rerank changes Bundled prompt-bank state catch-up: - answer_synthesis v1.3.0/v1.4.0 (rules 13/14 from earlier rounds) - extraction_user_profile v1.1.0/v1.1.1/v1.1.2 (relative-time resolution, started/finished pair preservation) - compress_session_for_query v1.0.0–v1.3.0 (the in-tool denoiser introduced earlier; currently hard-disabled at the code level) - Older prompt versions flipped to active: false Misc: - LiteLLMClient seeds default to "42" for benchmark reproducibility - /api/search response now exposes rehydrated_text (set by the search agent when it called read_session_text)

yilu331 force-pushed the feat/openclaw-integration-upgrade branch from 24620cc to f18bb85 Compare April 13, 2026 04:13

yilu331 changed the title ~~feat(openclaw): upgrade integration with multi-user, agent playbooks, per-message search~~ feat(openclaw): upgrade integration with multi-user, agent playbooks, and 4-component design (skill, hook, rule, commands) Apr 13, 2026

yilu331 force-pushed the feat/openclaw-integration-upgrade branch from 4ce7538 to 20dfd47 Compare April 13, 2026 07:17

yilu331 added 16 commits April 13, 2026 02:25

feat: rewrite OpenClaw SKILL.md as unified expert-mode skill covering…

233d0af

… search, publish, and aggregation

docs: update OpenClaw README with multi-user, agent playbooks, and ag…

8495981

…gregation

feat(openclaw): add eval dataset and runner for integration testing

66a90d5

fix: E2E tests seed playbooks/profiles directly instead of relying on…

9dd58ff

… extraction batch gate

docs(openclaw): add manual testing guide, improve uninstall feedback

7409548

fix(openclaw): rename handler.js to handler.ts for OpenClaw hook load…

67c2f2f

…er compatibility OpenClaw's hook loader expects handler.ts (compiled at gateway startup). Also add package-lock.json for reproducible npm installs.

yilu331 force-pushed the feat/openclaw-integration-upgrade branch from 460b997 to 67c2f2f Compare April 13, 2026 09:59

yilu331 added 2 commits April 13, 2026 10:37

chore: bump version to 0.2.8 for PyPI release

2d4798e

yilu331 merged commit 7b3a913 into main Apr 13, 2026

yyiilluu deleted the feat/openclaw-integration-upgrade branch April 14, 2026 07:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(openclaw): upgrade integration with multi-user, agent playbooks, and 4-component design (skill, hook, rule, commands)#4

feat(openclaw): upgrade integration with multi-user, agent playbooks, and 4-component design (skill, hook, rule, commands)#4
yilu331 merged 18 commits into
mainfrom
feat/openclaw-integration-upgrade

yilu331 commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yilu331 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Hook (handler.ts)

Skill (SKILL.md)

Rule (rules/reflexio.md) — new

Commands — new

CLI fixes

Setup wizard (setup_cmd.py)

Documentation

Eval suite — new

E2E tests — new

Known Limitations

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yilu331 commented Apr 13, 2026 •

edited

Loading

Hook (`handler.ts`)

Skill (`SKILL.md`)

Rule (`rules/reflexio.md`) — new

Setup wizard (`setup_cmd.py`)