feat(openclaw): upgrade integration with multi-user, agent playbooks, and 4-component design (skill, hook, rule, commands)#4
Merged
Conversation
24620cc to
f18bb85
Compare
4ce7538 to
20dfd47
Compare
The search command was passing response.user_playbooks to the agent_playbooks parameter of format_context, dropping actual agent playbooks entirely. Fixed to pass response.agent_playbooks correctly, added user_playbooks parameter to format_context to display both types, and updated the count display to report all three result types.
- Add resolveUserId(event) with fallback chain: REFLEXIO_USER_ID env > agentId from sessionKey > ~/.openclaw/openclaw.json > "openclaw" - Replace all hardcoded `process.env.REFLEXIO_USER_ID || "openclaw"` with resolveUserId(event) in handleBootstrap, handleMessageSent, handleSessionEnd - Add message:received handler that runs reflexio search before each agent response, injecting results as REFLEXIO_CONTEXT.md bootstrap file - Update HOOK.md events list and add message:received documentation
…blish After each successful interactions publish, fire-and-forget a detached reflexio agent-playbooks aggregate process. A flag file at ~/.reflexio/logs/.aggregation-running (checked via statSync mtime) prevents concurrent runs within a 5-minute window.
During `reflexio setup openclaw`, copy each subdirectory of the commands/ directory to ~/.openclaw/skills/<command-name> with dirs_exist_ok=True, installing reflexio-extract and reflexio-aggregate as OpenClaw skills.
… search, publish, and aggregation
…tegration Tests the full data lifecycle for OpenClaw: multi-user scoped publishing, profile isolation, playbook aggregation producing agent playbooks, unified search returning both user and agent playbooks, and graceful degradation for cold-start and minimal-input scenarios.
…t description - _uninstall_openclaw() now removes reflexio-extract and reflexio-aggregate command directories from ~/.openclaw/skills/ alongside the main reflexio skill - Update REFLEXIO_USER_ID default in HOOK.md from 'openclaw' to 'auto (from agentId)' to accurately reflect the 4-step fallback chain behaviour
- F001: set REFLEXIO_URL (not REFLEXIO_API_URL) in eval runner CLI subprocess env - F002: scope _wait_extraction by user_id to prevent cross-scenario false positives - F003: replace naive JSON5 comment stripping with string-aware line-by-line parser - F004: simplify nested ternary in aggregation message logic - F005: dynamically discover commands in uninstall (match install behavior) - F006: truncate search prompt to 4096 chars to avoid ARG_MAX overflow
… extraction batch gate
…nstraints Aligns with agent-integration-design.md which specifies 4 components: Skill, Hook, Rule, Command. The Rule ensures agents follow injected Reflexio context, run manual search fallback, and enforce transparency, non-blocking, and silent infrastructure conventions.
- All test scenarios use self-contained prompts (no external project needed) - Added 'How to view agent logs' section - Phase 6: Added openclaw agents add/remove commands for creating test instances - Use realistic agent names with note to substitute user's own names - Added manual playbook seeding as fallback when extraction batch gate isn't met - Added troubleshooting entries for multi-agent setup
Add ensureServerRunning() to bootstrap handler — checks server health via 'reflexio status check' and starts it in background if down (local only). Also triggers auto-start from search hook on connection errors, so the next message finds the server ready. Uses flag file (~/.reflexio/logs/.server-starting) to prevent concurrent start attempts, with 2-minute stale cleanup — same pattern as Claude Code. Updated TESTING.md Phase 2 to verify auto-start and Phase 7 to verify auto-recovery after server stop.
…er compatibility OpenClaw's hook loader expects handler.ts (compiled at gateway startup). Also add package-lock.json for reproducible npm installs.
460b997 to
67c2f2f
Compare
The reflexio CLI has ~2s Python startup overhead when invoked via
symlink, causing the health check to timeout and falsely report the
server as down. Replace execFileSync('reflexio', ['status', 'check'])
with execFileSync('curl', ['-sf', '--max-time', '2', serverUrl/health])
which completes in milliseconds.
yilu331
added a commit
that referenced
this pull request
May 1, 2026
Adds an opt-in LLM relevance-judge rerank stage to search_user_profiles (and the playbook variants), parallel to the existing cross-encoder rerank. The new stage bridges synonym/brand→category gaps that pure lexical/semantic models can't bridge — e.g. "Thrive Market" = grocery service, "Suica card" = Tokyo transit, "TripIt app" = travel-organizer. Cross-encoder upgrades (bge-reranker-v2-m3) were tested and rejected: they don't have the retail-brand world knowledge needed. Architecture: - New helper score_pairs_llm() in reflexio/server/llm/rerank/llm_reranker.py - New prompt rerank_relevance/v1.0.0 (relevance-judge with explicit brand→category and tool→use-case guidance, scoring rubric, and a rule that user-owned tools/cards/apps score 7-9 on help/tips questions) - New tool arg llm_rerank: bool = False on SearchUserProfilesArgs and the playbook variants - _maybe_rerank_hits dispatches LLM rerank → cross-encoder → hybrid order in fallback chain; any failure path returns None and the caller falls back gracefully - Bundle wiring: search-tool handlers now receive llm_client + prompt_manager via _bundle_handler_with_llm Search prompt v1.10.0 documents llm_rerank in the tool palette and adds targeted exceptions to Patterns A, C, D, F where brand/proper-noun profiles are likely the answer but don't share the question's literal keywords. Pattern B explicitly OPTS OUT (recency dominates; rerank scrambles date order). All exceptions are tightly scoped to the question shape. Tested: - 16 unit tests for score_pairs_llm fallback chain - 10 unit tests for _maybe_rerank_hits dispatch + fallback semantics - Trip-wire test updated; semver-sort bug in _get_latest_prompt_version fixed (would have locked v1.10.0 → v1.9.0 lexically) - Smoke test on gpt4_2ba83207 (grocery superlative): Thrive Market ranked #14 baseline → #4 with llm_rerank=True - Smoke test on 0a34ad58 (Tokyo Suica/TripIt): TripIt missing baseline → #3 with llm_rerank=True - LongMemEval tune-100 r93 vs r91: 76/100 vs 74/100 (+2 acc); macro 81.6% vs 80.5% (+1.1pt); M-S +14pt (the target gain), SS-P +10pt; K-U regression observed but traced to extraction-time non-determinism (knowledge updates not captured during re-ingest), not the rerank changes Bundled prompt-bank state catch-up: - answer_synthesis v1.3.0/v1.4.0 (rules 13/14 from earlier rounds) - extraction_user_profile v1.1.0/v1.1.1/v1.1.2 (relative-time resolution, started/finished pair preservation) - compress_session_for_query v1.0.0–v1.3.0 (the in-tool denoiser introduced earlier; currently hard-disabled at the code level) - Older prompt versions flipped to active: false Misc: - LiteLLMClient seeds default to "42" for benchmark reproducibility - /api/search response now exposes rehydrated_text (set by the search agent when it called read_session_text)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
agentId) becomes a unique Reflexio user — playbooks and profiles are isolated per-instance, while agent playbooks are shared across all instancesagent-integration-design.md: Skill (procedural guidance), Hook (deterministic enforcement), Rule (behavioral constraints), Command (user-triggered actions)Changes
Hook (
handler.ts)handler.jstohandler.ts— OpenClaw's hook loader compiles TypeScript at gateway startup;.tsis the canonical handler formatresolveUserId(event)with 4-step fallback chain: env var → sessionKey agentId → OpenClaw config → "openclaw" defaulthandleSearchBeforeResponsefor per-messagemessage:receivedsearch injection (5s timeout, skips short/trivial prompts)triggerAggregationIfNeeded()— fire-and-forget aggregation after successful publish with flag file concurrency guardensureServerRunning()— auto-starts local Reflexio server at bootstrap and on search connection errors (same pattern as Claude Code's session_start_hook.sh)package-lock.jsonfor reproducible npm installs ofbetter-sqlite3Skill (
SKILL.md)Rule (
rules/reflexio.md) — newREFLEXIO_CONTEXTblocks from the search hookCommands — new
/reflexio-extract— manual comprehensive conversation extraction/reflexio-aggregate— manual playbook aggregation trigger with cron guidanceCLI fixes
reflexio searchto surface agent playbooks (was passinguser_playbooksasagent_playbooks)user_playbooksparameter toformat_context()output formatterSetup wizard (
setup_cmd.py)reflexio setup openclaw) and uninstall (reflexio setup openclaw --uninstall) for all 4 componentsDocumentation
Eval suite — new
E2E tests — new
Known Limitations
message:receivedandmessage:sentonly fire in channel sessions (Telegram, WhatsApp, etc.) — not in CLI (openclaw agent -m) or TUI mode. This is an OpenClaw architectural limitation where these events are emitted fromdispatchReplyFromConfig()(channel dispatch path only). The skill and rule provide fallback coverage for non-channel modes..mdfiles in~/.openclaw/workspace/are not loaded as workspace context — OpenClaw only loads specific named files (AGENTS.md,SOUL.md, etc.). The rule content may need to be appended toAGENTS.mdinstead.Test Plan
uv run pytest tests/e2e_tests/test_openclaw_integration.py -o 'addopts='agent:bootstrapfires correctly in all modes (CLI, TUI, channel)reflexio,reflexio-extract,reflexio-aggregateall ✓ readymessage:received/message:sentevents