Skip to content

Fix #1776: Startup blocked for ~100s by synchronous orphan episode recovery + LLM timeout c#2000

Closed
Memtensor-AI wants to merge 1 commit into
dev-v2.0.22from
bugfix/autodev-1776
Closed

Fix #1776: Startup blocked for ~100s by synchronous orphan episode recovery + LLM timeout c#2000
Memtensor-AI wants to merge 1 commit into
dev-v2.0.22from
bugfix/autodev-1776

Conversation

@Memtensor-AI

Copy link
Copy Markdown
Collaborator

Description

Fixed issue #1776: memos-local-plugin's core.init() no longer blocks for ~100s on synchronous orphan-episode reflect/reward/L2 work. Root cause was the two await recoverOpenEpisodesAsSessionEnd / recoverDirtyClosedEpisodes calls inside createMemoryCore.init() — each stale orphan fanned out into reflect (LLM 45s × 3 retries) → reward → L2 induction, all gated before startHttpServer could bind.

Solution: split init() so the cheap synchronous classification (lightweight close + topicState=interrupted meta updates for recent open episodes + list building) stays inline, while the slow recovery is scheduled on a module-scoped startupRecoveryPromise whose errors are swallowed via .catch(log.warn). Added a new optional MemoryCore.waitForStartupRecovery() so tests and graceful shutdown can opt into draining the background work; production adapters (adapters/openclaw/index.ts, bridge.cts) intentionally skip it so the HTTP viewer starts immediately. shutdown() now awaits the background promise before tearing down to avoid SQLite-misuse from mid-flight reflect listeners. Added two new log keys: init.background_recovery_started and init.background_recovery_failed.

Test evidence: 4 new unit tests under describe("issue #1776 — non-blocking startup recovery") cover fast init() (returns within 500ms even with 3 stale orphans seeded), observable wait, no-op-on-empty, and shutdown-drain. The 3 existing orphan-recovery tests gained a single await core.waitForStartupRecovery?.() line. Test results: tests/unit/pipeline/memory-core.test.ts 32/32 passed; related adapter/server/bridge suites 174/174 passed; full unit suite 1044/1047 passed; tsc --noEmit clean. The 2 remaining failures (tests/unit/storage/{migrator, traces-count}.test.ts) reproduce on the unchanged base branch and are unrelated to this fix.

Files changed: apps/memos-local-plugin/agent-contract/memory-core.ts (interface), apps/memos-local-plugin/core/pipeline/memory-core.ts (init/shutdown refactor + new waitForStartupRecovery), apps/memos-local-plugin/tests/unit/pipeline/memory-core.test.ts (4 new tests + 3 updated). Branch pushed to origin/bugfix/autodev-1776; opsp artifacts (proposal/spec/design/verification-report/task) archived to memos-autodev-specs main.

Related Issue (Required): Fixes #1776

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (does not change functionality, e.g. code style improvements, linting)
  • Documentation update

How Has This Been Tested?

Automated tests are pending.

  • Unit Test
  • Test Script Or Test Steps (please provide)
  • Pipeline Automated API Test (please provide)

Checklist

  • I have performed a self-review of my own code
  • I have commented my code in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works
  • I have created related documentation issue/PR in MemOS-Docs (if applicable)
  • I have linked the issue to this PR (if applicable)
  • I have mentioned the person who will review this PR

@MatthewZhuang, @CarltonXiang, @syzsunshine219, @World-controller please review this PR.

Reviewer Checklist

…1776)

memos-local-plugin's `core.init()` synchronously awaited the entire
reflect / reward / L2 chain for stale orphan episodes left over from the
previous process. With the configured `timeoutMs=45000` and three
typical orphans, the chain accumulated ~100 s of LLM round-trips before
`startHttpServer(...)` was allowed to run, so the OpenClaw gateway was
unreachable for the whole window.

Split the work in `createMemoryCore`:

- `init()` still does the cheap synchronous classification (lightweight
  close + `topicState=interrupted` meta update for recent orphans + list
  building) so the next user turn is correctly routed.
- The slow `recoverOpenEpisodesAsSessionEnd` + `recoverDirtyClosedEpisodes`
  calls are scheduled on a module-scoped `startupRecoveryPromise` and
  swallowed via `.catch(log.warn)` so they can never wedge shutdown.
- A new optional `MemoryCore.waitForStartupRecovery()` lets tests and
  shutdown await that background work explicitly. Production adapters
  (`adapters/openclaw/index.ts`, `bridge.cts`) intentionally skip it so
  the viewer comes up immediately.
- `shutdown()` awaits the background promise before tearing down so the
  in-flight reflect listeners don't hit a closed SQLite handle.

Adds 4 new unit tests covering the new contract (fast init, observable
wait, no-op when empty, shutdown drain) and threads
`waitForStartupRecovery?.()` into the 3 existing orphan-recovery tests
that depend on the slow path completing. Test results:

  tests/unit/pipeline/memory-core.test.ts   32/32 passed
  related adapter/server/bridge suites     174/174 passed
  tsc --noEmit                              clean

The two pre-existing failures in tests/unit/storage/{migrator,
traces-count}.test.ts reproduce on the unchanged base branch and are
unrelated to this fix.
@Memtensor-AI

Copy link
Copy Markdown
Collaborator Author

✅ Automated Test Results: PASSED

No applicable test scope for the changed files — automated tests skipped. Changed paths do not map to any configured scope (env.yaml source_mapping). Manual review recommended.

Branch: bugfix/autodev-1776

@syzsunshine219 syzsunshine219 changed the base branch from dev-20260624-v2.0.22 to dev-v2.0.22 July 1, 2026 07:14
@CarltonXiang CarltonXiang added the plugin Plugin/adapter/bridge layer (apps/ directory) | 插件/适配层 label Jul 2, 2026
@Memtensor-AI

Copy link
Copy Markdown
Collaborator Author

Closure recommendation: DO NOT MERGE as-is — superseded by #2002.\n\nThis PR fixes the non-blocking startup recovery path for #1776, but #2002 now includes the same startup-recovery mechanism plus the additional dirty-rescore failure backoff for #1808. I resolved and cloud-tested #2002 against dev-v2.0.22 instead:\n\n- #2002 merge conflict resolved\n- Cloud test-engine run tr-9960eb48-574 PASSED\n- Scope: memos_local_plugin\n- Result: 33/33 tests passed\n\nKeeping both PRs open/mergeable would duplicate the same memory-core startup recovery changes and increase conflict risk. Recommendation: close #2000 after confirming #2002 is the intended replacement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-generated bug Something isn't working | 功能异常 plugin Plugin/adapter/bridge layer (apps/ directory) | 插件/适配层

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants