Skip to content

bug(hermes): _bridge_ok cache permanently disables provider after transient startup failure #1797

@MuBeiGe

Description

@MuBeiGe

Summary

daemon_manager.py's ensure_bridge_running(probe_only=True) caches _bridge_ok as a module-level global with no expiry mechanism. Once set to False by a transient failure (subprocess creation race, env-var load ordering), it stays False for the lifetime of the gateway process. This causes MemTensorProvider.is_available() to return False permanently, and agent_init.py skips adding the provider — MemOS is silently disabled despite the bridge daemon being perfectly healthy.

Root Cause

In adapters/hermes/memos_provider/daemon_manager.py, the caching logic at lines 130–133:

if _bridge_ok is not None and probe_only:
    return _bridge_ok  # ← returns stale False forever

The sequence that triggers the bug:

  1. Hermes gateway starts, loads .env (which sets MEMOS_NODE_BINARY)
  2. agent_init.py calls is_available()ensure_bridge_running(probe_only=True)
  3. If _node_available() fails at this moment (e.g. transient subprocess error on Windows, or .env not yet loaded in os.environ), _bridge_ok is set to False
  4. Meanwhile ensure_viewer_daemon() (called during initialize() without probe_only) finds the bridge already running on port 18800 via _probe_viewer() and returns Truewithout ever calling ensure_bridge_running() again, so _bridge_ok stays False forever
  5. is_available() returns Falseagent_init.py:994 skips add_provider() → MemOS offline for the gateway's lifetime

Symptoms

  • Repeated "MemOS: Node.js not found on PATH" warnings in logs (from non-session contexts, no session ID prefix)
  • Bridge health endpoint (/api/v1/health) shows llm.available: true and embedder.available: true
  • But bridge_client never logs any activity — the provider was never registered
  • hermes doctor shows memtensor as unavailable

Proposed Fix

Three changes to daemon_manager.py, ~20 lines net new code:

1. TTL-based cache expiry

Add _bridge_ok_at: float = 0.0 timestamp and BRIDGE_OK_TTL_SEC = 60.0. Cached results expire after 60 seconds, forcing revalidation.

2. Running-bridge fallback

When _node_available() fails, check whether a bridge is already alive via _probe_viewer() == "running_memos". A live bridge process is definitive proof the environment is viable (the bridge itself was launched with Node.js).

3. shutdown_bridge() resets both variables

Fixed ensure_bridge_running:

def ensure_bridge_running(*, probe_only: bool = False) -> bool:
    global _bridge_ok, _bridge_ok_at
    with _lock:
        now = time.time()
        if _bridge_ok is not None and probe_only:
            if (now - _bridge_ok_at) < BRIDGE_OK_TTL_SEC:
                return _bridge_ok
            # Cache expired — fall through to revalidate.

        script = _bridge_script()
        if not script.exists():
            logger.warning("MemOS: bridge script missing at %s", script)
            _bridge_ok = False
            _bridge_ok_at = now
            return False

        if _node_available():
            _bridge_ok = True
            _bridge_ok_at = now
            return True

        # Node binary check failed. Check if bridge is already running.
        if _probe_viewer() == "running_memos":
            _bridge_ok = True
            _bridge_ok_at = now
            return True

        logger.warning("MemOS: Node.js not found on PATH")
        _bridge_ok = False
        _bridge_ok_at = now
        return False

Environment

  • MemOS version: 2.0.5 (@memtensor/memos-local-plugin)
  • Hermes Agent on Windows 10
  • Node.js v24.14.1 (path set via MEMOS_NODE_BINARY in .env)

Notes

Happy to submit a PR if this direction looks right. The fix has been tested on Windows — syntax verified, logic tested with fresh import.

Metadata

Metadata

Assignees

No one assigned

    Labels

    pluginPlugin/adapter/bridge layer (apps/ directory)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions