Skip to content

[RESILIENCE] Watchdog thread — monitor engine health and auto-recover from stalls #234

@ElioNeto

Description

@ElioNeto

Description

A background watchdog thread monitors engine health and automatically recovers from stalls, deadlocks, or hung operations.

Implementation

  1. Watchdog checks every 5s:
    • Compaction making progress? (check L0 count trend)
    • WAL fsyncs completing? (check latency metric)
    • Request queue draining? (check in-flight count)
    • Last successful mutation timestamp
  2. On stall detection:
    • LOG: dump current thread stacks (SIGQUIT on Unix)
    • ACTION: abort hung operations
    • ESCALATE: if no recovery after 3 attempts, enter READ_ONLY mode
  3. Expose watchdog state in /health

Configuration

resilience:
  watchdog:
    interval_secs: 5
    stall_threshold_secs: 30
    max_escalations: 3

Labels

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions