Skip to content

[RESILIENCE] Panic recovery — catch panics in worker threads and restart them #230

@ElioNeto

Description

@ElioNeto

Description

A single panic in a compaction thread or request handler currently kills the entire server. ApexStore should recover from panics gracefully.

Implementation

  1. Wrap all thread entry points with std::panic::catch_unwind
  2. On panic:
    • Log full panic payload and backtrace
    • Increment panic counter metric
    • Restart the thread after a delay
    • If panic rate exceeds threshold (5/min), enter DEGRADED mode
  3. Expose panic count in /metrics
  4. Add /admin/panic-info endpoint returning recent panic details

Code pattern

let result = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| {
    // compaction logic
}));
if let Err(panic_payload) = result {
    error!("Compaction thread panicked: {:?}", panic_payload);
    metrics.inc_panic_count();
    // restart thread after delay
}

Labels

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions