fix: wrap datastore worker in catch_unwind to prevent poisoned locks by tobixen · Pull Request #553 · ActivityWatch/aw-server-rust

tobixen · 2026-01-01T23:50:46Z

Summary

Use std::panic::catch_unwind() around the datastore worker loop as a safety net. If any panic occurs, the worker will exit gracefully instead of poisoning the Mutex lock and leaving the server in an unusable state.

Problem

When the datastore worker thread panics (for any reason), it poisons the Mutex lock. All subsequent requests then fail with "poisoned lock: another task failed inside" errors, and the server becomes completely unusable until restart.

This is the symptom described in #405

Solution

Wrap the worker's work_loop() in catch_unwind():

Catches any panic that occurs
Logs the panic message
Lets the worker exit cleanly
The channel closes properly
Future requests get clean "channel closed" errors instead of poisoned lock errors

This is a defense-in-depth approach - ideally panics should be replaced with proper error handling, but this safety net ensures no panic can leave the server in an unrecoverable state.

Test plan

cargo build compiles successfully
cargo test --package aw-datastore passes (9 tests)

🤖 Generated with Claude Code

Important

Wrap work_loop() in catch_unwind() in aw-datastore/src/worker.rs to handle panics gracefully and prevent poisoned locks.

Behavior:
- Wrap work_loop() in catch_unwind() in aw-datastore/src/worker.rs to handle panics gracefully.
- Logs panic messages and allows the worker to exit cleanly, preventing poisoned locks.
- Ensures future requests receive clean "channel closed" errors instead of "poisoned lock" errors.

^{This description was created by}^{for f8cf670. You can customize this summary. It will automatically update as commits are pushed.}

ellipsis-dev

Important

Looks good to me! 👍

Reviewed everything up to f8cf670 in 36 seconds. Click for details.

Reviewed 35 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 3 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. aw-datastore/src/worker.rs:315

Draft comment:
Good use of catch_unwind to safeguard against panics. Ensure the worker state (di) is unwind safe; using AssertUnwindSafe here assumes all captured state is safe to unwind.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

2. aw-datastore/src/worker.rs:327

Draft comment:
Panic payload extraction is correctly handled. Optionally, consider logging additional context or backtrace to aid debugging.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

3. aw-datastore/src/worker.rs:338

Draft comment:
After a panic, the worker thread exits, closing the channel. Confirm that this shutdown behavior (vs. automatically restarting a new worker) is acceptable for the server’s resilience strategy.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

Workflow ID: wflow_cxDmxiQ85n2qLVTV

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

greptile-apps · 2026-01-01T23:53:03Z

Greptile Summary

Wrapped the datastore worker's work_loop() in std::panic::catch_unwind() to prevent mutex lock poisoning when panics occur. This defense-in-depth fix ensures the server can't be left in an unrecoverable state where all subsequent requests fail with "poisoned lock" errors.

Key Changes

Added catch_unwind wrapper around work_loop() call in worker thread
Extracts and logs panic messages when caught
Allows worker to exit cleanly, closing the channel properly
Clients now receive clean "channel closed" errors instead of poisoned lock errors

Analysis

The fix addresses the root symptom from issue #405 where panics in the worker thread would poison the Mutex, making the entire server unusable until restart. While the PR correctly notes that panics should ideally be replaced with proper error handling, this safety net is appropriate for:

Defense-in-depth: Prevents any panic (expected or not) from leaving the server in an unrecoverable state
Improved error experience: Users get cleaner error messages instead of cryptic "poisoned lock" failures
Graceful degradation: Server can be restarted without manual intervention

The use of AssertUnwindSafe is appropriate here since the DatastoreWorker state is completely discarded after the panic - there's no risk of accessing corrupted state since the worker thread exits and all its owned data is dropped.

Confidence Score: 4/5

Safe to merge with high confidence - defensive fix that prevents server lockup without introducing new risks
Score of 4 reflects a well-implemented defensive fix with clear benefits and minimal risk. The implementation correctly uses catch_unwind to prevent lock poisoning, properly logs panics, and allows graceful degradation. The use of AssertUnwindSafe is justified since worker state is fully discarded. Not rated 5 only because this is a symptomatic fix rather than addressing the root causes of panics (lines 136, 145, 195 still use panic! for transaction errors), though the PR acknowledges this is defense-in-depth.
No files require special attention - the single-file change is straightforward and well-tested

Important Files Changed

Filename	Overview
aw-datastore/src/worker.rs	Wraps datastore worker loop in `catch_unwind` to prevent lock poisoning when panics occur - defensive fix that prevents server lockup

Sequence Diagram

sequenceDiagram
    participant Client as API Client
    participant Datastore as Datastore (Main Thread)
    participant Channel as MPSC Channel
    participant Worker as DatastoreWorker Thread
    participant CatchUnwind as catch_unwind Handler
    participant SQLite as SQLite Database

    Note over Datastore,Worker: Initialization
    Datastore->>Channel: Create MPSC channel
    Datastore->>Worker: Spawn worker thread
    Worker->>CatchUnwind: Wrap work_loop in catch_unwind
    CatchUnwind->>Worker: Execute work_loop(method)
    Worker->>SQLite: Open connection
    Worker->>Worker: Initialize DatastoreInstance

    Note over Client,SQLite: Normal Operation
    Client->>Datastore: Request (e.g., insert_events)
    Datastore->>Channel: Send command
    Channel->>Worker: Deliver command
    Worker->>SQLite: Execute SQL operation
    SQLite-->>Worker: Result
    Worker->>Channel: Send response
    Channel->>Datastore: Deliver response
    Datastore-->>Client: Return result

    Note over Client,SQLite: Panic Scenario (Before Fix)
    Client->>Datastore: Request triggers panic
    Datastore->>Channel: Send command
    Channel->>Worker: Deliver command
    Worker->>SQLite: Execute operation
    SQLite-->>Worker: Unexpected error
    Worker->>Worker: panic! occurs
    Note over Worker: Thread exits, Mutex poisoned
    Client->>Datastore: Subsequent request
    Datastore--XClient: Error: "poisoned lock"
    Note over Datastore: Server unusable until restart

    Note over Client,SQLite: Panic Scenario (After Fix)
    Client->>Datastore: Request triggers panic
    Datastore->>Channel: Send command
    Channel->>Worker: Deliver command
    Worker->>SQLite: Execute operation
    SQLite-->>Worker: Unexpected error
    Worker->>Worker: panic! occurs
    Worker->>CatchUnwind: Panic caught
    CatchUnwind->>CatchUnwind: Log panic message
    CatchUnwind->>Worker: Exit gracefully
    Worker->>Channel: Channel closes cleanly
    Note over Worker: No lock poisoning
    Client->>Datastore: Subsequent request
    Datastore->>Channel: Send command
    Channel--XDatastore: Channel closed error
    Datastore-->>Client: Clean error (channel closed)
    Note over Client: Clear error, can restart server gracefully

greptile-apps · 2026-01-01T23:53:04Z

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

tobixen · 2026-01-01T23:57:46Z

This pull request was created by Claude the AI. I don't know rust well enough to have a subjective point of view if this change is good or not. The prompt given to Claude was like this:

I don't know anything about rust, but wouldn't it be possible to make a general fix so that panic!(...) would cause an immediate graceful exit rather than putting the server in an unusable state?

My rationale is that a server that is down may be auto-restarted, and it's also easy to catch this through monitoring. A server that is up and listening to the port but not working at all may be a little bit harder to monitor.

I run the server through systemd, and systemd supports auto-restarting. I'm not sure how things work when utilizing the aw-qt app.

0xbrayo · 2026-01-02T11:33:55Z

LGTM

0xbrayo · 2026-01-02T11:44:47Z

I'm not sure how things work when utilizing the aw-qt app.

I believe only non-zero exit codes get restarted (that's how it works in aw-tauri, not sure about aw-qt). Since that's a crash. This would cause it to exit normally and might not be restarted. You can still throw a panic after gracefully handling the first one though. If that is deemed necessary

codecov · 2026-01-02T12:38:43Z

Codecov Report

❌ Patch coverage is 81.81818% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.97%. Comparing base (656f3c9) to head (6d234a3).
⚠️ Report is 38 commits behind head on master.

Files with missing lines	Patch %	Lines
aw-datastore/src/worker.rs	81.81%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #553      +/-   ##
==========================================
- Coverage   70.81%   67.97%   -2.84%     
==========================================
  Files          51       54       +3     
  Lines        2916     3151     +235     
==========================================
+ Hits         2065     2142      +77     
- Misses        851     1009     +158

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Use std::panic::catch_unwind() around the worker loop as a safety net. If any panic occurs, the worker will exit gracefully instead of poisoning the Mutex lock and leaving the server in an unusable state. This ensures: - Panics are caught and logged with their message - The worker thread exits cleanly - The channel closes properly - Future requests get clean "channel closed" errors - No poisoned locks requiring server restart Related to: ActivityWatch#405 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add tests to verify that the catch_unwind wrapper properly handles worker panics and prevents poisoned lock errors. Tests are gated behind a `test-panic` feature flag. Run with: cargo test --package aw-datastore --features test-panic 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

… extraction Extract the panic payload message logic into a standalone `extract_panic_message` helper and add unit tests covering all three branches: &str literal panics, String (formatted) panics, and unknown payload types. These tests run unconditionally (no feature flag required), so they are picked up by tarpaulin in CI and close the coverage gap reported by Codecov on the previous commit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ErikBjare · 2026-03-02T12:00:34Z

aw-datastore/src/worker.rs

+                    panic_msg
+                );
+            }
+            // Worker exits, channel closes, future requests get clean errors


Worker exits, channel closes, future requests get errors, but the server doesn't actually recover?

Determine how this plays together with #553 and see context in #405

tobixen · 2026-03-02T15:44:53Z

⚠️ This comment is AI-generated on behalf of @tobixen.

Closing this PR in light of the discussion today on issue #405 and PR #572.

Why closing:

The catch_unwind approach here has a fundamental limitation that ErikBjare correctly identified: after the worker exits, the server keeps listening but can never serve datastore requests again — it just returns errors forever without any recovery. Since the process doesn't crash, systemd won't restart it either. This leaves the server in the same kind of zombie state we were trying to avoid.

Recommendation — focus on #572:

PR #572 is closer to the right fix architecturally, but it also has a serious problem in its current form: the worker responds to clients before committing, so a commit failure silently and permanently drops events that clients believe were saved. ErikBjare flagged this too.

The correct fix for #572 would be to commit before responding — if the commit fails, the client gets an error and can retry. That requires restructuring the inner worker loop but would make the server truly correct under disk-full conditions.

That leaves panics from other causes (bugs, not disk-full) still unhandled. The catch_unwind idea from this PR could still be worth adding on top of a fixed #572 — but only if the catch handler restarts the worker thread rather than just logging and exiting, otherwise the server is still broken.

⚠️ This comment is AI-generated on behalf of @tobixen.

tobixen · 2026-03-02T15:52:09Z

My first impression of Claude was just amazing - it apparently did manage to understand some of my projects in some few minutes, even if I thought I was the only person in the world understanding it fully. It has managed to make small scripts, tons of test code and fix bugs for me much faster and better than what I'm able to do myself. Other times it's just making tons of crap, or replacing good code with bad code. I found the "this is fine"-meme quite fitting - it seems happy to burn down houses without realizing it's a bad idea, and if one requests it to change the color of the curtains in the burning house, it would do so without asking any questions.

I've come to realize that delivering pull requests without understanding the project or the code it wants to change perhaps isn't the best idea.

ErikBjare · 2026-03-02T16:01:25Z

@tobixen Sometimes they are put on the spot and panic, and seem to just do something, in hopes of satisfying some RL reward like short solution and passing a test suite 😆

They also seem to have an unusually difficult time with Rust, not sure why that is.

But yeah, making PRs to other's stuff still requires a little bit of self-review and early prodding so that they have some gating on their confidence. Still, I think it could probably have gotten there had I not been trying that other approach at the same time, with some feedback.

One solution that would be friendly to systemd etc would be to crash the entire app, that way the service manager (aw-qt, aw-tauri, systemd) would deal with restart instead of the server having to handle/restart poisoned locks. But should probably just try to clean up and restart in-process? We will continue that discussion in #572

ellipsis-dev bot reviewed Jan 1, 2026

View reviewed changes

tobixen mentioned this pull request Jan 2, 2026

fix: handle commit failures gracefully instead of panicking #552

Closed

3 tasks

tobixen force-pushed the fix/catch-unwind-worker-panics branch from 3afcbd9 to 9d0bc84 Compare February 20, 2026 00:53

tobixen and others added 3 commits February 24, 2026 02:01

tobixen force-pushed the fix/catch-unwind-worker-panics branch from 9d0bc84 to 6d234a3 Compare February 24, 2026 01:01

tobixen mentioned this pull request Mar 2, 2026

"poisoned lock: another task failed inside" #405

Closed

ErikBjare reviewed Mar 2, 2026

View reviewed changes

tobixen closed this Mar 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: wrap datastore worker in catch_unwind to prevent poisoned locks#553

fix: wrap datastore worker in catch_unwind to prevent poisoned locks#553
tobixen wants to merge 3 commits intoActivityWatch:masterfrom
tobixen:fix/catch-unwind-worker-panics

tobixen commented Jan 1, 2026 •

edited by ellipsis-dev bot

Loading

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

greptile-apps bot commented Jan 1, 2026

Uh oh!

greptile-apps bot commented Jan 1, 2026

Uh oh!

tobixen commented Jan 1, 2026 •

edited

Loading

Uh oh!

0xbrayo commented Jan 2, 2026

Uh oh!

0xbrayo commented Jan 2, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 2, 2026 •

edited

Loading

Uh oh!

ErikBjare Mar 2, 2026

Uh oh!

tobixen commented Mar 2, 2026

Uh oh!

tobixen commented Mar 2, 2026 •

edited

Loading

Uh oh!

ErikBjare commented Mar 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

tobixen commented Jan 1, 2026 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Test plan

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 1, 2026

Greptile Summary

Key Changes

Analysis

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot commented Jan 1, 2026

Greptile's behavior is changing!

Uh oh!

tobixen commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xbrayo commented Jan 2, 2026

Uh oh!

0xbrayo commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ErikBjare Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

tobixen commented Mar 2, 2026

Uh oh!

tobixen commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ErikBjare commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tobixen commented Jan 1, 2026 •

edited by ellipsis-dev bot

Loading

tobixen commented Jan 1, 2026 •

edited

Loading

0xbrayo commented Jan 2, 2026 •

edited

Loading

codecov bot commented Jan 2, 2026 •

edited

Loading

tobixen commented Mar 2, 2026 •

edited

Loading

ErikBjare commented Mar 2, 2026 •

edited

Loading