fix: resolve the deadlock by using atomics by adsharma · Pull Request #605 · LadybugDB/ladybug

adsharma · 2026-06-21T16:33:55Z

a classic lock‑order inversion between two TransactionManager mutexes, hit because TransactionContext::commit() (when it triggers an auto/force checkpoint) and TransactionContext::beginAutoTransaction() (→ beginTransactionInternal() → TransactionManager::beginTransaction()) acquire them in opposite orders.

The two locks involved:

mtxForSerializingPublicFunctionCalls ("public‑function" lock)
mtxForStartingNewTransactions ("write gate" / new‑transaction lock)

commit() calls checkpoint()/tryCheckpoint() → checkpointNoLock() when shouldAutoCheckpoint or shouldForceCheckpoint is set. So the cycle is:

Thread A (a writer whose commit() just fired an auto‑checkpoint): holds mtxForStartingNewTransactions (the write gate, kept after the drain loop exits), then tries to acquire mtxForSerializingPublicFunctionCalls to read lastTimestamp (line 305).
Thread B (another concurrent writer entering beginAutoTransaction() → beginTransaction(WRITE)): holds mtxForSerializingPublicFunctionCalls, then tries to acquire mtxForStartingNewTransactions (line 35).

adsharma · 2026-06-21T18:27:03Z

Tested by importing the ladybug source repo into lscope as of this commit: adsharma/lscope@fd6cc12

python3 main.py --index /path/to/ladybug --language cpp --workers 8
Ingested 2602 file(s), 33273 semantic node(s), and 87726 resolved call(s) into test.db using 8 analysis thread(s)
  cpp: 2602 file(s)
       65.54 real        83.17 user         6.06 sys
          2065891328  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              142518  page reclaims
                  18  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
               17507  voluntary context switches
              950821  involuntary context switches
       1662480220213  instructions retired
        353909380349  cycles elapsed
          1539655816  peak memory footprint

…#2340) * chore(deps): bump @ladybugdb/core to 0.18.0 Pins the release containing LadybugDB/ladybug#605 (TransactionManager lock-order-inversion deadlock fix). Checked for known post-release regressions specific to 0.18.0 via the Ladybug issue tracker — none found. * fix(lbug): re-validate version-coupled comments and regexes for 0.18.0 Extends the LADYBUGDB-CONTRACT re-validation to two spots the marker convention doesn't catch (bridge-db.ts's LBUG_OPEN_RETRY_PATTERNS, conn-lock.ts's serialization rationale). Confirms via upstream source diff (v0.16.1..v0.18.0) that every matched error-text string is unchanged; conn-lock.ts's rationale is unaffected by #612/#623 since neither addresses concurrent queries on one connection. Adds a stemmer-sweep test proving the bundled 0.18.0 FTS extension accepts every entry in SUPPORTED_FTS_STEMMERS, not just the default porter. A live-trigger test for isMissingShadowSidecarError was attempted but abandoned after empirical probing showed it isn't reliably reproducible (even a SIGKILL-simulated crash didn't reproduce the error on reopen) — documented as inspection-verified instead of overclaiming test coverage that doesn't exist. * test(lbug): add concurrent multi-connection deadlock stress test (#2338) Directly validates LadybugDB/ladybug#605 — the TransactionManager lock-order-inversion deadlock between a commit()-triggered checkpoint and a concurrent beginAutoTransaction() — under a shape close to GitNexus's real concurrent-writer load, independent of conn-lock.ts's app-level serialization. Comparison run against 0.17.1 (pre-fix): 1 of 4 runs hung for the full 60s timeout, a direct reproduction of the deadlock. 9 consecutive runs against 0.18.0 (post-fix) all passed cleanly. Production is unchanged — conn-lock.ts still serializes every write; this test validates the engine-level fix without shipping multi-writer as a default. * fix(test): address code review findings in multiwriter deadlock test - Reuse lbug-config.ts's createLbugDatabase (via GITNEXUS_WAL_CHECKPOINT_THRESHOLD) instead of a hand-duplicated 9-arg raw constructor call whose stated justification (needing to bypass createLbugDatabase for the threshold override) was incorrect — the env var already provides it. - Close every QueryResult via the existing closeQueryResults helper (write loop, read loop, verify query, setup query) instead of leaking native cursors, matching lbug-adapter.ts's established pattern. - Move all cleanup (timers, connections, db close, env var restore) into the outer finally block so it runs on every exit path, not just the happy path — a timeout or a writer exhausting its retry budget no longer leaves dangling timers/connections/abandoned query loops. Verified: 8 consecutive runs after the refactor, all passing cleanly. Found via 8-angle parallel code review (medium effort); the two other findings (isDbBusyError not recognizing LadybugDB's 'Only one write transaction' message, and shadow-file poll timing sensitivity) are noted in the PR description as residual — the first is a production-code change beyond this validation test's scope, the second is inherent to observing a transient native sidecar file and not cleanly fixable without overengineering. * fix(test): apply ce-code-review autofix findings Fixes from an 8-persona parallel review round (correctness/testing/ maintainability/project-standards/reliability/adversarial/agent-native/ learnings): - Extract the duplicated skipUnlessFtsAvailable/FTS_UNAVAILABLE_NOTE helper (previously copy-pasted between lbug-core-adapter.test.ts and fts-stemmer-sweep.test.ts) into a shared test/helpers/fts-availability.ts. - Fix a native connection leak: verifyConn in the deadlock test's final verification block is now pushed into the readers array the outer finally already closes, so it's cleaned up even if the count query throws. - Fix a latent TypeScript type error (tsconfig.test.json catches it, tsconfig.json doesn't): conn.query() types as QueryResult | QueryResult[]; narrow to the single-result case before calling .getAll() rather than assuming the array branch never happens. - Replace repeated inline InstanceType<typeof import(...)> expressions with local LbugDatabase/LbugConnection type aliases. Verified: 12 consecutive runs of the deadlock test all pass, full lbug-db project (336 tests) green. Cross-reviewer-confirmed but left as residual (design judgment calls, not mechanical fixes) for the PR description: isDbBusyError doesn't recognize LadybugDB's 'Only one write transaction' message (pre-existing production gap, confirmed independently by 3 reviewers); the deadlock test's timeout path doesn't cancel in-flight writer/reader loops before closing connections; the reader loop has no bounded retry for transient errors during the race window; pinning @ladybugdb/core with a caret range trades automatic patch updates for less re-validation certainty. * docs: trim task-referencing JSDoc artifacts, add operator notes The U2 re-validation pass left verbose 'Re-validated on the 0.17.0->0.18.0 bump (#2338): ...' paragraphs stacked onto 5 production files' docstrings, alongside the already-updated version numbers. That narrative (SIGKILL-probe methodology, diff commands run, issue cross-references) belongs in the PR description, not in code comments that will accumulate a new paragraph on every future bump and confuse readers who just want the current fact. Trimmed each to state only the durable, current-state fact: - lbug-config.ts, sidecar-recovery.ts, lbug-adapter.ts, bridge-db.ts: dropped the bump-narrative paragraphs; kept only genuinely durable notes (e.g., which matchers are inspection-verified vs live-tested, what upstream wording changed). - conn-lock.ts: compressed a 12-line, 3-issue-number enumeration into 2 lines stating the current conclusion (no upstream 0.18.0 fix addresses the same-connection-concurrent-query risk this lock guards against). Also added operator-facing notes to GUARDRAILS.md and RUNBOOK.md's existing 'LadybugDB lock' sections: an isDbBusyError gap found during this validation (LadybugDB's 'Only one write transaction...' message isn't recognized by our busy/lock retry matcher) means that specific error can surface unretried. Documented so it's recognized as the same single-writer conflict, not a new failure mode. * refactor(test): use gitnexus-shared's withRetry in multiwriter deadlock test Replaces the hand-rolled writeWithRetry/sleep loop with the existing gitnexus-shared retry helper (already used by embeddings/hf-env.ts) instead of duplicating the pattern. * fix(test): guarantee non-zero retry delay in deadlock test's writer loop withRetry's isRetryable previously returned {retry: bool} with no afterMs, so computeBackoffMs's exponential-jitter formula gave a deterministic zero-delay on the first retry (floor(random()*1) is always 0 at attempt=0). This contradicted the file's own documented tuning, which specifically needs a non-zero 1-3ms delay to avoid tripping a different native guard. Return an explicit afterMs override on the retryable branch instead. * docs(test): remove dangling doc references from deadlock test JSDoc The JSDoc pointed to a local-session-only docs/plans/2026-07-01-001-... path (docs/ is repo-gitignored, so this never existed for anyone but the implementing session) and to "the PR description" as a source of truth that stops being current once the PR merges. Replace both with self-contained prose and durable references (issue/PR numbers, commit SHAs, GUARDRAILS.md/RUNBOOK.md) that stay resolvable after merge. * fix(search): harden SUPPORTED_FTS_STEMMERS against external mutation Type as ReadonlySet<string> to match this codebase's established convention for exported validation allowlists (EVAL_SERVER_TOOLS, STRUCTURAL_LABELS). Type-only change — no behavior change; both the internal .has() check and the sweep test's spread-iterate pattern continue to work unchanged. * docs(guardrails): fold Known-gap note into the LadybugDB Sign's Why label GUARDRAILS.md's own convention is strictly Trigger/Do/Why per Sign entry (stated in the file's header, followed by all 5 other entries). The new isDbBusyError gap note introduced a 4th label; fold it into Why instead, which is what it's actually explaining. * fix(test): run the multi-writer deadlock test on Windows too itLbugMultiwriter mirrored lbug-core-adapter.test.ts's win32 skip, but that pattern exists for a close-then-reopen-same-path lock lingering bug (kuzudb/kuzu#3872). This test never reopens the database — it holds connections open for the whole run — so the skip excluded the one test validating issue #2338's deadlock fix from the platform conn-lock.ts actually ships native bindings for. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 5 <noreply@anthropic.com>

fix: resolve the deadlock by using atomics

d369d44

adsharma force-pushed the write_deadlock branch from 37217d3 to d369d44 Compare June 21, 2026 16:38

adsharma merged commit d7b9f19 into main Jun 21, 2026
4 checks passed

adsharma deleted the write_deadlock branch June 21, 2026 18:27

This was referenced Jul 1, 2026

Validate Ladybug multi-writer deadlock fix (PR #605) against GitNexus large-repo ingestion abhigyanpatwari/GitNexus#2338

Closed

fix(deps): pin Ladybug 0.18.0, validate the multi-writer deadlock fix abhigyanpatwari/GitNexus#2340

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: resolve the deadlock by using atomics#605

fix: resolve the deadlock by using atomics#605
adsharma merged 1 commit into
mainfrom
write_deadlock

adsharma commented Jun 21, 2026

Uh oh!

adsharma commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

adsharma commented Jun 21, 2026

Uh oh!

adsharma commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant