fix: resolve the deadlock by using atomics#605
Merged
Merged
Conversation
Contributor
Author
|
Tested by importing the ladybug source repo into lscope as of this commit: adsharma/lscope@fd6cc12 |
magyargergo
added a commit
to abhigyanpatwari/GitNexus
that referenced
this pull request
Jul 1, 2026
…#2340) * chore(deps): bump @ladybugdb/core to 0.18.0 Pins the release containing LadybugDB/ladybug#605 (TransactionManager lock-order-inversion deadlock fix). Checked for known post-release regressions specific to 0.18.0 via the Ladybug issue tracker — none found. * fix(lbug): re-validate version-coupled comments and regexes for 0.18.0 Extends the LADYBUGDB-CONTRACT re-validation to two spots the marker convention doesn't catch (bridge-db.ts's LBUG_OPEN_RETRY_PATTERNS, conn-lock.ts's serialization rationale). Confirms via upstream source diff (v0.16.1..v0.18.0) that every matched error-text string is unchanged; conn-lock.ts's rationale is unaffected by #612/#623 since neither addresses concurrent queries on one connection. Adds a stemmer-sweep test proving the bundled 0.18.0 FTS extension accepts every entry in SUPPORTED_FTS_STEMMERS, not just the default porter. A live-trigger test for isMissingShadowSidecarError was attempted but abandoned after empirical probing showed it isn't reliably reproducible (even a SIGKILL-simulated crash didn't reproduce the error on reopen) — documented as inspection-verified instead of overclaiming test coverage that doesn't exist. * test(lbug): add concurrent multi-connection deadlock stress test (#2338) Directly validates LadybugDB/ladybug#605 — the TransactionManager lock-order-inversion deadlock between a commit()-triggered checkpoint and a concurrent beginAutoTransaction() — under a shape close to GitNexus's real concurrent-writer load, independent of conn-lock.ts's app-level serialization. Comparison run against 0.17.1 (pre-fix): 1 of 4 runs hung for the full 60s timeout, a direct reproduction of the deadlock. 9 consecutive runs against 0.18.0 (post-fix) all passed cleanly. Production is unchanged — conn-lock.ts still serializes every write; this test validates the engine-level fix without shipping multi-writer as a default. * fix(test): address code review findings in multiwriter deadlock test - Reuse lbug-config.ts's createLbugDatabase (via GITNEXUS_WAL_CHECKPOINT_THRESHOLD) instead of a hand-duplicated 9-arg raw constructor call whose stated justification (needing to bypass createLbugDatabase for the threshold override) was incorrect — the env var already provides it. - Close every QueryResult via the existing closeQueryResults helper (write loop, read loop, verify query, setup query) instead of leaking native cursors, matching lbug-adapter.ts's established pattern. - Move all cleanup (timers, connections, db close, env var restore) into the outer finally block so it runs on every exit path, not just the happy path — a timeout or a writer exhausting its retry budget no longer leaves dangling timers/connections/abandoned query loops. Verified: 8 consecutive runs after the refactor, all passing cleanly. Found via 8-angle parallel code review (medium effort); the two other findings (isDbBusyError not recognizing LadybugDB's 'Only one write transaction' message, and shadow-file poll timing sensitivity) are noted in the PR description as residual — the first is a production-code change beyond this validation test's scope, the second is inherent to observing a transient native sidecar file and not cleanly fixable without overengineering. * fix(test): apply ce-code-review autofix findings Fixes from an 8-persona parallel review round (correctness/testing/ maintainability/project-standards/reliability/adversarial/agent-native/ learnings): - Extract the duplicated skipUnlessFtsAvailable/FTS_UNAVAILABLE_NOTE helper (previously copy-pasted between lbug-core-adapter.test.ts and fts-stemmer-sweep.test.ts) into a shared test/helpers/fts-availability.ts. - Fix a native connection leak: verifyConn in the deadlock test's final verification block is now pushed into the readers array the outer finally already closes, so it's cleaned up even if the count query throws. - Fix a latent TypeScript type error (tsconfig.test.json catches it, tsconfig.json doesn't): conn.query() types as QueryResult | QueryResult[]; narrow to the single-result case before calling .getAll() rather than assuming the array branch never happens. - Replace repeated inline InstanceType<typeof import(...)> expressions with local LbugDatabase/LbugConnection type aliases. Verified: 12 consecutive runs of the deadlock test all pass, full lbug-db project (336 tests) green. Cross-reviewer-confirmed but left as residual (design judgment calls, not mechanical fixes) for the PR description: isDbBusyError doesn't recognize LadybugDB's 'Only one write transaction' message (pre-existing production gap, confirmed independently by 3 reviewers); the deadlock test's timeout path doesn't cancel in-flight writer/reader loops before closing connections; the reader loop has no bounded retry for transient errors during the race window; pinning @ladybugdb/core with a caret range trades automatic patch updates for less re-validation certainty. * docs: trim task-referencing JSDoc artifacts, add operator notes The U2 re-validation pass left verbose 'Re-validated on the 0.17.0->0.18.0 bump (#2338): ...' paragraphs stacked onto 5 production files' docstrings, alongside the already-updated version numbers. That narrative (SIGKILL-probe methodology, diff commands run, issue cross-references) belongs in the PR description, not in code comments that will accumulate a new paragraph on every future bump and confuse readers who just want the current fact. Trimmed each to state only the durable, current-state fact: - lbug-config.ts, sidecar-recovery.ts, lbug-adapter.ts, bridge-db.ts: dropped the bump-narrative paragraphs; kept only genuinely durable notes (e.g., which matchers are inspection-verified vs live-tested, what upstream wording changed). - conn-lock.ts: compressed a 12-line, 3-issue-number enumeration into 2 lines stating the current conclusion (no upstream 0.18.0 fix addresses the same-connection-concurrent-query risk this lock guards against). Also added operator-facing notes to GUARDRAILS.md and RUNBOOK.md's existing 'LadybugDB lock' sections: an isDbBusyError gap found during this validation (LadybugDB's 'Only one write transaction...' message isn't recognized by our busy/lock retry matcher) means that specific error can surface unretried. Documented so it's recognized as the same single-writer conflict, not a new failure mode. * refactor(test): use gitnexus-shared's withRetry in multiwriter deadlock test Replaces the hand-rolled writeWithRetry/sleep loop with the existing gitnexus-shared retry helper (already used by embeddings/hf-env.ts) instead of duplicating the pattern. * fix(test): guarantee non-zero retry delay in deadlock test's writer loop withRetry's isRetryable previously returned {retry: bool} with no afterMs, so computeBackoffMs's exponential-jitter formula gave a deterministic zero-delay on the first retry (floor(random()*1) is always 0 at attempt=0). This contradicted the file's own documented tuning, which specifically needs a non-zero 1-3ms delay to avoid tripping a different native guard. Return an explicit afterMs override on the retryable branch instead. * docs(test): remove dangling doc references from deadlock test JSDoc The JSDoc pointed to a local-session-only docs/plans/2026-07-01-001-... path (docs/ is repo-gitignored, so this never existed for anyone but the implementing session) and to "the PR description" as a source of truth that stops being current once the PR merges. Replace both with self-contained prose and durable references (issue/PR numbers, commit SHAs, GUARDRAILS.md/RUNBOOK.md) that stay resolvable after merge. * fix(search): harden SUPPORTED_FTS_STEMMERS against external mutation Type as ReadonlySet<string> to match this codebase's established convention for exported validation allowlists (EVAL_SERVER_TOOLS, STRUCTURAL_LABELS). Type-only change — no behavior change; both the internal .has() check and the sweep test's spread-iterate pattern continue to work unchanged. * docs(guardrails): fold Known-gap note into the LadybugDB Sign's Why label GUARDRAILS.md's own convention is strictly Trigger/Do/Why per Sign entry (stated in the file's header, followed by all 5 other entries). The new isDbBusyError gap note introduced a 4th label; fold it into Why instead, which is what it's actually explaining. * fix(test): run the multi-writer deadlock test on Windows too itLbugMultiwriter mirrored lbug-core-adapter.test.ts's win32 skip, but that pattern exists for a close-then-reopen-same-path lock lingering bug (kuzudb/kuzu#3872). This test never reopens the database — it holds connections open for the whole run — so the skip excluded the one test validating issue #2338's deadlock fix from the platform conn-lock.ts actually ships native bindings for. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
a classic lock‑order inversion between two TransactionManager mutexes, hit because TransactionContext::commit() (when it triggers an auto/force checkpoint) and TransactionContext::beginAutoTransaction() (→ beginTransactionInternal() → TransactionManager::beginTransaction()) acquire them in opposite orders.
The two locks involved:
commit() calls checkpoint()/tryCheckpoint() → checkpointNoLock() when shouldAutoCheckpoint or shouldForceCheckpoint is set. So the cycle is: