Skip to content

fix: resolve the deadlock by using atomics#605

Merged
adsharma merged 1 commit into
mainfrom
write_deadlock
Jun 21, 2026
Merged

fix: resolve the deadlock by using atomics#605
adsharma merged 1 commit into
mainfrom
write_deadlock

Conversation

@adsharma

Copy link
Copy Markdown
Contributor

a classic lock‑order inversion between two TransactionManager mutexes, hit because TransactionContext::commit() (when it triggers an auto/force checkpoint) and TransactionContext::beginAutoTransaction() (→ beginTransactionInternal() → TransactionManager::beginTransaction()) acquire them in opposite orders.

The two locks involved:

  • mtxForSerializingPublicFunctionCalls ("public‑function" lock)
  • mtxForStartingNewTransactions ("write gate" / new‑transaction lock)

commit() calls checkpoint()/tryCheckpoint() → checkpointNoLock() when shouldAutoCheckpoint or shouldForceCheckpoint is set. So the cycle is:

  • Thread A (a writer whose commit() just fired an auto‑checkpoint): holds mtxForStartingNewTransactions (the write gate, kept after the drain loop exits), then tries to acquire mtxForSerializingPublicFunctionCalls to read lastTimestamp (line 305).
  • Thread B (another concurrent writer entering beginAutoTransaction() → beginTransaction(WRITE)): holds mtxForSerializingPublicFunctionCalls, then tries to acquire mtxForStartingNewTransactions (line 35).

@adsharma

Copy link
Copy Markdown
Contributor Author

Tested by importing the ladybug source repo into lscope as of this commit: adsharma/lscope@fd6cc12

python3 main.py --index /path/to/ladybug --language cpp --workers 8
Ingested 2602 file(s), 33273 semantic node(s), and 87726 resolved call(s) into test.db using 8 analysis thread(s)
  cpp: 2602 file(s)
       65.54 real        83.17 user         6.06 sys
          2065891328  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              142518  page reclaims
                  18  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
               17507  voluntary context switches
              950821  involuntary context switches
       1662480220213  instructions retired
        353909380349  cycles elapsed
          1539655816  peak memory footprint

@adsharma adsharma merged commit d7b9f19 into main Jun 21, 2026
4 checks passed
@adsharma adsharma deleted the write_deadlock branch June 21, 2026 18:27
magyargergo added a commit to abhigyanpatwari/GitNexus that referenced this pull request Jul 1, 2026
…#2340)

* chore(deps): bump @ladybugdb/core to 0.18.0

Pins the release containing LadybugDB/ladybug#605 (TransactionManager
lock-order-inversion deadlock fix). Checked for known post-release
regressions specific to 0.18.0 via the Ladybug issue tracker — none found.

* fix(lbug): re-validate version-coupled comments and regexes for 0.18.0

Extends the LADYBUGDB-CONTRACT re-validation to two spots the marker
convention doesn't catch (bridge-db.ts's LBUG_OPEN_RETRY_PATTERNS,
conn-lock.ts's serialization rationale). Confirms via upstream source
diff (v0.16.1..v0.18.0) that every matched error-text string is
unchanged; conn-lock.ts's rationale is unaffected by #612/#623 since
neither addresses concurrent queries on one connection. Adds a
stemmer-sweep test proving the bundled 0.18.0 FTS extension accepts
every entry in SUPPORTED_FTS_STEMMERS, not just the default porter.

A live-trigger test for isMissingShadowSidecarError was attempted but
abandoned after empirical probing showed it isn't reliably
reproducible (even a SIGKILL-simulated crash didn't reproduce the
error on reopen) — documented as inspection-verified instead of
overclaiming test coverage that doesn't exist.

* test(lbug): add concurrent multi-connection deadlock stress test (#2338)

Directly validates LadybugDB/ladybug#605 — the TransactionManager
lock-order-inversion deadlock between a commit()-triggered checkpoint
and a concurrent beginAutoTransaction() — under a shape close to
GitNexus's real concurrent-writer load, independent of conn-lock.ts's
app-level serialization.

Comparison run against 0.17.1 (pre-fix): 1 of 4 runs hung for the full
60s timeout, a direct reproduction of the deadlock. 9 consecutive runs
against 0.18.0 (post-fix) all passed cleanly. Production is unchanged —
conn-lock.ts still serializes every write; this test validates the
engine-level fix without shipping multi-writer as a default.

* fix(test): address code review findings in multiwriter deadlock test

- Reuse lbug-config.ts's createLbugDatabase (via GITNEXUS_WAL_CHECKPOINT_THRESHOLD)
  instead of a hand-duplicated 9-arg raw constructor call whose stated
  justification (needing to bypass createLbugDatabase for the threshold
  override) was incorrect — the env var already provides it.
- Close every QueryResult via the existing closeQueryResults helper
  (write loop, read loop, verify query, setup query) instead of leaking
  native cursors, matching lbug-adapter.ts's established pattern.
- Move all cleanup (timers, connections, db close, env var restore) into
  the outer finally block so it runs on every exit path, not just the
  happy path — a timeout or a writer exhausting its retry budget no
  longer leaves dangling timers/connections/abandoned query loops.

Verified: 8 consecutive runs after the refactor, all passing cleanly.

Found via 8-angle parallel code review (medium effort); the two other
findings (isDbBusyError not recognizing LadybugDB's 'Only one write
transaction' message, and shadow-file poll timing sensitivity) are
noted in the PR description as residual — the first is a production-code
change beyond this validation test's scope, the second is inherent to
observing a transient native sidecar file and not cleanly fixable
without overengineering.

* fix(test): apply ce-code-review autofix findings

Fixes from an 8-persona parallel review round (correctness/testing/
maintainability/project-standards/reliability/adversarial/agent-native/
learnings):

- Extract the duplicated skipUnlessFtsAvailable/FTS_UNAVAILABLE_NOTE
  helper (previously copy-pasted between lbug-core-adapter.test.ts and
  fts-stemmer-sweep.test.ts) into a shared test/helpers/fts-availability.ts.
- Fix a native connection leak: verifyConn in the deadlock test's final
  verification block is now pushed into the readers array the outer
  finally already closes, so it's cleaned up even if the count query
  throws.
- Fix a latent TypeScript type error (tsconfig.test.json catches it,
  tsconfig.json doesn't): conn.query() types as QueryResult |
  QueryResult[]; narrow to the single-result case before calling
  .getAll() rather than assuming the array branch never happens.
- Replace repeated inline InstanceType<typeof import(...)> expressions
  with local LbugDatabase/LbugConnection type aliases.

Verified: 12 consecutive runs of the deadlock test all pass, full
lbug-db project (336 tests) green.

Cross-reviewer-confirmed but left as residual (design judgment calls,
not mechanical fixes) for the PR description: isDbBusyError doesn't
recognize LadybugDB's 'Only one write transaction' message (pre-existing
production gap, confirmed independently by 3 reviewers); the deadlock
test's timeout path doesn't cancel in-flight writer/reader loops before
closing connections; the reader loop has no bounded retry for transient
errors during the race window; pinning @ladybugdb/core with a caret
range trades automatic patch updates for less re-validation certainty.

* docs: trim task-referencing JSDoc artifacts, add operator notes

The U2 re-validation pass left verbose 'Re-validated on the
0.17.0->0.18.0 bump (#2338): ...' paragraphs stacked onto 5 production
files' docstrings, alongside the already-updated version numbers. That
narrative (SIGKILL-probe methodology, diff commands run, issue
cross-references) belongs in the PR description, not in code comments
that will accumulate a new paragraph on every future bump and confuse
readers who just want the current fact. Trimmed each to state only the
durable, current-state fact:

- lbug-config.ts, sidecar-recovery.ts, lbug-adapter.ts, bridge-db.ts:
  dropped the bump-narrative paragraphs; kept only genuinely durable
  notes (e.g., which matchers are inspection-verified vs live-tested,
  what upstream wording changed).
- conn-lock.ts: compressed a 12-line, 3-issue-number enumeration into
  2 lines stating the current conclusion (no upstream 0.18.0 fix
  addresses the same-connection-concurrent-query risk this lock
  guards against).

Also added operator-facing notes to GUARDRAILS.md and RUNBOOK.md's
existing 'LadybugDB lock' sections: an isDbBusyError gap found during
this validation (LadybugDB's 'Only one write transaction...' message
isn't recognized by our busy/lock retry matcher) means that specific
error can surface unretried. Documented so it's recognized as the same
single-writer conflict, not a new failure mode.

* refactor(test): use gitnexus-shared's withRetry in multiwriter deadlock test

Replaces the hand-rolled writeWithRetry/sleep loop with the existing
gitnexus-shared retry helper (already used by embeddings/hf-env.ts)
instead of duplicating the pattern.

* fix(test): guarantee non-zero retry delay in deadlock test's writer loop

withRetry's isRetryable previously returned {retry: bool} with no afterMs,
so computeBackoffMs's exponential-jitter formula gave a deterministic
zero-delay on the first retry (floor(random()*1) is always 0 at attempt=0).
This contradicted the file's own documented tuning, which specifically
needs a non-zero 1-3ms delay to avoid tripping a different native guard.
Return an explicit afterMs override on the retryable branch instead.

* docs(test): remove dangling doc references from deadlock test JSDoc

The JSDoc pointed to a local-session-only docs/plans/2026-07-01-001-...
path (docs/ is repo-gitignored, so this never existed for anyone but the
implementing session) and to "the PR description" as a source of truth
that stops being current once the PR merges. Replace both with
self-contained prose and durable references (issue/PR numbers, commit
SHAs, GUARDRAILS.md/RUNBOOK.md) that stay resolvable after merge.

* fix(search): harden SUPPORTED_FTS_STEMMERS against external mutation

Type as ReadonlySet<string> to match this codebase's established
convention for exported validation allowlists (EVAL_SERVER_TOOLS,
STRUCTURAL_LABELS). Type-only change — no behavior change; both the
internal .has() check and the sweep test's spread-iterate pattern
continue to work unchanged.

* docs(guardrails): fold Known-gap note into the LadybugDB Sign's Why label

GUARDRAILS.md's own convention is strictly Trigger/Do/Why per Sign
entry (stated in the file's header, followed by all 5 other entries).
The new isDbBusyError gap note introduced a 4th label; fold it into
Why instead, which is what it's actually explaining.

* fix(test): run the multi-writer deadlock test on Windows too

itLbugMultiwriter mirrored lbug-core-adapter.test.ts's win32 skip, but
that pattern exists for a close-then-reopen-same-path lock lingering
bug (kuzudb/kuzu#3872). This test never reopens the database — it
holds connections open for the whole run — so the skip excluded the
one test validating issue #2338's deadlock fix from the platform
conn-lock.ts actually ships native bindings for.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant