Fix four high-severity concurrency/correctness bugs (swarm, locks, lsp, config) by gnanam1990 · Pull Request #261 · Gitlawb/zero

gnanam1990 · 2026-06-20T04:09:08Z

Summary

Four high-severity concurrency/correctness bugs surfaced by a full-tree audit. All are latent — go test ./... passes — because they live in race windows and provider/error edges the suite doesn't exercise. Each fix ships with a regression test.

Fixes

Area	Bug	Fix
swarm	`AdoptOrphans` re-dispatches a task whose member is queued over the concurrency cap → the same delegated task runs twice (duplicate file/shell side effects, broken one-member-per-task invariant).	`liveAgents()` now counts queued specs as live, so a merely-waiting member is never treated as an orphan.
config	An `_API_KEY` env var rewrites a same-named compatible-transport* provider's kind, breaking proxy/gateway setups with a confusing "requires official baseURL" error.	When the env supplies credentials only (no base URL) and a same-named provider exists, leave the kind unset so the merge preserves the existing one.
cron / hooks / oauth	Stale-lock reclaim did a blind `Remove`+recreate; under contention two processes both take the lock — a mutual-exclusion violation (duplicate audit sequence numbers, reopened metadata read-modify-write race).	Atomic rename-with-verify reclaim (only one racer wins; a freshly-reacquired lock is restored, not stolen). Also falls through to the bounded wait instead of hot-spinning when a reclaim never wins.
lsp	A crashed/exited language-server session is never evicted → every later diagnostic errors forever, and self-correct fails on every changed file every iteration with no recovery short of a restart.	`Client.IsClosed()` + evict-and-restart in `sessionFor`.

Tests

TestLiveAgentsIncludesQueuedSpecs
TestEnvKeyPreservesExistingCompatibleProviderKind, TestEnvKeyCreatesStandardProviderWhenAbsent
TestReclaimStaleLock (atomic reclaim + the fresh-lock-restored protection)
TestSessionForEvictsDeadSession

go build ./..., go vet ./..., gofmt, and the full go test ./... are all green.

Summary by CodeRabbit

Release Notes

Bug Fixes
- Provider configuration now preserves existing compatible settings when environment credentials are supplied without a custom base URL.
- Improved stale lock handling across multiple components using atomic rename operations, preventing race conditions where live locks could be prematurely deleted.
- LSP session management now validates client connections and automatically recovers by starting fresh sessions when needed.
- Queued agent specs are now properly included in live agent tracking for correct orphan/adoption behavior.

Surfaced by a full-tree audit; all are latent (the suite passed) and live in race windows / error edges the tests don't exercise. Each ships a regression test. - swarm: liveAgents() now counts queued specs, so AdoptOrphans can no longer re-dispatch (double-execute) a task whose member is merely waiting for a free slot over the concurrency cap. - cron/hooks/oauth: stale-lock reclaim is now atomic (rename aside, then verify-and-restore if it turns out fresh) instead of a blind Remove that let two racers both reclaim and hold the lock; also falls through to the bounded wait instead of hot-spinning when a reclaim never wins. - lsp: a dead language-server session is evicted and restarted instead of being returned forever, so one server crash no longer permanently breaks diagnostics (and spuriously fails self-correct every iteration). - config: an API-key env var no longer overwrites a same-named compatible-transport provider's kind when no base URL is supplied.

coderabbitai · 2026-06-20T04:10:11Z

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Free

Run ID: 27f703fe-ad69-4d85-91d8-a3a42f9ae529

📥 Commits

Reviewing files that changed from the base of the PR and between 1c30f4c and 782b450.

📒 Files selected for processing (11)

internal/config/env_provider_kind_test.go
internal/config/resolver.go
internal/cron/lock.go
internal/cron/lock_reclaim_test.go
internal/hooks/lock.go
internal/lsp/client.go
internal/lsp/manager.go
internal/lsp/manager_test.go
internal/oauth/lock.go
internal/swarm/lifecycle_test.go
internal/swarm/team.go

Walkthrough

Four independent bug fixes: stale file lock reclamation in cron, hooks, and oauth packages switches from unconditional deletion to an atomic rename-verify-restore strategy; lsp.Manager.sessionFor gains dead-session eviction using a new Client.IsClosed method; config.applyProviderEnv preserves an existing provider's ProviderKind when only credentials are supplied via env; and Team.liveAgents includes queued MemberSpecs alongside running members.

Changes

Atomic Stale-Lock Reclaim (cron, hooks, oauth)

Layer / File(s)	Summary
`reclaimStaleLock` helper and integration into lock loops `internal/cron/lock.go`, `internal/hooks/lock.go`, `internal/oauth/lock.go`	Each lock file gets a new `reclaimStaleLock` helper that renames the stale lock to a token-unique `.stale.<token>` name, re-checks staleness after rename, restores the original if it became fresh, and removes it only when still stale. `lockJob`, `lockAudit`, and `acquireFileLock` now call this helper instead of directly removing the stale lock file.
Unit test for reclaim logic `internal/cron/lock_reclaim_test.go`	`TestReclaimStaleLock` asserts stale locks are removed, fresh locks are preserved, and missing files produce no reclaim.

LSP Dead Session Eviction

Layer / File(s)	Summary
`Client.IsClosed` and `Manager.sessionFor` eviction `internal/lsp/client.go`, `internal/lsp/manager.go`	`Client.IsClosed` non-blockingly reads the `c.closed` channel. `Manager.sessionFor` uses it at the cache-hit path (evicts and restarts if dead) and at the start-race-loser path (installs the new session instead of returning the dead cached one).
Dead session eviction test `internal/lsp/manager_test.go`	`TestSessionForEvictsDeadSession` verifies live-session reuse, eviction after `client.closed` is closed, and server restart count via a `starts` counter.

Config Provider Kind Preservation

Layer / File(s)	Summary
`applyProviderEnv` fix and tests `internal/config/resolver.go`, `internal/config/env_provider_kind_test.go`	When `baseURL` is empty, `applyProviderEnv` scans existing providers for a name match and clears `profile.ProviderKind` before merging, so `mergeProfile` retains the existing kind. Two tests cover both the preservation and the absent-provider creation paths.

Swarm Queue Inclusion in liveAgents

Layer / File(s)	Summary
`liveAgents` queue inclusion and regression test `internal/swarm/team.go`, `internal/swarm/lifecycle_test.go`	`Team.liveAgents` preallocates for `t.members + t.queue` and adds all queued `MemberSpec.ID`s. `TestLiveAgentsIncludesQueuedSpecs` verifies queued specs appear in the returned set under a `maxSize: 1` cap.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Note

🎁 Summarized by CodeRabbit Free

Your organization is on the Free plan. CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please upgrade your subscription to CodeRabbit Pro by visiting https://app.coderabbit.ai/login.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-06-20T04:11:00Z

Zero automated PR review

Verdict: No blockers found

Blockers

None found.

Validation

[pass] Diff hygiene: git diff --check
[pass] Tests: go test ./...
[pass] Build: go run ./cmd/zero-release build
[pass] Smoke build: go run ./cmd/zero-release smoke

Scope

Head: 782b450e9cf2
Changed files (11): internal/config/env_provider_kind_test.go, internal/config/resolver.go, internal/cron/lock.go, internal/cron/lock_reclaim_test.go, internal/hooks/lock.go, internal/lsp/client.go, internal/lsp/manager.go, internal/lsp/manager_test.go, internal/oauth/lock.go, internal/swarm/lifecycle_test.go, internal/swarm/team.go

This deterministic review checks validation status and basic diff hygiene. A human reviewer still owns product judgment and design quality.

Vasanthdev2004

Review: Fix four high-severity concurrency/correctness bugs (swarm, locks, lsp, config)

Reviewed the diff against the current source in internal/{config,cron,hooks,oauth,lsp,swarm}. All four fixes are correct, well-reasoned, and come with focused regression tests. Approving.

H1 — swarm `liveAgents` omits queued specs (P1)

internal/swarm/team.go:305 — liveAgents() returned only running members, not specs parked in t.queue. AdoptOrphans (lifecycle.go:144) uses that set to decide which tasks still have an owner; a queued spec is owned and about to launch but was missing from the set, so AdoptOrphans would re-dispatch its task → double execution.

Fix is correct: liveAgents now snapshots t.queue under the same t.mu lock it already holds for members, so the orphan check sees a consistent owner set. Capacity sizing len(members)+len(queue) is a minor nice-to-have. TestLiveAgentsIncludesQueuedSpecs covers the regression precisely.

H2 — env key clobbers compatible provider kind (P2)

internal/config/resolver.go:472 — when OPENAI_API_KEY (credentials only, no baseURL) is exported and a same-named openai-compatible proxy/gateway already exists, applyProviderEnv set profile.ProviderKind = ProviderKindOpenAI, and mergeProfile (resolver.go:382) overwrites the existing kind because next.ProviderKind != "". Result: a configured gateway is silently downgraded to the standard OpenAI transport on the next env-applied load.

Fix is correct: when baseURL == "" and a same-named provider exists, set profile.ProviderKind = "" so mergeProfile preserves the existing kind. The name used in the pre-mergeProvider scan matches the name providerMergeName resolves inside mergeProvider (both operate on profile.Name post-providerEnvTargetName), so the "found existing" decision is consistent with the merge target. TestEnvKeyPreservesExistingCompatibleProviderKind and TestEnvKeyCreatesStandardProviderWhenAbsent cover both branches.

H3 — stale lock reclaim breaks mutual exclusion (P1)

internal/{cron,hooks,oauth}/lock.go — the old _ = os.Remove(lockPath); continue let two racers both remove-and-recreate the same stale lock, so both believed they held it. Mutual exclusion was broken across processes whenever a stale lock was reclaimed under contention.

Fix is correct and materially improves safety. reclaimStaleLock renames the lock to a per-acquirer .<token> name; os.Rename of a given source is atomic and wins for exactly one racer (POSIX rename is atomic; on Windows the loser gets ERROR_FILE_NOT_FOUND/ERROR_ACCESS_DENIED). The loser returns false and falls through to the bounded wait instead of hot-spinning on a reclaim that can never win (good — the old continue could spin until the deadline even when the lock was genuinely held by someone else).

One observation worth flagging but not blocking:

P3 — "restore if fresh" branch is effectively unreachable. After the rename, the moved file's mtime is preserved and only gets staler; no other process can update the renamed file's mtime (its name is unique to this acquirer), and a new acquirer can't O_EXCL-create over the old path while it still exists. So time.Since(info.ModTime()) <= staleAfter is false in practice for a file the caller already verified stale. The branch is harmless defensive code, not a defect.
P3 — orphaned .stale.<token> files. If the process dies between the rename and the os.Remove, the renamed file is left behind. It doesn't block anyone (it's not the lock path), but it could accumulate. A defer os.Remove(reclaimed) inside reclaimStaleLock would reap it even on a crash between rename and the explicit remove. Not blocking.
P3 — three identical copies of reclaimStaleLock. cron, hooks, and oauth each define the same function. A shared internal helper would reduce drift risk, but the packages are deliberately separate and the function is small; duplication is acceptable. Not blocking.

TestReclaimStaleLock covers the genuinely-stale, fresh, and missing cases well.

H4 — LSP manager never evicts a dead session (P1)

internal/lsp/manager.go:129 — sessionFor returned the cached session forever. Once the language server crashed / exited / hit a read error, Client.failPending closes c.closed, but the manager kept handing out the dead session, so every subsequent diagnostic failed permanently.

Fix is correct. Client.IsClosed() is a non-blocking <-c.closed read (safe to call under m.mu). On the dead path, the session is deleted under the lock, the lock is released, and server.Shutdown is called outside the lock (good — a slow shutdown can't stall other sessions). The start-race re-check now also guards existing.client.IsClosed(), so a dead entry installed by a racing starter is replaced rather than reused. All three branches (live / dead / absent) unlock exactly once; no double-unlock or fall-through-with-lock-held.

The one residual window — IsClosed() returns false, then the client dies before the session is returned to the caller — self-heals on the next sessionFor call (the current caller gets a single failed op, which is the correct behavior for a just-crashed server). TestSessionForEvictsDeadSession verifies reuse-while-live, eviction-after-close, and restart-count.

Verdict

Approve. All four are real bugs, the fixes are sound, the concurrency reasoning holds, and every fix ships with a regression test. The P3 nits above are follow-up cleanup, not blockers.

Vasanthdev2004 approved these changes Jun 20, 2026

View reviewed changes

gnanam1990 merged commit b0a280e into main Jun 20, 2026
7 checks passed

Vasanthdev2004 deleted the fix/audit-batch-1-highs branch June 28, 2026 08:27

gnanam1990 mentioned this pull request Jul 3, 2026

fix(oauth): retry secret-lock on Windows ERROR_ACCESS_DENIED (delete-pending race) #445

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix four high-severity concurrency/correctness bugs (swarm, locks, lsp, config)#261

Fix four high-severity concurrency/correctness bugs (swarm, locks, lsp, config)#261
gnanam1990 merged 1 commit into
mainfrom
fix/audit-batch-1-highs

gnanam1990 commented Jun 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 20, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

Vasanthdev2004 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gnanam1990 commented Jun 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fixes

Tests

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 20, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

github-actions Bot commented Jun 20, 2026

Zero automated PR review

Blockers

Validation

Scope

Uh oh!

Vasanthdev2004 left a comment

Choose a reason for hiding this comment

Review: Fix four high-severity concurrency/correctness bugs (swarm, locks, lsp, config)

H1 — swarm liveAgents omits queued specs (P1)

H2 — env key clobbers compatible provider kind (P2)

H3 — stale lock reclaim breaks mutual exclusion (P1)

H4 — LSP manager never evicts a dead session (P1)

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gnanam1990 commented Jun 20, 2026 •

edited by coderabbitai Bot

Loading

H1 — swarm `liveAgents` omits queued specs (P1)