Fix four high-severity concurrency/correctness bugs (swarm, locks, lsp, config)#261
Conversation
Surfaced by a full-tree audit; all are latent (the suite passed) and live in race windows / error edges the tests don't exercise. Each ships a regression test. - swarm: liveAgents() now counts queued specs, so AdoptOrphans can no longer re-dispatch (double-execute) a task whose member is merely waiting for a free slot over the concurrency cap. - cron/hooks/oauth: stale-lock reclaim is now atomic (rename aside, then verify-and-restore if it turns out fresh) instead of a blind Remove that let two racers both reclaim and hold the lock; also falls through to the bounded wait instead of hot-spinning when a reclaim never wins. - lsp: a dead language-server session is evicted and restarted instead of being returned forever, so one server crash no longer permanently breaks diagnostics (and spuriously fails self-correct every iteration). - config: an API-key env var no longer overwrites a same-named compatible-transport provider's kind when no base URL is supplied.
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Free Run ID: 📒 Files selected for processing (11)
WalkthroughFour independent bug fixes: stale file lock reclamation in ChangesAtomic Stale-Lock Reclaim (cron, hooks, oauth)
LSP Dead Session Eviction
Config Provider Kind Preservation
Swarm Queue Inclusion in liveAgents
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Note 🎁 Summarized by CodeRabbit FreeYour organization is on the Free plan. CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please upgrade your subscription to CodeRabbit Pro by visiting https://app.coderabbit.ai/login. Comment |
Zero automated PR reviewVerdict: No blockers found Blockers
Validation
ScopeHead: This deterministic review checks validation status and basic diff hygiene. A human reviewer still owns product judgment and design quality. |
Vasanthdev2004
left a comment
There was a problem hiding this comment.
Review: Fix four high-severity concurrency/correctness bugs (swarm, locks, lsp, config)
Reviewed the diff against the current source in internal/{config,cron,hooks,oauth,lsp,swarm}. All four fixes are correct, well-reasoned, and come with focused regression tests. Approving.
H1 — swarm liveAgents omits queued specs (P1)
internal/swarm/team.go:305 — liveAgents() returned only running members, not specs parked in t.queue. AdoptOrphans (lifecycle.go:144) uses that set to decide which tasks still have an owner; a queued spec is owned and about to launch but was missing from the set, so AdoptOrphans would re-dispatch its task → double execution.
Fix is correct: liveAgents now snapshots t.queue under the same t.mu lock it already holds for members, so the orphan check sees a consistent owner set. Capacity sizing len(members)+len(queue) is a minor nice-to-have. TestLiveAgentsIncludesQueuedSpecs covers the regression precisely.
H2 — env key clobbers compatible provider kind (P2)
internal/config/resolver.go:472 — when OPENAI_API_KEY (credentials only, no baseURL) is exported and a same-named openai-compatible proxy/gateway already exists, applyProviderEnv set profile.ProviderKind = ProviderKindOpenAI, and mergeProfile (resolver.go:382) overwrites the existing kind because next.ProviderKind != "". Result: a configured gateway is silently downgraded to the standard OpenAI transport on the next env-applied load.
Fix is correct: when baseURL == "" and a same-named provider exists, set profile.ProviderKind = "" so mergeProfile preserves the existing kind. The name used in the pre-mergeProvider scan matches the name providerMergeName resolves inside mergeProvider (both operate on profile.Name post-providerEnvTargetName), so the "found existing" decision is consistent with the merge target. TestEnvKeyPreservesExistingCompatibleProviderKind and TestEnvKeyCreatesStandardProviderWhenAbsent cover both branches.
H3 — stale lock reclaim breaks mutual exclusion (P1)
internal/{cron,hooks,oauth}/lock.go — the old _ = os.Remove(lockPath); continue let two racers both remove-and-recreate the same stale lock, so both believed they held it. Mutual exclusion was broken across processes whenever a stale lock was reclaimed under contention.
Fix is correct and materially improves safety. reclaimStaleLock renames the lock to a per-acquirer .<token> name; os.Rename of a given source is atomic and wins for exactly one racer (POSIX rename is atomic; on Windows the loser gets ERROR_FILE_NOT_FOUND/ERROR_ACCESS_DENIED). The loser returns false and falls through to the bounded wait instead of hot-spinning on a reclaim that can never win (good — the old continue could spin until the deadline even when the lock was genuinely held by someone else).
One observation worth flagging but not blocking:
-
P3 — "restore if fresh" branch is effectively unreachable. After the rename, the moved file's mtime is preserved and only gets staler; no other process can update the renamed file's mtime (its name is unique to this acquirer), and a new acquirer can't O_EXCL-create over the old path while it still exists. So
time.Since(info.ModTime()) <= staleAfteris false in practice for a file the caller already verified stale. The branch is harmless defensive code, not a defect. -
P3 — orphaned
.stale.<token>files. If the process dies between the rename and theos.Remove, the renamed file is left behind. It doesn't block anyone (it's not the lock path), but it could accumulate. Adefer os.Remove(reclaimed)insidereclaimStaleLockwould reap it even on a crash between rename and the explicit remove. Not blocking. -
P3 — three identical copies of
reclaimStaleLock.cron,hooks, andoautheach define the same function. A shared internal helper would reduce drift risk, but the packages are deliberately separate and the function is small; duplication is acceptable. Not blocking.
TestReclaimStaleLock covers the genuinely-stale, fresh, and missing cases well.
H4 — LSP manager never evicts a dead session (P1)
internal/lsp/manager.go:129 — sessionFor returned the cached session forever. Once the language server crashed / exited / hit a read error, Client.failPending closes c.closed, but the manager kept handing out the dead session, so every subsequent diagnostic failed permanently.
Fix is correct. Client.IsClosed() is a non-blocking <-c.closed read (safe to call under m.mu). On the dead path, the session is deleted under the lock, the lock is released, and server.Shutdown is called outside the lock (good — a slow shutdown can't stall other sessions). The start-race re-check now also guards existing.client.IsClosed(), so a dead entry installed by a racing starter is replaced rather than reused. All three branches (live / dead / absent) unlock exactly once; no double-unlock or fall-through-with-lock-held.
The one residual window — IsClosed() returns false, then the client dies before the session is returned to the caller — self-heals on the next sessionFor call (the current caller gets a single failed op, which is the correct behavior for a just-crashed server). TestSessionForEvictsDeadSession verifies reuse-while-live, eviction-after-close, and restart-count.
Verdict
Approve. All four are real bugs, the fixes are sound, the concurrency reasoning holds, and every fix ships with a regression test. The P3 nits above are follow-up cleanup, not blockers.
Summary
Four high-severity concurrency/correctness bugs surfaced by a full-tree audit. All are latent —
go test ./...passes — because they live in race windows and provider/error edges the suite doesn't exercise. Each fix ships with a regression test.Fixes
AdoptOrphansre-dispatches a task whose member is queued over the concurrency cap → the same delegated task runs twice (duplicate file/shell side effects, broken one-member-per-task invariant).liveAgents()now counts queued specs as live, so a merely-waiting member is never treated as an orphan.*_API_KEYenv var rewrites a same-named compatible-transport provider's kind, breaking proxy/gateway setups with a confusing "requires official baseURL" error.Remove+recreate; under contention two processes both take the lock — a mutual-exclusion violation (duplicate audit sequence numbers, reopened metadata read-modify-write race).Client.IsClosed()+ evict-and-restart insessionFor.Tests
TestLiveAgentsIncludesQueuedSpecsTestEnvKeyPreservesExistingCompatibleProviderKind,TestEnvKeyCreatesStandardProviderWhenAbsentTestReclaimStaleLock(atomic reclaim + the fresh-lock-restored protection)TestSessionForEvictsDeadSessiongo build ./...,go vet ./...,gofmt, and the fullgo test ./...are all green.Summary by CodeRabbit
Release Notes