Skip to content

Fix four high-severity concurrency/correctness bugs (swarm, locks, lsp, config)#261

Merged
gnanam1990 merged 1 commit into
mainfrom
fix/audit-batch-1-highs
Jun 20, 2026
Merged

Fix four high-severity concurrency/correctness bugs (swarm, locks, lsp, config)#261
gnanam1990 merged 1 commit into
mainfrom
fix/audit-batch-1-highs

Conversation

@gnanam1990

@gnanam1990 gnanam1990 commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

Four high-severity concurrency/correctness bugs surfaced by a full-tree audit. All are latentgo test ./... passes — because they live in race windows and provider/error edges the suite doesn't exercise. Each fix ships with a regression test.

Fixes

Area Bug Fix
swarm AdoptOrphans re-dispatches a task whose member is queued over the concurrency cap → the same delegated task runs twice (duplicate file/shell side effects, broken one-member-per-task invariant). liveAgents() now counts queued specs as live, so a merely-waiting member is never treated as an orphan.
config An *_API_KEY env var rewrites a same-named compatible-transport provider's kind, breaking proxy/gateway setups with a confusing "requires official baseURL" error. When the env supplies credentials only (no base URL) and a same-named provider exists, leave the kind unset so the merge preserves the existing one.
cron / hooks / oauth Stale-lock reclaim did a blind Remove+recreate; under contention two processes both take the lock — a mutual-exclusion violation (duplicate audit sequence numbers, reopened metadata read-modify-write race). Atomic rename-with-verify reclaim (only one racer wins; a freshly-reacquired lock is restored, not stolen). Also falls through to the bounded wait instead of hot-spinning when a reclaim never wins.
lsp A crashed/exited language-server session is never evicted → every later diagnostic errors forever, and self-correct fails on every changed file every iteration with no recovery short of a restart. Client.IsClosed() + evict-and-restart in sessionFor.

Tests

  • TestLiveAgentsIncludesQueuedSpecs
  • TestEnvKeyPreservesExistingCompatibleProviderKind, TestEnvKeyCreatesStandardProviderWhenAbsent
  • TestReclaimStaleLock (atomic reclaim + the fresh-lock-restored protection)
  • TestSessionForEvictsDeadSession

go build ./..., go vet ./..., gofmt, and the full go test ./... are all green.

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Provider configuration now preserves existing compatible settings when environment credentials are supplied without a custom base URL.
    • Improved stale lock handling across multiple components using atomic rename operations, preventing race conditions where live locks could be prematurely deleted.
    • LSP session management now validates client connections and automatically recovers by starting fresh sessions when needed.
    • Queued agent specs are now properly included in live agent tracking for correct orphan/adoption behavior.

Surfaced by a full-tree audit; all are latent (the suite passed) and live
in race windows / error edges the tests don't exercise. Each ships a
regression test.

- swarm: liveAgents() now counts queued specs, so AdoptOrphans can no longer
  re-dispatch (double-execute) a task whose member is merely waiting for a
  free slot over the concurrency cap.
- cron/hooks/oauth: stale-lock reclaim is now atomic (rename aside, then
  verify-and-restore if it turns out fresh) instead of a blind Remove that
  let two racers both reclaim and hold the lock; also falls through to the
  bounded wait instead of hot-spinning when a reclaim never wins.
- lsp: a dead language-server session is evicted and restarted instead of
  being returned forever, so one server crash no longer permanently breaks
  diagnostics (and spuriously fails self-correct every iteration).
- config: an API-key env var no longer overwrites a same-named
  compatible-transport provider's kind when no base URL is supplied.
@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Review Change Stack

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Free

Run ID: 27f703fe-ad69-4d85-91d8-a3a42f9ae529

📥 Commits

Reviewing files that changed from the base of the PR and between 1c30f4c and 782b450.

📒 Files selected for processing (11)
  • internal/config/env_provider_kind_test.go
  • internal/config/resolver.go
  • internal/cron/lock.go
  • internal/cron/lock_reclaim_test.go
  • internal/hooks/lock.go
  • internal/lsp/client.go
  • internal/lsp/manager.go
  • internal/lsp/manager_test.go
  • internal/oauth/lock.go
  • internal/swarm/lifecycle_test.go
  • internal/swarm/team.go

Walkthrough

Four independent bug fixes: stale file lock reclamation in cron, hooks, and oauth packages switches from unconditional deletion to an atomic rename-verify-restore strategy; lsp.Manager.sessionFor gains dead-session eviction using a new Client.IsClosed method; config.applyProviderEnv preserves an existing provider's ProviderKind when only credentials are supplied via env; and Team.liveAgents includes queued MemberSpecs alongside running members.

Changes

Atomic Stale-Lock Reclaim (cron, hooks, oauth)

Layer / File(s) Summary
reclaimStaleLock helper and integration into lock loops
internal/cron/lock.go, internal/hooks/lock.go, internal/oauth/lock.go
Each lock file gets a new reclaimStaleLock helper that renames the stale lock to a token-unique .stale.<token> name, re-checks staleness after rename, restores the original if it became fresh, and removes it only when still stale. lockJob, lockAudit, and acquireFileLock now call this helper instead of directly removing the stale lock file.
Unit test for reclaim logic
internal/cron/lock_reclaim_test.go
TestReclaimStaleLock asserts stale locks are removed, fresh locks are preserved, and missing files produce no reclaim.

LSP Dead Session Eviction

Layer / File(s) Summary
Client.IsClosed and Manager.sessionFor eviction
internal/lsp/client.go, internal/lsp/manager.go
Client.IsClosed non-blockingly reads the c.closed channel. Manager.sessionFor uses it at the cache-hit path (evicts and restarts if dead) and at the start-race-loser path (installs the new session instead of returning the dead cached one).
Dead session eviction test
internal/lsp/manager_test.go
TestSessionForEvictsDeadSession verifies live-session reuse, eviction after client.closed is closed, and server restart count via a starts counter.

Config Provider Kind Preservation

Layer / File(s) Summary
applyProviderEnv fix and tests
internal/config/resolver.go, internal/config/env_provider_kind_test.go
When baseURL is empty, applyProviderEnv scans existing providers for a name match and clears profile.ProviderKind before merging, so mergeProfile retains the existing kind. Two tests cover both the preservation and the absent-provider creation paths.

Swarm Queue Inclusion in liveAgents

Layer / File(s) Summary
liveAgents queue inclusion and regression test
internal/swarm/team.go, internal/swarm/lifecycle_test.go
Team.liveAgents preallocates for t.members + t.queue and adds all queued MemberSpec.IDs. TestLiveAgentsIncludesQueuedSpecs verifies queued specs appear in the returned set under a maxSize: 1 cap.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes


Note

🎁 Summarized by CodeRabbit Free

Your organization is on the Free plan. CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please upgrade your subscription to CodeRabbit Pro by visiting https://app.coderabbit.ai/login.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor

Zero automated PR review

Verdict: No blockers found

Blockers

  • None found.

Validation

  • [pass] Diff hygiene: git diff --check
  • [pass] Tests: go test ./...
  • [pass] Build: go run ./cmd/zero-release build
  • [pass] Smoke build: go run ./cmd/zero-release smoke

Scope

Head: 782b450e9cf2
Changed files (11): internal/config/env_provider_kind_test.go, internal/config/resolver.go, internal/cron/lock.go, internal/cron/lock_reclaim_test.go, internal/hooks/lock.go, internal/lsp/client.go, internal/lsp/manager.go, internal/lsp/manager_test.go, internal/oauth/lock.go, internal/swarm/lifecycle_test.go, internal/swarm/team.go

This deterministic review checks validation status and basic diff hygiene. A human reviewer still owns product judgment and design quality.

@Vasanthdev2004 Vasanthdev2004 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Fix four high-severity concurrency/correctness bugs (swarm, locks, lsp, config)

Reviewed the diff against the current source in internal/{config,cron,hooks,oauth,lsp,swarm}. All four fixes are correct, well-reasoned, and come with focused regression tests. Approving.

H1 — swarm liveAgents omits queued specs (P1)

internal/swarm/team.go:305liveAgents() returned only running members, not specs parked in t.queue. AdoptOrphans (lifecycle.go:144) uses that set to decide which tasks still have an owner; a queued spec is owned and about to launch but was missing from the set, so AdoptOrphans would re-dispatch its task → double execution.

Fix is correct: liveAgents now snapshots t.queue under the same t.mu lock it already holds for members, so the orphan check sees a consistent owner set. Capacity sizing len(members)+len(queue) is a minor nice-to-have. TestLiveAgentsIncludesQueuedSpecs covers the regression precisely.

H2 — env key clobbers compatible provider kind (P2)

internal/config/resolver.go:472 — when OPENAI_API_KEY (credentials only, no baseURL) is exported and a same-named openai-compatible proxy/gateway already exists, applyProviderEnv set profile.ProviderKind = ProviderKindOpenAI, and mergeProfile (resolver.go:382) overwrites the existing kind because next.ProviderKind != "". Result: a configured gateway is silently downgraded to the standard OpenAI transport on the next env-applied load.

Fix is correct: when baseURL == "" and a same-named provider exists, set profile.ProviderKind = "" so mergeProfile preserves the existing kind. The name used in the pre-mergeProvider scan matches the name providerMergeName resolves inside mergeProvider (both operate on profile.Name post-providerEnvTargetName), so the "found existing" decision is consistent with the merge target. TestEnvKeyPreservesExistingCompatibleProviderKind and TestEnvKeyCreatesStandardProviderWhenAbsent cover both branches.

H3 — stale lock reclaim breaks mutual exclusion (P1)

internal/{cron,hooks,oauth}/lock.go — the old _ = os.Remove(lockPath); continue let two racers both remove-and-recreate the same stale lock, so both believed they held it. Mutual exclusion was broken across processes whenever a stale lock was reclaimed under contention.

Fix is correct and materially improves safety. reclaimStaleLock renames the lock to a per-acquirer .<token> name; os.Rename of a given source is atomic and wins for exactly one racer (POSIX rename is atomic; on Windows the loser gets ERROR_FILE_NOT_FOUND/ERROR_ACCESS_DENIED). The loser returns false and falls through to the bounded wait instead of hot-spinning on a reclaim that can never win (good — the old continue could spin until the deadline even when the lock was genuinely held by someone else).

One observation worth flagging but not blocking:

  • P3 — "restore if fresh" branch is effectively unreachable. After the rename, the moved file's mtime is preserved and only gets staler; no other process can update the renamed file's mtime (its name is unique to this acquirer), and a new acquirer can't O_EXCL-create over the old path while it still exists. So time.Since(info.ModTime()) <= staleAfter is false in practice for a file the caller already verified stale. The branch is harmless defensive code, not a defect.

  • P3 — orphaned .stale.<token> files. If the process dies between the rename and the os.Remove, the renamed file is left behind. It doesn't block anyone (it's not the lock path), but it could accumulate. A defer os.Remove(reclaimed) inside reclaimStaleLock would reap it even on a crash between rename and the explicit remove. Not blocking.

  • P3 — three identical copies of reclaimStaleLock. cron, hooks, and oauth each define the same function. A shared internal helper would reduce drift risk, but the packages are deliberately separate and the function is small; duplication is acceptable. Not blocking.

TestReclaimStaleLock covers the genuinely-stale, fresh, and missing cases well.

H4 — LSP manager never evicts a dead session (P1)

internal/lsp/manager.go:129sessionFor returned the cached session forever. Once the language server crashed / exited / hit a read error, Client.failPending closes c.closed, but the manager kept handing out the dead session, so every subsequent diagnostic failed permanently.

Fix is correct. Client.IsClosed() is a non-blocking <-c.closed read (safe to call under m.mu). On the dead path, the session is deleted under the lock, the lock is released, and server.Shutdown is called outside the lock (good — a slow shutdown can't stall other sessions). The start-race re-check now also guards existing.client.IsClosed(), so a dead entry installed by a racing starter is replaced rather than reused. All three branches (live / dead / absent) unlock exactly once; no double-unlock or fall-through-with-lock-held.

The one residual window — IsClosed() returns false, then the client dies before the session is returned to the caller — self-heals on the next sessionFor call (the current caller gets a single failed op, which is the correct behavior for a just-crashed server). TestSessionForEvictsDeadSession verifies reuse-while-live, eviction-after-close, and restart-count.

Verdict

Approve. All four are real bugs, the fixes are sound, the concurrency reasoning holds, and every fix ships with a regression test. The P3 nits above are follow-up cleanup, not blockers.

@gnanam1990 gnanam1990 merged commit b0a280e into main Jun 20, 2026
7 checks passed
@Vasanthdev2004 Vasanthdev2004 deleted the fix/audit-batch-1-highs branch June 28, 2026 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants