Skip to content

fix(stack): harden MergeStackEnvVars — in-tx deleting guard, lock_timeout, honest keys_set (panel #7)#238

Merged
mastermanas805 merged 1 commit into
masterfrom
fix/mergestackenvvars-intx-hardening-2026-06-04
Jun 4, 2026
Merged

fix(stack): harden MergeStackEnvVars — in-tx deleting guard, lock_timeout, honest keys_set (panel #7)#238
mastermanas805 merged 1 commit into
masterfrom
fix/mergestackenvvars-intx-hardening-2026-06-04

Conversation

@mastermanas805

Copy link
Copy Markdown
Member

What (design-review follow-up to #235)

Three hardenings to the atomic env merge:

  1. In-tx teardown guard (closes a TOCTOU). The "stack mid-teardown" check was a handler pre-read (stack.Status == deleting before the merge) — racy: the teardown worker can flip status between GetStackBySlug and the merge, committing an env change onto a deleting stack. The SELECT now fetches status under the FOR UPDATE lock and returns models.ErrStackDeleting (→ 409 stack_deleting); the handler pre-read is removed so the in-tx check is authoritative. Same posture as the resource-DELETE / SetTTL hardenings.
  2. SET LOCAL lock_timeout = '3s' as the first tx statement — a PATCH racing a long-held row lock fails fast to a retryable 503 instead of hanging the request goroutine.
  3. keys_set counts only actual upserts (v != ""). The old len(body.Env) - deletes over-counted a no-op delete (empty value for an absent key), making the rule-12 audit lie.

Coverage

Sites:          MergeStackEnvVars (model) + StackHandler.UpdateEnv
Coverage test:  TestMergeStackEnvVars_Branches rewritten for new SQL + adds lock_timeout-error
                and status='deleting'→ErrStackDeleting cases (MergeStackEnvVars 100%);
                existing real-DB deleting handler tests now exercise the in-tx path
Fault tests:    2 stack-env fault tests' failAfter bumped +1 (in-tx SET LOCAL exec is counted)
Live verified:  go test vs real Postgres+Redis; full ./...Stack suite green; all changed lines covered

🤖 Generated with Claude Code

…eout, honest keys_set (panel #7)

Design-review follow-up to #235. Three hardenings to the atomic env merge:

1. In-tx teardown guard (closes a TOCTOU). The "stack is mid-teardown"
   check was a pre-read in the handler (stack.Status == deleting before
   MergeStackEnvVars) — racy: the teardown worker can flip status between
   GetStackBySlug and the merge, committing an env change onto a deleting
   stack. The SELECT now fetches status under the FOR UPDATE lock and
   returns models.ErrStackDeleting (→ 409 stack_deleting); the handler
   pre-read is removed so the in-tx check is authoritative. Same posture as
   the resource-DELETE / SetTTL hardenings this session.

2. SET LOCAL lock_timeout = '3s' as the first tx statement, so a PATCH that
   races a long-held row lock fails fast to a retryable 503 instead of
   hanging the request goroutine indefinitely.

3. keys_set audit count now counts only actual upserts (v != ""). The old
   `len(body.Env) - deletes` over-counted a no-op delete (empty value for an
   absent key: not a delete, not a set), making the rule-12 audit lie.

Tests: TestMergeStackEnvVars_Branches rewritten for the new SQL (SET LOCAL
exec + status,env_vars SELECT) and adds the lock_timeout-error and
status='deleting'→ErrStackDeleting cases — MergeStackEnvVars stays 100%.
The existing real-DB deleting handler tests now exercise the in-tx path
deterministically. The two stack-env fault tests' failAfter bumped +1 for
the added in-tx SET LOCAL exec (which the fault harness counts). Verified
vs real Postgres+Redis; all changed lines covered.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 enabled auto-merge (squash) June 4, 2026 02:32
@mastermanas805 mastermanas805 merged commit c334806 into master Jun 4, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant