Summary
Workers exhibit a recurring behaviour pattern of marking step checkboxes [x] and setting **Status:** ✅ Complete in STATUS.md before the final code-review verdict for that step has returned APPROVE. This pattern is the proximate trigger for the death-spiral in #537, but it's also observable in successful runs — it's "lucky" when the in-flight review returns APPROVE; "unlucky" when it returns REVISE.
This is closely related to #1 but worth tracking separately because the symptoms are different:
Hardening against the pattern (regardless of whether REVISE eventually arrives) reduces a class of latent risk.
Reproduction
Run any task at Review Level: 2 with non-trivial implementation work (so the review takes more than a few seconds). Watch the worker's STATUS.md edits relative to its review_step calls.
Observation: the worker frequently:
- Implements step N
- Updates STATUS.md to mark all checkboxes
[x] and set Status: ✅ Complete
- Commits with
feat(<task>): complete Step N — ...
- Calls
review_step(step=N, type='code')
Steps 2 and 3 commit the worker to a "this step is done" stance before step 4 has confirmed it.
Concrete evidence
Failed batch 20260506T105850 (Step 2):
Successful batch 20260506T131717 (Step 3) — same pattern, different luck:
- Worker committed
bd1b09e feat(MIG-002): complete Step 3 — wire DemoForm to Turnstile + form-token
- STATUS.md showed Step 3 ✅ Complete
.reviewer-state.json showed status: 'running', reviewType: 'code', reviewStep: 3
- R009 happened to return APPROVE; if it had returned REVISE we'd have repeated the death-spiral
- This batch succeeded despite the same anti-pattern, not because of correct protocol
The pattern recurs because the current worker prompt's "Order of operations" guidance is implicit — the order is implied by the description of what each phase does, but no explicit rule forbids marking the step complete before the final verdict arrives.
Root cause
The base task-worker prompt (templates/agents/task-worker.md) describes:
- "Hydrate STATUS.md with sub-checkboxes for the step's work"
- "Implement the step"
- "Run targeted tests"
- "Call
review_step(step=N, type='code')"
- "Handle the verdict"
But doesn't explicitly say when the worker should set checkboxes to [x] or change Status: ✅ Complete. A reasonable interpretation is "mark complete when you're done implementing" — which is wrong (it should be "when the final code-review APPROVE has landed").
This is the same root cause as issue #537 but viewed from the behavioural angle rather than the deadlock angle. The fix proposed in #537 (explicit ordering rule + recovery recipe) also resolves this issue.
Fix proposals
A. Explicit ordering rule (mirrors #1)
Add to the base prompt:
## When to mark a step complete
Set `[x]` on a checkbox the moment the underlying outcome is satisfied (e.g.,
"DemoForm.astro adds hidden form_token input" → tick once the input is
written and committed).
Set `**Status:** ✅ Complete` on a step ONLY after:
1. All checkboxes are `[x]`, AND
2. For Review Level ≥ 2: the final code-review for that step has returned APPROVE.
If you mark `Status: ✅ Complete` and a subsequent review returns REVISE,
revert the status update (set back to `🟨 In Progress`), commit the revert,
then handle the REVISE through the normal recipe.
B. STATUS.md template hint
Update the STATUS.md template (in references/prompt-template.md or wherever) so each step has a textual note:
### Step N: ...
**Status:** ⬜ Not Started
> Set Status to ✅ Complete only after the final code-review APPROVE.
> See worker prompt §"When to mark a step complete".
- [ ] ...
The visible reminder in the per-step block is a constant nudge.
C. Reviewer pre-check
When the reviewer is invoked for review_step(type='code', step=N), it can read STATUS.md and surface a Pattern Violations finding if the step is already marked complete:
Pattern Violations: Step N is already marked ✅ Complete in STATUS.md before this review's verdict was issued. Mark steps complete only after final APPROVE — see worker prompt §"When to mark a step complete". This is a process-only finding; it does not change the verdict on the implementation itself.
Recommendation
Ship A + B. They reinforce each other: the prompt teaches; the template reminds. C is a nice-to-have observability layer that flags the issue when it slips through.
Why this is P2 not P1
The pattern is observed in production but doesn't always cause failure — it only fails when the in-flight review returns REVISE. So the impact is "latent risk that occasionally manifests as #1's deadlock". Fixing #1 mitigates the worst case; fixing this issue removes the latent risk entirely.
Acceptance criteria
Related
Affected version: taskplane@0.28.4. Two production batches (20260506T105850 failed, 20260506T131717 succeeded) both exhibit the pattern. Detailed event logs available on request.
Summary
Workers exhibit a recurring behaviour pattern of marking step checkboxes
[x]and setting**Status:** ✅ CompleteinSTATUS.mdbefore the final code-review verdict for that step has returned APPROVE. This pattern is the proximate trigger for the death-spiral in #537, but it's also observable in successful runs — it's "lucky" when the in-flight review returns APPROVE; "unlucky" when it returns REVISE.This is closely related to #1 but worth tracking separately because the symptoms are different:
Hardening against the pattern (regardless of whether REVISE eventually arrives) reduces a class of latent risk.
Reproduction
Run any task at
Review Level: 2with non-trivial implementation work (so the review takes more than a few seconds). Watch the worker's STATUS.md edits relative to itsreview_stepcalls.Observation: the worker frequently:
[x]and setStatus: ✅ Completefeat(<task>): complete Step N — ...review_step(step=N, type='code')Steps 2 and 3 commit the worker to a "this step is done" stance before step 4 has confirmed it.
Concrete evidence
Failed batch
20260506T105850(Step 2):0ae7f49 feat(MIG-002): complete Step 2 — demo-request bot-protection wiring + email limiter + tests← STATUS marked complete hereR006-code-step2.mdreturned REVISE after the commitSuccessful batch
20260506T131717(Step 3) — same pattern, different luck:bd1b09e feat(MIG-002): complete Step 3 — wire DemoForm to Turnstile + form-token.reviewer-state.jsonshowedstatus: 'running', reviewType: 'code', reviewStep: 3The pattern recurs because the current worker prompt's "Order of operations" guidance is implicit — the order is implied by the description of what each phase does, but no explicit rule forbids marking the step complete before the final verdict arrives.
Root cause
The base task-worker prompt (
templates/agents/task-worker.md) describes:review_step(step=N, type='code')"But doesn't explicitly say when the worker should set checkboxes to
[x]or changeStatus: ✅ Complete. A reasonable interpretation is "mark complete when you're done implementing" — which is wrong (it should be "when the final code-review APPROVE has landed").This is the same root cause as issue #537 but viewed from the behavioural angle rather than the deadlock angle. The fix proposed in #537 (explicit ordering rule + recovery recipe) also resolves this issue.
Fix proposals
A. Explicit ordering rule (mirrors #1)
Add to the base prompt:
B. STATUS.md template hint
Update the STATUS.md template (in
references/prompt-template.mdor wherever) so each step has a textual note:The visible reminder in the per-step block is a constant nudge.
C. Reviewer pre-check
When the reviewer is invoked for
review_step(type='code', step=N), it can read STATUS.md and surface aPattern Violationsfinding if the step is already marked complete:Recommendation
Ship A + B. They reinforce each other: the prompt teaches; the template reminds. C is a nice-to-have observability layer that flags the issue when it slips through.
Why this is P2 not P1
The pattern is observed in production but doesn't always cause failure — it only fails when the in-flight review returns REVISE. So the impact is "latent risk that occasionally manifests as #1's deadlock". Fixing #1 mitigates the worst case; fixing this issue removes the latent risk entirely.
Acceptance criteria
Status: ✅ Complete.Related
Affected version:
taskplane@0.28.4. Two production batches (20260506T105850 failed, 20260506T131717 succeeded) both exhibit the pattern. Detailed event logs available on request.