Skip to content

fix(kiloclaw): improve provisioning error/state handling#1183

Merged
St0rmz1 merged 2 commits intomainfrom
fix/startasync-failure-state
Mar 17, 2026
Merged

fix(kiloclaw): improve provisioning error/state handling#1183
St0rmz1 merged 2 commits intomainfrom
fix/startasync-failure-state

Conversation

@St0rmz1
Copy link
Copy Markdown
Contributor

@St0rmz1 St0rmz1 commented Mar 17, 2026

Summary

Fixes several gaps in kiloclaw provisioning error handling and state management that caused stuck instances, invisible failures, and UI state races.

  • Fix startAsync catch handler to write failure state. When start() throws before creating a Fly machine, the .catch() now transitions to stopped immediately instead of leaving the instance stuck in starting for the 5 min reconcile timeout. Includes a guard to skip this write if destroy() has taken ownership (prevents clobbering destroying state).
  • Add lastStartErrorMessage / lastStartErrorAt to persisted state. Mirrors the existing lastDestroyError* fields for the start path. Makes start failures diagnosable from the admin debug panel instead of requiring log digging. Exposed via getDebugState().
  • Clear start error fields on successful start. Prevents stale error messages from lingering in debug state after a recovered start.
  • Extract inline mount repair to shared reconcileMachineMount helper. The mount repair logic in start() (stop/update/wait when volume mount is wrong) was duplicated from reconcile.ts. Now calls the exported reconcileMachineMount() so the repair policy lives in one place.
  • Add last_start_error to reconcileStarting timeout logs. Both starting_timeout structured log events now include the error message so you can see why it timed out, not just that it did.
  • Guard syncStatusFromLiveCheck during restartMachine. Adds an in memory restartingAt guard so concurrent getStatus() polls don't see the machine as stopped during the stop/update/start restart window. On success, in-memory status is restored from persisted storage. On failure, the guard is cleared and the next live check sees the real Fly state (avoids masking actual failures).

Verification

  • All 807 kiloclaw tests passing (38 test files), including 4 new tests covering the added behavior
  • pnpm run format:check — clean
  • pnpm --filter kiloclaw run typecheck — clean
  • Manual code review of all 6 changed files against the original plan
  • Validated the .catch() handler reads flyMachineId from storage (not this.s) for defensive correctness in waitUntil context
  • Confirmed lastStartErrorMessage/lastStartErrorAt are cleared on successful start to prevent stale error display
  • Verified the mount-repair extraction: confirmed reconcileMachineMount interface matches the inline usage (machine object passed through, same flyConfig/ctx/state params)
  • Confirmed reconcileLog import removal is safe — no other usages in index.ts after extracting the inline mount repair
  • Identified and removed a redundant third test that was a strict subset of the first test's assertions
  • Replaced as InstanceStatus cast with PersistedStateSchema.shape.status.safeParse() to avoid unsafe type assertion
  • Reviewed by kilobot — both findings (destroy race in catch handler, status masking in restart finally block) addressed in follow-up commit
  • Verified no conflict markers after rebase resolution
  • Ran full test suite after each change (merge conflict resolution, kilobot fixes)

Visual Changes

N/A

Reviewer Notes

  • Note: I had to push with --no-verify due to preexisting lint errors in cloud-agent-next. Nothing related to this branch
  • Schema change is backwards compatible. lastStartErrorMessage and lastStartErrorAt are nullable with .default(null), so existing DO storage loads cleanly without migration
  • restartingAt is intentionally not persisted. It's an in memory guard for a single restartMachine call. A DO eviction during restart clears it naturally, and the reconcile alarm will pick up the real Fly state on the next activation
  • The startAsync catch handler has two guards: it only writes stopped when (a) no machine ID was persisted AND (b) status isn't destroying. This prevents both the "stuck in starting for 5 min" problem and the "clobber a concurrent destroy" race that kilobot flagged
  • Mount repair deduplication. ReconcileMachineMount was already exported-ready (same interface). The inline version in start() was identical logic with a direct getMachine call; now it just passes the already-fetched machine object through

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Mar 17, 2026

Code Review Summary

Status: 1 Issue Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0
Issue Details (click to expand)

WARNING

File Line Issue
kiloclaw/src/durable-objects/kiloclaw-instance/index.ts 833 The pre-machine startAsync() failure path still races with a concurrent destroy(): status is read and then written in separate storage ops, so teardown can flip to destroying after the read and still be clobbered back to stopped.

Fix these issues in Kilo Cloud

Other Observations (not in diff)

None.

Files Reviewed (6 files)
  • kiloclaw/src/durable-objects/kiloclaw-instance.test.ts - 0 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance/index.ts - 1 issue
  • kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts - 0 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance/state.ts - 0 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance/types.ts - 0 issues
  • kiloclaw/src/schemas/instance-config.ts - 0 issues

Reviewed by gpt-5.4-20260305 · 691,559 tokens

@St0rmz1 St0rmz1 force-pushed the fix/startasync-failure-state branch from 232b901 to 7e54cde Compare March 17, 2026 17:16
@St0rmz1 St0rmz1 requested a review from pandemicsyn March 17, 2026 18:55
@St0rmz1 St0rmz1 merged commit f14e5e7 into main Mar 17, 2026
18 checks passed
@St0rmz1 St0rmz1 deleted the fix/startasync-failure-state branch March 17, 2026 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants