fix(kiloclaw): reverse region list on capacity recovery#1059
Conversation
…oute The rename from restartGateway to restartMachine has been deployed and all frontends now use /api/admin/machine/restart. Remove the old /api/admin/gateway/restart alias that was kept for rollout safety.
…prioritizing
deprioritizeRegion compared the concrete region (e.g. 'ord') against
meta-regions ('us', 'eu'), so it never matched and the recovery volume
landed back in the same exhausted region. Reverse the configured region
list instead so recovery deterministically tries the opposite geographic
region first.
| const hasUserData = state.lastStartedAt !== null; | ||
| const allRegions = shuffleRegions(parseRegions(env.FLY_REGION ?? DEFAULT_FLY_REGION)); | ||
| const regions = deprioritizeRegion(allRegions, oldRegion); | ||
| const regions = parseRegions(env.FLY_REGION ?? DEFAULT_FLY_REGION).reverse(); |
There was a problem hiding this comment.
WARNING: Recovery can retry the same failed region
Reversing the configured region list drops the previous deprioritizeRegion(oldRegion) behavior. If FLY_REGION is an explicit list like iad,ord,cdg and the stranded volume is in ord, this becomes cdg,ord,iad, so recovery can hit ord again before trying every other region. That reintroduces the same capacity failure this path is supposed to avoid.
There was a problem hiding this comment.
theres only 2 meta regions
Code Review SummaryStatus: 1 Issues Found | Recommendation: Address before merge Overview
Fix these issues in Kilo Cloud Issue Details (click to expand)WARNING
Other Observations (not in diff)None. Files Reviewed (2 files)
|
| const hasUserData = state.lastStartedAt !== null; | ||
| const allRegions = shuffleRegions(parseRegions(env.FLY_REGION ?? DEFAULT_FLY_REGION)); | ||
| const regions = deprioritizeRegion(allRegions, oldRegion); | ||
| const regions = parseRegions(env.FLY_REGION ?? DEFAULT_FLY_REGION).reverse(); |
There was a problem hiding this comment.
WARNING: Recovery order no longer accounts for the failed region
reverse() always prefers the last configured entry, regardless of where the stranded volume actually lives. With explicit region lists like dfw,ord,cdg, a volume that failed in ord would now retry ord again before dfw, which defeats the purpose of the 412 recovery path and can keep the retry pinned to the constrained region.
There was a problem hiding this comment.
we use meta regions now, and theres only 2 of them.
Code Review SummaryStatus: 1 Issues Found | Recommendation: Address before merge Overview
Fix these issues in Kilo Cloud Issue Details (click to expand)WARNING
Other Observations (not in diff)N/A Files Reviewed (2 files)
Reviewed by gpt-5.4-20260305 · 418,357 tokens |
…1059) With explicit region codes instead of the 'us' alias, deprioritizeRegion can now correctly match concrete regions (e.g. 'iad') against the list. The original bug — meta-region 'us' never matching concrete region codes — no longer applies.
…es (#1090) * fix(kiloclaw): omit ord from Fly region list due to provisioning issues * revert(kiloclaw): restore shuffle+deprioritize in capacity recovery (#1059) With explicit region codes instead of the 'us' alias, deprioritizeRegion can now correctly match concrete regions (e.g. 'iad') against the list. The original bug — meta-region 'us' never matching concrete region codes — no longer applies.
Summary
replaceStrandedVolumeuseddeprioritizeRegionto push the failed region to the back of the list, but this compared the concrete Fly region (e.g.ord) against meta-regions (us,eu) — which never matched. The recovery volume landed back in the same exhausted region, causing the retrycreateNewMachineto fail with the same capacity error. Axiom logs confirmed a 76% failure rate (136/178) on capacity recovery attempts, with replacement volumes landing inord113/178 times. Fix: reverse the configured region list on recovery so the opposite geographic region is always tried first.Verification
pnpm typecheck— passpnpm test— 537 tests pass (28 test files)ord113/178 times)Visual Changes
N/A
Reviewer Notes
deprioritizeRegionfunction and its tests are kept (still exported) but no longer used in the recovery path. It could be cleaned up separately if desired.ensureVolumestill useshuffleRegionsfor randomized first-attempt region selection — only the capacity recovery path is changed to deterministic reverse order.FLY_REGION=us,eu, recovery now always tries[eu, us], ensuring the replacement volume targets the opposite continent first.