Skip to content

fix(kiloclaw): reverse region list on capacity recovery#1059

Merged
pandemicsyn merged 3 commits into
mainfrom
florian/chore/capacity
Mar 12, 2026
Merged

fix(kiloclaw): reverse region list on capacity recovery#1059
pandemicsyn merged 3 commits into
mainfrom
florian/chore/capacity

Conversation

@pandemicsyn
Copy link
Copy Markdown
Contributor

@pandemicsyn pandemicsyn commented Mar 12, 2026

Summary

  • Fix capacity recovery region fallback: replaceStrandedVolume used deprioritizeRegion to push the failed region to the back of the list, but this compared the concrete Fly region (e.g. ord) against meta-regions (us, eu) — which never matched. The recovery volume landed back in the same exhausted region, causing the retry createNewMachine to fail with the same capacity error. Axiom logs confirmed a 76% failure rate (136/178) on capacity recovery attempts, with replacement volumes landing in ord 113/178 times. Fix: reverse the configured region list on recovery so the opposite geographic region is always tried first.

Verification

  • pnpm typecheck — pass
  • pnpm test — 537 tests pass (28 test files)
  • Verified via Axiom log analysis that the capacity recovery was firing but replacement volumes were landing back in the same region (ord 113/178 times)
  • Additional verification after deploy

Visual Changes

N/A

Reviewer Notes

  • The deprioritizeRegion function and its tests are kept (still exported) but no longer used in the recovery path. It could be cleaned up separately if desired.
  • Initial provision and ensureVolume still use shuffleRegions for randomized first-attempt region selection — only the capacity recovery path is changed to deterministic reverse order.
  • With FLY_REGION=us,eu, recovery now always tries [eu, us], ensuring the replacement volume targets the opposite continent first.

…oute

The rename from restartGateway to restartMachine has been deployed and
all frontends now use /api/admin/machine/restart. Remove the old
/api/admin/gateway/restart alias that was kept for rollout safety.
…prioritizing

deprioritizeRegion compared the concrete region (e.g. 'ord') against
meta-regions ('us', 'eu'), so it never matched and the recovery volume
landed back in the same exhausted region. Reverse the configured region
list instead so recovery deterministically tries the opposite geographic
region first.
const hasUserData = state.lastStartedAt !== null;
const allRegions = shuffleRegions(parseRegions(env.FLY_REGION ?? DEFAULT_FLY_REGION));
const regions = deprioritizeRegion(allRegions, oldRegion);
const regions = parseRegions(env.FLY_REGION ?? DEFAULT_FLY_REGION).reverse();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Recovery can retry the same failed region

Reversing the configured region list drops the previous deprioritizeRegion(oldRegion) behavior. If FLY_REGION is an explicit list like iad,ord,cdg and the stranded volume is in ord, this becomes cdg,ord,iad, so recovery can hit ord again before trying every other region. That reintroduces the same capacity failure this path is supposed to avoid.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theres only 2 meta regions

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Mar 12, 2026

Code Review Summary

Status: 1 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File Line Issue
kiloclaw/src/durable-objects/kiloclaw-instance/fly-machines.ts 68 Reversing the configured region list can retry the same failed region before untried regions when FLY_REGION contains 3+ explicit regions.
Other Observations (not in diff)

None.

Files Reviewed (2 files)
  • kiloclaw/src/durable-objects/kiloclaw-instance/fly-machines.ts - 1 issue
  • kiloclaw/src/durable-objects/kiloclaw-instance.test.ts - 0 issues

const hasUserData = state.lastStartedAt !== null;
const allRegions = shuffleRegions(parseRegions(env.FLY_REGION ?? DEFAULT_FLY_REGION));
const regions = deprioritizeRegion(allRegions, oldRegion);
const regions = parseRegions(env.FLY_REGION ?? DEFAULT_FLY_REGION).reverse();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Recovery order no longer accounts for the failed region

reverse() always prefers the last configured entry, regardless of where the stranded volume actually lives. With explicit region lists like dfw,ord,cdg, a volume that failed in ord would now retry ord again before dfw, which defeats the purpose of the 412 recovery path and can keep the retry pinned to the constrained region.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use meta regions now, and theres only 2 of them.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Mar 12, 2026

Code Review Summary

Status: 1 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File Line Issue
kiloclaw/src/durable-objects/kiloclaw-instance/fly-machines.ts 68 Recovery now ignores the failed region, so explicit region lists can retry the same constrained region before healthier alternatives.
Other Observations (not in diff)

N/A

Files Reviewed (2 files)
  • kiloclaw/src/durable-objects/kiloclaw-instance/fly-machines.ts - 1 issue
  • kiloclaw/src/durable-objects/kiloclaw-instance.test.ts - 0 issues

Reviewed by gpt-5.4-20260305 · 418,357 tokens

@pandemicsyn pandemicsyn enabled auto-merge March 12, 2026 16:16
@pandemicsyn pandemicsyn merged commit c975a72 into main Mar 12, 2026
18 checks passed
@pandemicsyn pandemicsyn deleted the florian/chore/capacity branch March 12, 2026 16:16
jrf0110 added a commit that referenced this pull request Mar 13, 2026
…1059)

With explicit region codes instead of the 'us' alias, deprioritizeRegion
can now correctly match concrete regions (e.g. 'iad') against the list.
The original bug — meta-region 'us' never matching concrete region codes —
no longer applies.
jrf0110 added a commit that referenced this pull request Mar 13, 2026
…es (#1090)

* fix(kiloclaw): omit ord from Fly region list due to provisioning issues

* revert(kiloclaw): restore shuffle+deprioritize in capacity recovery (#1059)

With explicit region codes instead of the 'us' alias, deprioritizeRegion
can now correctly match concrete regions (e.g. 'iad') against the list.
The original bug — meta-region 'us' never matching concrete region codes —
no longer applies.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants