Skip to content

fix(cloud-agent): use execution heartbeat for idle cleanup instead of kiloServerLastActivity#3176

Merged
eshurakov merged 2 commits into
mainfrom
fix/cloud-agent-idle-cleanup
May 11, 2026
Merged

fix(cloud-agent): use execution heartbeat for idle cleanup instead of kiloServerLastActivity#3176
eshurakov merged 2 commits into
mainfrom
fix/cloud-agent-idle-cleanup

Conversation

@eshurakov
Copy link
Copy Markdown
Contributor

Summary

The kiloServerLastActivity field in session metadata was only set at session prepare and wrapper start time — never updated during an active execution. After 15 minutes (the idle timeout), cleanupIdleKiloServer would see a stale timestamp and SIGTERM the container, even though the execution was actively running and heartbeating every 30s. This caused false "Container shutdown: SIGTERM" interruptions for any session running longer than 15 minutes.

Replace kiloServerLastActivity with execution-level data:

  • If there's an active execution, skip cleanup immediately (no point checking idle when something is running)
  • Otherwise, derive the last activity timestamp from the latest execution's lastHeartbeat, completedAt, or startedAt (in that priority order)

This removes the need for the separate kiloServerLastActivity field and the recordKiloServerActivity() RPC method entirely, along with all its call sites in the router and orchestrator.

Verification

Triggered a Cloud Agent session on Kilo-Org/kilocode and confirmed via Axiom logs that:

  • All Worker outcomes were ok with zero exceptions
  • The session was killed by our own cleanupIdleKiloServer at the 15-minute mark (idleMs=959890, idleTimeoutMs=900000)
  • The No wrapper found to stop log confirmed the container was already gone by the time the alarm tried to stop it
  • Callback was delivered with status interrupted and reason Container shutdown: SIGTERM

Visual Changes

N/A

Reviewer Notes

The kiloServerLastActivity field is being removed from the session metadata schema. Existing sessions in DO storage that have this field set will simply have it ignored — the zod schema uses .optional() and no code path reads it anymore. No migration is needed.

eshurakov added 2 commits May 11, 2026 16:02
… kiloServerLastActivity

The kiloServerLastActivity field was only set at session prepare and
wrapper start time — never updated during an active execution. After
15 minutes, cleanupIdleKiloServer would see a stale timestamp and
SIGTERM the container, even though the execution was actively running
and heartbeating every 30s.

Replace kiloServerLastActivity with execution-level data: if there's
an active execution, skip cleanup immediately; otherwise derive the
last activity timestamp from the latest execution's lastHeartbeat,
completedAt, or startedAt (in that priority order).

Remove the now-unnecessary recordKiloServerActivity() RPC method and
all its call sites.
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented May 11, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Overview

This is a clean, well-scoped fix. The root cause (stale kiloServerLastActivity never being updated during active executions) is clearly described, and the solution is a meaningful improvement: deriving idle activity from execution-level data that already tracks heartbeats and completion timestamps.

Notes

One behavioral change worth being aware of (not a bug): Sessions that have been prepared (kilo server started) but have zero executions will now return early from cleanupIdleKiloServer at the !latestExecution check (CloudAgentSession.ts:1707). With the old approach, recordKiloServerActivity() was called at prepare time, so the idle timeout would still fire for these sessions. In practice this degrades gracefully — the sandbox has its own sleep timer (SANDBOX_SLEEP_AFTER_SECONDS) which reclaims the container independently — but the DO-level idle cleanup won't proactively stop the wrapper for prepared-but-never-used sessions.

Logic correctness of the new approach:

  • The early return when activeExecutionId !== null prevents any false cleanup during an active run — this is the fix for the original bug.
  • getAll() returns executions in insertion order (via executions.push()), so executions[executions.length - 1] correctly gives the most recently started execution.
  • The fallback chain lastHeartbeat ?? completedAt ?? startedAt is always a number since startedAt is required on ExecutionMetadata.
  • The active execution check is now done before the expensive getAll() call, which is a minor performance improvement over the old ordering.

All removed symbols (recordKiloServerActivity, kiloServerLastActivity, withDORetry imports) are fully cleaned up with no dangling references.

Files Reviewed (5 files)
  • services/cloud-agent-next/src/execution/orchestrator.ts
  • services/cloud-agent-next/src/persistence/CloudAgentSession.ts
  • services/cloud-agent-next/src/persistence/schemas.ts
  • services/cloud-agent-next/src/persistence/types.ts
  • services/cloud-agent-next/src/router/handlers/session-prepare.ts

Fix these issues in Kilo Cloud


Reviewed by claude-sonnet-4.6 · 646,262 tokens

@eshurakov eshurakov merged commit 6ab4151 into main May 11, 2026
23 of 24 checks passed
@eshurakov eshurakov deleted the fix/cloud-agent-idle-cleanup branch May 11, 2026 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants