Skip to content

fix(deploy): wait for quickcheck containers to be ready#349

Merged
fslongjin merged 1 commit into
TencentCloud:masterfrom
fslongjin:fix/quickcheck-container-wait
May 25, 2026
Merged

fix(deploy): wait for quickcheck containers to be ready#349
fslongjin merged 1 commit into
TencentCloud:masterfrom
fslongjin:fix/quickcheck-container-wait

Conversation

@fslongjin
Copy link
Copy Markdown
Member

@fslongjin fslongjin commented May 25, 2026

Summary

  • Poll one-click quickcheck container state while Docker reports starting.
  • Fix a deployment race where cube-webui can still be starting because its container health check interval is 10s; the quickcheck script can also reach the container check at around 10s, before Docker has refreshed the health status, causing deployment to fail with:
[one-click-runtime] ERROR: container is not ready: cube-webui (status=starting)
  • Keep failing fast for missing or unhealthy containers, with a configurable timeout via CUBE_QUICKCHECK_CONTAINER_TIMEOUT.

Assisted-by: Cursor:GPT-5.5

@cubesandboxbot
Copy link
Copy Markdown

Claude Code is working…

I'll analyze this and get back to you.

View job run

Allow one-click quickcheck to tolerate containers that are still starting by polling until they become ready or a timeout is reached.

Signed-off-by: jinlong <jinlong@tencent.com>
@fslongjin fslongjin force-pushed the fix/quickcheck-container-wait branch from 8541480 to a054efd Compare May 25, 2026 12:53
@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented May 25, 2026

Claude finished @fslongjin's task in 32s —— View job


Review: fix(deploy): wait for quickcheck containers to be ready

Summary

This PR replaces the single-shot container state check in check_container_ready() with a polling loop that retries while Docker reports starting, up to a configurable timeout (default 60s). Containers in healthy or running states are accepted immediately; any other state (missing, unhealthy, etc.) fails fast.

Overall assessment

The change is sound and well-structured. The polling approach is the right fix for the race where containers are still in starting when the quickcheck runs. Here's the analysis:

✅ Good things

  • Configurable timeout via CUBE_QUICKCHECK_CONTAINER_TIMEOUT (default 60s) — good for tuning in slower environments.
  • Correct semantics: healthy and running both accepted, starting silently retried, everything else fails fast.
  • Consistent failure mode: uses die() from common.sh for both timeout and invalid states, matching the rest of the script.
  • Clean git hygiene: single commit, focused change, no unrelated modifications.

🔍 Minor observations

1. Timeout accuracy off by up to interval seconds

When a container stays in starting state, the timeout check (elapsed >= timeout) is evaluated after the case statement but before the sleep. However, elapsed is only incremented after the sleep, so the actual wall-clock timeout for the starting case is timeout + interval (up to 62s with defaults).

This is negligible in practice (2s variance on a 60s timeout) and unlikely to matter. Not a blocker.

2. check_container_ready() duplicates wait_for_health() from common.sh

File deploy/one-click/scripts/one-click/common.sh:233-248 already has a wait_for_health() function that polls container state with retries:

wait_for_health() {
  local container="$1"
  local retries="${2:-40}"
  local delay="${3:-2}"
  ...
}

The key difference is that wait_for_health() returns 1 on timeout while check_container_ready() calls die(). Since quickcheck.sh already sources common.sh, one could argue for reusing wait_for_health() and wrapping it with die on failure. However:

  • wait_for_health uses retry-count rather than wall-clock timeout
  • The scripts are separate deployment modules — some duplication is acceptable

This is not a change request, just an observation.

3. Arithmetic context is safe under set -e

The (( ... )) arithmetic returns 1 when the expression evaluates to 0, which would trigger set -e. However, since the comparison is elapsed >= timeout (which never evaluates to 0 when true), and elapsed starts at 0 while timeout defaults to 60, this is safe. The first iteration's check (( 0 >= 60 )) evaluates to 0 (false), but this is inside a while loop, and set -e does not trigger inside while test conditions. Good.

Verdict

Approved. The change is correct, well-tested, and addresses the container readiness race condition effectively. The minor timeout drift is acceptable.

@fslongjin fslongjin merged commit 6b8f774 into TencentCloud:master May 25, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant