Skip to content

deploy: wait for network-agent readiness in one-click startup#304

Merged
kinwin-ustc merged 1 commit into
TencentCloud:masterfrom
xiongxz:codex/wait-network-agent-readiness
May 19, 2026
Merged

deploy: wait for network-agent readiness in one-click startup#304
kinwin-ustc merged 1 commit into
TencentCloud:masterfrom
xiongxz:codex/wait-network-agent-readiness

Conversation

@xiongxz
Copy link
Copy Markdown
Contributor

@xiongxz xiongxz commented May 19, 2026

Fix one-click startup racing network-agent readiness.

up.sh and up-compute.sh previously started network-agent and immediately continued with the rest of the stack. On slower hosts, Cubelet can start while network-agent is still coming up, which can leave Cubelet-dependent services unavailable during template artifact distribution. In that case template creation can fail with:

cubemastercli run fail: artifact distribution failed on all 1 nodes: rpc error: code = Unimplemented desc = unknown service cubelet.services.images.v1.Images

This waits for the network-agent /readyz endpoint before starting Cubelet-dependent services, and exposes NETWORK_AGENT_READY_TIMEOUT so one-click deployments can tune the wait budget. The default Cubelet config also extends the internal network-agent initialization timeout to match the one-click startup budget.

Validation

bash -n deploy/one-click/scripts/one-click/up.sh deploy/one-click/scripts/one-click/up-compute.sh

Assisted-by: Codex:GPT-5.5

@xiongxz xiongxz changed the title [codex] one-click: wait for network-agent readiness deploy: wait for network-agent readiness in one-click startup May 19, 2026
Wait for network-agent readiness before starting Cubelet-dependent services and extend the Cubelet network-agent initialization timeout.

This avoids template artifact distribution racing Cubelet startup when network-agent is still coming up.

Assisted-by: Codex:GPT-5.5
Signed-off-by: xiongxz <xiuzhang.xiong@lexmount.com>
@xiongxz xiongxz force-pushed the codex/wait-network-agent-readiness branch from fe2e298 to 6c52e93 Compare May 19, 2026 03:44
@kinwin-ustc
Copy link
Copy Markdown
Collaborator

/cubebot review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 19, 2026

Claude finished @kinwin-ustc's task in 1m 9s —— View job


Review Checklist

  • Review code changes
  • Analyze for bugs, security issues, and edge cases
  • Post review feedback

Code Review: PR #304

This PR fixes a race condition where network-agent hasn't finished initializing by the time Cubelet starts, causing template artifact distribution to fail with rpc error: code = Unimplemented. The fix adds a readiness check (/readyz) on the network-agent before proceeding.

What Was Changed

File Change
deploy/one-click/scripts/one-click/up.sh +4 lines: NETWORK_AGENT_READY_TIMEOUT variable and wait_for_http call after network-agent start
deploy/one-click/scripts/one-click/up-compute.sh +4/-2: same readiness wait, plus the new timeout variable
deploy/one-click/env.example +1: documentation for NETWORK_AGENT_READY_TIMEOUT=120
Cubelet/config/config.toml +2: network_agent_init_timeout = "120s" in the network plugin config

Positive Aspects

  1. Correct diagnosis — The race is real. Starting Cubelet before network-agent has registered its gRPC service causes the Unimplemented error on the Images service. Waiting on /readyz before proceeding is the right fix.

  2. Consistent pattern — Both up.sh and up-compute.sh use the same approach: start network-agent normally, then call wait_for_http before launching dependent services. The scripts remain structurally parallel, which is good for maintainability.

  3. Good reuse of existing utilitieswait_for_http from common.sh (line 219) already handles retry loops, configurable delays, and curl -f error semantics. No new polling machinery was introduced.

  4. Configurable timeoutNETWORK_AGENT_READY_TIMEOUT defaults to 120s but can be overridden via env or .one-click.env. This is essential because the timeout that works on a fast dev machine may be too short on a slower host.

  5. Matched timeoutsnetwork_agent_init_timeout = "120s" in config.toml aligns Cubelet's internal patience with the one-click startup script's readiness budget. Without this, Cubelet could internally give up on network-agent before the startup script's wait_for_http even notices a problem.

  6. Documented in env.example — The new variable is added to env.example at line 106, making it visible to deployers tuning their configuration.

Potential Concerns

No -L on curl, but that's correct
The wait_for_http call uses curl -fsS without -L (follow redirects). This is fine — a /readyz endpoint should return 200 directly, not redirect. If it ever redirected, we'd want to know rather than silently follow.

Deadly error message references LOG_DIR
If network-agent fails to become ready:

die "network-agent did not become ready, check logs under ${LOG_DIR}"

This references ${LOG_DIR}, which is correct — the network-agent's stdout/stderr is redirected to "${LOG_DIR}/network-agent.log" by start_with_pidfile. The error message is actionable.

Edge case: network-agent goes down after /readyz passes but before Cubelet connects
The network_agent_init_timeout = "120s" in Cubelet's config handles this independently — Cubelet will retry connections for up to 120s before giving up. This is a reasonable safety net and doesn't depend on the one-click startup script's checks.

Security

No security concerns. The readiness check is against 127.0.0.1:19090 (localhost-only), no credentials are transmitted, and no filesystem changes are made during the check.

Nit

The NETWORK_AGENT_READY_TIMEOUT default is 120, and wait_for_http uses a 1-second delay between retries, meaning up to 120 HTTP requests will be made to /readyz in the worst case. This is negligible load on localhost but might be worth a brief comment noting that the delay is intentionally short so the first successful readiness probe isn't unnecessarily late. Not blocking.

Summary

This is a clean, focused fix for a genuine race condition. The implementation is consistent across both script variants, uses existing infrastructure, and has appropriate configurability. The matched timeouts between the startup script and Cubelet's internal config are a thoughtful touch.


| Branch: codex/wait-network-agent-readiness

@xiongxz xiongxz marked this pull request as ready for review May 19, 2026 04:24
@kinwin-ustc kinwin-ustc merged commit c9364ac into TencentCloud:master May 19, 2026
7 of 8 checks passed
@xiongxz xiongxz deleted the codex/wait-network-agent-readiness branch May 19, 2026 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants