Skip to content

Fix nodectl service unresponsive when a node's control-server is unreachable#140

Merged
Keshoid merged 7 commits into
release/nodectl/v0.5.0from
feature/sma-88-nodectl-api-commands-hang-when-service-port-is-unavailable
May 15, 2026
Merged

Fix nodectl service unresponsive when a node's control-server is unreachable#140
Keshoid merged 7 commits into
release/nodectl/v0.5.0from
feature/sma-88-nodectl-api-commands-hang-when-service-port-is-unavailable

Conversation

@Keshoid
Copy link
Copy Markdown
Contributor

@Keshoid Keshoid commented May 14, 2026

Root cause

AdnlClient::connect in src/adnl/src/adnl/client.rs performs a synchronous TCP connect via socket2::Socket::connect_timeout. The call uses poll(2) under the hood and parks the tokio worker thread until the kernel either completes the handshake or hits the configured timeout (default 20s).

When a node's control-server port is firewalled / black-holed, every ADNL connect attempt parks a worker for the full timeout. On a typical k8s pod (2–4 vCPU → 2–4 tokio workers) a handful of concurrent ADNL calls (elections runner per binding + voting provider + GET /v1/nodes probe + retry loop in do_rq up to 4 attempts each) is enough to park every worker. With no worker available, axum::serve's listener.accept() is never polled. The HTTP server stops accepting new requests — including /health.

Fix

Primary

Introduce AdnlClient::timeout_connect alongside the existing connect. The new method uses tokio::net::TcpStream::connect wrapped in tokio::time::timeout, so the worker yields back to the runtime while the kernel performs the handshake or waits on an unresponsive peer. The original SO_LINGER 0 option is preserved via socket2::SockRef::from(&stream).

AdnlClient::connect is left untouched to avoid disturbing src/node/bin/console.rs and ADNL test suites. ControlClientAdnl in node-control/control-client/ is switched to timeout_connect.

Secondary

  • common::resolve_ip is made async (tokio::net::lookup_host) to remove another blocking DNS call from async paths.
  • nodectl CLI HTTP client gains a connect timeout (5s default) and overall request timeout (10s default) for service REST calls, with NODECTL_API_CONNECT_TIMEOUT_SECS / NODECTL_API_REQUEST_TIMEOUT_SECS env overrides. Timeout / connect failures are mapped to actionable error messages that include the attempted URL.

Side effect

timeout_connect selects the socket address family from the resolved SocketAddr (IPv4 or IPv6). The legacy connect hard-coded Domain::IPV4, so any IPv6 control-server address previously failed with EAFNOSUPPORT. IPv6 now works.

Out of scope (follow-ups for milestone 0.6.0)

  • ClientJsonRpc / toncenter_rs JSON-RPC client also lacks timeouts — separate class of hang (service-side blocking on unreachable ton-http-api endpoint). Worth bounded retries / circuit breaker.
  • Memory pressure / RSS metrics endpoint for easier triage in future incidents.

Closes SMA-88

Copilot AI review requested due to automatic review settings May 14, 2026 22:48
@linear
Copy link
Copy Markdown

linear Bot commented May 14, 2026

SMA-88

@Keshoid Keshoid requested review from ITBear and mrnkslv May 14, 2026 22:48
@Keshoid Keshoid changed the base branch from master to release/nodectl/v0.5.0 May 14, 2026 22:50
…-commands-hang-when-service-port-is-unavailable
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses nodectl service unresponsiveness when a node’s control-server is unreachable by eliminating blocking TCP/DNS operations on tokio worker threads and by adding client-side timeouts, plus it extends the control-plane API/CLI to handle voting via REST.

Changes:

  • Add async-native AdnlClient::timeout_connect and switch the nodectl control client to use it to avoid tokio worker starvation on black-holed connects.
  • Make common::resolve_ip async using tokio::net::lookup_host to avoid blocking DNS resolution on async paths.
  • Add REST voting endpoints to the service, migrate nodectl vote to a REST client, and introduce HTTP connect/request timeouts with actionable CLI errors.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/node/tests/test_run_net_py/run_singlehost_nodectl.py Extends single-host nodectl E2E to smoke-test voting REST/CLI and optionally create a proposal.
src/node/tests/test_run_net_py/readme.md Documents the nodectl single-host E2E bootstrap and voting smoke/creation options.
src/node/tests/test_load_net/scripts/topup.ts Increases post-send balance polling timeout for top-up script.
src/node/tests/test_load_net/scripts/create-proposal.ts Improves error message when proposal expiry is below minimum.
src/node/tests/test_load_net/scripts/add-nominators-to-pool.ts Refactors nominator funding/deploy to use a highload wallet and adds readiness polling.
src/node/tests/test_load_net/package.json Adds @tonkite/highload-wallet-v3 dependency.
src/node/tests/test_load_net/bun.lock Locks @tonkite/highload-wallet-v3 dependency.
src/node-control/service/src/task/mod.rs Adjusts elections task wiring after moving elections into the service crate.
src/node-control/service/src/service_main_task.rs Updates imports for elections task callback after module move.
src/node-control/service/src/runtime_config.rs Makes config file persistence optional (tests) and adds test helper with_path.
src/node-control/service/src/lib.rs Exposes new elections module from the service crate.
src/node-control/service/src/http/mod.rs Adds CRUD handler test module.
src/node-control/service/src/http/http_server_task.rs Adds voting REST routes and OpenAPI registration for voting types.
src/node-control/service/src/http/entity_crud_handlers_tests.rs Adds REST mutation tests for CRUD endpoints and persistence-to-disk behavior.
src/node-control/service/src/http/config_handlers.rs Implements voting REST handlers (/v1/voting/*) and related DTOs.
src/node-control/service/src/http/config_handlers_tests.rs Adds tests for voting config snapshot and basic voting proposal add/rm behaviors.
src/node-control/service/src/http/auth_tests.rs Expands auth/role-gating tests to cover voting routes and CRUD route matrices.
src/node-control/service/src/elections/runner.rs Updates module paths after elections module move.
src/node-control/service/src/elections/runner_tests.rs Updates type paths in tests after elections module move.
src/node-control/service/src/elections/providers/traits.rs Introduces provider traits and shared types inside service crate.
src/node-control/service/src/elections/providers/mod.rs Exposes provider modules from the new in-crate elections module.
src/node-control/service/src/elections/providers/default.rs Updates imports to use in-crate provider traits/types.
src/node-control/service/src/elections/mod.rs Adds elections module definition in the service crate.
src/node-control/service/src/elections/election_task.rs Updates module paths after elections move into service crate.
src/node-control/service/src/elections/election_emulator.rs Adds election emulation and stake threshold computation utilities + tests.
src/node-control/service/src/elections/adaptive_strategy.rs Updates imports after elections module move.
src/node-control/service/Cargo.toml Removes external elections crate dependency and adds mockall for tests.
src/node-control/README.md Updates nodectl vote documentation to reflect REST-based behavior and auth requirements.
src/node-control/Makefile Removes elections package from test target list.
src/node-control/elections/Cargo.toml Removes standalone elections crate manifest (elections now lives in service crate).
src/node-control/control-client/src/client_adnl.rs Switches to timeout_connect and adds a starvation regression test.
src/node-control/control-client/Cargo.toml Adds tokio dev-dependency for the new async test.
src/node-control/common/src/socket_utils.rs Makes resolve_ip async using tokio DNS and updates tests accordingly.
src/node-control/common/src/app_config.rs Updates ADNL config building to await the new async resolver.
src/node-control/commands/src/commands/nodectl/vote_cmd.rs Migrates voting CLI commands to call the service voting REST API.
src/node-control/commands/src/commands/nodectl/utils.rs Adds reqwest client timeouts, env overrides, and actionable error mapping + tests.
src/node-control/commands/src/commands/nodectl/service_api_cmd.rs Switches API client usage to the timeout-enabled builder and mapped errors.
src/node-control/commands/src/commands/nodectl/config_wallet_cmd.rs Adjusts elections provider import path and improves wallet address formatting.
src/node-control/commands/src/commands/nodectl/config_elections_cmd.rs Fixes table formatting with ANSI coloring by padding before coloring.
src/node-control/commands/Cargo.toml Removes unused hex and external elections dependency; relies on service elections module.
src/Makefile Removes formatting target for deleted node-control/elections crate.
src/Cargo.toml Removes node-control/elections from workspace members.
src/Cargo.lock Updates lockfile for workspace crate graph changes and new dev-deps.
src/adnl/src/adnl/client.rs Adds AdnlClient::timeout_connect with tokio-based connect + timeout and linger preservation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 22 to +26
if let Ok(socket_addr) = addr.parse::<SocketAddr>() {
return Ok(socket_addr);
}

let mut resolved =
addr.to_socket_addrs().with_context(|| format!("failed to resolve address: {addr}"))?;

resolved.next().with_context(|| format!("resolver returned no addresses for: {addr}"))
tokio::net::lookup_host(addr)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@Keshoid Keshoid merged commit e5ad45f into release/nodectl/v0.5.0 May 15, 2026
5 of 6 checks passed
@Keshoid Keshoid deleted the feature/sma-88-nodectl-api-commands-hang-when-service-port-is-unavailable branch May 15, 2026 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants