Fix nodectl service unresponsive when a node's control-server is unreachable#140
Merged
Conversation
…-commands-hang-when-service-port-is-unavailable
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses nodectl service unresponsiveness when a node’s control-server is unreachable by eliminating blocking TCP/DNS operations on tokio worker threads and by adding client-side timeouts, plus it extends the control-plane API/CLI to handle voting via REST.
Changes:
- Add async-native
AdnlClient::timeout_connectand switch the nodectl control client to use it to avoid tokio worker starvation on black-holed connects. - Make
common::resolve_ipasync usingtokio::net::lookup_hostto avoid blocking DNS resolution on async paths. - Add REST voting endpoints to the service, migrate
nodectl voteto a REST client, and introduce HTTP connect/request timeouts with actionable CLI errors.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/node/tests/test_run_net_py/run_singlehost_nodectl.py | Extends single-host nodectl E2E to smoke-test voting REST/CLI and optionally create a proposal. |
| src/node/tests/test_run_net_py/readme.md | Documents the nodectl single-host E2E bootstrap and voting smoke/creation options. |
| src/node/tests/test_load_net/scripts/topup.ts | Increases post-send balance polling timeout for top-up script. |
| src/node/tests/test_load_net/scripts/create-proposal.ts | Improves error message when proposal expiry is below minimum. |
| src/node/tests/test_load_net/scripts/add-nominators-to-pool.ts | Refactors nominator funding/deploy to use a highload wallet and adds readiness polling. |
| src/node/tests/test_load_net/package.json | Adds @tonkite/highload-wallet-v3 dependency. |
| src/node/tests/test_load_net/bun.lock | Locks @tonkite/highload-wallet-v3 dependency. |
| src/node-control/service/src/task/mod.rs | Adjusts elections task wiring after moving elections into the service crate. |
| src/node-control/service/src/service_main_task.rs | Updates imports for elections task callback after module move. |
| src/node-control/service/src/runtime_config.rs | Makes config file persistence optional (tests) and adds test helper with_path. |
| src/node-control/service/src/lib.rs | Exposes new elections module from the service crate. |
| src/node-control/service/src/http/mod.rs | Adds CRUD handler test module. |
| src/node-control/service/src/http/http_server_task.rs | Adds voting REST routes and OpenAPI registration for voting types. |
| src/node-control/service/src/http/entity_crud_handlers_tests.rs | Adds REST mutation tests for CRUD endpoints and persistence-to-disk behavior. |
| src/node-control/service/src/http/config_handlers.rs | Implements voting REST handlers (/v1/voting/*) and related DTOs. |
| src/node-control/service/src/http/config_handlers_tests.rs | Adds tests for voting config snapshot and basic voting proposal add/rm behaviors. |
| src/node-control/service/src/http/auth_tests.rs | Expands auth/role-gating tests to cover voting routes and CRUD route matrices. |
| src/node-control/service/src/elections/runner.rs | Updates module paths after elections module move. |
| src/node-control/service/src/elections/runner_tests.rs | Updates type paths in tests after elections module move. |
| src/node-control/service/src/elections/providers/traits.rs | Introduces provider traits and shared types inside service crate. |
| src/node-control/service/src/elections/providers/mod.rs | Exposes provider modules from the new in-crate elections module. |
| src/node-control/service/src/elections/providers/default.rs | Updates imports to use in-crate provider traits/types. |
| src/node-control/service/src/elections/mod.rs | Adds elections module definition in the service crate. |
| src/node-control/service/src/elections/election_task.rs | Updates module paths after elections move into service crate. |
| src/node-control/service/src/elections/election_emulator.rs | Adds election emulation and stake threshold computation utilities + tests. |
| src/node-control/service/src/elections/adaptive_strategy.rs | Updates imports after elections module move. |
| src/node-control/service/Cargo.toml | Removes external elections crate dependency and adds mockall for tests. |
| src/node-control/README.md | Updates nodectl vote documentation to reflect REST-based behavior and auth requirements. |
| src/node-control/Makefile | Removes elections package from test target list. |
| src/node-control/elections/Cargo.toml | Removes standalone elections crate manifest (elections now lives in service crate). |
| src/node-control/control-client/src/client_adnl.rs | Switches to timeout_connect and adds a starvation regression test. |
| src/node-control/control-client/Cargo.toml | Adds tokio dev-dependency for the new async test. |
| src/node-control/common/src/socket_utils.rs | Makes resolve_ip async using tokio DNS and updates tests accordingly. |
| src/node-control/common/src/app_config.rs | Updates ADNL config building to await the new async resolver. |
| src/node-control/commands/src/commands/nodectl/vote_cmd.rs | Migrates voting CLI commands to call the service voting REST API. |
| src/node-control/commands/src/commands/nodectl/utils.rs | Adds reqwest client timeouts, env overrides, and actionable error mapping + tests. |
| src/node-control/commands/src/commands/nodectl/service_api_cmd.rs | Switches API client usage to the timeout-enabled builder and mapped errors. |
| src/node-control/commands/src/commands/nodectl/config_wallet_cmd.rs | Adjusts elections provider import path and improves wallet address formatting. |
| src/node-control/commands/src/commands/nodectl/config_elections_cmd.rs | Fixes table formatting with ANSI coloring by padding before coloring. |
| src/node-control/commands/Cargo.toml | Removes unused hex and external elections dependency; relies on service elections module. |
| src/Makefile | Removes formatting target for deleted node-control/elections crate. |
| src/Cargo.toml | Removes node-control/elections from workspace members. |
| src/Cargo.lock | Updates lockfile for workspace crate graph changes and new dev-deps. |
| src/adnl/src/adnl/client.rs | Adds AdnlClient::timeout_connect with tokio-based connect + timeout and linger preservation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
22
to
+26
| if let Ok(socket_addr) = addr.parse::<SocketAddr>() { | ||
| return Ok(socket_addr); | ||
| } | ||
|
|
||
| let mut resolved = | ||
| addr.to_socket_addrs().with_context(|| format!("failed to resolve address: {addr}"))?; | ||
|
|
||
| resolved.next().with_context(|| format!("resolver returned no addresses for: {addr}")) | ||
| tokio::net::lookup_host(addr) |
ITBear
approved these changes
May 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause
AdnlClient::connectinsrc/adnl/src/adnl/client.rsperforms a synchronous TCP connect viasocket2::Socket::connect_timeout. The call usespoll(2)under the hood and parks the tokio worker thread until the kernel either completes the handshake or hits the configured timeout (default 20s).When a node's control-server port is firewalled / black-holed, every ADNL connect attempt parks a worker for the full timeout. On a typical k8s pod (2–4 vCPU → 2–4 tokio workers) a handful of concurrent ADNL calls (elections runner per binding + voting provider +
GET /v1/nodesprobe + retry loop indo_rqup to 4 attempts each) is enough to park every worker. With no worker available,axum::serve'slistener.accept()is never polled. The HTTP server stops accepting new requests — including/health.Fix
Primary
Introduce
AdnlClient::timeout_connectalongside the existingconnect. The new method usestokio::net::TcpStream::connectwrapped intokio::time::timeout, so the worker yields back to the runtime while the kernel performs the handshake or waits on an unresponsive peer. The originalSO_LINGER 0option is preserved viasocket2::SockRef::from(&stream).AdnlClient::connectis left untouched to avoid disturbingsrc/node/bin/console.rsand ADNL test suites.ControlClientAdnlinnode-control/control-client/is switched totimeout_connect.Secondary
common::resolve_ipis made async (tokio::net::lookup_host) to remove another blocking DNS call from async paths.nodectlCLI HTTP client gains a connect timeout (5s default) and overall request timeout (10s default) for service REST calls, withNODECTL_API_CONNECT_TIMEOUT_SECS/NODECTL_API_REQUEST_TIMEOUT_SECSenv overrides. Timeout / connect failures are mapped to actionable error messages that include the attempted URL.Side effect
timeout_connectselects the socket address family from the resolvedSocketAddr(IPv4 or IPv6). The legacyconnecthard-codedDomain::IPV4, so any IPv6 control-server address previously failed withEAFNOSUPPORT. IPv6 now works.Out of scope (follow-ups for milestone 0.6.0)
ClientJsonRpc/toncenter_rsJSON-RPC client also lacks timeouts — separate class of hang (service-side blocking on unreachable ton-http-api endpoint). Worth bounded retries / circuit breaker.Closes SMA-88