Fix TCP PeerAgent advertised host#111
Conversation
JimyMa
commented
Jun 4, 2026
- Fix TCP PeerAgent cross-host advertised address handling: TCP PeerAgent connections now default to a peer-reachable local host derived from registered topology and, when that address is loopback, fall back to the local route used to reach NanoCtrl;
- TCP rendezvous now fails fast when endpoint.connect does not actually establish the underlying endpoint;
- add a two-process TCP one-sided read example plus focused tests for TCP host resolution and conn_meta reuse.
There was a problem hiding this comment.
Code Review
This pull request introduces TCP support and dynamic local host resolution to PeerAgent, including a new two-process TCP read example and corresponding unit tests. Key feedback includes: ensuring that connection reuse in _ensure_connection_from_meta filters by transport type to avoid returning RDMA connections for TCP requests; wrapping endpoint.connect in a try-except block to handle exceptions and prevent threads from blocking on failed connections; dynamically resolving the socket family in _local_ip_for_remote to support IPv6; and avoiding the external use of the private _resolve_tcp_local_host method in the example script to preserve encapsulation.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| endpoint.connect(peer_qp_info) | ||
| if hasattr(endpoint, "is_connected") and not endpoint.is_connected(): | ||
| logger.warning( | ||
| "StreamMailbox %s: endpoint.connect(%s) failed for endpoint_info=%s", | ||
| self._agent.alias, | ||
| peer, | ||
| peer_qp_info, | ||
| ) | ||
| self._agent._mark_connection_failed(conn_id) | ||
| return |
There was a problem hiding this comment.
If endpoint.connect(peer_qp_info) raises an exception (e.g., due to connection refused, network timeout, or invalid parameters), the exception will propagate out of _try_connect_peer_inner and be caught in the main listen loop. Because _mark_connection_failed is never called in this scenario, the connection state remains stuck in "connecting". Any thread waiting on conn.wait() will block for the full timeout duration (default 60 seconds) before raising a TimeoutError. Wrapping the connection attempt in a try...except block ensures that any failure is caught immediately, marking the connection as failed and allowing waiting threads to fail fast.
try:
endpoint.connect(peer_qp_info)
if hasattr(endpoint, "is_connected") and not endpoint.is_connected():
raise RuntimeError("Endpoint is not connected after connect()")
except Exception as e:
logger.warning(
"StreamMailbox %s: endpoint.connect(%s) failed for endpoint_info=%s: %s",
self._agent.alias,
peer,
peer_qp_info,
e,
)
self._agent._mark_connection_failed(conn_id)
returnCo-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request improves TCP connection handling in the PeerAgent by dynamically resolving local TCP host/IP addresses instead of defaulting to 0.0.0.0, preventing remote peers from receiving unreachable addresses. It also adds connection reuse for TCP, robust error handling during connection handshakes (marking connections as failed and raising errors), a new two-process TCP one-sided read example, and corresponding unit tests. A review comment correctly points out a duplicate @staticmethod decorator in _agent.py that should be removed.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>