Skip to content

Fix TCP PeerAgent advertised host#111

Merged
JimyMa merged 8 commits into
mainfrom
fix-tcp-peer-agent-advertised-host
Jun 4, 2026
Merged

Fix TCP PeerAgent advertised host#111
JimyMa merged 8 commits into
mainfrom
fix-tcp-peer-agent-advertised-host

Conversation

@JimyMa
Copy link
Copy Markdown
Contributor

@JimyMa JimyMa commented Jun 4, 2026

  • Fix TCP PeerAgent cross-host advertised address handling: TCP PeerAgent connections now default to a peer-reachable local host derived from registered topology and, when that address is loopback, fall back to the local route used to reach NanoCtrl;
  • TCP rendezvous now fails fast when endpoint.connect does not actually establish the underlying endpoint;
  • add a two-process TCP one-sided read example plus focused tests for TCP host resolution and conn_meta reuse.

@JimyMa JimyMa requested a deployment to self-hosted-rdma June 4, 2026 13:51 — with GitHub Actions Waiting
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces TCP support and dynamic local host resolution to PeerAgent, including a new two-process TCP read example and corresponding unit tests. Key feedback includes: ensuring that connection reuse in _ensure_connection_from_meta filters by transport type to avoid returning RDMA connections for TCP requests; wrapping endpoint.connect in a try-except block to handle exceptions and prevent threads from blocking on failed connections; dynamically resolving the socket family in _local_ip_for_remote to support IPv6; and avoiding the external use of the private _resolve_tcp_local_host method in the example script to preserve encapsulation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread dlslime/dlslime/peer_agent/_agent.py
Comment thread dlslime/dlslime/peer_agent/_mailbox.py Outdated
Comment on lines +375 to +384
endpoint.connect(peer_qp_info)
if hasattr(endpoint, "is_connected") and not endpoint.is_connected():
logger.warning(
"StreamMailbox %s: endpoint.connect(%s) failed for endpoint_info=%s",
self._agent.alias,
peer,
peer_qp_info,
)
self._agent._mark_connection_failed(conn_id)
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If endpoint.connect(peer_qp_info) raises an exception (e.g., due to connection refused, network timeout, or invalid parameters), the exception will propagate out of _try_connect_peer_inner and be caught in the main listen loop. Because _mark_connection_failed is never called in this scenario, the connection state remains stuck in "connecting". Any thread waiting on conn.wait() will block for the full timeout duration (default 60 seconds) before raising a TimeoutError. Wrapping the connection attempt in a try...except block ensures that any failure is caught immediately, marking the connection as failed and allowing waiting threads to fail fast.

        try:
            endpoint.connect(peer_qp_info)
            if hasattr(endpoint, "is_connected") and not endpoint.is_connected():
                raise RuntimeError("Endpoint is not connected after connect()")
        except Exception as e:
            logger.warning(
                "StreamMailbox %s: endpoint.connect(%s) failed for endpoint_info=%s: %s",
                self._agent.alias,
                peer,
                peer_qp_info,
                e,
            )
            self._agent._mark_connection_failed(conn_id)
            return

Comment thread dlslime/dlslime/peer_agent/_agent.py
Comment thread dlslime/examples/python/p2p_tcp_rc_read_peer_agent_two_process.py Outdated
Comment thread dlslime/examples/python/p2p_tcp_rc_read_peer_agent_two_process.py Outdated
@JimyMa JimyMa requested a deployment to self-hosted-rdma June 4, 2026 14:13 — with GitHub Actions Waiting
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@JimyMa JimyMa requested a deployment to self-hosted-rdma June 4, 2026 14:18 — with GitHub Actions Waiting
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@JimyMa JimyMa requested a deployment to self-hosted-rdma June 4, 2026 14:18 — with GitHub Actions Waiting
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@JimyMa JimyMa requested a deployment to self-hosted-rdma June 4, 2026 14:18 — with GitHub Actions Waiting
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@JimyMa JimyMa requested a deployment to self-hosted-rdma June 4, 2026 14:19 — with GitHub Actions Waiting
@JimyMa JimyMa temporarily deployed to self-hosted-rdma June 4, 2026 14:20 — with GitHub Actions Inactive
@JimyMa
Copy link
Copy Markdown
Contributor Author

JimyMa commented Jun 4, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves TCP connection handling in the PeerAgent by dynamically resolving local TCP host/IP addresses instead of defaulting to 0.0.0.0, preventing remote peers from receiving unreachable addresses. It also adds connection reuse for TCP, robust error handling during connection handshakes (marking connections as failed and raising errors), a new two-process TCP one-sided read example, and corresponding unit tests. A review comment correctly points out a duplicate @staticmethod decorator in _agent.py that should be removed.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread dlslime/dlslime/peer_agent/_agent.py Outdated
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@JimyMa JimyMa temporarily deployed to self-hosted-rdma June 4, 2026 14:22 — with GitHub Actions Inactive
@JimyMa JimyMa merged commit 8721270 into main Jun 4, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant