Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does UCX supprot TCP multi-rail? #9763

Open
huzhijiang opened this issue Mar 19, 2024 Discussed in #9753 · 4 comments · May be fixed by #9839
Open

Does UCX supprot TCP multi-rail? #9763

huzhijiang opened this issue Mar 19, 2024 Discussed in #9753 · 4 comments · May be fixed by #9839

Comments

@huzhijiang
Copy link

Discussed in #9753

Originally posted by huzhijiang March 17, 2024
Here is the TCP multi-rail test result. It seems UCX always choose one rail to send message, not both. And the perfomance also down a bit if trying to use two TCP devices:

[root@promote ucx-1.15.0]# UCX_TLS=tcp UCX_NET_DEVICES=enp7s0  ucx_perftest 192.168.1.199 -t ucp_am_bw -s 32768 -n 100000
[1710655695.035228] [promote:69998:0]        perftest.c:783  UCX  WARN  CPU affinity is not set (bound to 12 cpus). Performance may be impacted.
+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |       overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[thread 0]             27279      2.000    36.659    36.659      852.46     852.46       27279       27279
[thread 0]             55450      2.000    35.498    36.069      880.33     866.40       28171       27725
[thread 0]             83648      2.000    35.464    35.865      881.16     871.32       28197       27882
Final:                100000      2.000    35.615    35.824      877.43     872.31       28078       27914


[root@promote ucx-1.15.0]# UCX_TLS=tcp UCX_NET_DEVICES=enp8s0  ucx_perftest 192.168.1.199 -t ucp_am_bw -s 32768 -n 100000
[1710655711.658207] [promote:70004:0]        perftest.c:783  UCX  WARN  CPU affinity is not set (bound to 12 cpus). Performance may be impacted.
+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |       overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[thread 0]             26849      6.000    37.248    37.248      838.97     838.97       26847       26847
[thread 0]             54565      2.000    36.084    36.657      866.04     852.51       27713       27280
[thread 0]             82901      2.000    35.293    36.191      885.43     863.48       28334       27631
Final:                100000      2.000    39.720    36.794      786.77     849.32       25177       27178

[root@promote ucx-1.15.0]# UCX_TLS=tcp UCX_NET_DEVICES=enp7s0,enp8s0  ucx_perftest 192.168.1.199 -t ucp_am_bw -s 32768 -n 100000
[1710655666.995142] [promote:69992:0]        perftest.c:783  UCX  WARN  CPU affinity is not set (bound to 12 cpus). Performance may be impacted.
+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |       overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[thread 0]             22812      2.000    43.838    43.838      712.85     712.85       22811       22811
[thread 0]             45946      2.000    43.231    43.533      722.86     717.85       23132       22971
[thread 0]             68145      2.000    45.047    44.026      693.71     709.81       22199       22714
[thread 0]             91011      2.000    43.739    43.954      714.47     710.97       22863       22751
Final:                100000      2.000    44.435    43.997      703.27     710.27       22505       22729

```</div>

@huzhijiang
Copy link
Author

Checked the related code, seems I must construct data in a way to trigger both am_zcopy_first and am_zcopy_middle to drive more than one TCP ifaces to work. So I added the following parameters to ucx_perftest and try again:
-D iov -s 2048,2048,2048,2048,2048,2048,2048,2048,2048,2048,2048,2048,2048,2048,2048,2048
By checking profiling result, now am_zcopy_middle and am_zcopy_first are both triggered. But according to ifconfig from the sender side, still only one NIC sending.

From lsof -Pn -p, I can see ucx_perf build two sockets to remote node but use the same local and remote ip port , it seems a issue?

ucx_perft 3264 root 15u IPv4 60939 0t0 TCP 192.168.200.2:37768->192.168.200.1:57119 (ESTABLISHED)
ucx_perft 3264 root 16u IPv4 53571 0t0 TCP 192.168.200.2:46983->192.168.200.1:37442 (ESTABLISHED)

I would like to promote this disscusion to an issue for more attention, many thanks!

Version:
Library version: 1.15.0
Library path: /lib64/libucs.so.0
API headers version: 1.15.0
Git branch '', revision
Configured with: --enable-examples --enable-logging --enable-debug-data=yes --enable-profiling=yes

ucx_info -d
ucx_info_d_0319.txt

Run command:
UCX_LOG_LEVEL=data UCX_TLS=tcp UCX_NET_DEVICES=enp7s0,enp8s0 UCX_MAX_EAGER_RAILS=2 UCX_MAX_RNDV_RAILS=2 ucx_perftest 192.168.1.199 -t ucp_am_bw -D iov -s 2048,2048,2048,2048,2048,2048,2048,2048,2048,2048,2048,2048,2048,2048,2048,2048 -n 10000 > run_log_0319.txt

Run log:
run_log_0319.txt

Profiling result output(not -n 10000 but smaller value):
profiling_0319.txt

@huzhijiang
Copy link
Author

huzhijiang commented Mar 20, 2024

By examing the run log, it seems the following things happend(on one side of ucx_perftest, the other side not listed) to eventually cause the problem :

  1. After trying to connect to peer and got peer connect request, the two ifaces have the following fd and eps:
  • iface of active lane 0 (192.168.200.2):
    
      connect fd:14   192.168.100.2:45664 -> 192.168.100.1:35323  
                    ep: 0xc2e240
          Note: The reason why iface 200.2 trying to connect 100.1 is because uct_tcp_iface_is_reachable() do not care.
                     And the OS routing module choose 100.2 to create socket that make this connection valid.
      accept fd:16     192.168.200.2:58575 -> 192.168.200.1:55026
                   ep: 0x7f5fec000ba0
    
  • iface of active lane 1 (192.168.100.2): 
    
      connect fd:15   192.168.200.2:59820 -> 192.168.200.1:60177
                    ep: 0xc2e2f0
          Note: The reason why iface 100.2 trying to connect 200.1 is because uct_tcp_iface_is_reachable() do not care.
                     And the OS routing module choose 200.2 to create socket that make this connection valid.
      accept fd:17     192.168.100.2:39369 -> 192.168.100.1:60170
                  ep: 0x7f5fec000c50
    
  1. But after some conn matching works and uct_tcp_cm_handle_simult_conn():
    Since ep: 0xc2e240 's peer addr (100.1) is smaller than ep's iface addr(200.2), so uct_tcp_cm_simult_conn_accept_remote_conn() is called and accept fd:16 replaces connect fd:14.
    Since ep: 0xc2e2f0 's peer addr (200.1) is larger than ep's iface addr(100.2), so uct_tcp_cm_simult_conn_accept_remote_conn() is NOT called and connect fd:15 replaces accept fd:17.

  2. So at last , the two am_bw lanes got two fd remain (15,and 16) , and both of them use 200.2 as source address to talk to peer. And I believe this is why only one NIC is used to send despite am_zcopy_middle and am_zcopy_first are both triggered.

Do not know how to solve this, if above analysis is right, some ideals currenly are:
a) Make uct_tcp_iface_is_reachable() to do real reachable checking? Seems not the right way becasue routing can make full connection.
b) Invoke bind() to local iface addr before trying connect to peer? This maybe can make connect to use right source addr , but don known if it will cause connect fail, and how to deal with the failure?

@huzhijiang
Copy link
Author

Confirmed that by adding the following code:

struct sockaddr_storage bind_addr = iface->config.ifaddr;
ucs_sockaddr_set_port((struct sockaddr*)&bind_addr, 0);
ucs_sockaddr_sizeof((struct sockaddr*)&bind_addr, &addr_len);
bind(ep->fd, (struct sockaddr*)&bind_addr, addr_len);

into here


, then it seems the problem can be solved, both NIC are sending out traffic to peer now. P.S. trafic rate is 1.8 to 1.4.

@huzhijiang
Copy link
Author

huzhijiang commented Apr 1, 2024

Confirmed that by adding the following code:

struct sockaddr_storage bind_addr = iface->config.ifaddr;
ucs_sockaddr_set_port((struct sockaddr*)&bind_addr, 0);
ucs_sockaddr_sizeof((struct sockaddr*)&bind_addr, &addr_len);
bind(ep->fd, (struct sockaddr*)&bind_addr, addr_len);

into here

, then it seems the problem can be solved, both NIC are sending out traffic to peer now. P.S. trafic rate is 1.8 to 1.4.

This trick need arp_filter=0 or 2. So for those who really need arp_filter=1, it is not a solution. So this issue seems to be a bug?

@evgeny-leksikov evgeny-leksikov linked a pull request Apr 22, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant