Why NVLS not available on 4 or more machines? #1031

holmes313 · 2023-10-21T15:42:43Z

I enables the NVLS function on 4 H100 servers and each server have 8 GPUs. But there is no any acceleration, the all_reduce busWidth is 180GB/s.
Then I tested the NVLS on 2 servers, the all_reduce busWidth is 310GB/s.
Why NVLS not available on 4 or more machines?
Thanks.

AddyLaddy · 2023-10-22T01:37:47Z

NVLS is an acronym for NVLink SHARP.
NVLink SHARP is an AllReduce offload system over NVLink only.
Currently NVLink is only supported within a single node (e.g. 8 or 16 GPUs).
For scaling beyond a single node, then IB or RoCE is required.
There is a similar AllReduce offload system for IB networks which we now call IB SHARP.

The performance reported by the NCCL tests on 2 nodes can be misleading when comparing it to >2 nodes unless you set NCCL_ALGO=RING when testing on 2 nodes.

How many NICs are there per node and what speed are they?
Is this an IB or RoCE network ?
What is the node architecture?

Perhaps you can attach the NCCL_DEBUG=INFO log for us to examine.

holmes313 · 2023-10-24T03:09:51Z

Thanks for your reply. I tested it again, NVLS is works for nodes <4. But there is a new question:
I am using 8 dual port CX7 and each port has 200Gb/s on one node. The network is ROCE mode, I found the nccl-test will hang when nodes >= 4 and set NCCL_NVLS_ENABLE=1 and NCCL_ALGO=NVLSTree.

holmes313 closed this as completed Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why NVLS not available on 4 or more machines? #1031

Why NVLS not available on 4 or more machines? #1031

holmes313 commented Oct 21, 2023

AddyLaddy commented Oct 22, 2023

holmes313 commented Oct 24, 2023

Why NVLS not available on 4 or more machines? #1031

Why NVLS not available on 4 or more machines? #1031

Comments

holmes313 commented Oct 21, 2023

AddyLaddy commented Oct 22, 2023

holmes313 commented Oct 24, 2023