[BUG] NCCL2.20.5 meets "Message truncated : received 1024 bytes instead of 256" error while 2.18.5 not #1273

shh2000 · 2024-04-30T09:45:52Z

env

env1: 3 HGX-H100 with totally 24 GPUs. Same baremetal hardwares&envs with nvidia driver 535.129.03
env2: 3 HGX-A100 with totally 24 GPUs. Same baremetal hardwares&envs with nvidia driver 470.141.10

test code

import torch.distributed as dist

if __name__ == "__main__":

    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    group = dist.new_group([13,15,17,19,21])
    print(rank)
    print(world_size)

    a=torch.tensor([1,2]).to(rank%8)
    print(a)
    dist.all_reduce(a,group=group,op=dist.ReduceOp.SUM)

    #dist.all_reduce(a,op=dist.ReduceOp.SUM)
    dist.barrier()

    dist.barrier()
    print(a)
    dist.destroy_process_group()```

# result

1. env1 with ngctorch24:03(nccl 2.20.5) failed with "Message truncated : received 1024 bytes instead of 256"
2. env1 with ngctorch23:09(nccl 2.20.5) passed
3. env2 with ngctorch24:03 passed
4. env2 with ngctorch23:09 passed

The text was updated successfully, but these errors were encountered:

sjeaugey · 2024-04-30T09:54:10Z

Can you provide the log with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=GRAPH,INIT,ENV for the run on env1 which failed?

shh2000 · 2024-04-30T09:54:32Z

Additional information:

when using 4 HGX-H100 with totally 32 GPUs, all nccl version passed
when using 2 or 4 HGX-H100, using megatron:core_v0.6.0 to train some 70B/45B/xxxB LLM, passed. MFU=45% which is reasonable in "mcore training dense LLM on HGX-H100" context
all HGX-H100 passed nccl-test respectively(all-reduce-perf results is reasonable)

shh2000 · 2024-04-30T09:54:59Z

Can you provide the log with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=GRAPH,INIT,ENV for the run on env1 which failed?

Sure, i'll run with your envs and provide logs, thanks

shh2000 · 2024-04-30T10:03:07Z

@sjeaugey here's the log of noderank1(rank8-15).
noderank1_log.txt

kiskra-nvidia · 2024-04-30T15:59:19Z

We've fixed a similar-looking bug in NCCL 2.21; can you try with the latest version?

shh2000 · 2024-05-06T08:07:58Z

@kiskra-nvidia Thanks for the information. we may try ngctorch:2404 or some other ways to upgrade nccl 2.21+.

By the way, is there any publicly disclosable reasons about nccl 2.20's bug?(Not for the problem itself, but out of curiosity for the technology involved). I find and guess maybe 2.20 found wrong MPI paths in some cases(drivers+nnodes+topo)? look forward to your reply and thanks!

sjeaugey · 2024-05-06T09:04:36Z

Actually I'm not sure upgrading will help. The bug was a mixup of the connect with the following barrier and the barrier size was 8 bytes. Here all your sizes are more than 8.

The log you provided only shows one node. Could it be your environment was not forwarded to the other node? That would also explain the crash, as the other node might have a different configuration ending up in a mismatch and a discrepancy in sizes we're trying to exchange.

shh2000 · 2024-05-07T03:47:18Z

@sjeaugey Hi, my 3 nodes has the same baremetal config(8 H100+4 activated(8 in all) HDR NIC+2 CPU+PCIE5), with containers run from the same images(ngctorch2403+megatroncore0.6.0). If your guess is true, can my bug be reproduced by testing the P2P between every two ranks(C_{24}^2=24*23/2=276 cases)? By the way, if 276 p2p comm is all ok, would it face bug when using specific 5 rank to do all-reduce?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] NCCL2.20.5 meets "Message truncated : received 1024 bytes instead of 256" error while 2.18.5 not #1273

[BUG] NCCL2.20.5 meets "Message truncated : received 1024 bytes instead of 256" error while 2.18.5 not #1273

shh2000 commented Apr 30, 2024 •

edited

sjeaugey commented Apr 30, 2024

shh2000 commented Apr 30, 2024

shh2000 commented Apr 30, 2024

shh2000 commented Apr 30, 2024 •

edited

kiskra-nvidia commented Apr 30, 2024

shh2000 commented May 6, 2024 •

edited

sjeaugey commented May 6, 2024

shh2000 commented May 7, 2024

[BUG] NCCL2.20.5 meets "Message truncated : received 1024 bytes instead of 256" error while 2.18.5 not #1273

[BUG] NCCL2.20.5 meets "Message truncated : received 1024 bytes instead of 256" error while 2.18.5 not #1273

Comments

shh2000 commented Apr 30, 2024 • edited

env

test code

sjeaugey commented Apr 30, 2024

shh2000 commented Apr 30, 2024

shh2000 commented Apr 30, 2024

shh2000 commented Apr 30, 2024 • edited

kiskra-nvidia commented Apr 30, 2024

shh2000 commented May 6, 2024 • edited

sjeaugey commented May 6, 2024

shh2000 commented May 7, 2024

shh2000 commented Apr 30, 2024 •

edited

shh2000 commented Apr 30, 2024 •

edited

shh2000 commented May 6, 2024 •

edited