Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] NCCL2.20.5 meets "Message truncated : received 1024 bytes instead of 256" error while 2.18.5 not #1273

Open
shh2000 opened this issue Apr 30, 2024 · 8 comments

Comments

@shh2000
Copy link

shh2000 commented Apr 30, 2024

env

  • env1: 3 HGX-H100 with totally 24 GPUs. Same baremetal hardwares&envs with nvidia driver 535.129.03
  • env2: 3 HGX-A100 with totally 24 GPUs. Same baremetal hardwares&envs with nvidia driver 470.141.10

test code

import torch.distributed as dist

if __name__ == "__main__":

    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    group = dist.new_group([13,15,17,19,21])
    print(rank)
    print(world_size)

    a=torch.tensor([1,2]).to(rank%8)
    print(a)
    dist.all_reduce(a,group=group,op=dist.ReduceOp.SUM)

    #dist.all_reduce(a,op=dist.ReduceOp.SUM)
    dist.barrier()

    dist.barrier()
    print(a)
    dist.destroy_process_group()```

# result

1. env1 with ngctorch24:03(nccl 2.20.5) failed with "Message truncated : received 1024 bytes instead of 256"
2. env1 with ngctorch23:09(nccl 2.20.5) passed
3. env2 with ngctorch24:03 passed
4. env2 with ngctorch23:09 passed
@sjeaugey
Copy link
Member

Can you provide the log with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=GRAPH,INIT,ENV for the run on env1 which failed?

@shh2000
Copy link
Author

shh2000 commented Apr 30, 2024

Additional information:

  1. when using 4 HGX-H100 with totally 32 GPUs, all nccl version passed
  2. when using 2 or 4 HGX-H100, using megatron:core_v0.6.0 to train some 70B/45B/xxxB LLM, passed. MFU=45% which is reasonable in "mcore training dense LLM on HGX-H100" context
  3. all HGX-H100 passed nccl-test respectively(all-reduce-perf results is reasonable)

@shh2000
Copy link
Author

shh2000 commented Apr 30, 2024

Can you provide the log with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=GRAPH,INIT,ENV for the run on env1 which failed?

Sure, i'll run with your envs and provide logs, thanks

@shh2000
Copy link
Author

shh2000 commented Apr 30, 2024

@sjeaugey here's the log of noderank1(rank8-15).
noderank1_log.txt

@kiskra-nvidia
Copy link
Member

We've fixed a similar-looking bug in NCCL 2.21; can you try with the latest version?

@shh2000
Copy link
Author

shh2000 commented May 6, 2024

@kiskra-nvidia Thanks for the information. we may try ngctorch:2404 or some other ways to upgrade nccl 2.21+.

By the way, is there any publicly disclosable reasons about nccl 2.20's bug?(Not for the problem itself, but out of curiosity for the technology involved). I find and guess maybe 2.20 found wrong MPI paths in some cases(drivers+nnodes+topo)? look forward to your reply and thanks!

@sjeaugey
Copy link
Member

sjeaugey commented May 6, 2024

Actually I'm not sure upgrading will help. The bug was a mixup of the connect with the following barrier and the barrier size was 8 bytes. Here all your sizes are more than 8.

The log you provided only shows one node. Could it be your environment was not forwarded to the other node? That would also explain the crash, as the other node might have a different configuration ending up in a mismatch and a discrepancy in sizes we're trying to exchange.

@shh2000
Copy link
Author

shh2000 commented May 7, 2024

@sjeaugey Hi, my 3 nodes has the same baremetal config(8 H100+4 activated(8 in all) HDR NIC+2 CPU+PCIE5), with containers run from the same images(ngctorch2403+megatroncore0.6.0). If your guess is true, can my bug be reproduced by testing the P2P between every two ranks(C_{24}^2=24*23/2=276 cases)? By the way, if 276 p2p comm is all ok, would it face bug when using specific 5 rank to do all-reduce?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants