-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed Data Parallel (DDP) Training on PyTorch with AMD GPUs (ROCm) and RCCL test hangs #1129
Comments
Please re-run Also, can you share the ROCm kernel version installed on your machine -- the output of |
Hi @nileshnegi, thanks for looking in to this. The RCCL version:
The ROCm kernel version:
|
HI @nileshnegi, I have additional info. I was looking at this issue from the PyTorch side and found that there was actually an issue with applying
After setting this environment variable the test works!
The question is now, why do I need to set |
Additional info 2:
This shows:
I checked, but I don't have
How to install |
librccl-net.so is an optional plugin for handling network transports that aren't natively supported (IB/sockets), such as libFabric. It's not really an issue in this case. This is likely due to trying to use P2P communication on machines with PCIe-connected GPUs. Could you try the test again with HSA_FORCE_FINE_GRAIN_PCIE=1? |
Can you check if you have large bar enabled in your system? We expect 64-bit region 0, i.e. |
Hi @gilbertlee-amd, |
Hi @wenkaidu, yes I think it is a 64-bit region:
|
Can you try adding "iommu=pt" to remove this warning? |
Hi @wenkaidu, in
Now, the RCCL test works!
Not sure if this is related but the bandwidth is also much higher (see previous results)... I also checked my PyTorch training scripts, and it works now, without adding So, why does this work, what is the "pt" value and what does |
I am glad to hear the issue has been fixed! |
PS: Is this something that needs to be added to the ROCm documentation? Or is this something very specific to my system? Do others not need to add this kernel boot command line parameters? |
@wenkaidu, all, thanks for your help! I see in the link above that |
Yes, IOMMU setting is applicable to all systems. I agree this needs to be clearly called out in ROCm documentation. |
I have some more feedback about the documentation, maybe you can relay this? This week I have been building this custom server, with AMD Ryzen Threadripper PRO 5975WX and two AMD MI100 GPUs (will be extended to 6x MI100 in the next week). I expected it to be harder to setup the AMD GPUs on Ubuntu 22.04 for training with PyTorch, so that's a good sign. However, I had the following issues related to documentation:
|
Thanks for the feedback. Will do! |
Can confirm this problem exists in ROCm 5.6.1 as well within a VM running Ubuntu 22.04.4. |
@Trat8547 Can you please try ROCm 6.1 which has recently been released? |
Sorry for the late response. The 100% GPU stall issue still persists in both ROCm 6.1 and 6.1.1 with both 5.19 and 6.5. iommu=pt doesn't seem to have any effect and the GPU_MAX_HW_QUEUES=1 trick also doesn't help. I'm running with xen virtualization with the following kernel parameter set: GRUB_CMDLINE_XEN_DEFAULT="iommu=1,pass-through" Just to be thorough I've also set GRUB_CMDLINE_LINUX="iommu=pt" but this also doesn't fix the stall issue. |
I think there should be no iommu setting for kernel running inside VM. Can you try removing that? Also what happens if testing under BM. Does it also hang? |
Problem Description
I have a Ubuntu 22.04 machine with two AMD MI100 GPUs installed. When trying to run a PyTorch training script, using DDP and
backend="nccl"
(which under the hood should userccl
), the script hangs with the GPU use at 100%, without the expected GPU temperature buildup.At first I thought it was related to my PyTorch installation, but when I tried the
all_reduce_perf
test ofrccl-tests
I observed the same behaviour: the script hangs with the GPU use at 100%, without the expected GPU temperature buildup.Output of
all_reduce_perf
:Nothing happens after this initial output.
Output of
rock-smi
:The version of rccl I have installed:
Am I missing anything in my installation?
Operating System
Ubuntu 22.04.4 LTS (Jammy Jellyfish)
CPU
AMD Ryzen Threadripper PRO 5975WX 32-Cores
GPU
AMD Instinct MI100
ROCm Version
ROCm 6.0.0
ROCm Component
rccl
Steps to Reproduce
Run
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
of therccl-tests
repo, it should perform theall_reduce
test without blocking.(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
Additional Information
The text was updated successfully, but these errors were encountered: