-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multicard training error with rocm5.7 #2494
Comments
RCCL seems works well run the test is correct
|
(1) Can you provide steps that we can reproduce? |
for 1) after i remove kernel params : amd_iommu=off it's fixed casually,it‘s appear again after i set DDP gradient_view_as_buckect and cannot go back to the right behavior |
here is tmpl dockerfile
here is test code
command run is |
I am plagued by these errors using PyTorch with Stable Diffusion on my RX 7800 XT, including on 5.7.1. It will work fine for a bit, then my dmesg logs get spammed with the sq_intr errors shown above and my GPU resets. Is this a firmware issue? If not, is it likely to be fixed in ROCm 6.0? |
I was able to run the test.py successfully using rocm's pytorch-nightly docker image, but failed on other docker images including @sdli1995 's docker file. Will need to investigate further. |
@sdli1995 Have you tested with pytorch 2.1 with rocm 5.7? Though I don't have access to a 7900XTX node at this time, I failed to run your code on a MI250 with your code/dockerfile, and also, I failed to run the code on older version of pytorch/rocm combinations. However, I am able to run your test code successfully on pytorch 2.1 with rocm 5.7, using rocm/pytorch:latest docker image on MI210. Can you use |
I test several times test.py in rocm/pytorch:latest docker image it's work correct in 7900XTX nodes |
@hongxiayang Additionally,I try rocm-5.7.1 and 5.7.1 amdgpu-dkms driver it's works well utill now . Weird... |
@sdli1995 Feel free to close this issue since it works now. |
in ROCM 6.0 it‘s apears agin [ 2746.403998] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:0 pasid:0, for process pid 0 thread pid 0) |
@sdli1995 Can you please test with latest ROCm 6.1.1 to see if it happens with this build? Thanks! |
GPU Model:
7900XTX
OS and other system details:
CPU:
amd epyc 7542
MotherBoard:
H12SSL-i
RAM:
8*32G
OS:
debian12 with kernel 6.1.0.11
kernel params :
amd_iommu=off
amdgpu version:
[amdgpu_5.7.50700-1652687.22.04_amd64.deb]
Describe your Problem Provide sufficient information to reproduce your problem. Explain why the current behavior is a concern.
When i use 2 gpu run a Transformer Model with pytorch 2.0.1 it'crash and more the graphic card are not usable else the kernel dmesg show large amount of errors and first reboot the system cannot detect the graphics card it need reset bios and redetect
command is
torchrun --nnodes=1 --nproc_per_node=2 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=localhost:29400 task/asr/train.py
Output:
dmesg output
The text was updated successfully, but these errors were encountered: