Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: HIP error when running ResNet-50 on AMD PRO W7900 GPU with PyTorch #1398

Closed
liangyong928 opened this issue Apr 23, 2024 · 4 comments

Comments

@liangyong928
Copy link

The following code runs normally on an AMD PRO W7900 GPU:

import torch
device = torch.device("cuda")
x = torch.randn(128,10,224,224).to(device)
model = torch.nn.Conv2d(10, 64, 5).to(device)
output = model(x)
print(output.device)

However, when running the code below, I encounter an error:

import torch
import torchvision.models as models
from torchvision.models import ResNet50_Weights
x_large = torch.randn(128, 3, 224, 224)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#device = torch.device("cpu")
weights = ResNet50_Weights.IMAGENET1K_V1
model = models.resnet50(weights=weights).to(device)
model.eval()
x_large = x_large.to(device)
output = model(x_large)
print(output.device)

The error message is as follows:

Traceback (most recent call last):
  File "/root/test/testnew2.py", line 11, in <module>
    output = model(x_large)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torchvision/models/resnet.py", line 285, in forward
    return self._forward_impl(x)
  File "/usr/local/lib/python3.10/dist-packages/torchvision/models/resnet.py", line 278, in _forward_impl
    x = self.avgpool(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/pooling.py", line 1194, in forward
    return F.adaptive_avg_pool2d(input, self.output_size)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1228, in adaptive_avg_pool2d
    return torch._C._nn.adaptive_avg_pool2d(input, _output_size)
RuntimeError: HIP error: the operation cannot be performed in the present state
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Why does the first block of code run without any issues while the second block throws an error when using the AMD PRO W7900 GPU for computation? I would appreciate any insights or suggestions for resolving this issue.

@briansp2020
Copy link

What version of ROCm and PyTorch are you using? It helps to provide as much information as possible when asking for help.
I just tried it with ROCm 6.1 & PyTorch nightly build on 7900XTX and it seems fine...

@liangyong928
Copy link
Author

@briansp2020
For installing PyTorch, following the instructions in https://rocm.docs.amd.com/projects/radeon/en/latest/docs/install/install-pytorch.html under the "Install PyTorch via PIP" section, the installation commands used were:

wget https://repo.radeon.com/rocm/manylinux/rocm-rel-6.0.2/torch-2.1.2+rocm6.0-cp310-cp310-linux_x86_64.whl
wget https://repo.radeon.com/rocm/manylinux/rocm-rel-6.0.2/torchvision-0.16.1+rocm6.0-cp310-cp310-linux_x86_64.whl
pip3 install --force-reinstall torch-2.1.2+rocm6.0-cp310-cp310-linux_x86_64.whl torchvision-0.16.1+rocm6.0-cp310-cp310-linux_x86_64.whl

The versions of Python and PyTorch are as follows:

root@yong-System:/home/yong# python3
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.1.2+rocm6.0'

For installing ROCm, the commands used followed the "Option A: Graphics usecase" section from https://rocm.docs.amd.com/projects/radeon/en/latest/docs/install/install-radeon.html# :

sudo apt update
wget https://repo.radeon.com/amdgpu-install/23.40.2/ubuntu/jammy/amdgpu-install_6.0.60002-1_all.deb
sudo apt install ./amdgpu-install_6.0.60002-1_all.deb
sudo amdgpu-install -y --usecase=graphics,rocm
sudo usermod -a -G render,video $LOGNAME

After installation, the ROCm version is 6.0.2:

root@yong-System:/home/yong# ls /opt/rocm
rocm/       rocm-6.0.2/ 
root@yong-System:/home/yong# ls /opt/rocm-6.0.2/
amdgcn  bin  bin.bak  include  lib  libexec  llvm  share

@briansp2020
Copy link

I'd try the newly released 2.3

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0

or the nightly version

pip3 install --pre --force-reinstall torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0

@liangyong928
Copy link
Author

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0

Thanks, the issue has been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants