Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytorch not working with CUDA 11.2 and CUDA 11.7 #32

Open
TC-MCZ opened this issue Oct 12, 2023 · 4 comments
Open

Pytorch not working with CUDA 11.2 and CUDA 11.7 #32

TC-MCZ opened this issue Oct 12, 2023 · 4 comments
Labels
pytorch Issues related to pytorch

Comments

@TC-MCZ
Copy link

TC-MCZ commented Oct 12, 2023

          Hi ,I have some problems when running cricket in pytorch. I have pulled the latest code,and build pytorch locally with modify change the doces mentioned.

my CUDA is 11.2 and cudnn is 8.9.2 in ths Tesla P4,but get this problem:

server:
+08:01:00.423212 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:00.445168 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:00.445403 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:00.447247 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:00.448076 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:07.164339 ERROR: cuda_device_prop_result size mismatch in cpu-server-runtime.c:367 +08:02:22.370950 INFO: RPC deinit requested. +08:08:54.324012 INFO: have a nice day!
client:
`+08:00:36.417392 WARNING: could not find .nv.info section. This means this binary does not contain any kernels. in cpu-elf2.c:922
+08:00:36.418684 WARNING: could not find .nv.info section. This means this binary does not contain any kernels. in cpu-elf2.c:922
+08:00:36.420058 WARNING: could not find .nv.info section. This means this binary does not contain any kernels. in cpu-elf2.c:922
call failed: RPC: Timed out
call failed: RPC: Timed out
call failed: RPC: Timed out
+08:02:01.851255 ERROR: something went wrong in cpu-client-runtime.c:444
Traceback (most recent call last):
File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 242, in _lazy_init
queued_call()
File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 125, in _check_capability
capability = get_device_capability(d)
File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 357, in get_device_capability
prop = get_device_properties(device)
File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 375, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/lwh/cricket/tests/test_apps/pytorch_minimal.py", line 39, in
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 246, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error:

CUDA call was originally invoked at:

[' File "/home/lwh/cricket/tests/test_apps/pytorch_minimal.py", line 31, in \n import torch\n', ' File "", line 991, in _find_and_load\n', ' File "", line 975, in _find_and_load_unlocked\n', ' File "", line 671, in _load_unlocked\n', ' File "", line 843, in exec_module\n', ' File "", line 219, in _call_with_frames_removed\n', ' File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/init.py", line 798, in \n _C._initExtension(manager_path())\n', ' File "", line 991, in _find_and_load\n', ' File "", line 975, in _find_and_load_unlocked\n', ' File "", line 671, in _load_unlocked\n', ' File "", line 843, in exec_module\n', ' File "", line 219, in _call_with_frames_removed\n', ' File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 179, in \n _lazy_call(_check_capability)\n', ' File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 177, in _lazy_call\n _queued_calls.append((callable, traceback.format_stack()))\n']
+08:02:27.007890 ERROR: call failed. in cpu-client.c:213
+08:02:27.012036 INFO: api-call-cnt: 14
+08:02:27.012051 INFO: memcpy-cnt: 0`

Is my CUDA version wrong? or other reasons?

Originally posted by @Tlhaoge in #6 (comment)

@TC-MCZ TC-MCZ closed this as completed Oct 12, 2023
@TC-MCZ TC-MCZ reopened this Oct 12, 2023
@TC-MCZ TC-MCZ changed the title HI, I have the same problem with cuda 11.7, how do I fix it? @n-eiling, HI, I have the same problem with cuda 11.7, how do I fix it? Oct 12, 2023
@TC-MCZ TC-MCZ changed the title @n-eiling, HI, I have the same problem with cuda 11.7, how do I fix it? HI, I have the same problem with cuda 11.7, how do I fix it? Oct 12, 2023
@n-eiling n-eiling changed the title HI, I have the same problem with cuda 11.7, how do I fix it? Pytorch not working with CUDA 11.2 and CUDA 11.7 Dec 30, 2023
@n-eiling n-eiling added the pytorch Issues related to pytorch label Dec 30, 2023
@leonardosul
Copy link

leonardosul commented Jan 23, 2024

Encountering the same issue. Using CUDA 11.7 and CUDNN 8.7.0. Running on an AWS EC2 instance.

It would be really nice to have a Github workflow that builds and runs this the RPC server and docker container together to ensure that it works as described in the docs. Although this would require a GPU enabled runner... probably not as easy as I imagined 🤔

@n-eiling
Copy link
Member

There is a CI testing Cricket with a GPU enabled runner. There is no test for pytorch, yet, and yes, we should add one. However, I'm not surprised there are issues with pytorch support. Pytorch is really complex and uses a lot of CUDA features in unusual ways that make testing pretty difficult.

@RWTH-ACS RWTH-ACS deleted a comment from Cattacker Jan 24, 2024
@leonardosul
Copy link

@n-eiling Thanks for the reply! I can see that you use Gitlab CI. I can have a look and see if I can write a workflow that can test pytorch with cricket.

Outside of that how would you recommend I go about trying to map the unusual ways that pytorch uses cuda? That might be a good place to start I guess.

@leeyiding
Copy link

Hello, I encountered the same problem when running pytorch_minimal.py on cuda11.8 and cndnn8.9. Does anyone have a solution now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pytorch Issues related to pytorch
Projects
None yet
Development

No branches or pull requests

4 participants