-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stress:gpu crashes when compiled with CUDA but run without a device #641
Comments
There are 2 different failure modicum:
|
I pushed some commit in PR #642 to handle the lack of device more gracefully, both in the test and the runtime system. However, the test still fails in the PI, since there is no way to fix that at the runtime/test level. The issue is with the CI logic here: serotonin is a machine that has both NVIDIA and ROCM installed at the software stack level, but only ROCM device. The cmake logic defaults to NVIDIA in that case, and we should pass ROCM. I don't think we should complicate the CMake logic of the tests more here: they fail because we asked them to run on NVIDIA and there is no NVIDIA cards. Geri is working on preparing a PR on the CI to fix this issue at the CI logic level. |
the CI instances are tagged with their devices. As an example serotonin is tagger with |
There is a third problem: The JDF of the stress tester is not symmetrical parsec/tests/runtime/cuda/stress.jdf Line 106 in 1ababbe
parsec/tests/runtime/cuda/stress.jdf Line 128 in 1ababbe
Buggy behaviorThis causes occasionally the following behavior
Potential fixwe believe that the READ B should be from READ_A(m, r) without the (m+r)%mt randomization |
Describe the bug
The stress:gpu compiled with CUDA support may crash when run on a system without a CUDA GPU. See
https://github.com/ICLDisco/parsec/actions/runs/8190649752/job/22398165838?pr=515
Note that there appears to be some interaction with Intel ZE libraries that is unexpected here.
To Reproduce
See https://github.com/ICLDisco/parsec/actions/runs/8190649752/job/22398165838?pr=515
The text was updated successfully, but these errors were encountered: