Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stress:gpu crashes when compiled with CUDA but run without a device #641

Closed
abouteiller opened this issue Mar 7, 2024 · 4 comments · Fixed by #644
Closed

stress:gpu crashes when compiled with CUDA but run without a device #641

abouteiller opened this issue Mar 7, 2024 · 4 comments · Fixed by #644
Labels
bug Something isn't working
Milestone

Comments

@abouteiller
Copy link
Contributor

Describe the bug

The stress:gpu compiled with CUDA support may crash when run on a system without a CUDA GPU. See

https://github.com/ICLDisco/parsec/actions/runs/8190649752/job/22398165838?pr=515

Note that there appears to be some interaction with Intel ZE libraries that is unexpected here.

To Reproduce

See https://github.com/ICLDisco/parsec/actions/runs/8190649752/job/22398165838?pr=515

@abouteiller abouteiller added the bug Something isn't working label Mar 7, 2024
@abouteiller abouteiller added this to the v4.0 milestone Mar 7, 2024
@abouteiller
Copy link
Contributor Author

There are 2 different failure modicum:

  1. stress tries to allocate 33GB of memory, which may or may not be possible, especially on low-end cuda devices, or as host memory.
  2. stage tries to run hook = tc->incarnations[chore_id], but disabling the cuda device nullified the hook for the only valid chore_id of 0 (the incarnations array does not have a CPU implementation).

@therault
Copy link
Contributor

I pushed some commit in PR #642 to handle the lack of device more gracefully, both in the test and the runtime system.

However, the test still fails in the PI, since there is no way to fix that at the runtime/test level. The issue is with the CI logic here: serotonin is a machine that has both NVIDIA and ROCM installed at the software stack level, but only ROCM device. The cmake logic defaults to NVIDIA in that case, and we should pass ROCM.

I don't think we should complicate the CMake logic of the tests more here: they fail because we asked them to run on NVIDIA and there is no NVIDIA cards.

Geri is working on preparing a PR on the CI to fix this issue at the CI logic level.

@bosilca
Copy link
Contributor

bosilca commented Mar 12, 2024

the CI instances are tagged with their devices. As an example serotonin is tagger with gpu_amd while guyot is tagged with gpu_nvidia. The CI should use these tags to drive the correct set of testing.

@abouteiller
Copy link
Contributor Author

abouteiller commented Apr 6, 2024

There is a third problem:

The JDF of the stress tester is not symmetrical

-> B GEMM( m, 0 .. NGPUs-1, r )

READ B <- A READ_A( (m+g) % descA->super.mt, r)

Buggy behavior

This causes occasionally the following behavior

flow GEMM(1,1,0) B <- READ_A(1,0) accesses a data repo entry that has been freed as part of the termination for GEMM(1,0,0) entry 0x1bda310/READ_A(1, 0) of hash table this_task->data._f_B.source_repo has a usage count of 4/4 and is not retained: freeing it at stress.c:1041 @__data_repo_entry_used_once:120
22:26
this happens when the test runs on CPU (devices exist, on b00, but OOM, memory_use=90), not sure why this has anything to do with the changes in the PR (edited)

Potential fix

we believe that the READ B should be from READ_A(m, r) without the (m+r)%mt randomization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants