stress:gpu crashes when compiled with CUDA but run without a device #641

abouteiller · 2024-03-07T16:12:46Z

Describe the bug

The stress:gpu compiled with CUDA support may crash when run on a system without a CUDA GPU. See

https://github.com/ICLDisco/parsec/actions/runs/8190649752/job/22398165838?pr=515

Note that there appears to be some interaction with Intel ZE libraries that is unexpected here.

To Reproduce

See https://github.com/ICLDisco/parsec/actions/runs/8190649752/job/22398165838?pr=515

abouteiller · 2024-03-11T22:47:10Z

There are 2 different failure modicum:

stress tries to allocate 33GB of memory, which may or may not be possible, especially on low-end cuda devices, or as host memory.
stage tries to run hook = tc->incarnations[chore_id], but disabling the cuda device nullified the hook for the only valid chore_id of 0 (the incarnations array does not have a CPU implementation).

therault · 2024-03-12T02:13:18Z

I pushed some commit in PR #642 to handle the lack of device more gracefully, both in the test and the runtime system.

However, the test still fails in the PI, since there is no way to fix that at the runtime/test level. The issue is with the CI logic here: serotonin is a machine that has both NVIDIA and ROCM installed at the software stack level, but only ROCM device. The cmake logic defaults to NVIDIA in that case, and we should pass ROCM.

I don't think we should complicate the CMake logic of the tests more here: they fail because we asked them to run on NVIDIA and there is no NVIDIA cards.

Geri is working on preparing a PR on the CI to fix this issue at the CI logic level.

bosilca · 2024-03-12T06:14:54Z

the CI instances are tagged with their devices. As an example serotonin is tagger with gpu_amd while guyot is tagged with gpu_nvidia. The CI should use these tags to drive the correct set of testing.

abouteiller · 2024-04-06T00:21:11Z

There is a third problem:

The JDF of the stress tester is not symmetrical

parsec/tests/runtime/cuda/stress.jdf

Line 106 in 1ababbe

-> B GEMM( m, 0 .. NGPUs-1, r )

parsec/tests/runtime/cuda/stress.jdf

Line 128 in 1ababbe

READ B <- A READ_A( (m+g) % descA->super.mt, r)

Buggy behavior

This causes occasionally the following behavior

flow GEMM(1,1,0) B <- READ_A(1,0) accesses a data repo entry that has been freed as part of the termination for GEMM(1,0,0) entry 0x1bda310/READ_A(1, 0) of hash table this_task->data._f_B.source_repo has a usage count of 4/4 and is not retained: freeing it at stress.c:1041 @__data_repo_entry_used_once:120
22:26
this happens when the test runs on CPU (devices exist, on b00, but OOM, memory_use=90), not sure why this has anything to do with the changes in the PR (edited)

Potential fix

we believe that the READ B should be from READ_A(m, r) without the (m+r)%mt randomization

abouteiller added the bug Something isn't working label Mar 7, 2024

abouteiller added this to the v4.0 milestone Mar 7, 2024

This was referenced Mar 12, 2024

Fix the lack of direct GPU to GPU communications in multi-device runs. #642

Merged

Consolidated error handling when GPU only tests execute on CPU systems #644

Merged

abouteiller linked a pull request Mar 13, 2024 that will close this issue

Consolidated error handling when GPU only tests execute on CPU systems #644

Merged

4 tasks

abouteiller mentioned this issue Apr 6, 2024

Compute CPU and GPU versions without lying during kernel epilog (enable TTG/PTG versioning to coexist) #648

Merged

5 tasks

abouteiller mentioned this issue Apr 24, 2024

Make coverity happy for 4.0 #623

Draft

abouteiller added a commit to abouteiller/parsec that referenced this issue May 21, 2024

tests: bugfix ICLDisco#641

42347aa

abouteiller closed this as completed in #644 May 24, 2024

abouteiller mentioned this issue May 24, 2024

The CI configuration should request and use slurm tags to control what tests run and compile #659

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stress:gpu crashes when compiled with CUDA but run without a device #641

stress:gpu crashes when compiled with CUDA but run without a device #641

abouteiller commented Mar 7, 2024

abouteiller commented Mar 11, 2024

therault commented Mar 12, 2024

bosilca commented Mar 12, 2024

abouteiller commented Apr 6, 2024 •

edited

Loading

stress:gpu crashes when compiled with CUDA but run without a device #641

stress:gpu crashes when compiled with CUDA but run without a device #641

Comments

abouteiller commented Mar 7, 2024

Describe the bug

To Reproduce

abouteiller commented Mar 11, 2024

therault commented Mar 12, 2024

bosilca commented Mar 12, 2024

abouteiller commented Apr 6, 2024 • edited Loading

Buggy behavior

Potential fix

abouteiller commented Apr 6, 2024 •

edited

Loading