Skip to content

Handle error in CUDA/HIP module init and configurable max_streams#351

Merged
bosilca merged 2 commits intoICLDisco:masterfrom
therault:cuda-graceful-failure
Jun 16, 2022
Merged

Handle error in CUDA/HIP module init and configurable max_streams#351
bosilca merged 2 commits intoICLDisco:masterfrom
therault:cuda-graceful-failure

Conversation

@therault
Copy link
Copy Markdown
Contributor

A misconfiguration on leconte makes none of the device capable of allocating a cuda stream.

This leads to a SEGFAULT, because we try to free things that have not been allocated.

With this patch, the devices are just removed (with a warning), but the execution proceeds.

@therault therault requested a review from bosilca as a code owner March 17, 2022 02:00
@abouteiller abouteiller added the bug Something isn't working label Mar 17, 2022
Comment thread parsec/mca/device/cuda/device_cuda_module.c Outdated
@abouteiller abouteiller added this to the v4.0 milestone May 6, 2022
@therault therault requested a review from a team as a code owner June 9, 2022 19:57
Comment thread parsec/mca/device/cuda/device_cuda_component.c
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
@abouteiller abouteiller force-pushed the cuda-graceful-failure branch from 46a39f9 to 5096959 Compare June 13, 2022 20:38
@abouteiller abouteiller requested a review from bosilca June 13, 2022 20:39
@abouteiller abouteiller changed the title Handle more gracefully cases of error in CUDA module init Handle error in CUDA module init and configurable max_streams Jun 13, 2022
@abouteiller abouteiller changed the title Handle error in CUDA module init and configurable max_streams Handle error in CUDA/HIP module init and configurable max_streams Jun 14, 2022
@bosilca bosilca merged commit c041a1d into ICLDisco:master Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants