Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidated error handling when GPU only tests execute on CPU systems #644

Merged
merged 6 commits into from
May 24, 2024

Conversation

abouteiller
Copy link
Contributor

@abouteiller abouteiller commented Mar 12, 2024

Fixes issue #641

  • capture ci tag and compile for Cuda or HIP based on this tag, not on availability of the software stack.
  • decide if we keep the check in stage_main about counting GPUs. We now will detect (later) erroneous behavior and produce a parsec_fatal so maybe that is already clear enough?
  • cleanup error cases for when hook==NULL, cases have been split between when the dyld function was not populated (goto next_chore) and other cases (this is a bug, assert/fail).
  • stress tester is not symmetrical stress:gpu crashes when compiled with CUDA but run without a device #641 (comment)

parsec/scheduling.c Outdated Show resolved Hide resolved
@abouteiller abouteiller changed the title Bugfix/no gpu found Consolidated error handling when GPU only tests execute on CPU systems Mar 13, 2024
@abouteiller abouteiller added this to the v4.0 milestone Mar 13, 2024
@abouteiller abouteiller linked an issue Mar 13, 2024 that may be closed by this pull request
@abouteiller abouteiller added the bug Something isn't working label Mar 13, 2024
@abouteiller
Copy link
Contributor Author

Ok so current approach is that the tests will not fail, just put out a warning in the output.
That is not ideal but we have to choose between two bad options due to how ctest works: either we 1. cause spurious failures when running ctest when we don't have the hardware to run the test, or 2. we cause spurious success when doing so, there is no way to report that the test actually did not run.

This happens in CI, but also happens when running ctest manually (e.g., running with srun -wxyz ctest where xyz doesn't have a CUDA gpu).

@therault has been working with Geri to integrate the workflow tags into the ctest logic, but that is currently stalled so lets move with this for now.

@abouteiller abouteiller marked this pull request as ready for review May 21, 2024 23:40
@abouteiller abouteiller requested a review from a team as a code owner May 21, 2024 23:40
@bosilca
Copy link
Contributor

bosilca commented May 22, 2024

Why not returning a specific error code from the tests to indicate that the test requires hardware not available. We could then use expect to identify tests that cannot be run because the CI does not have the specific hardware (aka. the tag).

@abouteiller
Copy link
Contributor Author

abouteiller commented May 22, 2024

In the end, ctest will report the test as failed of passed, even if the test itself has a ternary output, ctest can only report binary results. So there will be loss of information, we just have to decide how we want it rounded up (passed) or down (failed, current behavior).
https://cmake.org/cmake/help/v3.19/prop_test/SKIP_RETURN_CODE.html

@abouteiller abouteiller merged commit a5f49ab into ICLDisco:master May 24, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stress:gpu crashes when compiled with CUDA but run without a device
3 participants