Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize low-level GPU API once instead of at each Variorum API call #443

Closed
wants to merge 4 commits into from

Conversation

amarathe84
Copy link
Collaborator

@amarathe84 amarathe84 commented Jul 7, 2023

Description

This PR adds state-preserving logic to the GPU functionality in Variorum. The goal is to call the respective initialization functions provided by the low-level device-specific APIs.

Fixes #427

Closes #459 once all tests pass.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Since this fix touches the AMD and Nvidia GPUs, it is tested on the following systems:

  • Lassen/Alehouse: Nvidia V100 GPUs
  • nvhpc1: Nvidia A100 GPUs
  • Corona/Tioga: AMD GPUs

@amarathe84 amarathe84 force-pushed the gpu-init-fix branch 2 times, most recently from c897425 to 5150443 Compare July 7, 2023 23:06
@amarathe84 amarathe84 self-assigned this Jul 7, 2023
@amarathe84 amarathe84 marked this pull request as ready for review July 25, 2023 22:57
@amarathe84 amarathe84 added status-ready-for-review Formatted, and tested on multiple systems. and removed status-work-in-progress In progress, not ready to merge. labels Jul 25, 2023
@slabasan slabasan mentioned this pull request Aug 2, 2023
@slabasan
Copy link
Collaborator

slabasan commented Aug 4, 2023

@amarathe84 This PR seems to be failing on lassen-gpu-only. Can you take a look? It could be an error with the runner and not the code.

@slabasan slabasan added status-work-in-progress In progress, not ready to merge. and removed status-ready-for-review Formatted, and tested on multiple systems. labels Jan 17, 2024
@amarathe84
Copy link
Collaborator Author

During testing of this PR, we noticed double free corruption error for the Nvidia GPU build of Variorum on Lassen. Debugging this issue revealed that the double free corruption happens because the second variorum call made by the test application tries to use deallocated NVML device object without NVML being initialized. Resolution of this issue will require introduction of stateful variorum_init() and variorum_finalize() variorum APIs. The user of those APIs will need to manually call nvmlInit() after variorum_init() and nvmlShutdown() before variorum_finalize() in their code. This PR is in WIP state as it is waiting for the introduction of the advanced APIs in Variorum.

@tpatki
Copy link
Member

tpatki commented Feb 7, 2024

#459 is tracking this for our gitlab tests. Closing this PR.

@tpatki tpatki closed this Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance issues with nvml_init/rsmi_init
3 participants