Initialize low-level GPU API once instead of at each Variorum API call #443

amarathe84 · 2023-07-07T21:22:14Z

Description

This PR adds state-preserving logic to the GPU functionality in Variorum. The goal is to call the respective initialization functions provided by the low-level device-specific APIs.

Fixes #427

Closes #459 once all tests pass.

Type of change

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Since this fix touches the AMD and Nvidia GPUs, it is tested on the following systems:

Lassen/Alehouse: Nvidia V100 GPUs
nvhpc1: Nvidia A100 GPUs
Corona/Tioga: AMD GPUs

slabasan · 2023-08-04T22:16:45Z

@amarathe84 This PR seems to be failing on lassen-gpu-only. Can you take a look? It could be an error with the runner and not the code.

amarathe84 · 2024-01-17T21:50:25Z

During testing of this PR, we noticed double free corruption error for the Nvidia GPU build of Variorum on Lassen. Debugging this issue revealed that the double free corruption happens because the second variorum call made by the test application tries to use deallocated NVML device object without NVML being initialized. Resolution of this issue will require introduction of stateful variorum_init() and variorum_finalize() variorum APIs. The user of those APIs will need to manually call nvmlInit() after variorum_init() and nvmlShutdown() before variorum_finalize() in their code. This PR is in WIP state as it is waiting for the introduction of the advanced APIs in Variorum.

tpatki · 2024-02-07T18:53:12Z

#459 is tracking this for our gitlab tests. Closing this PR.

amarathe84 force-pushed the gpu-init-fix branch 2 times, most recently from c897425 to 5150443 Compare July 7, 2023 23:06

amarathe84 self-assigned this Jul 7, 2023

amarathe84 added status-work-in-progress In progress, not ready to merge. type-bug area-hardware-support labels Jul 11, 2023

amarathe84 force-pushed the gpu-init-fix branch from 9729117 to 0682a7c Compare July 25, 2023 22:55

amarathe84 marked this pull request as ready for review July 25, 2023 22:57

amarathe84 requested review from tpatki and slabasan July 25, 2023 22:58

amarathe84 added status-ready-for-review Formatted, and tested on multiple systems. and removed status-work-in-progress In progress, not ready to merge. labels Jul 25, 2023

slabasan mentioned this pull request Aug 2, 2023

PR from fork/443 #459

Closed

amarathe84 added 3 commits January 16, 2024 22:04

Nvidia GPUs: Call nvmlInit() once (state-preservation)

282d7c4

AMD GPUs: Call rsmi_init() once

21bfe3e

Initialize ARM CPU model ID

a701bf0

slabasan force-pushed the gpu-init-fix branch from 0682a7c to a701bf0 Compare January 17, 2024 06:05

format

67f232c

slabasan added status-work-in-progress In progress, not ready to merge. and removed status-ready-for-review Formatted, and tested on multiple systems. labels Jan 17, 2024

tpatki closed this Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialize low-level GPU API once instead of at each Variorum API call #443

Initialize low-level GPU API once instead of at each Variorum API call #443

amarathe84 commented Jul 7, 2023 •

edited by slabasan

slabasan commented Aug 4, 2023

amarathe84 commented Jan 17, 2024

tpatki commented Feb 7, 2024

Initialize low-level GPU API once instead of at each Variorum API call #443

Initialize low-level GPU API once instead of at each Variorum API call #443

Conversation

amarathe84 commented Jul 7, 2023 • edited by slabasan

Description

Type of change

How Has This Been Tested?

slabasan commented Aug 4, 2023

amarathe84 commented Jan 17, 2024

tpatki commented Feb 7, 2024

amarathe84 commented Jul 7, 2023 •

edited by slabasan