You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please find the following conclusions based on CI logs.
The failed tests are tests containing call to nvmlInit.
The tests fails because they are aborted by watchdog at the end of the 15 minute timeout.
The tests fails only under valgrind.
The thread freezes on waitpid call.
I also have the following assumptions.
Several nvmlDeviceGetProcessUtilization called are present in the backtrace. I think this because several processes are running. And NVML tries to get utilization for each process. Moreover, two nvmlDeviceGetProcessUtilization calls are required to retrieve the information (according to the documentation). PID can be obtained from the second call. And I guess the obtained PIDs are passed to the system call waitpid.
NVML is only used to limit the size of the buffer in gtest. However, we already limit the buffer size in valgrind by 1 MB. This limitation is more strict that the limitation obtained using NVML API. Thus, I will make a workaround for the issue by adding condition before the call.
Describe the bug
GTest failures when running on the L40G GPU.
Log
Steps to Reproduce
Rerun failed jobs
https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=72701
The text was updated successfully, but these errors were encountered: