GTest failures when running on the L40G GPU #9531

Alexey-Rivkin · 2023-12-05T11:41:52Z

Describe the bug

GTest failures when running on the L40G GPU.
Log

Steps to Reproduce

Rerun failed jobs
https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=72701

rakhmets · 2023-12-08T15:17:39Z

Please find the following conclusions based on CI logs.
The failed tests are tests containing call to nvmlInit.
The tests fails because they are aborted by watchdog at the end of the 15 minute timeout.
The tests fails only under valgrind.
The thread freezes on waitpid call.

I also have the following assumptions.
Several nvmlDeviceGetProcessUtilization called are present in the backtrace. I think this because several processes are running. And NVML tries to get utilization for each process. Moreover, two nvmlDeviceGetProcessUtilization calls are required to retrieve the information (according to the documentation). PID can be obtained from the second call. And I guess the obtained PIDs are passed to the system call waitpid.

NVML is only used to limit the size of the buffer in gtest. However, we already limit the buffer size in valgrind by 1 MB. This limitation is more strict that the limitation obtained using NVML API. Thus, I will make a workaround for the issue by adding condition before the call.

Alexey-Rivkin added the Bug label Dec 5, 2023

rakhmets self-assigned this Dec 8, 2023

rakhmets mentioned this issue Dec 8, 2023

GTEST/UCT: Do not query BAR1 memory using NVML under valgrind. #9540

Closed

artemry-nv mentioned this issue Dec 13, 2023

AZP: Add L40G test #9491

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GTest failures when running on the L40G GPU #9531

GTest failures when running on the L40G GPU #9531

Alexey-Rivkin commented Dec 5, 2023

rakhmets commented Dec 8, 2023

GTest failures when running on the L40G GPU #9531

GTest failures when running on the L40G GPU #9531

Comments

Alexey-Rivkin commented Dec 5, 2023

Describe the bug

Steps to Reproduce

rakhmets commented Dec 8, 2023