Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTest failures when running on the L40G GPU #9531

Open
Alexey-Rivkin opened this issue Dec 5, 2023 · 1 comment
Open

GTest failures when running on the L40G GPU #9531

Alexey-Rivkin opened this issue Dec 5, 2023 · 1 comment
Assignees
Labels

Comments

@Alexey-Rivkin
Copy link
Collaborator

Describe the bug

GTest failures when running on the L40G GPU.
Log

Steps to Reproduce

Rerun failed jobs
https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=72701

@rakhmets rakhmets self-assigned this Dec 8, 2023
@rakhmets
Copy link
Collaborator

rakhmets commented Dec 8, 2023

Please find the following conclusions based on CI logs.
The failed tests are tests containing call to nvmlInit.
The tests fails because they are aborted by watchdog at the end of the 15 minute timeout.
The tests fails only under valgrind.
The thread freezes on waitpid call.

I also have the following assumptions.
Several nvmlDeviceGetProcessUtilization called are present in the backtrace. I think this because several processes are running. And NVML tries to get utilization for each process. Moreover, two nvmlDeviceGetProcessUtilization calls are required to retrieve the information (according to the documentation). PID can be obtained from the second call. And I guess the obtained PIDs are passed to the system call waitpid.

NVML is only used to limit the size of the buffer in gtest. However, we already limit the buffer size in valgrind by 1 MB. This limitation is more strict that the limitation obtained using NVML API. Thus, I will make a workaround for the issue by adding condition before the call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants