Skip to content

Conversation

@baldvin-kovacs
Copy link

After upgrading to nvidia 580.76.05-6 from nvidia 575.64.05-4, my Incus containers failed to boot up.

Running Incus in debug mode, I could find this log line:

lxc baldvin-test2 20250828094627.156 DEBUG utils - ../src/lxc/utils.c:run_buffer:560 - Script exec /usr/share/lxc/hooks/nvidia produced output: nvidia-container-cli: ldcache error: process /usr/bin/ldconfig terminated with signal 9

Inserting strace into /usr/share/lxc/hooks/nvidia, the strace output had 600MB+ worth of

clock_gettime(CLOCK_MONOTONIC, 0x7fff12e98060) = -1 EPERM (Operation not permitted)

This CL adds clock_gettime to the list of permitted calls, which fixed the issue on my machine.

After upgrading to nvidia 580.76.05-6 from nvidia 575.64.05-4,
my Incus containers failed to boot up.

Running Incus in debug mode, I could find this log line:

lxc baldvin-test2 20250828094627.156 DEBUG    utils - ../src/lxc/utils.c:run_buffer:560 - Script exec /usr/share/lxc/hooks/nvidia produced output: nvidia-container-cli: ldcache error: process /usr/bin/ldconfig terminated with signal 9

Inserting strace into /usr/share/lxc/hooks/nvidia, the strace
output had 600MB+ worth of

clock_gettime(CLOCK_MONOTONIC, 0x7fff12e98060) = -1 EPERM (Operation not permitted)

This CL adds clock_gettime to the list of permitted calls, which
fixed the issue on my machine.
@greyltc
Copy link

greyltc commented Sep 5, 2025

@elezar
Copy link
Member

elezar commented Sep 29, 2025

I don't really see how the driver upgrade would cause this. From the perspective of ldconfig which is being run to update the ldcache in the container the driver libraries are opague.

@baldvin-kovacs do you know what is triggering the clock_gettime call as part of updating the ldconfig?

@baldvin-kovacs
Copy link
Author

baldvin-kovacs commented Sep 29, 2025

I don't really see how the driver upgrade would cause this. From the perspective of ldconfig which is being run to update the ldcache in the container the driver libraries are opague.

@baldvin-kovacs do you know what is triggering the clock_gettime call as part of updating the ldconfig?

Sorry, no clue. I just know that after the upgrade, my incus containers stopped working. I was stracing it, and saw this call rejected. I made this change, and all of a sudden my incus containers were working again :)

So yes, it can be a change in other systems --- can be that this particular call was made earlier too, but the execution environment never rejected it, and now that became more strict. Unfortunately I just upgraded with a pacman -Syu, so not separately the nvidia driver. I did not trace back the issue to a change of the nvidia driver versus a change of the execution environment, sorry...

@elezar
Copy link
Member

elezar commented Sep 29, 2025

@baldvin-kovacs would you be able to sign-off your commit as required by the DCO? Otherwise I could create a PR with this change.

@elezar
Copy link
Member

elezar commented Sep 29, 2025

@baldvin-kovacs we should be able to verify that it's not specific to the driver by running ldconfig through strace on the host system directly.

@elezar
Copy link
Member

elezar commented Sep 30, 2025

@lahwaacz have you been able to narrow down the change in ldconfig that requires this additional syscall?

@lahwaacz
Copy link

No, I did not debug this at all.

@elezar
Copy link
Member

elezar commented Oct 1, 2025

I created #326 to be able to get this in before the upcoming v1.18.0 release.

@elezar elezar closed this in #326 Oct 1, 2025
@jgehrcke
Copy link

jgehrcke commented Oct 1, 2025

Especially given the CLOCK_MONOTONIC argument this is likely used to implement a deadline / timeout criterion, and whatever component tries to do that probably rightfully does so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants