Skip to content

Update Validator for /usr/bin symlink#2464

Open
JunAr7112 wants to merge 1 commit into
NVIDIA:mainfrom
JunAr7112:fix_symlink
Open

Update Validator for /usr/bin symlink#2464
JunAr7112 wants to merge 1 commit into
NVIDIA:mainfrom
JunAr7112:fix_symlink

Conversation

@JunAr7112
Copy link
Copy Markdown
Contributor

@JunAr7112 JunAr7112 commented May 15, 2026

Description

This MR is in response to this issue: #1357. In certain scenarios, we might have a symlink with /usr/bin which causes os.Lstat() to be unable to properly parse the nvidia-smi location. Inside the validator container, the host root filesystem is mounted at /host. So the validator tries to resolve: /host/usr/bin/nvidia-smi
But because /host/usr/bin points to an absolute symlink target, /run/.../bin, the path resolution happens from the validator container’s root, not from /host. In other words, it looks for: /run/.../bin/nvidia-smi inside the container namespace, instead of /host/run/.../bin/nvidia-smi. So the validator falsely concludes that the host-installed driver is not present.

Change

nvidiaSMIPath, err := symlink.FollowSymlinkInScope(filepath.Join(hostRootCtrPath, "usr/bin/nvidia-smi"), hostRootCtrPath)

We will switch to using symlink.FollowSymlinkInScope to resolve absolute paths in a helper function resolveHostNvidiaSMI(hostRootCtrPath string) (os.FileInfo, error). For testing purposes, it can be set to a different value. We will need to update the func (v *VGPUManager) runValidation(...) and the func validateHostDriver(...).

Testing

Added a Unit test:

local-agadiyar@ipp1-2290:~/gpu-operator$ go test ./cmd/nvidia-validator -run TestResolveHostNvidiaSMI -count=1
ok github.com/NVIDIA/gpu-operator/cmd/nvidia-validator 0.020s

TestResolveHostNvidiaSMI verifies that the validator can find host nvidia-smi through the mounted host root, including when /usr/bin is an absolute symlink inside that host root. It builds small fake host filesystems in temp dirs, creates regular files or symlinks from the test case map, calls resolveHostNvidiaSMI(hostRoot), and checks that the helper succeeds for valid layouts and fails when nvidia-smi is missing.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@JunAr7112 JunAr7112 force-pushed the fix_symlink branch 2 times, most recently from a345546 to 037a688 Compare May 15, 2026 23:18
Comment thread cmd/nvidia-validator/main.go Outdated
}
fileInfo, err := os.Lstat("/host/usr/bin/nvidia-smi")

nvidiaSMIPath := "/host/usr/bin/nvidia-smi"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this newly added logic to a new function which correctly resolves the nvidia-smi path? Then, we can easily write few unit-tests for it as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Let's please write unit tests in this PR.

Copy link
Copy Markdown
Contributor Author

@JunAr7112 JunAr7112 May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Test_validateHostDriverWithAbsoluteUsrBinSymlink(). For the unit test design, a variable hostRootControlPath was created that defaults to /host.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original suggestion was to move the symlink resolution code to its own function. I would strongly prefer that we do that since 1) we resolve the path to the host's nvidia-smi in two places, and 2) it greatly simplifies the unit test code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a helper function resolveHostNvidiaSMI(hostRootCtrPath string)

@rahulait
Copy link
Copy Markdown
Contributor

Overall, LGTM. Just a minor suggestion to have a separate function for it and also add unit-tests for it.

Copy link
Copy Markdown
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend we use existing packages for symlink resolution within a root. For example, we could use moby's FollowSymlinkInScope or https://pkg.go.dev/github.com/cyphar/filepath-securejoin. The former will probably suffice since the host path to nvidia-smi is assumed to be trustworthy in this context.

Comment thread cmd/nvidia-validator/main.go Outdated
}
fileInfo, err := os.Lstat("/host/usr/bin/nvidia-smi")

nvidiaSMIPath := "/host/usr/bin/nvidia-smi"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Let's please write unit tests in this PR.

Comment thread cmd/nvidia-validator/main.go Outdated
if filepath.IsAbs(target) {
nvidiaSMIPath = filepath.Join("/host", strings.TrimPrefix(target, "/"), "nvidia-smi")
} else {
nvidiaSMIPath = filepath.Join("/host/usr", target, "nvidia-smi")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not following this logic. For example, what happens if /host/usr/bin/nvidia-smi is a symlink to ../../nvidia-smi?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try using symlinkinScope here. wrt the current format, if /usr/bin points to an absolute host path, it prefixes /host; if it is relative, it resolves it under /host/usr

@JunAr7112 JunAr7112 force-pushed the fix_symlink branch 12 times, most recently from f73c4ac to 2b5b248 Compare May 22, 2026 17:37
Comment thread cmd/nvidia-validator/main_test.go Outdated
"testing"
)

func Test_validateHostDriverWithAbsoluteUsrBinSymlink(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels overly complicated. As suggested in a prior comment, let's move the nvidia-smi resolution code to its own function and just write a unit test method for that code.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I updated to realign the format more to match this

@JunAr7112 JunAr7112 force-pushed the fix_symlink branch 4 times, most recently from 14b3a1a to f09159d Compare May 26, 2026 17:51
Comment thread cmd/nvidia-validator/main.go Outdated

// resolveHostNvidiaSMI resolves and stats nvidia-smi within the mounted host root.
func resolveHostNvidiaSMI(hostRootCtrPath string) (os.FileInfo, error) {
nvidiaSMIPath, err := symlink.FollowSymlinkInScope(filepath.Join(hostRootCtrPath, "usr/bin/nvidia-smi"), hostRootCtrPath)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdesiniotis Should we use cyphar.SecureJoin here instead?

Copy link
Copy Markdown
Contributor Author

@JunAr7112 JunAr7112 May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cyphar.SecureJoin is legacy code. libpathrs is recommended instead. Would that be needed for this use case though? It would be a larger dependency

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate?

Copy link
Copy Markdown
Contributor Author

@JunAr7112 JunAr7112 May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This library was originally just an implementation of SecureJoin which was golang/go#20126 as a safer filepath.Join that would restrict the path lookup to be inside a root directory.

The implementation was based on code that existed in several container runtimes. Unfortunately, this API is fundamentally unsafe against attackers that can modify path components after SecureJoin returns and before the caller uses the path, allowing for some fairly trivial TOCTOU attacks.

https://github.com/cyphar/filepath-securejoin

In the readme they were recommending libpathrs. I think it might be easier to use a smaller dependency though, unless we need anything else from libpathrs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in #2464 (review), either should work. Using pathrs is a more secure approach since it only operates on file descriptors. In contrast symlink.FollowSymlinkInScope returns a path which must be resolved again when opening / executing the file. Since the host's root filesystem is a "trusted" filesystem in this context, I am not too concerned with using symlink.FollowSymlinkInScope.

However, with that said, @JunAr7112 let's switch to using https://github.com/cyphar/filepath-securejoin/tree/main/pathrs-lite in this PR. This is more secure and aligns with the work we recently did in the toolkit. The pseudocode would be:

f, _ := pathrs.OpenInRoot("/host", "/usr/bin/nvidia-smi")
fInfo, _ := f.Stat()
. . .

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I can switch to using pathrs for more consistency

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok @cdesiniotis I updated the MR to use pathrs

@JunAr7112 JunAr7112 force-pushed the fix_symlink branch 2 times, most recently from 25fc202 to 58290d3 Compare May 26, 2026 19:02
Comment thread cmd/nvidia-validator/main.go Outdated

// resolveHostNvidiaSMI opens and stats nvidia-smi within the mounted host root.
func resolveHostNvidiaSMI(hostRootCtrPath string) (os.FileInfo, error) {
f, err := pathrs.OpenInRoot(hostRootCtrPath, "/usr/bin/nvidia-smi")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also attempt paths like /usr/sbin/nvidia-smi and /usr/lib/wsl/lib/nvidia-smi?

cc @cdesiniotis

Copy link
Copy Markdown
Contributor Author

@JunAr7112 JunAr7112 May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tariq1890 In what systems would we see /usr/sbin/nvidia-smi and /usr/lib/wsl/lib/nvidia-smi symlinks? I think the original error was wrt NixOS systems, but I can generalize to include this as well if its a common bug

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/usr/lib/wsl/lib/nvidia-smi is used in WSL. I believe Talos and other similar distros use /usr/sbin/nvidia-smi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should expand the search paths here, but we could do that in a follow-up. For reference, we have precedent for searching for nvidia-smi in multiple directories, e.g. https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu/blob/510086482c09c9cdde2886c292fc0006fc9df926/hack/kubelet-plugin-prestart.sh#L44-L52

@JunAr7112 JunAr7112 force-pushed the fix_symlink branch 2 times, most recently from fbb0f00 to eb1c2e5 Compare May 26, 2026 21:12
Comment thread cmd/nvidia-validator/main.go Outdated
if err != nil {
return nil, fmt.Errorf("failed to open 'nvidia-smi' on the host: %w", err)
}
defer func() { _ = f.Close() }()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defer func() { _ = f.Close() }()
defer f.Close()

Comment thread cmd/nvidia-validator/main.go Outdated
}

func runCommand(command string, args []string, silent bool) error {
var runCommand = func(command string, args []string, silent bool) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert this change since it is no longer needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted to func


// resolveHostNvidiaSMI opens and stats nvidia-smi within the mounted host root.
func resolveHostNvidiaSMI(hostRootCtrPath string) (os.FileInfo, error) {
f, err := pathrs.OpenInRoot(hostRootCtrPath, "/usr/bin/nvidia-smi")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should expand the search paths here, but we could do that in a follow-up. For reference, we have precedent for searching for nvidia-smi in multiple directories, e.g. https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu/blob/510086482c09c9cdde2886c292fc0006fc9df926/hack/kubelet-plugin-prestart.sh#L44-L52

Comment thread cmd/nvidia-validator/main_test.go
Comment thread cmd/nvidia-validator/main_test.go Outdated
Comment on lines +61 to +65
setup: func(t *testing.T, hostRoot string) {
outsideDir := filepath.Join(filepath.Dir(hostRoot), "outside", "bin")
require.NoError(t, os.MkdirAll(outsideDir, 0755))
require.NoError(t, os.WriteFile(filepath.Join(outsideDir, "nvidia-smi"), []byte("fake nvidia-smi"), 0600))
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this? Note, the target of the symlink does not need to exist for this test case. I would vote to remove this.

Also, isn't the directory your are creating, filepath.Join(filepath.Dir(hostRoot), "outside", "bin") , within the root?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I simplified the format for this case to just use "/usr/bin": "symlink=../../", to represent the symlink target being outside the root case and got rid of setup.

For the older format, filepath.Join(filepath.Dir(hostRoot), "outside", "bin") would yield tmp/outside/bin which would be parallel to the hostRoot: /tmp/TestResolveHostNvidiaSMI.../, not within it. This format would've been more explicit with having a folder outside the root but is probably not strictly necessary.

Signed-off-by: Arjun <agadiyar@nvidia.com>
@cdesiniotis
Copy link
Copy Markdown
Contributor

@tariq1890 can you re-review this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants