Update Validator for /usr/bin symlink#2464
Conversation
a345546 to
037a688
Compare
| } | ||
| fileInfo, err := os.Lstat("/host/usr/bin/nvidia-smi") | ||
|
|
||
| nvidiaSMIPath := "/host/usr/bin/nvidia-smi" |
There was a problem hiding this comment.
Can we move this newly added logic to a new function which correctly resolves the nvidia-smi path? Then, we can easily write few unit-tests for it as well.
There was a problem hiding this comment.
+1. Let's please write unit tests in this PR.
There was a problem hiding this comment.
Added Test_validateHostDriverWithAbsoluteUsrBinSymlink(). For the unit test design, a variable hostRootControlPath was created that defaults to /host.
There was a problem hiding this comment.
The original suggestion was to move the symlink resolution code to its own function. I would strongly prefer that we do that since 1) we resolve the path to the host's nvidia-smi in two places, and 2) it greatly simplifies the unit test code.
There was a problem hiding this comment.
I created a helper function resolveHostNvidiaSMI(hostRootCtrPath string)
|
Overall, LGTM. Just a minor suggestion to have a separate function for it and also add unit-tests for it. |
cdesiniotis
left a comment
There was a problem hiding this comment.
I would recommend we use existing packages for symlink resolution within a root. For example, we could use moby's FollowSymlinkInScope or https://pkg.go.dev/github.com/cyphar/filepath-securejoin. The former will probably suffice since the host path to nvidia-smi is assumed to be trustworthy in this context.
| } | ||
| fileInfo, err := os.Lstat("/host/usr/bin/nvidia-smi") | ||
|
|
||
| nvidiaSMIPath := "/host/usr/bin/nvidia-smi" |
There was a problem hiding this comment.
+1. Let's please write unit tests in this PR.
| if filepath.IsAbs(target) { | ||
| nvidiaSMIPath = filepath.Join("/host", strings.TrimPrefix(target, "/"), "nvidia-smi") | ||
| } else { | ||
| nvidiaSMIPath = filepath.Join("/host/usr", target, "nvidia-smi") |
There was a problem hiding this comment.
I am not following this logic. For example, what happens if /host/usr/bin/nvidia-smi is a symlink to ../../nvidia-smi?
There was a problem hiding this comment.
I can try using symlinkinScope here. wrt the current format, if /usr/bin points to an absolute host path, it prefixes /host; if it is relative, it resolves it under /host/usr
f73c4ac to
2b5b248
Compare
| "testing" | ||
| ) | ||
|
|
||
| func Test_validateHostDriverWithAbsoluteUsrBinSymlink(t *testing.T) { |
There was a problem hiding this comment.
This feels overly complicated. As suggested in a prior comment, let's move the nvidia-smi resolution code to its own function and just write a unit test method for that code.
There was a problem hiding this comment.
For reference, here is a method which tests path resolution in a root: https://github.com/NVIDIA/nvidia-container-toolkit/blob/2925ef90ea40767ada6599c03a1c6a949bae85ee/cmd/nvidia-cdi-hook/cudacompat/container-root_test.go#L300-L391
There was a problem hiding this comment.
Ok I updated to realign the format more to match this
14b3a1a to
f09159d
Compare
|
|
||
| // resolveHostNvidiaSMI resolves and stats nvidia-smi within the mounted host root. | ||
| func resolveHostNvidiaSMI(hostRootCtrPath string) (os.FileInfo, error) { | ||
| nvidiaSMIPath, err := symlink.FollowSymlinkInScope(filepath.Join(hostRootCtrPath, "usr/bin/nvidia-smi"), hostRootCtrPath) |
There was a problem hiding this comment.
@cdesiniotis Should we use cyphar.SecureJoin here instead?
There was a problem hiding this comment.
I think cyphar.SecureJoin is legacy code. libpathrs is recommended instead. Would that be needed for this use case though? It would be a larger dependency
There was a problem hiding this comment.
This library was originally just an implementation of SecureJoin which was golang/go#20126 as a safer filepath.Join that would restrict the path lookup to be inside a root directory.
The implementation was based on code that existed in several container runtimes. Unfortunately, this API is fundamentally unsafe against attackers that can modify path components after SecureJoin returns and before the caller uses the path, allowing for some fairly trivial TOCTOU attacks.
https://github.com/cyphar/filepath-securejoin
In the readme they were recommending libpathrs. I think it might be easier to use a smaller dependency though, unless we need anything else from libpathrs
There was a problem hiding this comment.
As mentioned in #2464 (review), either should work. Using pathrs is a more secure approach since it only operates on file descriptors. In contrast symlink.FollowSymlinkInScope returns a path which must be resolved again when opening / executing the file. Since the host's root filesystem is a "trusted" filesystem in this context, I am not too concerned with using symlink.FollowSymlinkInScope.
However, with that said, @JunAr7112 let's switch to using https://github.com/cyphar/filepath-securejoin/tree/main/pathrs-lite in this PR. This is more secure and aligns with the work we recently did in the toolkit. The pseudocode would be:
f, _ := pathrs.OpenInRoot("/host", "/usr/bin/nvidia-smi")
fInfo, _ := f.Stat()
. . .
There was a problem hiding this comment.
Alright, I can switch to using pathrs for more consistency
There was a problem hiding this comment.
Ok @cdesiniotis I updated the MR to use pathrs
25fc202 to
58290d3
Compare
|
|
||
| // resolveHostNvidiaSMI opens and stats nvidia-smi within the mounted host root. | ||
| func resolveHostNvidiaSMI(hostRootCtrPath string) (os.FileInfo, error) { | ||
| f, err := pathrs.OpenInRoot(hostRootCtrPath, "/usr/bin/nvidia-smi") |
There was a problem hiding this comment.
Should we also attempt paths like /usr/sbin/nvidia-smi and /usr/lib/wsl/lib/nvidia-smi?
cc @cdesiniotis
There was a problem hiding this comment.
@tariq1890 In what systems would we see /usr/sbin/nvidia-smi and /usr/lib/wsl/lib/nvidia-smi symlinks? I think the original error was wrt NixOS systems, but I can generalize to include this as well if its a common bug
There was a problem hiding this comment.
/usr/lib/wsl/lib/nvidia-smi is used in WSL. I believe Talos and other similar distros use /usr/sbin/nvidia-smi
There was a problem hiding this comment.
I think we should expand the search paths here, but we could do that in a follow-up. For reference, we have precedent for searching for nvidia-smi in multiple directories, e.g. https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu/blob/510086482c09c9cdde2886c292fc0006fc9df926/hack/kubelet-plugin-prestart.sh#L44-L52
fbb0f00 to
eb1c2e5
Compare
| if err != nil { | ||
| return nil, fmt.Errorf("failed to open 'nvidia-smi' on the host: %w", err) | ||
| } | ||
| defer func() { _ = f.Close() }() |
There was a problem hiding this comment.
| defer func() { _ = f.Close() }() | |
| defer f.Close() |
| } | ||
|
|
||
| func runCommand(command string, args []string, silent bool) error { | ||
| var runCommand = func(command string, args []string, silent bool) error { |
There was a problem hiding this comment.
Let's revert this change since it is no longer needed.
|
|
||
| // resolveHostNvidiaSMI opens and stats nvidia-smi within the mounted host root. | ||
| func resolveHostNvidiaSMI(hostRootCtrPath string) (os.FileInfo, error) { | ||
| f, err := pathrs.OpenInRoot(hostRootCtrPath, "/usr/bin/nvidia-smi") |
There was a problem hiding this comment.
I think we should expand the search paths here, but we could do that in a follow-up. For reference, we have precedent for searching for nvidia-smi in multiple directories, e.g. https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu/blob/510086482c09c9cdde2886c292fc0006fc9df926/hack/kubelet-plugin-prestart.sh#L44-L52
| setup: func(t *testing.T, hostRoot string) { | ||
| outsideDir := filepath.Join(filepath.Dir(hostRoot), "outside", "bin") | ||
| require.NoError(t, os.MkdirAll(outsideDir, 0755)) | ||
| require.NoError(t, os.WriteFile(filepath.Join(outsideDir, "nvidia-smi"), []byte("fake nvidia-smi"), 0600)) | ||
| }, |
There was a problem hiding this comment.
Do we really need this? Note, the target of the symlink does not need to exist for this test case. I would vote to remove this.
Also, isn't the directory your are creating, filepath.Join(filepath.Dir(hostRoot), "outside", "bin") , within the root?
There was a problem hiding this comment.
Ok I simplified the format for this case to just use "/usr/bin": "symlink=../../", to represent the symlink target being outside the root case and got rid of setup.
For the older format, filepath.Join(filepath.Dir(hostRoot), "outside", "bin") would yield tmp/outside/bin which would be parallel to the hostRoot: /tmp/TestResolveHostNvidiaSMI.../, not within it. This format would've been more explicit with having a folder outside the root but is probably not strictly necessary.
Signed-off-by: Arjun <agadiyar@nvidia.com>
|
@tariq1890 can you re-review this? |
Description
This MR is in response to this issue: #1357. In certain scenarios, we might have a symlink with /usr/bin which causes os.Lstat() to be unable to properly parse the nvidia-smi location. Inside the validator container, the host root filesystem is mounted at /host. So the validator tries to resolve: /host/usr/bin/nvidia-smi
But because /host/usr/bin points to an absolute symlink target, /run/.../bin, the path resolution happens from the validator container’s root, not from /host. In other words, it looks for: /run/.../bin/nvidia-smi inside the container namespace, instead of /host/run/.../bin/nvidia-smi. So the validator falsely concludes that the host-installed driver is not present.
Change
nvidiaSMIPath, err := symlink.FollowSymlinkInScope(filepath.Join(hostRootCtrPath, "usr/bin/nvidia-smi"), hostRootCtrPath)
We will switch to using symlink.FollowSymlinkInScope to resolve absolute paths in a helper function resolveHostNvidiaSMI(hostRootCtrPath string) (os.FileInfo, error). For testing purposes, it can be set to a different value. We will need to update the func (v *VGPUManager) runValidation(...) and the func validateHostDriver(...).
Testing
Added a Unit test:
local-agadiyar@ipp1-2290:~/gpu-operator$ go test ./cmd/nvidia-validator -run TestResolveHostNvidiaSMI -count=1
ok github.com/NVIDIA/gpu-operator/cmd/nvidia-validator 0.020s
TestResolveHostNvidiaSMI verifies that the validator can find host nvidia-smi through the mounted host root, including when /usr/bin is an absolute symlink inside that host root. It builds small fake host filesystems in temp dirs, creates regular files or symlinks from the test case map, calls resolveHostNvidiaSMI(hostRoot), and checks that the helper succeeds for valid layouts and fails when nvidia-smi is missing.