-
Notifications
You must be signed in to change notification settings - Fork 440
Add nvidia-container-cli.compat-mode config option #1055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3b27c30 to
1db1b88
Compare
977723e to
aa0cb99
Compare
klueska
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the test only tests with the "ldconfig" method and nothing else. Is this intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just to check that the default config is generated correctly. We don't currently have a setting to override this, but we probably want to expose it.
internal/config/cli.go
Outdated
| CUDACompatModeMount = "mount" | ||
| CUDACompatModeLdconfig = "ldconfig" | ||
| CUDACompatModeHook = "hook" | ||
| CUDACompatModeDisabled = "disabled" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the runtime has one more "mode" than the nvidia-container-cli, i.e. the hook mode -- is that the interpretation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct. Thinking about it now, I don't think we should put this setting in the nvidia-container-cli section, but rather in the nvidia-container-runtime section.
aa0cb99 to
6b33bcb
Compare
internal/config/cli.go
Outdated
| // CUDACompatMode sets the mode to be used to make CUDA Forward Compat | ||
| // libraries discoverable in the container. | ||
| CUDACompatMode cudaCompatMode `toml:"cuda-compat-mode,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this rather be at:
nvidia-container-runtime.cuda-compat-mode = ldconfig
or
nvidia-container-runtime.modes.legacy.cuda-compat-mode = ldconfig
I think the latter since it actually only affects the behaviour of the legacy runtime mode, with the hook always being injected in other modes except if it is explicitly opted out of by the feature flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion here, but I think your latter suggestion makes sense. At first, I thought this made more sense in the nvidia-container-cli section, but because of the value hook, which only applies when the NVIDIA Container Runtime is used, I think it is okay to nest this option under nvidia-container-runtime.modes.legacy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it's cleaner this way. I have updated the implementation accordingly.
d2daf38 to
742ff37
Compare
| args = append(args, "--no-cntlibs") | ||
| switch hook.NVIDIAContainerRuntimeConfig.Modes.Legacy.CUDACompatMode { | ||
| case config.CUDACompatModeLdconfig: | ||
| args = append(args, "--cuda-compat-mode=ldconfig") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of repeating args = append(args, "<flag>") statements, can we assign the resolved flag to a local variable and append it to args after the switch-case block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a method to the hookConfig type to pull this logic into. I can see us doing something similar for the other flags as a follow up.
742ff37 to
50959cf
Compare
This change adds an nvidia-container-runtime.modes.legacy.cuda-compat-mode config option. This can be set to one of four values: * ldconfig (default): the --cuda-compat-mode=ldconfig flag is passed to the nvidia-container-cli * mount: the --cuda-compat-mode=mount flag is passed to the nvidia-conainer-cli * disabled: the --cuda-compat-mode=disabled flag is passed to the nvidia-container-cli * hook: the --cuda-compat-mode=disabled flag is passed to the nvidia-container-cli AND the enable-cuda-compat hook is used to provide forward compatibility. Note that the disable-cuda-compat-lib-hook feature flag will prevent the enable-cuda-compat hook from being used. This change also means that the allow-cuda-compat-libs-from-container feature flag no longer has any effect. Signed-off-by: Evan Lezar <elezar@nvidia.com>
50959cf to
f4981f0
Compare
This change adds a config option to trigger the
--cuda-compat-mode=ldconfigoption added to thenvidia-container-cliin NVIDIA/libnvidia-container#307.The
nvidia-container-runtime.modes.legacy.cuda-compat-modeoption controls the behaviour of both thenvidia-container-runtime-hookand thenvidia-container-runtime. The main motivation being that the hook-based CUDA Forward Compat support that was added as part of the v1.17.5 release requires thenvidia-container-runtimeand as such does not work when only the Docker--gpusflag is specified.Testing:
cuda-compat-mode unset (upgrades)
--gpusflagnvidiaruntimecuda-compat-mode=ldconfig
--gpusflagnvidiaruntimecuda-compat-mode=mount
--gpusflagnvidiaruntimecuda-compat-mode=disabled
--gpusflagnvidiaruntimecuda-compat-mode=hook
--gpusflagNote that this has CUDA Forward Compat disabled
nvidiaruntimenvidiaruntime withdisable-cuda-compat-hook = true