-
Notifications
You must be signed in to change notification settings - Fork 417
Implement Drop-In Lifecycle Support for Containerd and Crio #1272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pull Request Test Coverage Report for Build 17823407768Details
💛 - Coveralls |
e7d8b8b
to
c8c059c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, I think the approach here is differs from what my first instinct was to more clearly split SOURCE and TARGET / NVIDIA runtimes but to keep the logic pretty much as is. Let us assume that we make the changes as described in #1279, then the switch to drop in files SHOULD reduce to implementing GetDefaultRuntimeOptions
to pull these from the SOURCE config, whereas any functions that MODIFY the config should act on the TARGET config.
It may be useful to list the various combinations more clearly in the design doc or some kind of test here (please point it out if it is there).
|
||
defaultContainerdDropInDir = "/etc/containerd/conf.d" | ||
defaultCrioDropInDir = "/etc/crio/conf.d" | ||
defaultDropInFileName = "99.nvidia.toml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defaultDropInFileName = "99.nvidia.toml" | |
defaultDropInFileName = "99-nvidia.toml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the following:
nvidia-ctk runtime configure \
--runtime=containerd \
--config=/etc/containerd/config.toml \
--nvidia-config=/etc/containerd/conf.d/99-nvidia.toml
and
nvidia-ctk runtime configure \
--runtime=crio \
--config=/etc/crio/crio.conf \
--nvidia-config=/etc/crio/conf.d/99-nvidia.toml
the drop in dirs can be inferred as the PARENT dir of the nvidia-config
file.
Note that IFF --nvidia-config=""
we could trigger the same behaviour as --dry-run
in which case no config modifications are made to the specified config
files.
Does it make sense to set up tests for the various combinations here and then use that to drive development?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented this approach. PTAL
&cli.StringFlag{ | ||
Name: "drop-in-config-path", | ||
Usage: "the path to the drop-in configuration file for the containerd runtime", | ||
Destination: &config.dropInConfigPath, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a note: Here we use the same option and in the toolkit container logic we only expose a flag for containerd.
return 0, fmt.Errorf("failed to create drop-in directory %s: %w", c.dropInConfigPath, err) | ||
} | ||
|
||
return c.NVConfig.Save(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this may be saving to the wrong path. Why is this not the drop in file?
} | ||
|
||
// Main config exists, ensure it has imports directive | ||
return d.ensureImportsDirective() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the only difference between these two cases not be whether the original config is empty or not. Updating the imports should not require different logic.
|
||
var dropInConfigPath string | ||
if b.dropInDir != "" { | ||
dropInConfigPath = b.dropInDir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be an inconsistency here. Is this a directory or a file path?
pkg/config/engine/crio/crio.go
Outdated
} | ||
|
||
var dropInConfigPath string | ||
if b.dropInDir != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that IF the dropIn config path is ""
then we should have the same behviour as we currently have.
cmd/nvidia-ctk-installer/container/runtime/containerd/config_v2_test.go
Outdated
Show resolved
Hide resolved
Fixes #1260 |
c8c059c
to
aafd1c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements drop-in lifecycle support for container runtimes (containerd and CRI-O) by adding support for NVIDIA-specific configuration files that can be managed separately from the main runtime configuration files.
- Adds a new
WithNvidiaConfig
option for both containerd and CRI-O configs to specify a separate NVIDIA configuration file path - Implements specialized save logic that creates NVIDIA-specific config files containing only runtime configurations
- Updates the configure command and installer to support the new
--nvidia-config
flag
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
pkg/config/engine/crio/option.go | Adds WithNvidiaConfig option to specify NVIDIA-specific config file path |
pkg/config/engine/crio/crio.go | Implements NVIDIA config file support with specialized save/remove logic |
pkg/config/engine/containerd/option.go | Adds WithNvidiaConfig option for containerd configuration |
pkg/config/engine/containerd/containerd.go | Adds nvidiaConfig field to Config struct |
pkg/config/engine/containerd/config_v1.go | Updates Save method call to use pointer receiver |
pkg/config/engine/containerd/config.go | Implements NVIDIA config file support with import handling for containerd |
cmd/nvidia-ctk/runtime/configure/configure.go | Adds --nvidia-config flag and passes it to runtime configs |
cmd/nvidia-ctk-installer/container/runtime/runtime.go | Adds nvidia-config flag to installer runtime options |
cmd/nvidia-ctk-installer/container/runtime/crio/crio.go | Updates CRI-O installer to use nvidia config option |
cmd/nvidia-ctk-installer/container/runtime/containerd/containerd.go | Updates containerd installer to use nvidia config option |
cmd/nvidia-ctk-installer/container/container.go | Adds NvidiaConfig field to shared container options |
// If using NVIDIA-specific configuration, handle file cleanup | ||
if c.nvidiaConfig != "" { | ||
// Check if all NVIDIA runtimes are being removed | ||
remainingNvidiaRuntimes := 0 | ||
if runtimes := c.GetPath([]string{"crio", "runtime", "runtimes"}); runtimes != nil { | ||
if runtimesTree, ok := runtimes.(*toml.Tree); ok { | ||
for _, runtimeName := range runtimesTree.Keys() { | ||
if c.isNvidiaRuntime(runtimeName) && runtimeName != name { | ||
remainingNvidiaRuntimes++ | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runtime counting logic iterates through all runtimes every time a runtime is removed. Consider caching the NVIDIA runtime count or using a more efficient approach when removing multiple runtimes in sequence.
Copilot uses AI. Check for mistakes.
// If using NVIDIA-specific configuration, handle file cleanup | ||
if c.nvidiaConfig != "" { | ||
// Check if all NVIDIA runtimes are being removed | ||
remainingNvidiaRuntimes := 0 | ||
if runtimes := c.GetPath([]string{"plugins", c.CRIRuntimePluginName, "containerd", "runtimes"}); runtimes != nil { | ||
if runtimesTree, ok := runtimes.(*toml.Tree); ok { | ||
for _, runtimeName := range runtimesTree.Keys() { | ||
if c.isNvidiaRuntime(runtimeName) && runtimeName != name { | ||
remainingNvidiaRuntimes++ | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This runtime counting logic is duplicated between CRI-O and containerd implementations. Consider extracting this into a shared helper function to reduce code duplication.
Copilot uses AI. Check for mistakes.
// Update main config with imports directive | ||
if err := c.updateMainConfigImports(path); err != nil { | ||
// Try to clean up the NVIDIA config file on error | ||
os.Remove(c.nvidiaConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error from os.Remove
is silently ignored during cleanup. Consider logging the cleanup failure or handling it appropriately, as this could indicate permission issues or other problems.
os.Remove(c.nvidiaConfig) | |
if removeErr := os.Remove(c.nvidiaConfig); removeErr != nil { | |
c.Logger.Errorf("failed to remove NVIDIA config file during cleanup: %v", removeErr) | |
} |
Copilot uses AI. Check for mistakes.
if imports == nil { | ||
mainConfig.Set("imports", []string{importPattern}) | ||
} else if importsList, ok := imports.([]interface{}); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The imports handling logic has complex type checking with multiple nested conditions. Consider refactoring this into separate helper functions to improve readability and maintainability.
Copilot uses AI. Check for mistakes.
This PR now rebases #1279 and builds on the concepts introduced there; now this PR is smaller and focused. |
aafd1c4
to
9a43829
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
func (c ConfigV1) Save(path string) (int64, error) { | ||
return (Config)(c).Save(path) | ||
return (*Config)(&c).Save(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type conversion creates a pointer to a copy of the struct instead of using the existing pointer. It should be return ((*Config)(c)).Save(path)
to avoid creating an unnecessary copy.
Copilot uses AI. Check for mistakes.
func (c *Config) isNvidiaRuntime(name string) bool { | ||
return name == "nvidia" || name == "nvidia-cdi" || name == "nvidia-legacy" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded runtime names are duplicated between containerd and crio implementations. Consider extracting these to a shared constant or configuration to avoid inconsistencies.
Copilot uses AI. Check for mistakes.
// isNvidiaRuntime checks if the runtime name is an NVIDIA runtime | ||
func (c *Config) isNvidiaRuntime(name string) bool { | ||
return name == "nvidia" || name == "nvidia-cdi" || name == "nvidia-legacy" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded runtime names are duplicated between containerd and crio implementations. Consider extracting these to a shared constant or configuration to avoid inconsistencies.
// isNvidiaRuntime checks if the runtime name is an NVIDIA runtime | |
func (c *Config) isNvidiaRuntime(name string) bool { | |
return name == "nvidia" || name == "nvidia-cdi" || name == "nvidia-legacy" | |
// nvidiaRuntimeNames contains the recognized NVIDIA runtime names. | |
var nvidiaRuntimeNames = []string{"nvidia", "nvidia-cdi", "nvidia-legacy"} | |
// isNvidiaRuntime checks if the runtime name is an NVIDIA runtime | |
func (c *Config) isNvidiaRuntime(name string) bool { | |
for _, n := range nvidiaRuntimeNames { | |
if name == n { | |
return true | |
} | |
} | |
return false |
Copilot uses AI. Check for mistakes.
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
9a43829
to
516cac1
Compare
Name: "drop-in-config", | ||
Usage: "Path to the NVIDIA-specific config file to create. When specified, runtime configurations are saved to this file instead of modifying the main config file", | ||
Destination: &opts.NvidiaConfig, | ||
Value: defaultNVIDIARuntimeConfigFilePath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: How does this work for crio? Do we also have the ability to update the imports there? Is this something that we want to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to point to a diff location than the default one (see https://www.ibm.com/docs/en/zoscp/1.1.0?topic=options-crio-commands-flags#crio-commands__title__4)
--config-dir | Path to the configuration drop-in directory.
So we could offer the same drop-in-config
flag on the crio
side.
Let me see how that would look like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear. I'm ok for runtime-specific defaults here. We already do this for other settings here:
nvidia-container-toolkit/cmd/nvidia-ctk-installer/container/runtime/runtime.go
Lines 128 to 159 in 470b841
// Apply the runtime-specific config changes. | |
switch runtime { | |
case containerd.Name: | |
if opts.Config == runtimeSpecificDefault { | |
opts.Config = containerd.DefaultConfig | |
} | |
if opts.Socket == runtimeSpecificDefault { | |
opts.Socket = containerd.DefaultSocket | |
} | |
if opts.RestartMode == runtimeSpecificDefault { | |
opts.RestartMode = containerd.DefaultRestartMode | |
} | |
case crio.Name: | |
if opts.Config == runtimeSpecificDefault { | |
opts.Config = crio.DefaultConfig | |
} | |
if opts.Socket == runtimeSpecificDefault { | |
opts.Socket = crio.DefaultSocket | |
} | |
if opts.RestartMode == runtimeSpecificDefault { | |
opts.RestartMode = crio.DefaultRestartMode | |
} | |
case docker.Name: | |
if opts.Config == runtimeSpecificDefault { | |
opts.Config = docker.DefaultConfig | |
} | |
if opts.Socket == runtimeSpecificDefault { | |
opts.Socket = docker.DefaultSocket | |
} | |
if opts.RestartMode == runtimeSpecificDefault { | |
opts.RestartMode = docker.DefaultRestartMode | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think is ok to go with defaults for now
No description provided.