Skip to content

Conversation

ArangoGutierrez
Copy link
Collaborator

No description provided.

Copilot

This comment was marked as outdated.

@coveralls
Copy link

coveralls commented Aug 27, 2025

Pull Request Test Coverage Report for Build 17823407768

Details

  • 6 of 188 (3.19%) changed or added relevant lines in 10 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.5%) to 36.229%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/config/engine/containerd/config_v1.go 0 1 0.0%
cmd/nvidia-ctk/runtime/configure/configure.go 0 2 0.0%
pkg/config/engine/containerd/option.go 0 4 0.0%
pkg/config/engine/crio/option.go 0 4 0.0%
cmd/nvidia-ctk-installer/container/runtime/runtime.go 0 6 0.0%
pkg/config/engine/crio/crio.go 3 58 5.17%
pkg/config/engine/containerd/config.go 0 110 0.0%
Totals Coverage Status
Change from base Build 17802483936: -0.5%
Covered Lines: 4804
Relevant Lines: 13260

💛 - Coveralls

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I think the approach here is differs from what my first instinct was to more clearly split SOURCE and TARGET / NVIDIA runtimes but to keep the logic pretty much as is. Let us assume that we make the changes as described in #1279, then the switch to drop in files SHOULD reduce to implementing GetDefaultRuntimeOptions to pull these from the SOURCE config, whereas any functions that MODIFY the config should act on the TARGET config.

It may be useful to list the various combinations more clearly in the design doc or some kind of test here (please point it out if it is there).


defaultContainerdDropInDir = "/etc/containerd/conf.d"
defaultCrioDropInDir = "/etc/crio/conf.d"
defaultDropInFileName = "99.nvidia.toml"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defaultDropInFileName = "99.nvidia.toml"
defaultDropInFileName = "99-nvidia.toml"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the following:

nvidia-ctk runtime configure \
    --runtime=containerd \
    --config=/etc/containerd/config.toml \
    --nvidia-config=/etc/containerd/conf.d/99-nvidia.toml

and

nvidia-ctk runtime configure \
    --runtime=crio \
    --config=/etc/crio/crio.conf \
    --nvidia-config=/etc/crio/conf.d/99-nvidia.toml

the drop in dirs can be inferred as the PARENT dir of the nvidia-config file.

Note that IFF --nvidia-config="" we could trigger the same behaviour as --dry-run in which case no config modifications are made to the specified config files.

Does it make sense to set up tests for the various combinations here and then use that to drive development?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented this approach. PTAL

Comment on lines 178 to 182
&cli.StringFlag{
Name: "drop-in-config-path",
Usage: "the path to the drop-in configuration file for the containerd runtime",
Destination: &config.dropInConfigPath,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a note: Here we use the same option and in the toolkit container logic we only expose a flag for containerd.

return 0, fmt.Errorf("failed to create drop-in directory %s: %w", c.dropInConfigPath, err)
}

return c.NVConfig.Save(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this may be saving to the wrong path. Why is this not the drop in file?

}

// Main config exists, ensure it has imports directive
return d.ensureImportsDirective()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the only difference between these two cases not be whether the original config is empty or not. Updating the imports should not require different logic.


var dropInConfigPath string
if b.dropInDir != "" {
dropInConfigPath = b.dropInDir
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be an inconsistency here. Is this a directory or a file path?

}

var dropInConfigPath string
if b.dropInDir != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that IF the dropIn config path is "" then we should have the same behviour as we currently have.

@ArangoGutierrez
Copy link
Collaborator Author

Fixes #1260

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements drop-in lifecycle support for container runtimes (containerd and CRI-O) by adding support for NVIDIA-specific configuration files that can be managed separately from the main runtime configuration files.

  • Adds a new WithNvidiaConfig option for both containerd and CRI-O configs to specify a separate NVIDIA configuration file path
  • Implements specialized save logic that creates NVIDIA-specific config files containing only runtime configurations
  • Updates the configure command and installer to support the new --nvidia-config flag

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/config/engine/crio/option.go Adds WithNvidiaConfig option to specify NVIDIA-specific config file path
pkg/config/engine/crio/crio.go Implements NVIDIA config file support with specialized save/remove logic
pkg/config/engine/containerd/option.go Adds WithNvidiaConfig option for containerd configuration
pkg/config/engine/containerd/containerd.go Adds nvidiaConfig field to Config struct
pkg/config/engine/containerd/config_v1.go Updates Save method call to use pointer receiver
pkg/config/engine/containerd/config.go Implements NVIDIA config file support with import handling for containerd
cmd/nvidia-ctk/runtime/configure/configure.go Adds --nvidia-config flag and passes it to runtime configs
cmd/nvidia-ctk-installer/container/runtime/runtime.go Adds nvidia-config flag to installer runtime options
cmd/nvidia-ctk-installer/container/runtime/crio/crio.go Updates CRI-O installer to use nvidia config option
cmd/nvidia-ctk-installer/container/runtime/containerd/containerd.go Updates containerd installer to use nvidia config option
cmd/nvidia-ctk-installer/container/container.go Adds NvidiaConfig field to shared container options

Comment on lines +144 to +156
// If using NVIDIA-specific configuration, handle file cleanup
if c.nvidiaConfig != "" {
// Check if all NVIDIA runtimes are being removed
remainingNvidiaRuntimes := 0
if runtimes := c.GetPath([]string{"crio", "runtime", "runtimes"}); runtimes != nil {
if runtimesTree, ok := runtimes.(*toml.Tree); ok {
for _, runtimeName := range runtimesTree.Keys() {
if c.isNvidiaRuntime(runtimeName) && runtimeName != name {
remainingNvidiaRuntimes++
}
}
}
}
Copy link
Preview

Copilot AI Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runtime counting logic iterates through all runtimes every time a runtime is removed. Consider caching the NVIDIA runtime count or using a more efficient approach when removing multiple runtimes in sequence.

Copilot uses AI. Check for mistakes.

Comment on lines +134 to +146
// If using NVIDIA-specific configuration, handle file cleanup
if c.nvidiaConfig != "" {
// Check if all NVIDIA runtimes are being removed
remainingNvidiaRuntimes := 0
if runtimes := c.GetPath([]string{"plugins", c.CRIRuntimePluginName, "containerd", "runtimes"}); runtimes != nil {
if runtimesTree, ok := runtimes.(*toml.Tree); ok {
for _, runtimeName := range runtimesTree.Keys() {
if c.isNvidiaRuntime(runtimeName) && runtimeName != name {
remainingNvidiaRuntimes++
}
}
}
}
Copy link
Preview

Copilot AI Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runtime counting logic is duplicated between CRI-O and containerd implementations. Consider extracting this into a shared helper function to reduce code duplication.

Copilot uses AI. Check for mistakes.

// Update main config with imports directive
if err := c.updateMainConfigImports(path); err != nil {
// Try to clean up the NVIDIA config file on error
os.Remove(c.nvidiaConfig)
Copy link
Preview

Copilot AI Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error from os.Remove is silently ignored during cleanup. Consider logging the cleanup failure or handling it appropriately, as this could indicate permission issues or other problems.

Suggested change
os.Remove(c.nvidiaConfig)
if removeErr := os.Remove(c.nvidiaConfig); removeErr != nil {
c.Logger.Errorf("failed to remove NVIDIA config file during cleanup: %v", removeErr)
}

Copilot uses AI. Check for mistakes.

Comment on lines +273 to +275
if imports == nil {
mainConfig.Set("imports", []string{importPattern})
} else if importsList, ok := imports.([]interface{}); ok {
Copy link
Preview

Copilot AI Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The imports handling logic has complex type checking with multiple nested conditions. Consider refactoring this into separate helper functions to improve readability and maintainability.

Copilot uses AI. Check for mistakes.

@ArangoGutierrez
Copy link
Collaborator Author

As discussed offline, I think the approach here is differs from what my first instinct was to more clearly split SOURCE and TARGET / NVIDIA runtimes but to keep the logic pretty much as is. Let us assume that we make the changes as described in #1279, then the switch to drop in files SHOULD reduce to implementing GetDefaultRuntimeOptions to pull these from the SOURCE config, whereas any functions that MODIFY the config should act on the TARGET config.

It may be useful to list the various combinations more clearly in the design doc or some kind of test here (please point it out if it is there).

This PR now rebases #1279 and builds on the concepts introduced there; now this PR is smaller and focused.
The test bit's are also in an independent PR #1293

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Comment on lines 124 to +125
func (c ConfigV1) Save(path string) (int64, error) {
return (Config)(c).Save(path)
return (*Config)(&c).Save(path)
Copy link
Preview

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type conversion creates a pointer to a copy of the struct instead of using the existing pointer. It should be return ((*Config)(c)).Save(path) to avoid creating an unnecessary copy.

Copilot uses AI. Check for mistakes.

Comment on lines +264 to +266
func (c *Config) isNvidiaRuntime(name string) bool {
return name == "nvidia" || name == "nvidia-cdi" || name == "nvidia-legacy"
}
Copy link
Preview

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded runtime names are duplicated between containerd and crio implementations. Consider extracting these to a shared constant or configuration to avoid inconsistencies.

Copilot uses AI. Check for mistakes.

Comment on lines +312 to +314
// isNvidiaRuntime checks if the runtime name is an NVIDIA runtime
func (c *Config) isNvidiaRuntime(name string) bool {
return name == "nvidia" || name == "nvidia-cdi" || name == "nvidia-legacy"
Copy link
Preview

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded runtime names are duplicated between containerd and crio implementations. Consider extracting these to a shared constant or configuration to avoid inconsistencies.

Suggested change
// isNvidiaRuntime checks if the runtime name is an NVIDIA runtime
func (c *Config) isNvidiaRuntime(name string) bool {
return name == "nvidia" || name == "nvidia-cdi" || name == "nvidia-legacy"
// nvidiaRuntimeNames contains the recognized NVIDIA runtime names.
var nvidiaRuntimeNames = []string{"nvidia", "nvidia-cdi", "nvidia-legacy"}
// isNvidiaRuntime checks if the runtime name is an NVIDIA runtime
func (c *Config) isNvidiaRuntime(name string) bool {
for _, n := range nvidiaRuntimeNames {
if name == n {
return true
}
}
return false

Copilot uses AI. Check for mistakes.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Name: "drop-in-config",
Usage: "Path to the NVIDIA-specific config file to create. When specified, runtime configurations are saved to this file instead of modifying the main config file",
Destination: &opts.NvidiaConfig,
Value: defaultNVIDIARuntimeConfigFilePath,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: How does this work for crio? Do we also have the ability to update the imports there? Is this something that we want to do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to point to a diff location than the default one (see https://www.ibm.com/docs/en/zoscp/1.1.0?topic=options-crio-commands-flags#crio-commands__title__4)
--config-dir | Path to the configuration drop-in directory.

So we could offer the same drop-in-config flag on the crio side.
Let me see how that would look like

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear. I'm ok for runtime-specific defaults here. We already do this for other settings here:

// Apply the runtime-specific config changes.
switch runtime {
case containerd.Name:
if opts.Config == runtimeSpecificDefault {
opts.Config = containerd.DefaultConfig
}
if opts.Socket == runtimeSpecificDefault {
opts.Socket = containerd.DefaultSocket
}
if opts.RestartMode == runtimeSpecificDefault {
opts.RestartMode = containerd.DefaultRestartMode
}
case crio.Name:
if opts.Config == runtimeSpecificDefault {
opts.Config = crio.DefaultConfig
}
if opts.Socket == runtimeSpecificDefault {
opts.Socket = crio.DefaultSocket
}
if opts.RestartMode == runtimeSpecificDefault {
opts.RestartMode = crio.DefaultRestartMode
}
case docker.Name:
if opts.Config == runtimeSpecificDefault {
opts.Config = docker.DefaultConfig
}
if opts.Socket == runtimeSpecificDefault {
opts.Socket = docker.DefaultSocket
}
if opts.RestartMode == runtimeSpecificDefault {
opts.RestartMode = docker.DefaultRestartMode
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think is ok to go with defaults for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants