Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add platform detection logic from nvidia-container-toolkit #28

Merged
merged 4 commits into from
May 21, 2024

Conversation

elezar
Copy link
Member

@elezar elezar commented Mar 26, 2024

This ports the function from nvidia-container-toolkit to go-nvlib for more consistent reuse.

See:

which incorporate these changes into the toolkit and device plugin.

@elezar elezar requested a review from klueska March 26, 2024 08:47
@elezar elezar self-assigned this Mar 26, 2024
pkg/nvlib/info/info.go Outdated Show resolved Hide resolved
Comment on lines 134 to 102
// UsesNVGPUModule checks whether the nvgpu module is used.
// We use the device name to signal this, since devices that use the nvgpu module have their device
// names as:
//
// GPU 0: Orin (nvgpu) (UUID: 54d0709b-558d-5a59-9c65-0c5fc14a21a4)
//
// This function returns true if ALL devices use the nvgpu module.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is the nvgpu module? Is this the name of the kernel module for iGPUs? Maybe mention that here somehwere if so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is the kernel module for igpus. I can update the docstring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this in the latest. Please take another look.

@elezar elezar force-pushed the fix-igpu-nvml-auto branch 4 times, most recently from 2202e44 to 9bbfd8e Compare April 6, 2024 10:32
@elezar elezar requested a review from klueska April 6, 2024 10:33
@elezar elezar changed the title Add UsesNVGPUModule check to info module Add platform detection logic from nvidia-container-toolkit Apr 22, 2024
@elezar elezar requested a review from cdesiniotis April 23, 2024 08:58
cdesiniotis
cdesiniotis previously approved these changes Apr 23, 2024
Copy link
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good. I left one non-blocking question concerning the behavior when platform detection fails.

if (isTegra && !hasNVML) || usesNVGPUModule {
return "csv"
}
return "nvml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope for this PR - would it be better to return "" if an unknown platform is detected? I understand our components currently default to using "nvml" mode if platform detection fails, but would it be better for each client to decide how they want to handle this case where platform detection fails?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now that I think about it, we definitely should not be returning nvml by default here. For example, while reviewing NVIDIA/k8s-device-plugin#673, I notice we fail early when platform != info.PlatformNVML && platform != info.PlatformTegra.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. This was initially a mode check, but I reworked it to be a platform check intead. I will add a PlatformUnknown return value instead.

@@ -28,7 +29,7 @@ build:
GOOS=$(GOOS) go build ./...

all: check build binary
check: assert-fmt lint vet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily have an opinion here, just curious what made you decide to remove assert-fmt and vet from the checks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all handled by golangci-lint (could have been a different PR though).

"github.com/NVIDIA/go-nvml/pkg/dl"
)

type root string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this rootPath?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for a more generic name than *Path is that this type is intended to be used to extract more information about the driver root. There are multiple implementations of it in various projects and I need to consolidate them at some point soon.

The idea is that this struct will at least be able to:

  • Return the path to the driver libraries at this root.
  • Return the path to the device nodes at this path (or /)

I don't know whether the Path suffix conveys this.

I'm happy to make the change here now if you feel strongly about it since we don't have the other functionality consolidated yet.

Comment on lines 19 to 38
type builder struct {
root root
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call this options since its the place to enumerate / gather the options passed in?

Comment on lines 35 to 78
func (b *builder) build() Interface {
return &infolib{
root: root(b.root),
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this extra function necessary? Do we call it anywhere other than in in New()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, your're right. The initial implementation was structured a bit differently, and I did not change this when I simplified things. Let me rework it.

@@ -23,9 +23,10 @@ type Interface interface {

// Properties provides a set of functions to query capabilities of the system.
//
//go:generate moq -stub -out properties_mock.go . Properties
//go:generate moq -rm -stub -out properties_mock.go . Properties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the -rm do? Remove before regenerating?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I added it here, because I always get errors and have to manually remove the file before generating the mocks.

@@ -29,11 +37,21 @@ func New(opts ...Option) Interface {
if b.root == "" {
b.root = "/"
}
if b.nvmllib == nil {
b.nvmllib = nvml.New(
nvml.WithLibraryPath(b.root.tryResolveLibrary("libnvidia-ml.so.1")),
Copy link
Contributor

@klueska klueska Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a constant for "libnvidia-ml.so.1"? Does it actually make sense to define the constant in go-nvml itself?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have:

	defaultNvmlLibraryName      = "libnvidia-ml.so.1"

in go-nvml. I suppose we can make it public.

// GPU 0: Orin (nvgpu) (UUID: 54d0709b-558d-5a59-9c65-0c5fc14a21a4)
//
// This function returns true if ALL devices use the nvgpu module.
func (i *infolib) UsesNVGPUModule() (uses bool, reason string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call it OnlyUsesNVGPUModule (or something else to indicate that it only returns true if all. GPUs are managed by this module).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's fair. Let me update.

Comment on lines 97 to 110
// We ensure that this function never panics
defer func() {
if err := recover(); err != nil {
uses = false
reason = fmt.Sprintf("panic: %v", err)
}
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What might trigger a panic here to necessitate this (and not necessitate it in the other calls)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to possibly panic in nvml.Shutdown. This is no longer the case and as such it can probably be removed.

//
// Devices that use the nvgpu module have their device names as:
//
// GPU 0: Orin (nvgpu) (UUID: 54d0709b-558d-5a59-9c65-0c5fc14a21a4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this naming likely to be stable? I guess we can keep adding cases to this function to adapt as things change, but it still feel "brittle" to rely solely on the name like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's definitely brittle. If you recall we were pushing for a more stable mechanism to check this when nvml support was added to Orin-based systems. This was the suggested workaround. We should definitely revisit it.

Comment on lines 25 to 29
// Resolver defines a function to resolve a mode.
type Resolver interface {
ResolvePlatform() Platform
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolve a mode or resolve a platform?

Also, as a caller, info.Resolver() is not a very user-friendly name. maybe the more verbose PlatformResolver?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This was initially copying the concept of a mode from the NVIDIA container toolkit, but it isn't flexible enought and needs to be a lower level to also support the device plugin use cases. The comment needs to be updated and I can update the type too.

root root
nvmllib nvml.Interface
devicelib device.Interface
}

var _ Properties = &info{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to get lost in all the differences in naming and levels of indirection.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a fair point. I made another pass at this that I think is a lot cleaner.

I now call the interface PropertyExtractor and implement the Has* or Uses* functions there. The info type has also been explicitly renamed to propertyExtractor.

Comment on lines 29 to 33
type infolib struct {
logger basicLogger
Properties
Resolver
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So an infolib now IS a "Properties" and IS a "Resolver", and HAS a logger. Is there a better name for the Properties interface? I think that is where I am getting confused.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose Properties because the interface implements a set of bool, string functions that return information about the system. These system properties are then used to make decissions -- such as determining the platform.

I'm happy to discuss alternative names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top-level logger is not needed and has been removed. I have also renamed Properties to ProperyExtractor in the lates revision.

Properties
}

func (p platformResolver) ResolvePlatform() Platform {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This abstraction seems nice and straightforward. Where will it be consumed exactly? Is it meant as a replacement for direct calls to HasNvml() IsTegraSystem() etc.?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For most cases, yes, the ResolvePlatform() call will replace the individual calls and replicating the conditional logic.

I have already updated the nvidia-container-toolkit and device-plugin to make use of this functionality:

Note that the "Properties" themselves may still be required for edge cases, but we could probably make them internal at somet point.

@elezar elezar force-pushed the fix-igpu-nvml-auto branch 3 times, most recently from 659c752 to 5c049b1 Compare April 24, 2024 10:33
elezar added 4 commits May 2, 2024 14:07
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a PropertyExtractor interface to encapsulate functions
that query a system for certain capabilities.

The IsTegraSystem has been renamed to HasTegraFiles function and marked
as Deprecated.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar requested a review from klueska May 2, 2024 12:42
@elezar elezar merged commit 7604335 into main May 21, 2024
4 checks passed
@elezar elezar deleted the fix-igpu-nvml-auto branch May 21, 2024 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants