Skip to content

Conversation

@rahulait
Copy link
Contributor

@rahulait rahulait commented Dec 25, 2025

Dependencies

Depends on: NVIDIA/k8s-device-plugin#1550

Description

Problem

GPU Operator supports deploying multiple driver versions within a single Kubernetes cluster through the use of multiple NvidiaDriver custom resources (CRs). However, despite supporting multiple driver instances, the GPU Operator currently deploys only a single, cluster-wide NVIDIA Container Toolkit DaemonSet and a single NVIDIA Device Plugin DaemonSet.
This architecture introduces a limitation when different NvidiaDriver CRs enable different driver-dependent features - such as GPUDirect Storage (GDS), GDRCopy, or other optional components. Because the Container Toolkit and Device Plugin are deployed once per cluster and configured uniformly, they cannot be tailored to account for feature differences across driver instances. As a result, nodes running drivers with differing enabled features cannot be correctly or independently supported.

Proposed solution

During reconciliation in the GPU Operator, we will inject additional driver-enablement environment variables into the nvidia-driver container based on the ClusterPolicy or NvidiaDriver CR selected for the node. The driver container will then persist these variables to the host filesystem on which it runs.
With this mechanism, each node will record a node-local view of enabled additional drivers, accurately reflecting the features configured for that node via its ClusterPolicy or NvidiaDriver CR.

We are updating the gpu-operator's driver validation logic where it will now wait for all enabled drivers to be installed first before proceeding.

Nvidia device-plugin is already resilient to missing devices or drivers and does not crash if a particular device is not present on the node. We are now updating device-plugin to always attempt discovery for all supported devices and driver features.

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)

Testing

  • Unit tests (make coverage)
  • Manual cluster testing (describe below)
  • N/A or Other (docs, CI config, etc.)

Test details:
Manual testing done to validate the changes.

To test with clusterpolicy, following values.yaml was used:

driver:
  enabled: true
  nvidiaDriverCRD:
    enabled: false
    deployDefaultCR: false
  kernelModuleType: open
  repository: nvcr.io/nvidia
  image: driver
  version: 580.105.08
  imagePullPolicy: Always
  rdma:
    enabled: false
    useHostMofed: false
gds:
  enabled: true
  repository: nvcr.io/nvidia/cloud-native
  image: nvidia-fs
  version: "2.26.6"
  imagePullPolicy: IfNotPresent
gdrcopy:
  enabled: true
  repository: nvcr.io/nvidia/cloud-native
  image: gdrdrv
  version: "v2.5.1"
  imagePullPolicy: Always
operator:
  repository: rahulsharm810
  image: gpu-operator
  version: nvd3
  imagePullPolicy: Always
devicePlugin:
  repository: docker.io/rahulsharm810
  image: k8s-device-plugin
  version: nvd2
  imagePullPolicy: Always
cdi:
  enabled: false
validator:
  repository: rahulsharm810
  image: gpu-operator
  version: nvd3
  imagePullPolicy: Always

Pods after install:

root@test:~# kgpo
NAME                                                       READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-rdcrd                                1/1     Running     0          8m54s
gpu-operator-6457b8f76d-ldm8g                              1/1     Running     0          9m20s
nvidia-container-toolkit-daemonset-v72cb                   1/1     Running     0          8m54s
nvidia-cuda-validator-6hgln                                0/1     Completed   0          6m36s
nvidia-dcgm-exporter-6f86g                                 1/1     Running     0          8m54s
nvidia-device-plugin-daemonset-7pslg                       1/1     Running     0          8m54s
nvidia-driver-daemonset-kltm9                              3/3     Running     0          9m1s
nvidia-mig-manager-62vnq                                   1/1     Running     0          8m54s
nvidia-operator-validator-7fscv                            1/1     Running     0          8m54s
nvidiagpu-node-feature-discovery-gc-6d484cd547-sfgd5       1/1     Running     0          9m20s
nvidiagpu-node-feature-discovery-master-7d466cdd75-mg6nq   1/1     Running     0          9m20s
nvidiagpu-node-feature-discovery-worker-ltv95              1/1     Running     0          9m20s
root@test:~# cat /run/nvidia/driver/.additional-drivers-flags
GDRCOPY_ENABLED: true
GDS_ENABLED: true
GPU_DIRECT_RDMA_ENABLED: false
root@test:~#

Testing with nvidiadriver CR:

values.yaml file:

driver:
  enabled: true
  nvidiaDriverCRD:
    enabled: false
    deployDefaultCR: false
operator:
  repository: rahulsharm810
  image: gpu-operator
  version: nvd2
  imagePullPolicy: Always
devicePlugin:
  repository: docker.io/rahulsharm810
  image: k8s-device-plugin
  version: nvd2
  imagePullPolicy: Always
cdi:
  enabled: false
validator:
  repository: rahulsharm810
  image: gpu-operator
  version: nvd2
  imagePullPolicy: Always

nvidiadriver CRD installed using:

kind: NVIDIADriver
metadata:
  name: demo-test
spec:
  driverType: gpu
  gdrcopy:
    enabled: true
    repository: nvcr.io/nvidia/cloud-native
    image: gdrdrv
    version: v2.5.1
    imagePullPolicy: IfNotPresent
    imagePullSecrets: []
    env: []
    args: []
  kernelModuleType: open
  rdma:
    enabled: false
    useHostMofed: false
  gds:
    enabled: false
    repository: nvcr.io/nvidia/cloud-native
    image: nvidia-fs
    version: "2.26.6"
    imagePullPolicy: IfNotPresent
  startupProbe:
    failureThreshold: 120
    initialDelaySeconds: 60
    periodSeconds: 10
    timeoutSeconds: 60
  image: driver
  repository: nvcr.io/nvidia
  imagePullPolicy: Always
  version: 580.105.08
  usePrecompiled: false

Status after install:

root@test:~# kgpo
NAME                                                       READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-j8nlg                                1/1     Running     0          34m
gpu-operator-6457b8f76d-vtzbj                              1/1     Running     0          36m
nvidia-container-toolkit-daemonset-9hzvt                   1/1     Running     0          34m
nvidia-cuda-validator-8h769                                0/1     Completed   0          33m
nvidia-dcgm-exporter-2rzzf                                 1/1     Running     0          34m
nvidia-device-plugin-daemonset-v7fzj                       1/1     Running     0          34m
nvidia-gpu-driver-ubuntu24.04-6585477fb6-c4pm2             2/2     Running     0          35m
nvidia-mig-manager-s7m5q                                   1/1     Running     0          32m
nvidia-operator-validator-4sr4t                            1/1     Running     0          34m
nvidiagpu-node-feature-discovery-gc-6d484cd547-bc8k6       1/1     Running     0          36m
nvidiagpu-node-feature-discovery-master-7d466cdd75-vfqg2   1/1     Running     0          36m
nvidiagpu-node-feature-discovery-worker-cx8r9              1/1     Running     0          36m
root@test:~# cat /run/nvidia/driver/.additional-drivers-flags
GDRCOPY_ENABLED: true
GDS_ENABLED: false
GPU_DIRECT_RDMA_ENABLED: false
root@test:~#

CDI was enabled/disabled in both the tests to make sure it works with/without CDI.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 25, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rahulait
Copy link
Contributor Author

rahulait commented Jan 7, 2026

/ok to test 4457fca

@rahulait
Copy link
Contributor Author

rahulait commented Jan 9, 2026

/ok to test 1fefa07

@rahulait rahulait requested a review from cdesiniotis January 13, 2026 19:49
Copy link
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment, but otherwise lgtm!

@rahulait rahulait force-pushed the validate-additional-drivers branch 2 times, most recently from 7b28c83 to 0456f2b Compare January 13, 2026 21:37
@rahulait
Copy link
Contributor Author

/ok to test 0456f2b

@cdesiniotis
Copy link
Contributor

@tariq1890 requesting your review on this.

@tariq1890
Copy link
Contributor

Thanks for the detailed description @rahulait! Can we also add test cases to ensure the overall coverage doesn't drop?

@rahulait rahulait force-pushed the validate-additional-drivers branch 3 times, most recently from f851b75 to fb3d97e Compare January 14, 2026 21:19
Changes include:
* storing additional enabled drivers on the nodes itself so that container
toolkit and validation pods can check to see which drivers are enabled on that node.
* remove nvidia-fs and gdrcopy from driver validation, fix tests

Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
@rahulait rahulait force-pushed the validate-additional-drivers branch from fb3d97e to b430ca0 Compare January 14, 2026 21:24
@rahulait
Copy link
Contributor Author

/ok to test b430ca0

@rahulait rahulait requested a review from tariq1890 January 14, 2026 21:34
@cdesiniotis cdesiniotis merged commit 6504cfb into NVIDIA:main Jan 15, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants