Skip to content

nvidia-cuda-validator v25.10.0 fails to allocate vector #1900

@davidshen84

Description

@davidshen84

I installed the nvidia-gpu-operator v25.10 on my K3S cluster. Most gpu operator related pods are started successfully, except for the cuda validator, which fails with the following message:

cuda-validation Failed to allocate device vector A (error code no CUDA-capable device is detected)!

cuda-validation [Vector addition of 50000 elements]

stream closed EOF for gpu-operator/nvidia-cuda-validator-r6nsb (cuda-validation)

I downgraded to v25.3.2 and everything worked.

My host system is Gentoo. I installed the nvidia driver and nvidia-container-toolkit directly using the host package manager.

I customised the operator with the following values:

    driver:
      enabled: false
    toolkit:
      enabled: false

    devicePlugin:
      config:
        name: device-plugin-config
        create: true
        default: "time-slicing"
        data:
          time-slicing: |-
            version: v1
            flags:
              migStrategy: none
            sharing:
              timeSlicing:
                renameByDefault: false
                failRequestsGreaterThanOne: true
                resources:
                  - name: nvidia.com/gpu
                    replicas: 4

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions