Skip to content

[Bug]: Decouple host client shutdown from host driver setting #2501

@Arc676

Description

@Arc676

Describe the bug
The chart couples WITH_SHUTDOWN_HOST_GPU_CLIENTS and IS_HOST_DRIVER under the assumption that the host clients can and should be restarted if the driver is installed on the host.

# manually export additional envs required by mig-manager
export WITH_SHUTDOWN_HOST_GPU_CLIENTS=$IS_HOST_DRIVER
echo "WITH_SHUTDOWN_HOST_GPU_CLIENTS=$WITH_SHUTDOWN_HOST_GPU_CLIENTS"

Since mig-parted uses systemd to restart the host clients, this implicitly assumes that systemd is available on all systems where the driver is installed on the host. On systems where this is not the case, such as Talos Linux, mig-parted hangs because it unconditionally attempts to connect to systemd and there is no answer.

In addition, the MIG manager attempts to copy the mig-parted binary to the host if the host clients need to be shut down. This assumes that the host FS is writable. Although this may be the case for most systems, Talos Linux is an immutable OS; the host copy will always fail. However, mig-parted otherwise works fine on Talos Linux, so (unofficial) support for Talos can be obtained simply by skipping the host-copy and the connection to systemd.

See also NVIDIA/mig-parted#356.

To Reproduce
Deploy the GPU operator on a system with NVIDIA drivers installed on the host (i.e. with gpu-operator.driver.enabled = false) and which does not have systemd. For the second point, deploy the operator on a system where nvidia-mig-manager cannot copy files to the host. Deploying on Talos covers both these points.

Expected behavior
The user should be able to set these properties separately. On Talos, the user should be able to install the NVIDIA drivers via system extension (i.e. on the host) but force WITH_SHUTDOWN_HOST_GPU_CLIENTS=false to prevent mig-parted from attempting to perform tasks that cannot work on Talos.

Environment (please provide the following information):

  • GPU Operator Version: v25.10.1
  • OS: Talos Linux 1.13.0
  • Kernel Version: 6.18.24-talos
  • Container Runtime Version: containerd 2.2.3
  • Kubernetes Distro and Version: Talos, Kubernetes 1.35.1

Metadata

Metadata

Labels

bugIssue/PR to expose/discuss/fix a bug

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions