Describe the bug
The chart couples WITH_SHUTDOWN_HOST_GPU_CLIENTS and IS_HOST_DRIVER under the assumption that the host clients can and should be restarted if the driver is installed on the host.
|
# manually export additional envs required by mig-manager |
|
export WITH_SHUTDOWN_HOST_GPU_CLIENTS=$IS_HOST_DRIVER |
|
echo "WITH_SHUTDOWN_HOST_GPU_CLIENTS=$WITH_SHUTDOWN_HOST_GPU_CLIENTS" |
Since mig-parted uses systemd to restart the host clients, this implicitly assumes that systemd is available on all systems where the driver is installed on the host. On systems where this is not the case, such as Talos Linux, mig-parted hangs because it unconditionally attempts to connect to systemd and there is no answer.
In addition, the MIG manager attempts to copy the mig-parted binary to the host if the host clients need to be shut down. This assumes that the host FS is writable. Although this may be the case for most systems, Talos Linux is an immutable OS; the host copy will always fail. However, mig-parted otherwise works fine on Talos Linux, so (unofficial) support for Talos can be obtained simply by skipping the host-copy and the connection to systemd.
See also NVIDIA/mig-parted#356.
To Reproduce
Deploy the GPU operator on a system with NVIDIA drivers installed on the host (i.e. with gpu-operator.driver.enabled = false) and which does not have systemd. For the second point, deploy the operator on a system where nvidia-mig-manager cannot copy files to the host. Deploying on Talos covers both these points.
Expected behavior
The user should be able to set these properties separately. On Talos, the user should be able to install the NVIDIA drivers via system extension (i.e. on the host) but force WITH_SHUTDOWN_HOST_GPU_CLIENTS=false to prevent mig-parted from attempting to perform tasks that cannot work on Talos.
Environment (please provide the following information):
- GPU Operator Version: v25.10.1
- OS: Talos Linux 1.13.0
- Kernel Version: 6.18.24-talos
- Container Runtime Version: containerd 2.2.3
- Kubernetes Distro and Version: Talos, Kubernetes 1.35.1
Describe the bug
The chart couples
WITH_SHUTDOWN_HOST_GPU_CLIENTSandIS_HOST_DRIVERunder the assumption that the host clients can and should be restarted if the driver is installed on the host.gpu-operator/assets/state-mig-manager/0420_configmap.yaml
Lines 22 to 24 in 8460187
Since
mig-partedusessystemdto restart the host clients, this implicitly assumes thatsystemdis available on all systems where the driver is installed on the host. On systems where this is not the case, such as Talos Linux,mig-partedhangs because it unconditionally attempts to connect tosystemdand there is no answer.In addition, the MIG manager attempts to copy the
mig-partedbinary to the host if the host clients need to be shut down. This assumes that the host FS is writable. Although this may be the case for most systems, Talos Linux is an immutable OS; the host copy will always fail. However,mig-partedotherwise works fine on Talos Linux, so (unofficial) support for Talos can be obtained simply by skipping the host-copy and the connection tosystemd.See also NVIDIA/mig-parted#356.
To Reproduce
Deploy the GPU operator on a system with NVIDIA drivers installed on the host (i.e. with
gpu-operator.driver.enabled = false) and which does not havesystemd. For the second point, deploy the operator on a system wherenvidia-mig-managercannot copy files to the host. Deploying on Talos covers both these points.Expected behavior
The user should be able to set these properties separately. On Talos, the user should be able to install the NVIDIA drivers via system extension (i.e. on the host) but force
WITH_SHUTDOWN_HOST_GPU_CLIENTS=falseto preventmig-partedfrom attempting to perform tasks that cannot work on Talos.Environment (please provide the following information):