-
Notifications
You must be signed in to change notification settings - Fork 472
Description
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
Hi,
I am trying to configure MIG for a Kubernetes cluster, I have an GPU NODE that attached an NVIDIA-RTX-PRO-6000-Blackwell-Max-Q-Workstation-Edition.
Then I try to install gpu-operator with MIG enabled, all daemonset work find, there is no issue.
Then I tried to label the node to apply MIG configuration, but I got issue, below is how to Reproduce
To Reproduce
Step 1: I use this command to install chart gpu-operator
helm upgrade --install gpu-operator -n nvidia nvidia/gpu-operator `
--version v24.3.0 `
--set driver.enabled=true `
--set driver.version="580.126.16" `
--set mig.strategy=mixed `
--set migManager.env[0].name=WITH_REBOOT `
--set-string migManager.env[0].value=true `
--set toolkit.enabled=true `
--set toolkit.env[0].name=CONTAINERD_CONFIG `
--set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl `
--set toolkit.env[1].name=CONTAINERD_SOCKET `
--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock `
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS `
--set toolkit.env[2].value=nvidia `
--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT `
--set-string toolkit.env[3].value=true `
--set psp.enabled=true `
--set validator.driver.env[0].name=DISABLE_DEV_CHAR_SYMLINK_CREATION `
--set-string validator.driver.env[0].value=true `
--set daemonsets.tolerations[0].operator=Exists
Step 2: I prepare an config map like below
apiVersion: v1
kind: ConfigMap
metadata:
name: blackwell-mig-config
data:
config.yaml: |
version: v1
mig-configs:
all-disabled:
- devices: all
mig-enabled: false
all-1g.24gb.gfx:
- devices: all
mig-enabled: true
mig-devices:
"1g.24gb+gfx": 4
all-1g.24gb.me.all:
- devices: all
mig-enabled: true
mig-devices:
"1g.24gb+me.all": 1
all-1g.24gb-me:
- devices: all
mig-enabled: true
mig-devices:
"1g.24gb-me": 4
all-2g.48gb:
- devices: all
mig-enabled: true
mig-devices:
"2g.48gb": 2
all-2g.48gb.gfx:
- devices: all
mig-enabled: true
mig-devices:
"2g.48gb+gfx": 2
all-2g.48gb.me.all:
- devices: all
mig-enabled: true
mig-devices:
"2g.48gb+me.all": 1
all-2g.48gb-me:
- devices: all
mig-enabled: true
mig-devices:
"2g.48gb-me": 2
all-4g.96gb:
- devices: all
mig-enabled: true
mig-devices:
"4g.96gb": 1
all-4g.96gb.gfx:
- devices: all
mig-enabled: true
mig-devices:
"4g.96gb+gfx": 1
Step 3: I edit the clsuter policy with below command:
kubectl patch clusterpolicies.nvidia.com cluster-policy `
-n nvidia `
--type='json' `
-p='[{\"op\":\"replace\", \"path\":\"/spec/mig/strategy\", \"value\":\"mixed\"},{\"op\":\"replace\", \"path\":\"/spec/migManager/config/name\", \"value\":\"blackwell-mig-config\"}]'
step 4: I continue to label the node with another MIG confi MIG all-1g.24gb.gfx
kubectl label nodes k3s-ai-worker nvidia.com/mig.config=all-1g.24gb.gfx --overwrite
step 5: when I check log of pod nvidia-mig-manager I get bellow log and error:
2026-03-25T19:50:19.477812141Z time="2026-03-25T19:50:19Z" level=info msg="Updating to MIG config: all-1g.24gb.gfx"
2026-03-25T19:50:19.492286796Z Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
2026-03-25T19:50:19.541596886Z Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
2026-03-25T19:50:19.541616317Z Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
2026-03-25T19:50:19.588315431Z Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
2026-03-25T19:50:19.588332443Z Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
2026-03-25T19:50:19.635636196Z Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
2026-03-25T19:50:19.635652978Z Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
2026-03-25T19:50:19.682024793Z Current value of 'nvidia.com/gpu.deploy.dcgm=true'
2026-03-25T19:50:19.682041979Z Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
2026-03-25T19:50:19.728575198Z Current value of 'nvidia.com/gpu.deploy.nvsm=paused-for-mig-change'
2026-03-25T19:50:19.728592864Z Asserting that the requested configuration is present in the configuration file
2026-03-25T19:50:19.732904771Z time="2026-03-25T19:50:19Z" level=fatal msg="Error parsing config file: unmarshal error: error unmarshaling JSON: while decoding JSON: error validating values in 'mig-devices' field: invalid format for '1g.24gb-me': cannot parse fields of '1g.24gb-me': missing 'gb' from '%!d(MISSING)gb'"
2026-03-25T19:50:19.733292085Z Unable to validate the selected MIG configuration
2026-03-25T19:50:19.733353601Z Restarting any GPU clients previously shutdown on the host by restarting their systemd services
2026-03-25T19:50:19.884533391Z Restarting any GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels
2026-03-25T19:50:19.940637565Z node/k3s-ai-worker not labeled
2026-03-25T19:50:19.941788776Z Changing the 'nvidia.com/mig.config.state' node label to 'failed'
2026-03-25T19:50:19.995878919Z node/k3s-ai-worker not labeled
2026-03-25T19:50:19.997760562Z time="2026-03-25T19:50:19Z" level=error msg="Error: exit status 1"
2026-03-25T19:50:19.997766181Z time="2026-03-25T19:50:19Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"
Step 6: I think configmap is wrong, so I edited to
apiVersion: v1
kind: ConfigMap
metadata:
name: blackwell-mig-config
data:
config.yaml: |
version: v1
mig-configs:
all-disabled:
- devices: all
mig-enabled: false
all-2g.48gb:
- devices: all
mig-enabled: true
mig-devices:
"2g.48gb": 2
Step 7: I label the node with command, the purpose is apply MIG all-2g.48gb
kubectl label nodes k3s-ai-worker nvidia.com/mig.config=all-2g.48gb --overwrite
step 8: when I check log of pod nvidia-mig-manager, I get bellow log and error:
time="2026-03-25T19:58:56Z" level=info msg="Updating to MIG config: all-2g.48gb"
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=paused-for-mig-change'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=paused-for-mig-change'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=paused-for-mig-change'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=paused-for-mig-change'
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm=paused-for-mig-change'
Asserting that the requested configuration is present in the configuration file
Selected MIG configuration is valid
Getting current value of the 'nvidia.com/mig.config.state' node label
Current value of 'nvidia.com/mig.config.state=rebooting'
Checking if the selected MIG config is currently applied or not
2026/03/25 19:58:57 WARNING: unable to get device name: [failed to find device with id '2bb4']
time="2026-03-25T19:58:57Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
Checking if the MIG mode setting in the selected config is currently applied or not
If the state is 'rebooting', we expect this to always return true
2026/03/25 19:58:57 WARNING: unable to get device name: [failed to find device with id '2bb4']
time="2026-03-25T19:58:57Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
MIG mode change did not take effect after rebooting
Restarting any GPU clients previously shutdown on the host by restarting their systemd services
Restarting any GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels
node/k3s-ai-worker labeled
Changing the 'nvidia.com/mig.config.state' node label to 'failed'
node/k3s-ai-worker labeled
time="2026-03-25T19:58:57Z" level=error msg="Error: exit status 1"
time="2026-03-25T19:58:57Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"
---> it seem there is some wrong configuration in my cluster. Could you help me to check
Expected behavior
The GPU node should has MIG enabled
Environment (please provide the following information):
- GPU Operator Version: v24.3.0
- OS: Ubuntu 24.04.4 LTS
- Kernel Version: 6.8.0-106-generic
- Container Runtime Version: containerd://2.1.5-k3s1
- Kubernetes Distro and Version: k3s
Information to attach (optional if deemed irrelevant)
- kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE - kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE - If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME - If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers - Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi - containerd logs
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com