[Bug]: Unable to configure MIG for NVIDIA-RTX-PRO-6000-Blackwell-Max-Q-Workstation-Edition

_**Important Note:  NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case [here](https://enterprise-support.nvidia.com/s/create-case)**._

**Describe the bug**
Hi,
I am trying to configure MIG for a Kubernetes cluster, I have an GPU NODE that attached an NVIDIA-RTX-PRO-6000-Blackwell-Max-Q-Workstation-Edition.
Then I try to install gpu-operator with MIG enabled, all daemonset work find, there is no issue.
Then I tried to label the node to apply MIG configuration, but I got issue, below is how to Reproduce

**To Reproduce**

Step 1: I use this command to install chart gpu-operator

```
helm upgrade --install gpu-operator -n nvidia nvidia/gpu-operator `
    --version v24.3.0 `
	--set driver.enabled=true `
	--set driver.version="580.126.16" `
	--set mig.strategy=mixed `
	--set migManager.env[0].name=WITH_REBOOT `
	--set-string migManager.env[0].value=true `
	--set toolkit.enabled=true `
	--set toolkit.env[0].name=CONTAINERD_CONFIG `
	--set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl `
	--set toolkit.env[1].name=CONTAINERD_SOCKET `
	--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock `
	--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS `
	--set toolkit.env[2].value=nvidia `
	--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT `
	--set-string toolkit.env[3].value=true `
	--set psp.enabled=true `
	--set validator.driver.env[0].name=DISABLE_DEV_CHAR_SYMLINK_CREATION `
	--set-string validator.driver.env[0].value=true `
	--set daemonsets.tolerations[0].operator=Exists 
```
Step 2: I prepare an config map like below

```
apiVersion: v1
kind: ConfigMap
metadata:
  name: blackwell-mig-config
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-disabled:
        - devices: all
          mig-enabled: false
      all-1g.24gb.gfx:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.24gb+gfx": 4

      all-1g.24gb.me.all:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.24gb+me.all": 1
      
      all-1g.24gb-me:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.24gb-me": 4

      all-2g.48gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "2g.48gb": 2

      all-2g.48gb.gfx:
        - devices: all
          mig-enabled: true
          mig-devices:
            "2g.48gb+gfx": 2

      all-2g.48gb.me.all:
        - devices: all
          mig-enabled: true
          mig-devices:
            "2g.48gb+me.all": 1

      all-2g.48gb-me:
        - devices: all
          mig-enabled: true
          mig-devices:
            "2g.48gb-me": 2

      all-4g.96gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "4g.96gb": 1

      all-4g.96gb.gfx:
        - devices: all
          mig-enabled: true
          mig-devices:
            "4g.96gb+gfx": 1

```

Step 3: I edit the clsuter policy with below command:

```
kubectl patch clusterpolicies.nvidia.com cluster-policy `
 -n nvidia `
     --type='json' `
     -p='[{\"op\":\"replace\", \"path\":\"/spec/mig/strategy\", \"value\":\"mixed\"},{\"op\":\"replace\", \"path\":\"/spec/migManager/config/name\", \"value\":\"blackwell-mig-config\"}]'
```

step 4: I continue to label the node with another MIG confi MIG all-1g.24gb.gfx
kubectl label nodes k3s-ai-worker  nvidia.com/mig.config=all-1g.24gb.gfx --overwrite

step 5: when I check log of pod  nvidia-mig-manager I get bellow log and error:
```
2026-03-25T19:50:19.477812141Z time="2026-03-25T19:50:19Z" level=info msg="Updating to MIG config: all-1g.24gb.gfx"
2026-03-25T19:50:19.492286796Z Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
2026-03-25T19:50:19.541596886Z Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
2026-03-25T19:50:19.541616317Z Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
2026-03-25T19:50:19.588315431Z Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
2026-03-25T19:50:19.588332443Z Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
2026-03-25T19:50:19.635636196Z Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
2026-03-25T19:50:19.635652978Z Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
2026-03-25T19:50:19.682024793Z Current value of 'nvidia.com/gpu.deploy.dcgm=true'
2026-03-25T19:50:19.682041979Z Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
2026-03-25T19:50:19.728575198Z Current value of 'nvidia.com/gpu.deploy.nvsm=paused-for-mig-change'
2026-03-25T19:50:19.728592864Z Asserting that the requested configuration is present in the configuration file
2026-03-25T19:50:19.732904771Z time="2026-03-25T19:50:19Z" level=fatal msg="Error parsing config file: unmarshal error: error unmarshaling JSON: while decoding JSON: error validating values in 'mig-devices' field: invalid format for '1g.24gb-me': cannot parse fields of '1g.24gb-me': missing 'gb' from '%!d(MISSING)gb'"
2026-03-25T19:50:19.733292085Z Unable to validate the selected MIG configuration
2026-03-25T19:50:19.733353601Z Restarting any GPU clients previously shutdown on the host by restarting their systemd services
2026-03-25T19:50:19.884533391Z Restarting any GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels
2026-03-25T19:50:19.940637565Z node/k3s-ai-worker not labeled
2026-03-25T19:50:19.941788776Z Changing the 'nvidia.com/mig.config.state' node label to 'failed'
2026-03-25T19:50:19.995878919Z node/k3s-ai-worker not labeled
2026-03-25T19:50:19.997760562Z time="2026-03-25T19:50:19Z" level=error msg="Error: exit status 1"
2026-03-25T19:50:19.997766181Z time="2026-03-25T19:50:19Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"
```

Step 6: I think configmap is wrong, so I edited to 
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: blackwell-mig-config
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-disabled:
        - devices: all
          mig-enabled: false

      all-2g.48gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "2g.48gb": 2
```

Step 7: I label the node with command, the purpose is apply MIG all-2g.48gb
kubectl label nodes k3s-ai-worker  nvidia.com/mig.config=all-2g.48gb --overwrite

step 8: when I check log of pod  nvidia-mig-manager, I get bellow log and error:
```
time="2026-03-25T19:58:56Z" level=info msg="Updating to MIG config: all-2g.48gb"
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=paused-for-mig-change'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=paused-for-mig-change'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=paused-for-mig-change'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=paused-for-mig-change'
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm=paused-for-mig-change'
Asserting that the requested configuration is present in the configuration file
Selected MIG configuration is valid
Getting current value of the 'nvidia.com/mig.config.state' node label
Current value of 'nvidia.com/mig.config.state=rebooting'
Checking if the selected MIG config is currently applied or not
2026/03/25 19:58:57 WARNING: unable to get device name: [failed to find device with id '2bb4']
time="2026-03-25T19:58:57Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
Checking if the MIG mode setting in the selected config is currently applied or not
If the state is 'rebooting', we expect this to always return true
2026/03/25 19:58:57 WARNING: unable to get device name: [failed to find device with id '2bb4']
time="2026-03-25T19:58:57Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
MIG mode change did not take effect after rebooting
Restarting any GPU clients previously shutdown on the host by restarting their systemd services
Restarting any GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels
node/k3s-ai-worker labeled
Changing the 'nvidia.com/mig.config.state' node label to 'failed'
node/k3s-ai-worker labeled
time="2026-03-25T19:58:57Z" level=error msg="Error: exit status 1"
time="2026-03-25T19:58:57Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"
```

---> it seem there is some wrong configuration in my cluster. Could you help me to check

**Expected behavior**
The GPU node should has MIG enabled

**Environment (please provide the following information):**
 - GPU Operator Version: v24.3.0
 - OS: Ubuntu 24.04.4 LTS
 - Kernel Version: 6.8.0-106-generic
 - Container Runtime Version: containerd://2.1.5-k3s1
 - Kubernetes Distro and Version: k3s



**Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/)** (optional if deemed irrelevant)

 - [ ] kubernetes pods status: `kubectl get pods -n OPERATOR_NAMESPACE`
 - [ ] kubernetes daemonset status: `kubectl get ds -n OPERATOR_NAMESPACE`
 - [ ] If a pod/ds is in an error state or pending state `kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME`
 - [ ] If a pod/ds is in an error state or pending state `kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers`
 - [ ] Output from running `nvidia-smi` from the driver container: `kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi`
 - [ ] containerd logs `journalctl -u containerd > containerd.log`


Collecting full debug bundle (optional):

```
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
```
**NOTE**: please refer to the [must-gather](https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh) script for debug data collected.

This bundle can be submitted to us via email: **operator_feedback@nvidia.com**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Unable to configure MIG for NVIDIA-RTX-PRO-6000-Blackwell-Max-Q-Workstation-Edition #2249

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Unable to configure MIG for NVIDIA-RTX-PRO-6000-Blackwell-Max-Q-Workstation-Edition #2249

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions