A100: The GPU operator will not install the mig-manager

### 1. Quick Debug Information
* OS/Version(e.g. RHEL8.6, Ubuntu22.04):Ubuntu22.04
* Kernel Version:5.15.0-60-generic
* Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):Containerd
* K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):v1.24.15
* GPU Operator Version:v23.6.0
* GPU: A100 PCIe 40GB


### 2. Issue or feature description
When the driver and nfd were installed in advance, I am trying to install gpu-operator in the A100 environment, but the installation of gpu-operator failed and the mig-manager was missing.
GPU:
`root@master1:~# lspci | grep NVIDIA
2f:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)`
The gpu operaotr pod info:
```
root@master1:~# kubectl get pod -n gpu-operator
NAME                                       READY   STATUS                  RESTARTS         AGE
gpu-feature-discovery-kht9c                1/1     Running                 1 (36m ago)      57m
gpu-operator-8597b78788-4ncg7              1/1     Running                 1 (36m ago)      57m
nvidia-container-toolkit-daemonset-pldv5   1/1     Running                 1 (36m ago)      57m
nvidia-cuda-validator-tgqqk                0/1     Init:CrashLoopBackOff   1 (17s ago)      19s
nvidia-dcgm-exporter-m7hg7                 1/1     Running                 1 (36m ago)      57m
nvidia-device-plugin-daemonset-gjlp7       0/1     CrashLoopBackOff        17 (4m47s ago)   57m
nvidia-operator-validator-7969z            0/1     Init:2/4                6 (2m59s ago)    57m
```

There is no nvidia-mig-manager pod.
And the error pod logs as follows:
```
root@master1:~# kubectl logs -n gpu-operator nvidia-device-plugin-daemonset-gjlp7
...
I0109 12:33:24.644789       1 main.go:256] Retreiving plugins.
I0109 12:33:24.645553       1 factory.go:107] Detected NVML platform: found NVML library
I0109 12:33:24.645594       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0109 12:33:24.699911       1 main.go:123] error starting plugins: error getting plugins: failed to construct NVML resource managers: error building device map: error building device map from config.resources: invalid MIG configuration: At least one device with migEnabled=true was not configured correctly: error visiting device: device 0 has an invalid MIG configuration
```

### 3. Steps to reproduce the issue
* Install k8s cluster;
* Install nfd:
```
root@master1:~# kubectl get pod -n node-feature-discovery
NAME                                                         READY   STATUS    RESTARTS       AGE
nfd-release-node-feature-discovery-master-5564946bcf-x6qzs   1/1     Running   13 (49m ago)   43d
nfd-release-node-feature-discovery-worker-x7nff              1/1     Running   11 (49m ago)   43d
```
* Install gpu driver: Driver Version: 535.129.03
![nvidia-smi](https://github.com/NVIDIA/gpu-operator/assets/51413062/77dd83bc-604d-468a-902a-354c9a1c4782)
* Install the operator: `helm install gpu-operator -n gpu-operator --create-namespace ./gpu-operator --set driver.enabled=false --set nfd.enabled=false`
* Check the gpu-operator pod.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A100: The GPU operator will not install the mig-manager #652

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A100: The GPU operator will not install the mig-manager #652

Description

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions