Skip to content

A100: The GPU operator will not install the mig-manager #652

@lsyLearn

Description

@lsyLearn

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04):Ubuntu22.04
  • Kernel Version:5.15.0-60-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):v1.24.15
  • GPU Operator Version:v23.6.0
  • GPU: A100 PCIe 40GB

2. Issue or feature description

When the driver and nfd were installed in advance, I am trying to install gpu-operator in the A100 environment, but the installation of gpu-operator failed and the mig-manager was missing.
GPU:
root@master1:~# lspci | grep NVIDIA 2f:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
The gpu operaotr pod info:

root@master1:~# kubectl get pod -n gpu-operator
NAME                                       READY   STATUS                  RESTARTS         AGE
gpu-feature-discovery-kht9c                1/1     Running                 1 (36m ago)      57m
gpu-operator-8597b78788-4ncg7              1/1     Running                 1 (36m ago)      57m
nvidia-container-toolkit-daemonset-pldv5   1/1     Running                 1 (36m ago)      57m
nvidia-cuda-validator-tgqqk                0/1     Init:CrashLoopBackOff   1 (17s ago)      19s
nvidia-dcgm-exporter-m7hg7                 1/1     Running                 1 (36m ago)      57m
nvidia-device-plugin-daemonset-gjlp7       0/1     CrashLoopBackOff        17 (4m47s ago)   57m
nvidia-operator-validator-7969z            0/1     Init:2/4                6 (2m59s ago)    57m

There is no nvidia-mig-manager pod.
And the error pod logs as follows:

root@master1:~# kubectl logs -n gpu-operator nvidia-device-plugin-daemonset-gjlp7
...
I0109 12:33:24.644789       1 main.go:256] Retreiving plugins.
I0109 12:33:24.645553       1 factory.go:107] Detected NVML platform: found NVML library
I0109 12:33:24.645594       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0109 12:33:24.699911       1 main.go:123] error starting plugins: error getting plugins: failed to construct NVML resource managers: error building device map: error building device map from config.resources: invalid MIG configuration: At least one device with migEnabled=true was not configured correctly: error visiting device: device 0 has an invalid MIG configuration

3. Steps to reproduce the issue

  • Install k8s cluster;
  • Install nfd:
root@master1:~# kubectl get pod -n node-feature-discovery
NAME                                                         READY   STATUS    RESTARTS       AGE
nfd-release-node-feature-discovery-master-5564946bcf-x6qzs   1/1     Running   13 (49m ago)   43d
nfd-release-node-feature-discovery-worker-x7nff              1/1     Running   11 (49m ago)   43d
  • Install gpu driver: Driver Version: 535.129.03
    nvidia-smi
  • Install the operator: helm install gpu-operator -n gpu-operator --create-namespace ./gpu-operator --set driver.enabled=false --set nfd.enabled=false
  • Check the gpu-operator pod.

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions