-
Notifications
You must be signed in to change notification settings - Fork 431
Closed
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.
Description
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04):Ubuntu22.04
- Kernel Version:5.15.0-60-generic
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):Containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):v1.24.15
- GPU Operator Version:v23.6.0
- GPU: A100 PCIe 40GB
2. Issue or feature description
When the driver and nfd were installed in advance, I am trying to install gpu-operator in the A100 environment, but the installation of gpu-operator failed and the mig-manager was missing.
GPU:
root@master1:~# lspci | grep NVIDIA 2f:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
The gpu operaotr pod info:
root@master1:~# kubectl get pod -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-kht9c 1/1 Running 1 (36m ago) 57m
gpu-operator-8597b78788-4ncg7 1/1 Running 1 (36m ago) 57m
nvidia-container-toolkit-daemonset-pldv5 1/1 Running 1 (36m ago) 57m
nvidia-cuda-validator-tgqqk 0/1 Init:CrashLoopBackOff 1 (17s ago) 19s
nvidia-dcgm-exporter-m7hg7 1/1 Running 1 (36m ago) 57m
nvidia-device-plugin-daemonset-gjlp7 0/1 CrashLoopBackOff 17 (4m47s ago) 57m
nvidia-operator-validator-7969z 0/1 Init:2/4 6 (2m59s ago) 57m
There is no nvidia-mig-manager pod.
And the error pod logs as follows:
root@master1:~# kubectl logs -n gpu-operator nvidia-device-plugin-daemonset-gjlp7
...
I0109 12:33:24.644789 1 main.go:256] Retreiving plugins.
I0109 12:33:24.645553 1 factory.go:107] Detected NVML platform: found NVML library
I0109 12:33:24.645594 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0109 12:33:24.699911 1 main.go:123] error starting plugins: error getting plugins: failed to construct NVML resource managers: error building device map: error building device map from config.resources: invalid MIG configuration: At least one device with migEnabled=true was not configured correctly: error visiting device: device 0 has an invalid MIG configuration
3. Steps to reproduce the issue
- Install k8s cluster;
- Install nfd:
root@master1:~# kubectl get pod -n node-feature-discovery
NAME READY STATUS RESTARTS AGE
nfd-release-node-feature-discovery-master-5564946bcf-x6qzs 1/1 Running 13 (49m ago) 43d
nfd-release-node-feature-discovery-worker-x7nff 1/1 Running 11 (49m ago) 43d
Metadata
Metadata
Assignees
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.
