New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Installation failed k8s-device-plugin(v0.9.0) #253
Comments
@Kwonho could you describe your setup a little bit more clearly? The code path for the error you are seeing should only be triggered if one (or more) of the devices on your system are configured with MIG mode enabled and a If you are using this in "standalone" mode (i.e. without the GPU operator), it may be that the underlying NVIDIA Container Toolkit components also need to be updated. |
@elezar I used DGX-A100 for MIG test. When i try to install v0.7.3, there is no problem. |
Are you using the GPU-operator? Or is this a standard device plugin install? Did you update the NVIDIA Container Runtime components as part of updating to 0.9.0? Which versions of |
If I recall correctly, there was a change in |
I using standard device plugin install (helm or yaml) And the Runtime component's version belows. |
Could you update I will create a ticket to track adding this requirement to the documentation. |
I am also facing issue while deploying nvidia device plugin -v0.9.0 A100 GPU - mig enable
kubectl nvidia-plugin logs
|
Hi @anaconda2196. Is there only a single device in the host? Which version of the CUDA driver and CUDA Container Toolkit (nvidia-docker) do you have installed? See https://docs.nvidia.com/datacenter/cloud-native/kubernetes/mig-k8s.html#mig-support-in-kubernetes |
Hi @elezar k8s version - 1.20.2 If I tried with mig strategy: sigle then also I am facing same issue not only for nvidia-plugin version v.0.9.0 but also for v0.7.0
migstrategy - single
Problem is with resource type on my A100 gpu node. I am getting
|
with migstrategy=single After upgrading drivers
gpu-feature-discovery pod running correctly and applied correct labels to A100 GPU node whether if migstrategy=single / mixed. Problem with nvidia-plugin pod - crashingoff v0.9.0
v0.7.0
|
Same here, crashlooping with 0.12.2 panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0 |
Looks like this is a race condition issue. Having the label |
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. |
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Issue or feature description
I could install k8s-device-plugin(v.0.7.3), but i try to upgrade v.0.9.0, then the errors occur
2. Information to attach (optional if deemed irrelevant)
Common error checking:
2021/06/07 07:38:35 Loading NVML
2021/06/07 07:38:35 Starting FS watcher.
2021/06/07 07:38:35 Starting OS watcher.
2021/06/07 07:38:35 Retreiving plugins.
2021/06/07 07:38:35 Fatal: missing MIG GPU instance capability path: /proc/driver/nvidia/capabilities/gpu0/mig/gi7/access
2021/06/07 07:38:35 Shutdown of NVML returned:
panic: Fatal: missing MIG GPU instance capability path: /proc/driver/nvidia/capabilities/gpu0/mig/gi7/access
goroutine 1 [running]:
log.Panicln(0xc42057b910, 0x2, 0x2)
/usr/local/go/src/log/log.go:340 +0xc0
main.check(0xadec60, 0xc420481000)
/go/src/nvidia-device-plugin/nvidia.go:61 +0x81
main.(*MigDeviceManager).Devices(0xc42000c500, 0x0, 0x0, 0x0)
/go/src/nvidia-device-plugin/nvidia.go:129 +0x287
main.start(0xc4202c0ec0, 0x0, 0x0)
/go/src/nvidia-device-plugin/main.go:155 +0x5d1
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).RunContext(0xc420432000, 0xae5a40, 0xc42002c018, 0xc42001e070, 0x7, 0x7, 0x0, 0x0)
/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:315 +0x6c8
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).Run(0xc420432000, 0xc42001e070, 0x7, 0x7, 0x4567e0, 0xc42034df50)
/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:215 +0x61
main.main()
/go/src/nvidia-device-plugin/main.go:88 +0x751
Additional information that might help better understand your environment and reproduce the bug:
Kubernetes version is below
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:07:13Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
The text was updated successfully, but these errors were encountered: