Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installation failed k8s-device-plugin(v0.9.0) #253

Open
1 task
Kwonho opened this issue Jun 7, 2021 · 13 comments
Open
1 task

Installation failed k8s-device-plugin(v0.9.0) #253

Kwonho opened this issue Jun 7, 2021 · 13 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@Kwonho
Copy link

Kwonho commented Jun 7, 2021

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

I could install k8s-device-plugin(v.0.7.3), but i try to upgrade v.0.9.0, then the errors occur

2. Information to attach (optional if deemed irrelevant)

Common error checking:

  • The k8s-device-plugin container logs
    2021/06/07 07:38:35 Loading NVML
    2021/06/07 07:38:35 Starting FS watcher.
    2021/06/07 07:38:35 Starting OS watcher.
    2021/06/07 07:38:35 Retreiving plugins.
    2021/06/07 07:38:35 Fatal: missing MIG GPU instance capability path: /proc/driver/nvidia/capabilities/gpu0/mig/gi7/access
    2021/06/07 07:38:35 Shutdown of NVML returned:
    panic: Fatal: missing MIG GPU instance capability path: /proc/driver/nvidia/capabilities/gpu0/mig/gi7/access

goroutine 1 [running]:
log.Panicln(0xc42057b910, 0x2, 0x2)
/usr/local/go/src/log/log.go:340 +0xc0
main.check(0xadec60, 0xc420481000)
/go/src/nvidia-device-plugin/nvidia.go:61 +0x81
main.(*MigDeviceManager).Devices(0xc42000c500, 0x0, 0x0, 0x0)
/go/src/nvidia-device-plugin/nvidia.go:129 +0x287
main.start(0xc4202c0ec0, 0x0, 0x0)
/go/src/nvidia-device-plugin/main.go:155 +0x5d1
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).RunContext(0xc420432000, 0xae5a40, 0xc42002c018, 0xc42001e070, 0x7, 0x7, 0x0, 0x0)
/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:315 +0x6c8
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).Run(0xc420432000, 0xc42001e070, 0x7, 0x7, 0x4567e0, 0xc42034df50)
/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:215 +0x61
main.main()
/go/src/nvidia-device-plugin/main.go:88 +0x751

Additional information that might help better understand your environment and reproduce the bug:

Kubernetes version is below
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:07:13Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}

@elezar
Copy link
Member

elezar commented Jun 7, 2021

@Kwonho could you describe your setup a little bit more clearly? The code path for the error you are seeing should only be triggered if one (or more) of the devices on your system are configured with MIG mode enabled and a mig.strategy other than mig.strategy=none is configured.

If you are using this in "standalone" mode (i.e. without the GPU operator), it may be that the underlying NVIDIA Container Toolkit components also need to be updated.

@Kwonho
Copy link
Author

Kwonho commented Jun 7, 2021

@elezar I used DGX-A100 for MIG test. When i try to install v0.7.3, there is no problem.
But i try to install v0.9.0 same way, errors are occured.
If you need more information, please let me know. Thanks

@elezar
Copy link
Member

elezar commented Jun 7, 2021

Are you using the GPU-operator? Or is this a standard device plugin install?

Did you update the NVIDIA Container Runtime components as part of updating to 0.9.0? Which versions of libnvidia-container, nvidia-container-runtime, nvidia-container-toolkit, and nvidia-docker2 are installed (if any)?

@elezar
Copy link
Member

elezar commented Jun 7, 2021

If I recall correctly, there was a change in libnvidia-container 1.4.0 that was required due to how the /proc/driver/nvidia folder was being managed by the driver. This may be what we're seeing here.

@Kwonho
Copy link
Author

Kwonho commented Jun 7, 2021

I using standard device plugin install (helm or yaml)

And the Runtime component's version belows.
ii libnvidia-container1:amd64 1.1.0-1
ii nvidia-container-runtime 3.1.4-1
ii nvidia-container-toolkit 1.0.6-1
ii nvidia-docker2 2.2.2-1

@elezar
Copy link
Member

elezar commented Jun 8, 2021

Could you update nvidia-docker2 to 2.6.0? This should pull in the other dependencies.

I will create a ticket to track adding this requirement to the documentation.

@anaconda2196
Copy link

I am also facing issue while deploying nvidia device plugin -v0.9.0

A100 GPU - mig enable

nvidia-smi -L
GPU 0: A100-PCIE-40GB (UUID: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684)
  MIG 3g.20gb Device 0: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/1/0)
  MIG 2g.10gb Device 1: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/5/0)
  MIG 1g.5gb Device 2: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/13/0)

kubectl nvidia-plugin logs

panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684

goroutine 1 [running]:
main.(*migStrategyMixed).GetPlugins(0x1042638, 0x5, 0xae1140, 0x1042638)
	/go/src/nvidia-device-plugin/mig-strategy.go:171 +0xa41
main.start(0xc4201a6e80, 0x0, 0x0)
	/go/src/nvidia-device-plugin/main.go:146 +0x54c
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).RunContext(0xc42017b080, 0xae5a40, 0xc42019e010, 0xc4201a8000, 0x7, 0x7, 0x0, 0x0)
	/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:315 +0x6c8
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).Run(0xc42017b080, 0xc4201a8000, 0x7, 0x7, 0x4567e0, 0xc420211f50)
	/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:215 +0x61
main.main()
	/go/src/nvidia-device-plugin/main.go:88 +0x751

@elezar
Copy link
Member

elezar commented Jul 15, 2021

Hi @anaconda2196. Is there only a single device in the host?

Which version of the CUDA driver and CUDA Container Toolkit (nvidia-docker) do you have installed? See https://docs.nvidia.com/datacenter/cloud-native/kubernetes/mig-k8s.html#mig-support-in-kubernetes

@anaconda2196
Copy link

anaconda2196 commented Jul 15, 2021

Hi @elezar

k8s version - 1.20.2

If I tried with mig strategy: sigle then also I am facing same issue not only for nvidia-plugin version v.0.9.0 but also for v0.7.0
see (#257)

yum list installed | grep nvidia
Repository libnvidia-container is listed more than once in the configuration
Repository libnvidia-container-experimental is listed more than once in the configuration
Repository nvidia-container-runtime is listed more than once in the configuration
Repository nvidia-container-runtime-experimental is listed more than once in the configuration
libnvidia-container-tools.x86_64
                                1.4.0-1                        @libnvidia-container
libnvidia-container1.x86_64     1.4.0-1                        @libnvidia-container
nvidia-container-runtime.x86_64 3.5.0-1                        @nvidia-container-runtime
nvidia-container-toolkit.x86_64 1.5.1-2                        @nvidia-container-runtime
nvidia-docker2.noarch           2.6.0-1                        @nvidia-docker   


migstrategy - single

# nvidia-smi
Thu Jul 15 13:58:34 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      Off  | 00000000:86:00.0 Off |                   On |
| N/A   53C    P0    34W / 250W |     25MiB / 40537MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    7   0   0  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    8   0   1  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   2  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   11   0   3  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   12   0   4  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   13   0   5  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   14   0   6  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
nvidia-smi -L
GPU 0: A100-PCIE-40GB (UUID: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684)
  MIG 1g.5gb Device 0: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/7/0)
  MIG 1g.5gb Device 1: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/8/0)
  MIG 1g.5gb Device 2: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/9/0)
  MIG 1g.5gb Device 3: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/11/0)
  MIG 1g.5gb Device 4: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/12/0)
  MIG 1g.5gb Device 5: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/13/0)
  MIG 1g.5gb Device 6: (UUID: MIG-GPU-fe14a161-d1f3-7706-c257-3b22fe15c684/14/0)


# nvidia-smi mig -lgi
+----------------------------------------------------+
| GPU instances:                                     |
| GPU   Name          Profile  Instance   Placement  |
|                       ID       ID       Start:Size |
|====================================================|
|   0  MIG 1g.5gb       19        7          4:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19        8          5:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19        9          6:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       11          0:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       12          1:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       13          2:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       14          3:1     |
+----------------------------------------------------+

# nvidia-smi mig -lci
+--------------------------------------------------------------------+
| Compute instances:                                                 |
| GPU     GPU       Name             Profile   Instance   Placement  |
|       Instance                       ID        ID       Start:Size |
|         ID                                                         |
|====================================================================|
|   0      7       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0      8       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0      9       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     11       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     12       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     13       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     14       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+

kubectl get node GPU-NODE -o yaml
looks like gpe-feature-discovery pods running correctly and assigned labels to the A100 gpu node.

labels:
...


nvidia.com/cuda.driver.major=450
                    nvidia.com/cuda.driver.minor=80
                    nvidia.com/cuda.driver.rev=02
                    nvidia.com/cuda.runtime.major=11
                    nvidia.com/cuda.runtime.minor=0
                    nvidia.com/gfd.timestamp=1626381819
                    nvidia.com/gpu.compute.major=8
                    nvidia.com/gpu.compute.minor=0
                    nvidia.com/gpu.count=7
                    nvidia.com/gpu.engines.copy=1
                    nvidia.com/gpu.engines.decoder=0
                    nvidia.com/gpu.engines.encoder=0
                    nvidia.com/gpu.engines.jpeg=0
                    nvidia.com/gpu.engines.ofa=0
                    nvidia.com/gpu.family=ampere
                    nvidia.com/gpu.machine=ProLiant-DL380-Gen10
                    nvidia.com/gpu.memory=4864
                    nvidia.com/gpu.multiprocessors=14
                    nvidia.com/gpu.product=A100-PCIE-40GB-MIG-1g.5gb
                    nvidia.com/gpu.slices.ci=1
                    nvidia.com/gpu.slices.gi=1
                    nvidia.com/mig.strategy=single
...

kubectl -n kube-system get pods
NAME                                                          READY   STATUS    RESTARTS   AGE

gpu-feature-discovery-dfrwf                                   1/1     Running   0          6m18s

nfd-master-6dd87d999-spkqp                                    1/1     Running   0          6m33s
nfd-worker-w2wbf                                              1/1     Running   0          6m33s
nvidia-device-plugin-9p462                                    0/1     Error     6          6m37s

$ kubectl -n kube-system logs nvidia-device-plugin-9p462
2021/07/15 20:49:32 Loading NVML
2021/07/15 20:49:32 Starting FS watcher.
2021/07/15 20:49:32 Starting OS watcher.
2021/07/15 20:49:32 Retreiving plugins.
2021/07/15 20:49:32 Shutdown of NVML returned: <nil>
panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684

goroutine 1 [running]:
main.(*migStrategySingle).GetPlugins(0x1042638, 0x6, 0xae11c0, 0x1042638)
	/go/src/nvidia-device-plugin/mig-strategy.go:102 +0x890
main.start(0xc4201acec0, 0x0, 0x0)
	/go/src/nvidia-device-plugin/main.go:146 +0x54c
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).RunContext(0xc42017ad80, 0xae5a40, 0xc4201a4010, 0xc4201ae000, 0x7, 0x7, 0x0, 0x0)
	/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:315 +0x6c8
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).Run(0xc42017ad80, 0xc4201ae000, 0x7, 0x7, 0x4567e0, 0xc420219f50)
	/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:215 +0x61
main.main()
	/go/src/nvidia-device-plugin/main.go:88 +0x751

Problem is with resource type on my A100 gpu node.

I am getting

kubectl describe node
...
Capacity:
nvidia.com/gpu: 0
...
Allocatable:
nvidia.com/gpu: 0

...


@anaconda2196
Copy link

with migstrategy=single
checked with both version - v0.7.0 | v.0.9.0

After upgrading drivers

nvidia-smi
Thu Jul 15 16:40:54 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      Off  | 00000000:86:00.0 Off |                   On |
| N/A   56C    P0    35W / 250W |     13MiB / 40536MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    7   0   0  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    8   0   1  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   2  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   11   0   3  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   12   0   4  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   13   0   5  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   14   0   6  |      1MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# yum list installed | grep nvidia
Repository libnvidia-container is listed more than once in the configuration
Repository libnvidia-container-experimental is listed more than once in the configuration
Repository nvidia-container-runtime is listed more than once in the configuration
Repository nvidia-container-runtime-experimental is listed more than once in the configuration
libnvidia-container-tools.x86_64
                                1.4.0-1                        @libnvidia-container
libnvidia-container1.x86_64     1.4.0-1                        @libnvidia-container
nvidia-container-runtime.x86_64 3.5.0-1                        @nvidia-container-runtime
nvidia-container-toolkit.x86_64 1.5.1-2                        @nvidia-container-runtime
nvidia-docker2.noarch           2.6.0-1                        @nvidia-docker   

nvidia-smi mig -lgi
+----------------------------------------------------+
| GPU instances:                                     |
| GPU   Name          Profile  Instance   Placement  |
|                       ID       ID       Start:Size |
|====================================================|
|   0  MIG 1g.5gb       19        7          4:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19        8          5:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19        9          6:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       11          0:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       12          1:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       13          2:1     |
+----------------------------------------------------+
|   0  MIG 1g.5gb       19       14          3:1     |
+----------------------------------------------------+

nvidia-smi mig -lci
+--------------------------------------------------------------------+
| Compute instances:                                                 |
| GPU     GPU       Name             Profile   Instance   Placement  |
|       Instance                       ID        ID       Start:Size |
|         ID                                                         |
|====================================================================|
|   0      7       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0      8       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0      9       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     11       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     12       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     13       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+
|   0     14       MIG 1g.5gb           0         0          0:1     |
+--------------------------------------------------------------------+

gpu-feature-discovery pod running correctly and applied correct labels to A100 GPU node whether if migstrategy=single / mixed.

Problem with nvidia-plugin pod - crashingoff

v0.9.0

kubectl -n kube-system logs nvidia-device-plugin-xgv7t
2021/07/15 23:49:47 Loading NVML
2021/07/15 23:49:47 Starting FS watcher.
2021/07/15 23:49:47 Starting OS watcher.
2021/07/15 23:49:47 Retreiving plugins.
2021/07/15 23:49:47 Shutdown of NVML returned: <nil>
panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0: GPU-fe14a161-d1f3-7706-c257-3b22fe15c684

goroutine 1 [running]:
main.(*migStrategySingle).GetPlugins(0x1042638, 0x6, 0xae11c0, 0x1042638)
	/go/src/nvidia-device-plugin/mig-strategy.go:102 +0x890
main.start(0xc42016eec0, 0x0, 0x0)
	/go/src/nvidia-device-plugin/main.go:146 +0x54c
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).RunContext(0xc4202e2000, 0xae5a40, 0xc42002c018, 0xc42001e070, 0x7, 0x7, 0x0, 0x0)
	/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:315 +0x6c8
nvidia-device-plugin/vendor/github.com/urfave/cli/v2.(*App).Run(0xc4202e2000, 0xc42001e070, 0x7, 0x7, 0x4567e0, 0xc4201fbf50)
	/go/src/nvidia-device-plugin/vendor/github.com/urfave/cli/v2/app.go:215 +0x61
main.main()
	/go/src/nvidia-device-plugin/main.go:88 +0x751


v0.7.0

kubectl -n kube-system logs nvidia-device-plugin-jl85z
2021/07/15 23:57:37 Loading NVML
2021/07/15 23:57:37 Starting FS watcher.
2021/07/15 23:57:37 Starting OS watcher.
2021/07/15 23:57:37 Retreiving plugins.
2021/07/15 23:57:37 Shutdown of NVML returned: <nil>
panic: No MIG devices present on node

goroutine 1 [running]:
main.(*migStrategySingle).GetPlugins(0xfdbb58, 0x6, 0xa9a700, 0xfdbb58)
	/go/src/nvidia-device-plugin/mig-strategy.go:115 +0x43f
main.main()
	/go/src/nvidia-device-plugin/main.go:103 +0x413

@dimm0
Copy link

dimm0 commented Jun 23, 2022

Same here, crashlooping with 0.12.2

panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0

@abn
Copy link

abn commented Aug 24, 2022

Looks like this is a race condition issue. Having the label nvidia.com/mig.config set on the node in question should trigger the mig-manager allowing the device plugin to succeed.

Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants