Skip to content

GPU CI is not running #14346

@otaj

Description

@otaj

🐛 Bug

GPU CI is not running. The job is failing at "Initialize containers" step already with message regarding NVML versions. The log is taken from this particular job.

[View raw log](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_apis/build/builds/91171/logs/5)

Starting: Initialize containers
/usr/bin/docker version --format '{{.Server.APIVersion}}'
'1.41'
Docker daemon API version: '1.41'
/usr/bin/docker version --format '{{.Client.APIVersion}}'
'1.41'
Docker client API version: '1.41'
/usr/bin/docker ps --all --quiet --no-trunc --filter "label=b0e97f"
/usr/bin/docker network prune --force --filter "label=b0e97f"
/usr/bin/docker pull pytorchlightning/pytorch_lightning:base-cuda-py3.9-torch1.12-cuda11.3.1
base-cuda-py3.9-torch1.12-cuda11.3.1: Pulling from pytorchlightning/pytorch_lightning
Digest: sha256:7ecd427dfdd098338d638676d9a985ece75dc4feed634feb2f67a28f9de8df88
Status: Image is up to date for pytorchlightning/pytorch_lightning:base-cuda-py3.9-torch1.12-cuda11.3.1
docker.io/pytorchlightning/pytorch_lightning:base-cuda-py3.9-torch1.12-cuda11.3.1
/usr/bin/docker info -f "{{range .Plugins.Network}}{{println .}}{{end}}"
bridge
host
ipvlan
macvlan
null
overlay
/usr/bin/docker network create --label b0e97f vsts_network_0d959b252f39483d928a2e3d23ebe36d
d6c557a66378ac2495f9b88fdf224792fb90d7edb4d9da5a3de7bdcf32004276
/usr/bin/docker inspect --format="{{index .Config.Labels \"com.azure.dev.pipelines.agent.handler.node.path\"}}" pytorchlightning/pytorch_lightning:base-cuda-py3.9-torch1.12-cuda11.3.1
/usr/bin/docker create --name 9ddd9a0fd7fb4af58cc739cebeb6f4bf_image_264214 --label b0e97f --network vsts_network_0d959b252f39483d928a2e3d23ebe36d --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all --shm-size=512m -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/agent/_work/1":"/__w/1" -v "/agent/_work/_temp":"/__w/_temp" -v "/agent/_work/_tasks":"/__w/_tasks" -v "/agent/_work/_tool":"/__t" -v "/agent/externals":"/__a/externals":ro -v "/agent/_work/.taskkey":"/__w/.taskkey" pytorchlightning/pytorch_lightning:base-cuda-py3.9-torch1.12-cuda11.3.1 "/__a/externals/node/bin/node" -e "setInterval(function(){}, 24 * 60 * 60 * 1000);"
88023cc7e7845f3c62d253bbebcc1acebfcf833d7085c518e39f3da039358385
/usr/bin/docker start 88023cc7e7845f3c62d253bbebcc1acebfcf833d7085c518e39f3da039358385
Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown
Error: failed to start containers: 88023cc7e7845f3c62d253bbebcc1acebfcf833d7085c518e39f3da039358385
##[error]Docker start fail with exit code 1
Finishing: Initialize containers

cc @tchaton @rohitgr7 @carmocca @akihironitta @Borda

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingciContinuous Integrationpriority: 0High priority task

Type

No type

Projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions