nvidia-container-cli timeout error when running ECS tasks #3960

monirul · 2024-05-15T06:17:48Z

Image I'm using:
aws-ecs-2-nvidia (1.20.0)

What I expected to happen:
The container that requires NVIDIA GPU should run successfully in ECS variant of bottlerocket and the ECS task should complete successfully.

What actually happened:
When i tried to run a workload in the ecs cluster and the workload requires an NVIDIA GPU, the ECS task fails with an error

CannotStartContainerError: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: initialization error: driver rpc error: timed out: unknown containerKnown

How to reproduce the problem:

Create a ECS cluster
Provision an p5 instance for ECS-nvidia variant and configure it to join the ECS cluster you create at the first step
Create a task that runs a workload that requires NVIDIA GPU( in my case I used Nvidia smoke test)
Launch the task in the ECS cluster.
Observe the error message indicating a failure to start the container.

Root Cause
The issue is caused due to timeout error while loading the driver right before running the container. Generally, the NVIDIA driver gets unloaded when there is no client connected to the driver. kernel mode driver is not already running or connected to a target GPU, the invocation of any program that attempts to interact with that GPU will transparently cause the driver to load and/or initialize the GPU.

Workaround:
To avoid the timeout error, we can enable the NVIDIA driver persistence mode by running the command nvidia-smi -pm 1. It allows to keep the GPUs initialized even when no clients are connected and prevents the kernel module from fully unloading software and hardware state when there are no connected clients. This way, we do not need to load the driver before running the containers and prevent timeout error.

Solution
According to NVIDIA documentation, to address this error and minimize the initial driver load time, NVIDIA offers a user-space daemon for Linux. This daemon ensures persistence of driver state across CUDA job runs, providing a better and reliable solution compared to the workaround involving persistence mode.

Proposal
I propose to include the nvidia-persistenced binary, provided by the nvidia driver, in the bottlerocket. And run it as a systemd unit to ensure the NVIDIA driver remains loaded and available, preventing the timeout error from occurring.

The text was updated successfully, but these errors were encountered:

monirul added type/bug Something isn't working status/needs-triage Pending triage or re-evaluation labels May 15, 2024

monirul changed the title ~~ECS task fails to run on bottlerocket with an error nvidia-container-cli: initialization error: driver rpc error: timed out~~ ECS task fails with an error nvidia-container-cli: initialization error: driver rpc error: timed out May 15, 2024

monirul changed the title ~~ECS task fails with an error nvidia-container-cli: initialization error: driver rpc error: timed out~~ nvidia-container-cli timeout error when running ECS tasks May 15, 2024

vigh-m added area/accelerated-computing Issues related to GPUs/ASICs status/needs-proposal Needs a more detailed proposal for next steps and removed status/needs-triage Pending triage or re-evaluation labels May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-container-cli timeout error when running ECS tasks #3960

nvidia-container-cli timeout error when running ECS tasks #3960

monirul commented May 15, 2024

nvidia-container-cli timeout error when running ECS tasks #3960

nvidia-container-cli timeout error when running ECS tasks #3960

Comments

monirul commented May 15, 2024