-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Containers losing access to GPUs after some time on AWS ECS Amazon Linux 2 with Nvidia 550.73 #465
Comments
@johnnybenson could there be something updating the container. Note that with the As a workaround for this you could see if requesting the following device nodes:
on the |
Thanks for the fast reply @elezar. I added the device nodes to my docker run command, I can see them in the container with The process inside the container still fails around an hour into running with Without restarting anything, while the original process continues to fail to obtain the device I can successfully This leads me to think that the issue may be in the application layer, or perhaps a bug in wgpu. Our workaround for now—we are able to detect when this happens, exit the program, then allow the service to spawn a new container instance. If our "unit of work" approaches 30-40 minutes, we may be in trouble again. For now, this feels like an okay path forward. Thanks again! |
for anyone with the same issue. Seems like there is a workaround here: Run nvidia
Set this command to run at boot:
|
Opening an issue here because I feel like I have exhausted what's out there.
I don't believe my set up fits the criteria from the pinned issue (#48) and I am not seeing any errors e.g.
"Failed to initialize NVML: Unknown Error"
The containers start up, can access the GPU, and work great for minutes or hours until suddenly my program can no longer access the GPU and it remains in this state until I restart the task.
I'm using an ECS Optimized Amazon Linux AMI where I install nvidia drivers and the nvidia container toolkit.
The docker container base uses
debian:sid-slim
, installslibglvnd-dev
, includes env vars forNVIDIA_DRIVER_CAPABILITIES=all
andNVIDIA_VISIBLE_DEVICES=all
, and finally executes a binary compiled from Rust withwgpu 19.3
. When the GPU is available, the adapter output from Rust / wgpu is:Vulkan, Tesla T4, DiscreteGpu, NVIDIA (550.73)
.I apologize for dumping out all of this information and asking for help here. I can't find any errors. The setup works until it doesn't. And the problem described here NVIDIA/nvidia-docker#1469 lead me to #48 which all sounds so similar to the issue that I am having using newer drivers, a newer toolkit.
Any advice on where to look to learn more and diagnose this better would be tremendously appreciated.
[ec2-user@ip-10-0-80-174 ~]$ uname -r 4.14.336-257.568.amzn2.x86_64
[ec2-user@ip-10-0-80-174 ~]$ runc --version runc version 1.1.11 commit: 4bccb38cc9cf198d52bebf2b3a90cd14e7af8c06 spec: 1.0.2-dev go: go1.20.12 libseccomp: 2.5.2
[ec2-user@ip-10-0-80-174 ~]$ nvidia-container-cli -V cli-version: 1.13.5 lib-version: 1.13.5 build date: 2023-07-18T11:37+0000 build revision: 66607bd046341f7aad7de80a9f022f122d1f2fce build compiler: gcc 7.3.1 20180712 (Red Hat 7.3.1-15) build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
The text was updated successfully, but these errors were encountered: