Containers losing access to GPUs after some time on AWS ECS Amazon Linux 2 with Nvidia 550.73 #465

johnnybenson · 2024-04-17T22:46:19Z

Opening an issue here because I feel like I have exhausted what's out there.

I don't believe my set up fits the criteria from the pinned issue (#48) and I am not seeing any errors e.g. "Failed to initialize NVML: Unknown Error"

The containers start up, can access the GPU, and work great for minutes or hours until suddenly my program can no longer access the GPU and it remains in this state until I restart the task.

I'm using an ECS Optimized Amazon Linux AMI where I install nvidia drivers and the nvidia container toolkit.

The docker container base uses debian:sid-slim, installs libglvnd-dev, includes env vars for NVIDIA_DRIVER_CAPABILITIES=all and NVIDIA_VISIBLE_DEVICES=all, and finally executes a binary compiled from Rust with wgpu 19.3. When the GPU is available, the adapter output from Rust / wgpu is: Vulkan, Tesla T4, DiscreteGpu, NVIDIA (550.73).

I apologize for dumping out all of this information and asking for help here. I can't find any errors. The setup works until it doesn't. And the problem described here NVIDIA/nvidia-docker#1469 lead me to #48 which all sounds so similar to the issue that I am having using newer drivers, a newer toolkit.

Any advice on where to look to learn more and diagnose this better would be tremendously appreciated.

Docker: 20.10.25
Nvidia Drivers: 550.73
nvidia-container-toolkit-base-1.13.5-1.x86_64
libnvidia-container1-1.13.5-1.x86_64
nvidia-container-toolkit-1.13.5-1.x86_64
libnvidia-container-tools-1.13.5-1.x86_64

[ec2-user@ip-10-0-80-174 ~]$ uname -r
4.14.336-257.568.amzn2.x86_64

[ec2-user@ip-10-0-80-174 ~]$ docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc., v0.0.0+unknown)

Server:
 Containers: 3
  Running: 2
  Paused: 0
  Stopped: 1
 Images: 5
 Server Version: 20.10.25
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 64b8a811b07ba6288238eefc14d898ee0b5b99ba
 runc version: 4bccb38cc9cf198d52bebf2b3a90cd14e7af8c06
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.14.336-257.568.amzn2.x86_64
 Operating System: Amazon Linux 2
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 30.95GiB
 Name: ip-10-0-80-174.us-east-2.compute.internal
 ID: USE5:M43H:FX5T:OLTT:GQ6T:YC2P:AF6C:BBC3:B2W6:VZTG:NP7N:G2AO
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

[ec2-user@ip-10-0-80-174 ~]$ runc --version
runc version 1.1.11
commit: 4bccb38cc9cf198d52bebf2b3a90cd14e7af8c06
spec: 1.0.2-dev
go: go1.20.12
libseccomp: 2.5.2

[ec2-user@ip-10-0-80-174 ~]$ nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Wed Apr 17 21:00:00 2024
Driver Version                            : 550.73
CUDA Version                              : 12.4

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Product Name                          : Tesla T4
    Product Brand                         : NVIDIA
    Product Architecture                  : Turing
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1563820007202
    GPU UUID                              : GPU-4b1fa49d-95c3-e9ce-9ac5-8ada3e8dcb94
    Minor Number                          : 0
    VBIOS Version                         : 90.04.96.00.02
    MultiGPU Board                        : No
    Board ID                              : 0x1e
    Board Part Number                     : 900-2G183-0000-001
    GPU Part Number                       : 1EB8-895-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G183.0200.00.02
        OEM Object                        : 1.1
        ECC Object                        : 5.0
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU C2C Mode                          : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Pass-Through
        Host VGPU Mode                    : N/A
        vGPU Heterogeneous Mode           : N/A
    vGPU Software Licensed Product
        Product Name                      : NVIDIA Virtual Applications
        License Status                    : Licensed (Expiry: N/A)
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    GSP Firmware Version                  : 550.73
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x00
        Device                            : 0x1E
        Domain                            : 0x0000
        Base Classcode                    : 0x3
        Sub Classcode                     : 0x2
        Device Id                         : 0x1EB810DE
        Bus Id                            : 00000000:00:1E.0
        Sub System Id                     : 0x12A210DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
                Device Current            : 1
                Device Max                : 3
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : N/A
    Performance State                     : P8
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    Sparse Operation Mode                 : N/A
    FB Memory Usage
        Total                             : 15360 MiB
        Reserved                          : 443 MiB
        Used                              : 467 MiB
        Free                              : 14451 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 16 MiB
        Free                              : 240 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 30 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 96 C
        GPU Slowdown Temp                 : 93 C
        GPU Max Operating Temp            : 85 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    GPU Power Readings
        Power Draw                        : 14.69 W
        Current Power Limit               : 70.00 W
        Requested Power Limit             : 70.00 W
        Default Power Limit               : 70.00 W
        Min Power Limit                   : 60.00 W
        Max Power Limit                   : 70.00 W
    GPU Memory Power Readings
        Power Draw                        : N/A
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 300 MHz
        SM                                : 300 MHz
        Memory                            : 405 MHz
        Video                             : 540 MHz
    Applications Clocks
        Graphics                          : 1590 MHz
        Memory                            : 5001 MHz
    Default Applications Clocks
        Graphics                          : 585 MHz
        Memory                            : 5001 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1590 MHz
        SM                                : 1590 MHz
        Memory                            : 5001 MHz
        Video                             : 1470 MHz
    Max Customer Boost Clocks
        Graphics                          : 1590 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Fabric
        State                             : N/A
        Status                            : N/A
        CliqueId                          : N/A
        ClusterUUID                       : N/A
        Health
            Bandwidth                     : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 6781
            Type                          : C+G
            Name                          : /usr/local/bin/bakery
            Used GPU Memory               : 451 MiB

[ec2-user@ip-10-0-80-174 ~]$ nvidia-container-cli -V
cli-version: 1.13.5
lib-version: 1.13.5
build date: 2023-07-18T11:37+0000
build revision: 66607bd046341f7aad7de80a9f022f122d1f2fce
build compiler: gcc 7.3.1 20180712 (Red Hat 7.3.1-15)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

[ec2-user@ip-10-0-80-174 ~]$ nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0417 21:00:23.020428 11394 nvc.c:376] initializing library context (version=1.13.5, build=66607bd046341f7aad7de80a9f022f122d1f2fce)
I0417 21:00:23.020471 11394 nvc.c:350] using root /
I0417 21:00:23.020480 11394 nvc.c:351] using ldcache /etc/ld.so.cache
I0417 21:00:23.020490 11394 nvc.c:352] using unprivileged user 1000:1000
I0417 21:00:23.020511 11394 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0417 21:00:23.020628 11394 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0417 21:00:23.022026 11395 nvc.c:273] failed to set inheritable capabilities
W0417 21:00:23.022069 11395 nvc.c:274] skipping kernel modules load due to failure
I0417 21:00:23.022256 11396 rpc.c:71] starting driver rpc service
I0417 21:00:23.057461 11399 rpc.c:71] starting nvcgo rpc service
I0417 21:00:23.058265 11394 nvc_info.c:798] requesting driver information with ''
I0417 21:00:23.059290 11394 nvc_info.c:176] selecting /usr/lib64/vdpau/libvdpau_nvidia.so.550.73
I0417 21:00:23.059386 11394 nvc_info.c:176] selecting /usr/lib64/libnvoptix.so.550.73
I0417 21:00:23.059434 11394 nvc_info.c:176] selecting /usr/lib64/libnvidia-tls.so.550.73
I0417 21:00:23.059469 11394 nvc_info.c:176] selecting /usr/lib64/libnvidia-rtcore.so.550.73
I0417 21:00:23.059508 11394 nvc_info.c:176] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.550.73
I0417 21:00:23.059556 11394 nvc_info.c:176] selecting /usr/lib64/libnvidia-pkcs11.so.550.73
I0417 21:00:23.059585 11394 nvc_info.c:176] selecting /usr/lib64/libnvidia-pkcs11-openssl3.so.550.73
I0417 21:00:23.059616 11394 nvc_info.c:176] selecting /usr/lib64/libnvidia-opticalflow.so.550.73
I0417 21:00:23.059664 11394 nvc_info.c:176] selecting /usr/lib64/libnvidia-opencl.so.550.73
I0417 21:00:23.059701 11394 nvc_info.c:176] selecting /usr/lib64/libnvidia-nvvm.so.550.73
I0417 21:00:23.059751 11394 nvc_info.c:176] selecting /usr/lib64/libnvidia-ngx.so.550.73
I0417 21:00:23.059789 11394 nvc_info.c:176] selecting /usr/lib64/libnvidia-ml.so.550.73
I0417 21:00:23.059850 11394 nvc_info.c:176] selecting /usr/lib64/libnvidia-gpucomp.so.550.73
I0417 21:00:23.059886 11394 nvc_info.c:176] selecting /usr/lib64/libnvidia-glvkspirv.so.550.73
I0417 21:00:23.059922 11394 nvc_info.c:176] selecting /usr/lib64/libnvidia-glsi.so.550.73
I0417 21:00:23.059958 11394 nvc_info.c:176] selecting /usr/lib64/libnvidia-glcore.so.550.73
I0417 21:00:23.059994 11394 nvc_info.c:176] selecting /usr/lib64/libnvidia-fbc.so.550.73
I0417 21:00:23.060042 11394 nvc_info.c:176] selecting /usr/lib64/libnvidia-encode.so.550.73
I0417 21:00:23.060091 11394 nvc_info.c:176] selecting /usr/lib64/libnvidia-eglcore.so.550.73
I0417 21:00:23.060130 11394 nvc_info.c:176] selecting /usr/lib64/libnvidia-cfg.so.550.73
I0417 21:00:23.060180 11394 nvc_info.c:176] selecting /usr/lib64/libnvidia-allocator.so.550.73
I0417 21:00:23.060228 11394 nvc_info.c:176] selecting /usr/lib64/libnvcuvid.so.550.73
I0417 21:00:23.060382 11394 nvc_info.c:176] selecting /usr/lib64/libcudadebugger.so.550.73
I0417 21:00:23.060418 11394 nvc_info.c:176] selecting /usr/lib64/libcuda.so.550.73
I0417 21:00:23.060494 11394 nvc_info.c:176] selecting /usr/lib64/libGLX_nvidia.so.550.73
I0417 21:00:23.060539 11394 nvc_info.c:176] selecting /usr/lib64/libGLESv2_nvidia.so.550.73
I0417 21:00:23.060580 11394 nvc_info.c:176] selecting /usr/lib64/libGLESv1_CM_nvidia.so.550.73
I0417 21:00:23.060622 11394 nvc_info.c:176] selecting /usr/lib64/libEGL_nvidia.so.550.73
I0417 21:00:23.060671 11394 nvc_info.c:176] selecting /usr/lib/vdpau/libvdpau_nvidia.so.550.73
I0417 21:00:23.060720 11394 nvc_info.c:176] selecting /usr/lib/libnvidia-tls.so.550.73
I0417 21:00:23.060759 11394 nvc_info.c:176] selecting /usr/lib/libnvidia-ptxjitcompiler.so.550.73
I0417 21:00:23.060818 11394 nvc_info.c:176] selecting /usr/lib/libnvidia-opticalflow.so.550.73
I0417 21:00:23.060869 11394 nvc_info.c:176] selecting /usr/lib/libnvidia-opencl.so.550.73
I0417 21:00:23.060909 11394 nvc_info.c:176] selecting /usr/lib/libnvidia-nvvm.so.550.73
I0417 21:00:23.060965 11394 nvc_info.c:176] selecting /usr/lib/libnvidia-ml.so.550.73
I0417 21:00:23.061016 11394 nvc_info.c:176] selecting /usr/lib/libnvidia-gpucomp.so.550.73
I0417 21:00:23.061056 11394 nvc_info.c:176] selecting /usr/lib/libnvidia-glvkspirv.so.550.73
I0417 21:00:23.061094 11394 nvc_info.c:176] selecting /usr/lib/libnvidia-glsi.so.550.73
I0417 21:00:23.061137 11394 nvc_info.c:176] selecting /usr/lib/libnvidia-glcore.so.550.73
I0417 21:00:23.061176 11394 nvc_info.c:176] selecting /usr/lib/libnvidia-fbc.so.550.73
I0417 21:00:23.061231 11394 nvc_info.c:176] selecting /usr/lib/libnvidia-encode.so.550.73
I0417 21:00:23.061283 11394 nvc_info.c:176] selecting /usr/lib/libnvidia-eglcore.so.550.73
I0417 21:00:23.061326 11394 nvc_info.c:176] selecting /usr/lib/libnvidia-allocator.so.550.73
I0417 21:00:23.061387 11394 nvc_info.c:176] selecting /usr/lib/libnvcuvid.so.550.73
I0417 21:00:23.061442 11394 nvc_info.c:176] selecting /usr/lib/libcuda.so.550.73
I0417 21:00:23.061499 11394 nvc_info.c:176] selecting /usr/lib/libGLX_nvidia.so.550.73
I0417 21:00:23.061538 11394 nvc_info.c:176] selecting /usr/lib/libGLESv2_nvidia.so.550.73
I0417 21:00:23.061589 11394 nvc_info.c:176] selecting /usr/lib/libGLESv1_CM_nvidia.so.550.73
I0417 21:00:23.061627 11394 nvc_info.c:176] selecting /usr/lib/libEGL_nvidia.so.550.73
W0417 21:00:23.061650 11394 nvc_info.c:402] missing library libnvidia-nscq.so
W0417 21:00:23.061659 11394 nvc_info.c:402] missing library libnvidia-fatbinaryloader.so
W0417 21:00:23.061673 11394 nvc_info.c:402] missing library libnvidia-compiler.so
W0417 21:00:23.061685 11394 nvc_info.c:402] missing library libnvidia-ifr.so
W0417 21:00:23.061692 11394 nvc_info.c:402] missing library libnvidia-cbl.so
W0417 21:00:23.061699 11394 nvc_info.c:406] missing compat32 library libnvidia-cfg.so
W0417 21:00:23.061704 11394 nvc_info.c:406] missing compat32 library libnvidia-nscq.so
W0417 21:00:23.061732 11394 nvc_info.c:406] missing compat32 library libcudadebugger.so
W0417 21:00:23.061745 11394 nvc_info.c:406] missing compat32 library libnvidia-fatbinaryloader.so
W0417 21:00:23.061749 11394 nvc_info.c:406] missing compat32 library libnvidia-compiler.so
W0417 21:00:23.061754 11394 nvc_info.c:406] missing compat32 library libnvidia-pkcs11.so
W0417 21:00:23.061760 11394 nvc_info.c:406] missing compat32 library libnvidia-pkcs11-openssl3.so
W0417 21:00:23.061774 11394 nvc_info.c:406] missing compat32 library libnvidia-ngx.so
W0417 21:00:23.061781 11394 nvc_info.c:406] missing compat32 library libnvidia-ifr.so
W0417 21:00:23.061788 11394 nvc_info.c:406] missing compat32 library libnvidia-rtcore.so
W0417 21:00:23.061798 11394 nvc_info.c:406] missing compat32 library libnvoptix.so
W0417 21:00:23.061805 11394 nvc_info.c:406] missing compat32 library libnvidia-cbl.so
I0417 21:00:23.061904 11394 nvc_info.c:302] selecting /usr/bin/nvidia-smi
I0417 21:00:23.061923 11394 nvc_info.c:302] selecting /usr/bin/nvidia-debugdump
I0417 21:00:23.061946 11394 nvc_info.c:302] selecting /usr/bin/nvidia-persistenced
I0417 21:00:23.061978 11394 nvc_info.c:302] selecting /usr/bin/nvidia-cuda-mps-control
I0417 21:00:23.062000 11394 nvc_info.c:302] selecting /usr/bin/nvidia-cuda-mps-server
W0417 21:00:23.062058 11394 nvc_info.c:428] missing binary nv-fabricmanager
I0417 21:00:23.062102 11394 nvc_info.c:488] listing firmware path /lib/firmware/nvidia/550.73/gsp_ga10x.bin
I0417 21:00:23.062112 11394 nvc_info.c:488] listing firmware path /lib/firmware/nvidia/550.73/gsp_tu10x.bin
I0417 21:00:23.062135 11394 nvc_info.c:561] listing device /dev/nvidiactl
I0417 21:00:23.062145 11394 nvc_info.c:561] listing device /dev/nvidia-uvm
I0417 21:00:23.062152 11394 nvc_info.c:561] listing device /dev/nvidia-uvm-tools
I0417 21:00:23.062159 11394 nvc_info.c:561] listing device /dev/nvidia-modeset
I0417 21:00:23.062190 11394 nvc_info.c:346] listing ipc path /run/nvidia-persistenced/socket
W0417 21:00:23.062213 11394 nvc_info.c:352] missing ipc path /var/run/nvidia-fabricmanager/socket
W0417 21:00:23.062232 11394 nvc_info.c:352] missing ipc path /tmp/nvidia-mps
I0417 21:00:23.062243 11394 nvc_info.c:854] requesting device information with ''
I0417 21:00:23.076755 11394 nvc_info.c:745] listing device /dev/nvidia0 (GPU-4b1fa49d-95c3-e9ce-9ac5-8ada3e8dcb94 at 00000000:00:1e.0)
NVRM version:   550.73
CUDA version:   12.4

Device Index:   0
Device Minor:   0
Model:          Tesla T4
Brand:          Nvidia
GPU UUID:       GPU-4b1fa49d-95c3-e9ce-9ac5-8ada3e8dcb94
Bus Location:   00000000:00:1e.0
Architecture:   7.5
I0417 21:00:23.076816 11394 nvc.c:434] shutting down library context
I0417 21:00:23.076861 11399 rpc.c:95] terminating nvcgo rpc service
I0417 21:00:23.077176 11394 rpc.c:135] nvcgo rpc service terminated successfully
I0417 21:00:23.086508 11396 rpc.c:95] terminating driver rpc service
I0417 21:00:23.086661 11394 rpc.c:135] driver rpc service terminated successfully

[ec2-user@ip-10-0-80-174 ~]$ cat /etc/docker/daemon.json
{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }

[ec2-user@ip-10-0-80-174 log]$ cat /etc/nvidia-container-runtime/config.toml
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"

# Specify the runtimes to consider. This list is processed in order and the PATH
# searched for matching executables unless the entry is an absolute path.
runtimes = [
    "docker-runc",
    "runc",
]

mode = "auto"

    [nvidia-container-runtime.modes.csv]

    mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

The text was updated successfully, but these errors were encountered:

elezar · 2024-04-18T08:28:12Z

@johnnybenson could there be something updating the container. Note that with the legacy injection mechanism where the nvidia-container-runtime-hook makes cgroup modifications to a container, the container engine such as docker is not aware of these modifications and running a docker update command would effectively remove access.

As a workaround for this you could see if requesting the following device nodes:

/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset

on the docker command line explicitly addresses the issue.

johnnybenson · 2024-04-18T22:20:02Z

Thanks for the fast reply @elezar.

I added the device nodes to my docker run command, I can see them in the container with ls /dev/nvidia* and I see them listed in docker inspect. I tested this on a simple EC2 instance without the added layers of ECS instrumentation.

The process inside the container still fails around an hour into running with Unrecognized device error ERROR_INITIALIZATION_FAILED when wgpu requests the device.

Without restarting anything, while the original process continues to fail to obtain the device I can successfully docker exec --it <the-same-container> <the-same-command> and everything works great. Although, the original long-lived process started with docker run never recovers.

This leads me to think that the issue may be in the application layer, or perhaps a bug in wgpu.

Our workaround for now—we are able to detect when this happens, exit the program, then allow the service to spawn a new container instance. If our "unit of work" approaches 30-40 minutes, we may be in trouble again. For now, this feels like an okay path forward.

Thanks again!

sardorkhon2002 · 2024-04-19T09:24:26Z

for anyone with the same issue. Seems like there is a workaround here:
#48

Run nvidia

sudo nvidia-ctk system create-dev-char-symlinks --create-all

Set this command to run at boot:

echo 'ACTION=="add", DEVPATH=="/bus/pci/drivers/nvidia", RUN+="/usr/bin/nvidia-ctk system create-dev-char-symlinks --create-all"' | sudo tee /lib/udev/rules.d/71-nvidia-dev-char.rules

johnnybenson closed this as completed Apr 18, 2024

psinger mentioned this issue May 3, 2024

ValueError: invalid literal for int() with base 10: ‘Failed to initialize NVML: Unknown Error’ h2oai/h2o-llmstudio#668

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Containers losing access to GPUs after some time on AWS ECS Amazon Linux 2 with Nvidia 550.73 #465

Containers losing access to GPUs after some time on AWS ECS Amazon Linux 2 with Nvidia 550.73 #465

johnnybenson commented Apr 17, 2024

elezar commented Apr 18, 2024

johnnybenson commented Apr 18, 2024

sardorkhon2002 commented Apr 19, 2024

Containers losing access to GPUs after some time on AWS ECS Amazon Linux 2 with Nvidia 550.73 #465

Containers losing access to GPUs after some time on AWS ECS Amazon Linux 2 with Nvidia 550.73 #465

Comments

johnnybenson commented Apr 17, 2024

elezar commented Apr 18, 2024

johnnybenson commented Apr 18, 2024

sardorkhon2002 commented Apr 19, 2024