-
Couldn't load subscription status.
- Fork 427
Description
Summary
We are experiencing a critical regression with NVENC hardware encoding when using NVIDIA driver version 570.x in a multi-GPU Kubernetes environment. On a node with four identical GPUs, any containerized application managed by the GPU Operator can only use the NVENC encoder successfully if it is scheduled on the last enumerated GPU (e.g., GPU 3 of 4). Pods scheduled on any other GPU (0, 1, or 2) fail to initialize the encoder.
This issue is a clear regression, as the entire setup works perfectly with the 550.x driver series. Host-level encoding works on all cards, and we have confirmed there is no driver version mismatch between the host and the container. The problem appears to be specific to how the 570.x driver exposes NVENC capabilities to the containerized environment in a multi-GPU configuration.
Environment Details
- Hardware:
- CPU: AMD Ryzen Threadripper 7970X (32-Cores)
- GPU: 4 x NVIDIA GeForce RTX 4080 SUPER
- Motherboard: ASUSTeK Pro WS TRX50-SAGE WIFI
- Software:
- Orchestrator: Kubernetes
- GPU Management: NVIDIA GPU Operator
- Host Driver (Problematic):
570.x(e.g., 570.124.06) - Host Driver (Working):
550.xseries - Container: Using a container with correctly matched user-space libraries for the host driver.
- Application: An Unreal Engine-based rendering service, and standard
ffmpeg.
Steps to Reproduce
- Configure a Kubernetes node with multiple identical GPUs (e.g., 4x 4080 SUPER) and install NVIDIA host driver
570.x. - Deploy the NVIDIA GPU Operator.
- Deploy a Kubernetes
Deploymentthat requests a single GPU (spec.containers.resources.limits: nvidia.com/gpu: 1). - Ensure pods from the Deployment are scheduled on different physical GPUs (e.g., GPU 0, GPU 1, etc.).
- Inside a pod scheduled on any GPU except the last one, attempt to initialize an NVENC session using any application (
ffmpeg, custom code, etc.).
Expected Behavior
The containerized application should be able to successfully initialize the NVENC hardware encoder and perform video encoding, regardless of which physical GPU (0, 1, 2, or 3) is assigned to the pod.
Actual Behavior
- Consistent Failure on first N-1 GPUs: NVENC initialization fails on pods assigned to GPU 0, GPU 1, and GPU 2.
- Consistent Success on the last GPU: A pod that is scheduled on GPU 3 works perfectly and can encode video without issue.
- Application-Agnostic Failure: The issue is not tied to our application. A standard
ffmpegcommand inside a failing pod reproduces the error perfectly:$ ffmpeg -f lavfi -i testsrc=size=1920x1080:rate=30 -t 10 -c:v h264_nvenc -f null - ... [h264_nvenc @ 0x55de29791c00] OpenEncodeSessionEx failed: unsupported device (2): (no details) [h264_nvenc @ 0x55de29791c00] No capable devices found Error initializing output stream 0:0 -- Error while opening encoder for output stream #0:0
- Our Unreal Engine application logs corresponding errors:
LogAVCodecs: Error: Error Creating: Failed to create encoder [NVENC 2] LogPixelStreaming: Error: Could not create encoder.
Troubleshooting and Analysis
- This is a clear regression, as downgrading the host driver to the
550.xseries resolves the issue completely on the exact same hardware and software stack. - The issue is specific to the container environment. Running
ffmpegwith NVENC directly on the host OS works correctly for all 4 GPUs simultaneously. - The problem is tied to the logical GPU index, not a specific faulty card. Physically swapping the GPUs does not change the behavior; the failure always occurs on the first N-1 logical GPUs.
- Based on this evidence, the behavior strongly suggests a bug in the
570.xdriver or a related component of the GPU Operator toolkit. The issue likely lies in the enumeration or initialization process for NVENC capabilities when exposing them to a container in a multi-GPU system.
Workaround
The only known workaround is to downgrade the NVIDIA host driver to a version in the 550.x series.