Skip to content

NVENC Fails in Kubernetes Pods on all but the last GPU with Driver 570.x or 580.x #1249

@FLM210

Description

@FLM210

Summary

We are experiencing a critical regression with NVENC hardware encoding when using NVIDIA driver version 570.x in a multi-GPU Kubernetes environment. On a node with four identical GPUs, any containerized application managed by the GPU Operator can only use the NVENC encoder successfully if it is scheduled on the last enumerated GPU (e.g., GPU 3 of 4). Pods scheduled on any other GPU (0, 1, or 2) fail to initialize the encoder.

This issue is a clear regression, as the entire setup works perfectly with the 550.x driver series. Host-level encoding works on all cards, and we have confirmed there is no driver version mismatch between the host and the container. The problem appears to be specific to how the 570.x driver exposes NVENC capabilities to the containerized environment in a multi-GPU configuration.

Environment Details

  • Hardware:
    • CPU: AMD Ryzen Threadripper 7970X (32-Cores)
    • GPU: 4 x NVIDIA GeForce RTX 4080 SUPER
    • Motherboard: ASUSTeK Pro WS TRX50-SAGE WIFI
  • Software:
    • Orchestrator: Kubernetes
    • GPU Management: NVIDIA GPU Operator
    • Host Driver (Problematic): 570.x (e.g., 570.124.06)
    • Host Driver (Working): 550.x series
    • Container: Using a container with correctly matched user-space libraries for the host driver.
    • Application: An Unreal Engine-based rendering service, and standard ffmpeg.

Steps to Reproduce

  1. Configure a Kubernetes node with multiple identical GPUs (e.g., 4x 4080 SUPER) and install NVIDIA host driver 570.x.
  2. Deploy the NVIDIA GPU Operator.
  3. Deploy a Kubernetes Deployment that requests a single GPU (spec.containers.resources.limits: nvidia.com/gpu: 1).
  4. Ensure pods from the Deployment are scheduled on different physical GPUs (e.g., GPU 0, GPU 1, etc.).
  5. Inside a pod scheduled on any GPU except the last one, attempt to initialize an NVENC session using any application (ffmpeg, custom code, etc.).

Expected Behavior

The containerized application should be able to successfully initialize the NVENC hardware encoder and perform video encoding, regardless of which physical GPU (0, 1, 2, or 3) is assigned to the pod.

Actual Behavior

  1. Consistent Failure on first N-1 GPUs: NVENC initialization fails on pods assigned to GPU 0, GPU 1, and GPU 2.
  2. Consistent Success on the last GPU: A pod that is scheduled on GPU 3 works perfectly and can encode video without issue.
  3. Application-Agnostic Failure: The issue is not tied to our application. A standard ffmpeg command inside a failing pod reproduces the error perfectly:
    $ ffmpeg -f lavfi -i testsrc=size=1920x1080:rate=30 -t 10 -c:v h264_nvenc -f null -
    ...
    [h264_nvenc @ 0x55de29791c00] OpenEncodeSessionEx failed: unsupported device (2): (no details)
    [h264_nvenc @ 0x55de29791c00] No capable devices found
    Error initializing output stream 0:0 -- Error while opening encoder for output stream #0:0
  4. Our Unreal Engine application logs corresponding errors:
    LogAVCodecs: Error: Error Creating: Failed to create encoder [NVENC 2]
    LogPixelStreaming: Error: Could not create encoder.
    

Troubleshooting and Analysis

  • This is a clear regression, as downgrading the host driver to the 550.x series resolves the issue completely on the exact same hardware and software stack.
  • The issue is specific to the container environment. Running ffmpeg with NVENC directly on the host OS works correctly for all 4 GPUs simultaneously.
  • The problem is tied to the logical GPU index, not a specific faulty card. Physically swapping the GPUs does not change the behavior; the failure always occurs on the first N-1 logical GPUs.
  • Based on this evidence, the behavior strongly suggests a bug in the 570.x driver or a related component of the GPU Operator toolkit. The issue likely lies in the enumeration or initialization process for NVENC capabilities when exposing them to a container in a multi-GPU system.

Workaround

The only known workaround is to downgrade the NVIDIA host driver to a version in the 550.x series.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions