NVENC Fails in Kubernetes Pods on all but the last GPU with Driver 570.x or 580.x

### Summary

We are experiencing a critical regression with NVENC hardware encoding when using NVIDIA driver version `570.x` in a multi-GPU Kubernetes environment. On a node with four identical GPUs, any containerized application managed by the GPU Operator can only use the NVENC encoder successfully if it is scheduled on the last enumerated GPU (e.g., GPU 3 of 4). Pods scheduled on any other GPU (0, 1, or 2) fail to initialize the encoder.

This issue is a clear regression, as the entire setup works perfectly with the `550.x` driver series. Host-level encoding works on all cards, and we have confirmed there is **no** driver version mismatch between the host and the container. The problem appears to be specific to how the 570.x driver exposes NVENC capabilities to the containerized environment in a multi-GPU configuration.

### Environment Details

* **Hardware:**
    * **CPU:** AMD Ryzen Threadripper 7970X (32-Cores)
    * **GPU:** 4 x NVIDIA GeForce RTX 4080 SUPER
    * **Motherboard:** ASUSTeK Pro WS TRX50-SAGE WIFI
* **Software:**
    * **Orchestrator:** Kubernetes
    * **GPU Management:** NVIDIA GPU Operator
    * **Host Driver (Problematic):** `570.x` (e.g., 570.124.06)
    * **Host Driver (Working):** `550.x` series
    * **Container:** Using a container with correctly matched user-space libraries for the host driver.
    * **Application:** An Unreal Engine-based rendering service, and standard `ffmpeg`.

### Steps to Reproduce

1.  Configure a Kubernetes node with multiple identical GPUs (e.g., 4x 4080 SUPER) and install NVIDIA host driver `570.x`.
2.  Deploy the NVIDIA GPU Operator.
3.  Deploy a Kubernetes `Deployment` that requests a single GPU (`spec.containers.resources.limits: nvidia.com/gpu: 1`).
4.  Ensure pods from the Deployment are scheduled on different physical GPUs (e.g., GPU 0, GPU 1, etc.).
5.  Inside a pod scheduled on any GPU *except the last one*, attempt to initialize an NVENC session using any application (`ffmpeg`, custom code, etc.).

### Expected Behavior

The containerized application should be able to successfully initialize the NVENC hardware encoder and perform video encoding, regardless of which physical GPU (0, 1, 2, or 3) is assigned to the pod.

### Actual Behavior

1.  **Consistent Failure on first N-1 GPUs:** NVENC initialization fails on pods assigned to GPU 0, GPU 1, and GPU 2.
2.  **Consistent Success on the last GPU:** A pod that is scheduled on GPU 3 works perfectly and can encode video without issue.
3.  **Application-Agnostic Failure:** The issue is not tied to our application. A standard `ffmpeg` command inside a failing pod reproduces the error perfectly:
    ```bash
    $ ffmpeg -f lavfi -i testsrc=size=1920x1080:rate=30 -t 10 -c:v h264_nvenc -f null -
    ...
    [h264_nvenc @ 0x55de29791c00] OpenEncodeSessionEx failed: unsupported device (2): (no details)
    [h264_nvenc @ 0x55de29791c00] No capable devices found
    Error initializing output stream 0:0 -- Error while opening encoder for output stream #0:0
    ```
4.  Our Unreal Engine application logs corresponding errors:
    ```
    LogAVCodecs: Error: Error Creating: Failed to create encoder [NVENC 2]
    LogPixelStreaming: Error: Could not create encoder.
    ```

### Troubleshooting and Analysis

* **This is a clear regression,** as downgrading the host driver to the `550.x` series resolves the issue completely on the exact same hardware and software stack.
* The issue is **specific to the container environment.** Running `ffmpeg` with NVENC directly on the host OS works correctly for all 4 GPUs simultaneously.
* The problem is tied to the **logical GPU index**, not a specific faulty card. Physically swapping the GPUs does not change the behavior; the failure always occurs on the first N-1 logical GPUs.
* Based on this evidence, the behavior strongly suggests a bug in the `570.x` driver or a related component of the GPU Operator toolkit. The issue likely lies in the enumeration or initialization process for NVENC capabilities when exposing them to a container in a multi-GPU system.

### Workaround

The only known workaround is to **downgrade the NVIDIA host driver to a version in the 550.x series.**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

NVENC Fails in Kubernetes Pods on all but the last GPU with Driver 570.x or 580.x #1249

Summary

Environment Details

Steps to Reproduce

Expected Behavior

Actual Behavior

Troubleshooting and Analysis

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

NVENC Fails in Kubernetes Pods on all but the last GPU with Driver 570.x or 580.x #1249

Description

Summary

Environment Details

Steps to Reproduce

Expected Behavior

Actual Behavior

Troubleshooting and Analysis

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions