Skip to content

RTD3 fails on GA103M (RTX 3080 Ti Mobile): driver holds unconditional pm_runtime baseline ref, kernels 6.14 + 6.17, open + proprietary 580/590 #1121

@johntmorehead

Description

@johntmorehead

RTD3 fails on GA103M (RTX 3080 Ti Mobile): driver holds unconditional pm_runtime baseline ref, kernel 6.14 + 6.17, open + proprietary 580/590

Summary

On a ThinkPad X1 Extreme Gen 5 (12th-gen Alder Lake) with GA103M
(10de:2420, RTX 3080 Ti Mobile), the dGPU never enters runtime D3.
After all userspace clients are gone and power/control is set to
auto, runtime_status stays active and runtime_usage stays at
1 with zero userspace openers per fuser. The driver
simultaneously self-reports Runtime D3 status: Enabled (fine-grained)
and Video Memory: Active.

The runtime_usage=1 is a driver-internal pm_runtime reference. No
userspace action can clear it.

The bug reproduces identically across a full 2×2 driver matrix and two
kernel versions, so it appears to be a driver/firmware issue rather
than kernel-side.

Battery cost on this hardware is roughly 5–10 W of dGPU idle power.

Hardware

Laptop Lenovo ThinkPad X1 Extreme Gen 5 (21DECTO1WW)
BIOS N3JET37W 1.21, dated 2023-11-07
CPU Intel i9-12900H (Alder Lake-P)
iGPU 8086:46a6 Iris Xe Graphics @ 0000:00:02.0
dGPU 10de:2420 GA103M / RTX 3080 Ti Mobile @ 0000:01:00.0 (rev a1)
PCIe Root Port 8086:460d @ 0000:00:01.0
HDA function 10de:2288 @ 0000:01:00.1
Distro Zorin OS 18.1 (Ubuntu 24.04 noble base)
Init systemd 255
Mode envycontrol hybrid (PRIME render-offload via prime-run)
Session GNOME on X11 (6.14) / Wayland (6.17) — bug present in both

Repro

# 1. Resolve dGPU DRM node (numbers shift across boots; resolve by vendor)
for c in /dev/dri/card* /dev/dri/renderD*; do
  n=$(basename "$c")
  pci=$(readlink -f /sys/class/drm/$n/device 2>/dev/null | grep -oE '[0-9a-f]{4}:[0-9a-f]{2}:[0-9a-f]{2}\.[0-9]' | tail -1)
  printf "%-25s %s\n" "$c" "$pci"
done
# dGPU is the cardN/renderDN whose path is 0000:01:00.0

# 2. Trigger RTD3
echo auto | sudo tee /sys/bus/pci/devices/0000:01:00.0/power/control
sleep 3

# 3. Observe state
cat /sys/bus/pci/devices/0000:01:00.0/power/{control,runtime_status,runtime_usage}
cat /proc/driver/nvidia/gpus/0000:01:00.0/power
sudo fuser -v /dev/nvidia* /dev/dri/cardN /dev/dri/renderDN

Expected

control=auto
runtime_status=suspended
runtime_usage=0
Video Memory: Off

Actual (every test)

control=auto
runtime_status=active
runtime_usage=1            <-- driver-internal ref
Video Memory: Active
Runtime D3 status: Enabled (fine-grained)
fuser: (no openers)

Test matrix

All combinations were tested with all the standard hybrid prerequisites
already in place: Mutter mutter-device-ignore udev tag on the dGPU,
Xwayland EGL/GLX defaults pointed at Mesa, nvidia-persistenced.service
masked, no Electron apps running, ollama stopped. fuser confirms
zero userspace openers in every failed test.

Driver Variant Version Kernel DPM Result
nvidia-driver-580 open 580.126.09 6.17.0-22-generic 0x03 FAIL: usage=1, VRAM Active, 0 openers
nvidia-driver-580 proprietary 580.126.09 6.17.0-22-generic 0x03 FAIL: usage=1, VRAM Active, 0 openers
nvidia-driver-580 proprietary 580.126.09 6.17.0-22-generic 0x02 FAIL: usage=1, VRAM Active, 0 openers
nvidia-driver-590 proprietary 590.48.01 6.17.0-22-generic 0x03 FAIL: usage=1, VRAM Active, 0 openers
nvidia-driver-590 open 590.48.01 6.17.0-22-generic 0x03 FAIL: usage=1, VRAM Active, 0 openers
nvidia-driver-590 open 590.48.01 6.14.0-37-generic 0x03 FAIL: usage=1, VRAM Active, 0 openers

DPM = NVreg_DynamicPowerManagement. Note that on 0x02 the driver
still reports Runtime D3 status: Enabled (fine-grained) — the status
string appears unaffected by the parameter on this GPU.

The 6.14 kernel was Ubuntu's HWE backport of upstream 6.14.11, package
linux-image-6.14.0-37-generic from noble-updates, with
linux-modules-extra-6.14.0-37-generic installed (i915 lives there on
noble HWE).

Module options in effect

# /etc/modprobe.d/nvidia.conf (envycontrol-generated, lightly edited)
options nvidia NVreg_DynamicPowerManagement=0x03
options nvidia NVreg_UsePageAttributeTable=1
options nvidia NVreg_InitializeSystemMemoryAllocations=0
options nvidia_drm modeset=1

# /etc/modprobe.d/nvidia-graphics-drivers-kms.conf (distro)
options nvidia_drm modeset=1
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_TemporaryFilePath=/var

/proc/driver/nvidia/params confirms all options are loaded
(DynamicPowerManagement: 3, PreserveVideoMemoryAllocations: 1,
InitializeSystemMemoryAllocations: 0, etc.).

/sys/bus/pci/devices/0000:01:00.0/power/autosuspend_delay_ms returns
EIO with the open module on both kernels — consistent with the open
module's documented immediate-suspend behavior, not a bug indicator.

Things ruled out

  • gnome-shell / Mutter opening the dGPU: mutter-device-ignore
    udev tag verified present on the dGPU's DRM node in TAGS and
    CURRENT_TAGS; gnome-shell does not appear in fuser.
  • Xwayland: __EGL_VENDOR_LIBRARY_FILENAMES and
    __GLX_VENDOR_LIBRARY_NAME point at Mesa; Xwayland does not open
    the dGPU.
  • ollama / CUDA-using userspace: stopping the service does not
    drop runtime_usage below 1.
  • Electron/Chromium apps: open the render node while running but
    release on quit; not the persistent ref.
  • Module unload chain: stopping ollama and modprobe -r nvidia_uvm nvidia_drm nvidia_modeset leaves only the bare nvidia module
    bound to the PCI device, and runtime_usage stays at 1.
    Conclusion: the bare nvidia module bound to the device holds the
    reference.
  • GDM/Wayland-vs-X11: bug present on 6.17 Wayland and 6.14 X11.
  • DPM mode: 0x02 and 0x03 both fail.
  • Ubuntu kernel package: bug also present on 6.14, so it is not a
    6.17-specific regression in the Ubuntu tree.

Boot-time observation (separate, possibly related)

On every boot of every driver variant, the dGPU comes up with
power/control=on despite the envycontrol-generated udev bind rule
writing auto. Manually writing auto after boot sticks. Hypothesis:
the bind rule fires before power/control is fully registered, and
the rule's TEST=="power/control" guard skips the write. This is
mitigatable in userspace and is not the subject of this issue, but
mentioning it in case it correlates with the baseline-ref behavior.

Captures

Public gist with three files:
https://gist.github.com/johntmorehead/85276a8decc5f20cfd5f8e240b852ea1

  • 6.17_baseline.txt — modinfo, params, modprobe.d, udev rules,
    current power state, full failure pattern on 6.17.
  • 6.17_journalctl_kernel.txt — kernel log filtered for
    nvidia/pcie/d3/gsp on 6.17 boot.
  • 6.14.0-37-generic_capture.txt — full identical capture on 6.14
    HWE kernel.

What I'd find useful

  • Confirmation that this is a known issue on GA103M / Ampere mobile
    with GSP-RM, or pointers to a tracking issue.
  • Any debug knob (NVreg_*, RmMsg, etc.) that would surface what
    pm_runtime reference the driver is holding.
  • Guidance on whether nouveau/NVK is the right path for users who
    prioritize idle power over CUDA/NVENC on this hardware.

Happy to gather more data — dmesg extracts, RmMsg traces, additional
NVreg_* permutations, ftrace of the pm_runtime put path, etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions