Fix: Prevent infinite idle holdoff loop blocking D3cold re-entry after AC replug by DhineshPonnarasan · Pull Request #1074 · NVIDIA/open-gpu-kernel-modules

DhineshPonnarasan · 2026-03-23T06:06:34Z

Problem

On Turing GPUs with:

NVreg_DynamicPowerManagement=0x02 (FINE mode)
NVreg_EnableGpuFirmware=0 (GSP disabled)

the GPU fails to return to D3cold after AC power is reinserted.

Observed behavior

Boot (AC connected) → GPU correctly enters D3cold
Unplug AC → GPU behavior remains correct
Reinsert AC → GPU transitions to D0
GPU never returns to D3cold afterward

This results in persistent power usage (~7–10W) until reboot.

Root Cause Analysis

The issue originates in the idle holdoff removal logic inside: RmRemoveIdleHoldoff()

Failure scenario

After AC replug:

GPU transitions to D0 and enters an active state
Idle detection relies on:
- RmCheckForGcxSupportOnCurrentState()
- idle_precondition_check_callback_scheduled

In AC-powered mode:

GC6 (deep idle) may be unavailable
RmCheckForGcxSupportOnCurrentState() repeatedly returns false

As a result, the following loop occurs:
RmRemoveIdleHoldoff()
→ GC6 not available
→ idle precondition callback not scheduled
→ reschedule RmRemoveIdleHoldoff()
→ repeat indefinitely

Consequence

nv_indicate_idle() is never called
pm_runtime_put_noidle() is never triggered
Runtime suspend is never reached
GPU remains stuck in D0

Solution

Introduce a bounded retry mechanism to break the infinite rescheduling loop.

Key idea

Allow a limited number of retries for GC6 eligibility, then force idle indication.

Implementation details

Add a counter: idle_holdoff_reschedule_count
Define a retry limit: MAX_IDLE_HOLDOFF_RESCHEDULES (e.g., 4)
Modify RmRemoveIdleHoldoff()

Behavior

If GC6 becomes available OR idle preconditions are met:
Proceed normally
Call nv_indicate_idle()
Reset counter
If GC6 is still unavailable:
Retry up to N times (~20 seconds total)
After threshold is reached:
Force nv_indicate_idle()
Reset counter
Allow autosuspend fallback

Why this works

The fix ensures:

Infinite rescheduling is eliminated
Idle indication is eventually triggered
Runtime PM flow resumes correctly

Resulting flow

nv_indicate_idle()
→ pm_runtime_put_noidle()
→ runtime suspend scheduled
→ nv_pmops_runtime_suspend()
→ GPU transitions to D3cold

Safety & Impact Analysis

No functional regression

Battery mode behavior unchanged
GC6-enabled systems unaffected
Default and disabled modes unaffected

Minimal scope

Change localized to RmRemoveIdleHoldoff()
No modification to core RM or PM logic

Safe fallback behavior

Only triggers when GC6 is persistently unavailable
Uses existing autosuspend path
Avoids introducing new power states or transitions

Verification

This fix has been:

Verified via detailed static analysis of execution flow
Validated for:
- loop termination
- correct counter handling
- safe state transitions
- absence of race conditions

Note:
This change has not been tested on real hardware due to environment limitations (WSL).

Expected Outcome

After applying this fix:

GPU may wake to D0 on AC replug
After ~20 seconds of inactivity:
- GPU correctly returns to D3cold
Eliminates persistent power drain

Request for Validation

Testing on affected systems (Turing + RTD3 FINE mode) would be greatly appreciated to confirm:

D3cold re-entry after AC replug
No regressions in suspend/resume or idle behavior

References

Fixes: Yet another rtd3/d3cold bug variant with Turing/580 #1071

Related components:

dynamic-power.c
runtime PM (RTD3)
GC6 idle state handling

Summary

This change resolves a timing-dependent infinite rescheduling condition by introducing a bounded retry mechanism, ensuring the GPU can always re-enter a low-power state even when GC6 is unavailable.

Hi @dagbdagb ,
Please review these changes and let me know if any further modifications are needed. If you notice any issues, please leave a comment below and I’ll address them. Thank you!

…ling

CLAassistant · 2026-03-23T06:06:50Z

All committers have signed the CLA.

dagbdagb · 2026-03-23T07:59:26Z

The suggested patch applies cleanly on top of main, but it does not build for me.

 [ nvidia            ]  CC           arch/nvalloc/unix/src/osapi.c
 [ nvidia            ]  CC           arch/nvalloc/unix/src/osinit.c
arch/nvalloc/unix/src/dynamic-power.c: In function ‘RmCanEnterGcxUnderGpuLock’:
arch/nvalloc/unix/src/dynamic-power.c:326:48: error: ‘nv_dynamic_power_t’ {aka ‘struct nv_dynamic_power_s’} has no member named ‘gcoff_max_fb_size’
  326 |               (usedFbSize <= nvp->dynamic_power.gcoff_max_fb_size) &&
      |                                                ^
arch/nvalloc/unix/src/dynamic-power.c:327:34: error: ‘nv_dynamic_power_t’ {aka ‘struct nv_dynamic_power_s’} has no member named ‘clients_gcoff_disallow_refcount’
  327 |               (nvp->dynamic_power.clients_gcoff_disallow_refcount == 0)))
      |                                  ^
arch/nvalloc/unix/src/dynamic-power.c: In function ‘osClientGcoffDisallowRefcount’:
arch/nvalloc/unix/src/dynamic-power.c:690:27: error: ‘nv_dynamic_power_t’ {aka ‘struct nv_dynamic_power_s’} has no member named ‘clients_gcoff_disallow_refcount’
  690 |         nvp->dynamic_power.clients_gcoff_disallow_refcount++;
      |                           ^
arch/nvalloc/unix/src/dynamic-power.c:694:27: error: ‘nv_dynamic_power_t’ {aka ‘struct nv_dynamic_power_s’} has no member named ‘clients_gcoff_disallow_refcount’
  694 |         nvp->dynamic_power.clients_gcoff_disallow_refcount--;
      |                           ^
arch/nvalloc/unix/src/dynamic-power.c: In function ‘rm_init_dynamic_power_management’:
arch/nvalloc/unix/src/dynamic-power.c:935:23: error: ‘nv_dynamic_power_t’ {aka ‘struct nv_dynamic_power_s’} has no member named ‘gcoff_max_fb_size’
  935 |     nvp->dynamic_power.gcoff_max_fb_size =
      |                       ^
arch/nvalloc/unix/src/dynamic-power.c: In function ‘RmInitDeferredDynamicPowerManagement’:
arch/nvalloc/unix/src/dynamic-power.c:2202:31: error: ‘nv_dynamic_power_t’ {aka ‘struct nv_dynamic_power_s’} has no member named ‘clients_gcoff_disallow_refcount’
 2202 |             nvp->dynamic_power.clients_gcoff_disallow_refcount = 0;
      |                               ^
arch/nvalloc/unix/src/dynamic-power.c: In function ‘RmCheckForGcOffPM’:
 [ nvidia-modeset    ]  CC           _out/Linux_x86_64/g_nvid_string.c
arch/nvalloc/unix/src/dynamic-power.c:2244:31: error: ‘nv_dynamic_power_t’ {aka ‘struct nv_dynamic_power_s’} has no member named ‘clients_gcoff_disallow_refcount’
 2244 |         if (nvp->dynamic_power.clients_gcoff_disallow_refcount != 0)
      |                               ^
arch/nvalloc/unix/src/dynamic-power.c:2247:47: error: ‘nv_dynamic_power_t’ {aka ‘struct nv_dynamic_power_s’} has no member named ‘gcoff_max_fb_size’
 2247 |         gcoff_max_fb_size = nvp->dynamic_power.gcoff_max_fb_size;
      |                                               ^
 [ nvidia-modeset    ]  LD           _out/Linux_x86_64/nv-modeset-kernel.o
make[1]: *** [Makefile:203: _out/Linux_x86_64/dynamic-power.o] Error 1
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory '/home/dagb/gits/open-gpu-kernel-modules/src/nvidia-modeset'
cd kernel-open/nvidia-modeset/ && ln -sf ../../src/nvidia-modeset/_out/Linux_x86_64/nv-modeset-kernel.o nv-modeset-kernel.o_binary
make[1]: Leaving directory '/home/dagb/gits/open-gpu-kernel-modules/src/nvidia'
make: *** [Makefile:34: src/nvidia/_out/Linux_x86_64/nv-kernel.o] Error 2

If this is sufficient to get my hardware to behave nicely with 590/main and GSP firmware, I would of course use it.
But out of the box, I have had more success with 580.

My observations and workarounds in #1071 are made in the context of 580 and with firmware loading disabled.

DhineshPonnarasan · 2026-03-23T10:18:09Z

Hi @dagbdagb ,
Thanks for catching the build break.
You were right: the issue was that two existing members in the dynamic power struct were accidentally dropped while adding the new retry counter.

I have restored both fields in nv-priv.h:

clients_gcoff_disallow_refcount
gcoff_max_fb_size

So the references in dynamic-power.c now resolve again.

Could you please pull the latest commit from PR #1074 and rebuild?
The previous missing-member errors in dynamic-power.c should be gone.

Also, thanks for the note about 580 and firmware-disabled behavior in #1071. That context is very useful and I’ll keep it in mind while validating this on 590/main.

dagbdagb · 2026-03-23T16:45:11Z

First things first:
I can now pull/reinsert power and have the card come back in d3cold.

Second:
I have not actually verified if this patch was what fixed it, but the RTD3 kernel messages are massively helpful.

I did as follows:

removed all installed nvidia-drivers
pulled open-gpu-kernel-modules from GH and merged this PR on top
built the open-gpu-kernel-modules drivers package make modules -j$(nproc)
installed the nvidia-drivers package (595.45.04)
deleted the 5 nvidia*.ko drivers in /lib/modules/linux......
installed the newly built kernel drivers in this repo: make modules_install -j$(nproc)
reboot

With this, I saw the following:

Booting ok:
[    1.588070] nvidia: loading out-of-tree module taints kernel.
[    1.603218] nvidia-nvlink: Nvlink Core is being initialized, major device number 243
[    1.606502] nvidia 0000:01:00.0: enabling device (0000 -> 0003)
[    1.608927] nvidia 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[    1.659522] nvidia 0000:01:00.0: NVRM: [RTD3] nv_indicate_idle: pm_runtime_put_noidle, usage_count=1
[    1.710253] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  595.45.04  Release Build  (dagb@gillette)  ma. 23. mars 16:15:51 +0100 2026
[    1.715624] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[    1.716201] [drm] Initialized nvidia-drm 0.0.0 for 0000:01:00.0 on minor 2
[    2.714169] nvidia 0000:01:00.0: NVRM: [RTD3] pmops_runtime_suspend: entry, usage_count=0
[    2.714766] nvidia 0000:01:00.0: NVRM: [RTD3] transition_dynamic_power: enter(suspend) skipped (not initialized)
[    2.715416] nvidia 0000:01:00.0: NVRM: [RTD3] pmops_runtime_suspend: exit ok, err=0

(card in d3cold at this time)

pulling out power
reinserting power
[  130.777115] nvidia 0000:01:00.0: NVRM: [RTD3] pmops_runtime_resume: entry, usage_count=0
[  130.777129] nvidia 0000:01:00.0: NVRM: [RTD3] transition_dynamic_power: exit(resume) skipped (not initialized)
[  130.777145] nvidia 0000:01:00.0: NVRM: [RTD3] pmops_runtime_suspend: entry, usage_count=0
[  130.777150] nvidia 0000:01:00.0: NVRM: [RTD3] transition_dynamic_power: enter(suspend) skipped (not initialized)
[  130.777156] nvidia 0000:01:00.0: NVRM: [RTD3] pmops_runtime_suspend: exit ok, err=0
[  130.903665] nvidia 0000:01:00.0: NVRM: [RTD3] pmops_runtime_resume: entry, usage_count=1
[  130.903680] nvidia 0000:01:00.0: NVRM: [RTD3] transition_dynamic_power: exit(resume) skipped (not initialized)

(card in d0 at this time)


starting llama.cpp

[  214.460616] nvidia 0000:01:00.0: NVRM: [RTD3] nv_indicate_not_idle: pm_runtime_get_noresume, usage_count=2
[  214.461140] Loading firmware: nvidia/595.45.04/gsp_tu10x.bin
[  214.513509] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20250807/nsarguments-61)
[  214.513776] ACPI Warning: \_SB.NPCF._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20250807/nsarguments-61)
[  215.447652] NVRM: GPU0 nvAssertOkFailedNoLog: Assertion failed: Invalid data passed [NV_ERR_INVALID_DATA] (0x00000025) returned from PlatformRequestHandler failed to get target temp from SBIOS @ platform_request_handler_ctrl.c:2171
[  215.447662] NVRM: GPU0 nvAssertOkFailedNoLog: Assertion failed: Invalid data passed [NV_ERR_INVALID_DATA] (0x00000025) returned from PlatformRequestHandler failed to get platform power mode from SBIOS @ platform_request_handler_ctrl.c:2114

exiting llama.cpp

[  324.059517] llama-server (1520) used greatest stack depth: 7320 bytes left
[  324.431218] nvidia 0000:01:00.0: NVRM: [RTD3] nv_indicate_idle: pm_runtime_put_noidle, usage_count=1

(card remains in d0)

A couple of observations at this point in time:

firmware loading is enforced, but delayed. NVreg_EnableGpuFirmware=0 is silently ignored.
(the README in this repo appears to state that the firmware now is mandatory)
My UEFI firmware is slightly buggy(?)
something appears to start talking to the dGPU on power insert

I spent all of 14 seconds scanning my process list, before I realized TLP was running.
I deinstalled TLP and rebooted.

After reboot, when I unplug/replug power, dmesg is quiet.

And the dGPU now remains in d3cold.

Additonal testing:

suspend when dGPU is in use:
works only after setting NVreg_PreserveVideoMemoryAllocations=0
(will not suspend at all if set to 1, I think this may be documented somewhere)
if dGPU isn't in use, dGPU comes back in d3cold after having been suspended
/proc/driver/nvidia/gpus/0000\:01\:00.0/power is/becomes confused:

cat /proc/driver/nvidia/gpus/0000\:01\:00.0/power
Runtime D3 status:          ?
Tegra iGPU Rail-Gating:     Disabled
Video Memory:               ?

GPU Hardware Support:
 Video Memory Self Refresh: ?
 Video Memory Off:          ?

S0ix Power Management:
 Platform Support:          Not Supported
 Status:                    ?

Notebook Dynamic Boost:     ?

dagbdagb · 2026-03-23T16:45:53Z

Please let me know what you want me to test, and in what sequence.

Fix D3cold re-entry after AC replug by bounding idle holdoff reschedu…

c32536f

…ling

DhineshPonnarasan marked this pull request as ready for review March 23, 2026 06:12

Fix build: restore nv_dynamic_power_s GCOFF fields

e6f3c3a

dagbdagb mentioned this pull request Mar 23, 2026

Yet another rtd3/d3cold bug variant with Turing/580 #1071

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Prevent infinite idle holdoff loop blocking D3cold re-entry after AC replug#1074

Fix: Prevent infinite idle holdoff loop blocking D3cold re-entry after AC replug#1074
DhineshPonnarasan wants to merge 2 commits intoNVIDIA:mainfrom
DhineshPonnarasan:fix/rtd3-d3cold-reentry-issue-1071

DhineshPonnarasan commented Mar 23, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Mar 23, 2026 •

edited

Loading

Uh oh!

dagbdagb commented Mar 23, 2026 •

edited

Loading

Uh oh!

DhineshPonnarasan commented Mar 23, 2026

Uh oh!

dagbdagb commented Mar 23, 2026 •

edited

Loading

Uh oh!

dagbdagb commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DhineshPonnarasan commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Observed behavior

Root Cause Analysis

Failure scenario

Consequence

Solution

Key idea

Implementation details

Behavior

Why this works

Resulting flow

Safety & Impact Analysis

No functional regression

Minimal scope

Safe fallback behavior

Verification

Expected Outcome

Request for Validation

References

Summary

Uh oh!

CLAassistant commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dagbdagb commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DhineshPonnarasan commented Mar 23, 2026

Uh oh!

dagbdagb commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dagbdagb commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DhineshPonnarasan commented Mar 23, 2026 •

edited

Loading

CLAassistant commented Mar 23, 2026 •

edited

Loading

dagbdagb commented Mar 23, 2026 •

edited

Loading

dagbdagb commented Mar 23, 2026 •

edited

Loading