Fix: Prevent infinite idle holdoff loop blocking D3cold re-entry after AC replug#1074
Fix: Prevent infinite idle holdoff loop blocking D3cold re-entry after AC replug#1074DhineshPonnarasan wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
|
The suggested patch applies cleanly on top of main, but it does not build for me. If this is sufficient to get my hardware to behave nicely with 590/main and GSP firmware, I would of course use it. My observations and workarounds in #1071 are made in the context of 580 and with firmware loading disabled. |
|
Hi @dagbdagb , I have restored both fields in nv-priv.h:
So the references in dynamic-power.c now resolve again. Could you please pull the latest commit from PR #1074 and rebuild? Also, thanks for the note about 580 and firmware-disabled behavior in #1071. That context is very useful and I’ll keep it in mind while validating this on 590/main. |
|
First things first: Second: I did as follows:
With this, I saw the following: A couple of observations at this point in time:
I spent all of 14 seconds scanning my process list, before I realized TLP was running. After reboot, when I unplug/replug power, dmesg is quiet. And the dGPU now remains in d3cold. Additonal testing:
|
|
Please let me know what you want me to test, and in what sequence. |
Problem
On Turing GPUs with:
NVreg_DynamicPowerManagement=0x02(FINE mode)NVreg_EnableGpuFirmware=0(GSP disabled)the GPU fails to return to D3cold after AC power is reinserted.
Observed behavior
This results in persistent power usage (~7–10W) until reboot.
Root Cause Analysis
The issue originates in the idle holdoff removal logic inside: RmRemoveIdleHoldoff()
Failure scenario
After AC replug:
RmCheckForGcxSupportOnCurrentState()idle_precondition_check_callback_scheduledIn AC-powered mode:
RmCheckForGcxSupportOnCurrentState()repeatedly returnsfalseAs a result, the following loop occurs:
RmRemoveIdleHoldoff()
→ GC6 not available
→ idle precondition callback not scheduled
→ reschedule RmRemoveIdleHoldoff()
→ repeat indefinitely
Consequence
nv_indicate_idle()is never calledpm_runtime_put_noidle()is never triggeredSolution
Introduce a bounded retry mechanism to break the infinite rescheduling loop.
Key idea
Allow a limited number of retries for GC6 eligibility, then force idle indication.
Implementation details
Add a counter: idle_holdoff_reschedule_count
Define a retry limit: MAX_IDLE_HOLDOFF_RESCHEDULES (e.g., 4)
Modify
RmRemoveIdleHoldoff()Behavior
If GC6 becomes available OR idle preconditions are met:
Proceed normally
Call
nv_indicate_idle()Reset counter
If GC6 is still unavailable:
Retry up to N times (~20 seconds total)
After threshold is reached:
Force
nv_indicate_idle()Reset counter
Allow autosuspend fallback
Why this works
The fix ensures:
Resulting flow
nv_indicate_idle()
→ pm_runtime_put_noidle()
→ runtime suspend scheduled
→ nv_pmops_runtime_suspend()
→ GPU transitions to D3cold
Safety & Impact Analysis
No functional regression
Minimal scope
RmRemoveIdleHoldoff()Safe fallback behavior
Verification
This fix has been:
Note:
This change has not been tested on real hardware due to environment limitations (WSL).
Expected Outcome
After applying this fix:
Request for Validation
Testing on affected systems (Turing + RTD3 FINE mode) would be greatly appreciated to confirm:
References
Related components:
dynamic-power.cSummary
This change resolves a timing-dependent infinite rescheduling condition by introducing a bounded retry mechanism, ensuring the GPU can always re-enter a low-power state even when GC6 is unavailable.
Hi @dagbdagb ,
Please review these changes and let me know if any further modifications are needed. If you notice any issues, please leave a comment below and I’ll address them. Thank you!