You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I've been debugging Thunderbolt eGPU hot-removal on Wayland and ended up with a small,
hardware-validated patch series against 610.43.02. Per CONTRIBUTING.md I'm raising it here before
sending a PR, to check scope and how you'd want it structured.
The problem. Hot-removing a TB eGPU (an enclosure power brownout, or pulling the Type-C cable)
hangs the whole desktop for minutes, unkillable, hard-reset only. The driver keeps running waits and
teardown steps on a GPU that is already off the bus, while holding the RM API and GPU locks. Through
KMS those locks reach the compositor's main thread, so every GPU client lands in uninterruptible
D-state and the session freezes.
The approach. The driver already tracks this state (PDB_PROP_GPU_IS_LOST, API_GPU_ATTACHED_SANITY_CHECK) and already short-circuits on it in places like kgspRpcSanityCheck().
The series just extends that to the wait and teardown sites that don't check it yet. Every guard is a
no-op when the GPU is present, so the normal path is unchanged, and Hopper (no Falcon Booter teardown)
is untouched by design.
The result. The same removal becomes a clean detach: the compositor migrates to the iGPU, the
kernel stays responsive, no hung_task, no Oops, no hard reset. Tested on a GA102 TB3 eGPU (Intel Titan
Ridge enclosure, AMD Strix Halo host, kernel 6.19) with both a PSU brownout and a cable unplug under
an active session.
Display notifier waits.nvEvoMakeRoom() and EvoCheckNotifier() in src/nvidia-modeset/src/nvkms-dma.c spin on an EVO FIFO GET pointer that never advances once the
GPU is gone, holding nvkms_lock. Two variants: link-up engine-dead hits nvEvoMakeRoom, link-down
removal hits EvoCheckNotifier.
GSP and Falcon teardown. On removal, kgspTeardown_TU102()
(gsp/arch/turing/kernel_gsp_tu102.c) runs FWSEC-SB and Booter-Unload. Each s_dmaPoll_GA102()
Falcon DMA poll (gsp/arch/ampere/kernel_gsp_falcon_ga102.c) waits out its full 4-second timeout,
and 10 to 15 stack up. threadStateResetTimeout() at the top of teardown re-arms the deadline, so
the existing per-wait lost detection never collapses them; they run to completion holding the RM API
and GPU locks for minutes.
GC6 refcount and teardown accounting.RmGc6BlockerRefCntDec NULL-derefs a removed device, externalKernelClientCount underflows, and outputs are not shut down before drm_mode_config_cleanup (leaks DP-MST state).
For the link-up engine-dead case, where RM never detects loss because the BAR stays mapped, the nvkms
waiters use a "GET pointer has not advanced in 30 seconds" backstop rather than a wall-clock timeout,
so a slow but healthy GPU won't trip it.
The commits: nvkms EVO push-buffer bail (plus a pushbuffer-overrun guard from my own review); tolerate
external-client-count underflow on a lost GPU; honor drm_dev_unplug() in the atomic check path; shut
down outputs before mode-config cleanup; guard the GC6 blocker refcount dec against a removed device;
skip Falcon teardown ucodes when the GPU is lost.
What I'd like to know:
Scope. Is "don't wedge the system on GPU surprise-removal" something you'd want here, or do you
treat it as out of scope?
Structure. If you'd take it, how would you like it shaped (target branch, any splitting or
squashing, one series vs one PR per subsystem)?
One honest note. The GSP/Falcon teardown hang is in shared RM code, so it likely affects the
proprietary driver too; it isn't open-specific. That's why I'm in Discussions rather than the bug
form. Is discussion plus a PR the right path, or do you have an internal route you'd prefer?
Happy to share full backtraces and test logs. Thanks!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi! I've been debugging Thunderbolt eGPU hot-removal on Wayland and ended up with a small,
hardware-validated patch series against 610.43.02. Per CONTRIBUTING.md I'm raising it here before
sending a PR, to check scope and how you'd want it structured.
The problem. Hot-removing a TB eGPU (an enclosure power brownout, or pulling the Type-C cable)
hangs the whole desktop for minutes, unkillable, hard-reset only. The driver keeps running waits and
teardown steps on a GPU that is already off the bus, while holding the RM API and GPU locks. Through
KMS those locks reach the compositor's main thread, so every GPU client lands in uninterruptible
D-state and the session freezes.
The approach. The driver already tracks this state (
PDB_PROP_GPU_IS_LOST,API_GPU_ATTACHED_SANITY_CHECK) and already short-circuits on it in places likekgspRpcSanityCheck().The series just extends that to the wait and teardown sites that don't check it yet. Every guard is a
no-op when the GPU is present, so the normal path is unchanged, and Hopper (no Falcon Booter teardown)
is untouched by design.
The result. The same removal becomes a clean detach: the compositor migrates to the iGPU, the
kernel stays responsive, no hung_task, no Oops, no hard reset. Tested on a GA102 TB3 eGPU (Intel Titan
Ridge enclosure, AMD Strix Halo host, kernel 6.19) with both a PSU brownout and a cable unplug under
an active session.
Six commits, about +390/-7, off tag 610.43.02:
The three hang sites, with file:line
nvEvoMakeRoom()andEvoCheckNotifier()insrc/nvidia-modeset/src/nvkms-dma.cspin on an EVO FIFO GET pointer that never advances once theGPU is gone, holding
nvkms_lock. Two variants: link-up engine-dead hitsnvEvoMakeRoom, link-downremoval hits
EvoCheckNotifier.kgspTeardown_TU102()(
gsp/arch/turing/kernel_gsp_tu102.c) runs FWSEC-SB and Booter-Unload. Eachs_dmaPoll_GA102()Falcon DMA poll (
gsp/arch/ampere/kernel_gsp_falcon_ga102.c) waits out its full 4-second timeout,and 10 to 15 stack up.
threadStateResetTimeout()at the top of teardown re-arms the deadline, sothe existing per-wait lost detection never collapses them; they run to completion holding the RM API
and GPU locks for minutes.
RmGc6BlockerRefCntDecNULL-derefs a removed device,externalKernelClientCountunderflows, and outputs are not shut down beforedrm_mode_config_cleanup(leaks DP-MST state).For the link-up engine-dead case, where RM never detects loss because the BAR stays mapped, the nvkms
waiters use a "GET pointer has not advanced in 30 seconds" backstop rather than a wall-clock timeout,
so a slow but healthy GPU won't trip it.
The commits: nvkms EVO push-buffer bail (plus a pushbuffer-overrun guard from my own review); tolerate
external-client-count underflow on a lost GPU; honor
drm_dev_unplug()in the atomic check path; shutdown outputs before mode-config cleanup; guard the GC6 blocker refcount dec against a removed device;
skip Falcon teardown ucodes when the GPU is lost.
What I'd like to know:
treat it as out of scope?
squashing, one series vs one PR per subsystem)?
proprietary driver too; it isn't open-specific. That's why I'm in Discussions rather than the bug
form. Is discussion plus a PR the right path, or do you have an internal route you'd prefer?
Happy to share full backtraces and test logs. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions