[RFC]: Surviving Thunderbolt eGPU surprise-removal: a hardware-validated open-tree patch series #1190

atassis · 2026-06-10T23:03:07Z

atassis
Jun 10, 2026

Hi! I've been debugging Thunderbolt eGPU hot-removal on Wayland and ended up with a small,
hardware-validated patch series against 610.43.02. Per CONTRIBUTING.md I'm raising it here before
sending a PR, to check scope and how you'd want it structured.

The problem. Hot-removing a TB eGPU (an enclosure power brownout, or pulling the Type-C cable)
hangs the whole desktop for minutes, unkillable, hard-reset only. The driver keeps running waits and
teardown steps on a GPU that is already off the bus, while holding the RM API and GPU locks. Through
KMS those locks reach the compositor's main thread, so every GPU client lands in uninterruptible
D-state and the session freezes.

The approach. The driver already tracks this state (PDB_PROP_GPU_IS_LOST,
API_GPU_ATTACHED_SANITY_CHECK) and already short-circuits on it in places like kgspRpcSanityCheck().
The series just extends that to the wait and teardown sites that don't check it yet. Every guard is a
no-op when the GPU is present, so the normal path is unchanged, and Hopper (no Falcon Booter teardown)
is untouched by design.

The result. The same removal becomes a clean detach: the compositor migrates to the iGPU, the
kernel stays responsive, no hung_task, no Oops, no hard reset. Tested on a GA102 TB3 eGPU (Intel Titan
Ridge enclosure, AMD Strix Halo host, kernel 6.19) with both a PSU brownout and a cable unplug under
an active session.

Six commits, about +390/-7, off tag 610.43.02:

The three hang sites, with file:line

Display notifier waits. nvEvoMakeRoom() and EvoCheckNotifier() in
src/nvidia-modeset/src/nvkms-dma.c spin on an EVO FIFO GET pointer that never advances once the
GPU is gone, holding nvkms_lock. Two variants: link-up engine-dead hits nvEvoMakeRoom, link-down
removal hits EvoCheckNotifier.
GSP and Falcon teardown. On removal, kgspTeardown_TU102()
(gsp/arch/turing/kernel_gsp_tu102.c) runs FWSEC-SB and Booter-Unload. Each s_dmaPoll_GA102()
Falcon DMA poll (gsp/arch/ampere/kernel_gsp_falcon_ga102.c) waits out its full 4-second timeout,
and 10 to 15 stack up. threadStateResetTimeout() at the top of teardown re-arms the deadline, so
the existing per-wait lost detection never collapses them; they run to completion holding the RM API
and GPU locks for minutes.
GC6 refcount and teardown accounting. RmGc6BlockerRefCntDec NULL-derefs a removed device,
externalKernelClientCount underflows, and outputs are not shut down before
drm_mode_config_cleanup (leaks DP-MST state).

For the link-up engine-dead case, where RM never detects loss because the BAR stays mapped, the nvkms
waiters use a "GET pointer has not advanced in 30 seconds" backstop rather than a wall-clock timeout,
so a slow but healthy GPU won't trip it.

The commits: nvkms EVO push-buffer bail (plus a pushbuffer-overrun guard from my own review); tolerate
external-client-count underflow on a lost GPU; honor drm_dev_unplug() in the atomic check path; shut
down outputs before mode-config cleanup; guard the GC6 blocker refcount dec against a removed device;
skip Falcon teardown ucodes when the GPU is lost.

What I'd like to know:

Scope. Is "don't wedge the system on GPU surprise-removal" something you'd want here, or do you
treat it as out of scope?
Structure. If you'd take it, how would you like it shaped (target branch, any splitting or
squashing, one series vs one PR per subsystem)?
One honest note. The GSP/Falcon teardown hang is in shared RM code, so it likely affects the
proprietary driver too; it isn't open-specific. That's why I'm in Discussions rather than the bug
form. Is discussion plus a PR the right path, or do you have an internal route you'd prefer?

Happy to share full backtraces and test logs. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Surviving Thunderbolt eGPU surprise-removal: a hardware-validated open-tree patch series #1190

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[RFC]: Surviving Thunderbolt eGPU surprise-removal: a hardware-validated open-tree patch series #1190

Uh oh!

atassis Jun 10, 2026

Replies: 0 comments

atassis
Jun 10, 2026