nvidia: make Resizable BAR resize failure non-fatal, and skip proactively on Thunderbolt by orkineric · Pull Request #1109 · NVIDIA/open-gpu-kernel-modules

orkineric · 2026-04-16T00:30:29Z

nvidia: make Resizable BAR resize failure non-fatal, and skip proactively on Thunderbolt

Problem

nv_resize_pcie_bars() is invoked during nv_pci_probe() as an
optimization: it tries to grow BAR1 to the largest size the hardware
advertises so the CPU can address the full VRAM directly. When the
resize fails, the driver currently treats this as a fatal probe error
and bails out via err_zero_dev, preventing the GPU from binding at
all.

This is the code path in question (kernel-open/nvidia/nv-pci.c,
595.58.03 line numbers):

if (nv_resize_pcie_bars(pci_dev)) {
    nv_printf(NV_DBG_ERRORS,
        "NVRM: Fatal Error while attempting to resize PCIe BARs.\n");
    goto err_zero_dev;
}

The "fatal error" framing is too strong. Resizable BAR is an optional
enhancement to the PCI 3.0 spec, not a correctness requirement. A GPU
with a non-resized BAR1 is still fully functional for CUDA, graphics,
and everything else the driver supports -- it just uses the DMA path
instead of direct CPU-mapped access to the full VRAM.

Motivating case: Thunderbolt 5 eGPU enclosures

This limitation is particularly visible with Thunderbolt / USB4 eGPU
enclosures, which have become much more common with products like the
Gigabyte Aorus RTX 5090 AI Box, Razer Core X, Sonnet Breakaway Box,
OneXGPU, etc. TB/USB4 hotplug PCIe bridges have a prefetchable MMIO
window on the order of hundreds of MiB, which cannot accommodate the
GiB-scale BAR1 that modern NVIDIA GPUs advertise.
pci_resize_resource() returns -ENOENT, nv_resize_pcie_bars()
returns non-zero, the existing code path fails, and the eGPU silently
never appears in nvidia-smi.

A reproduction:

Host: ASUS ProArt Z890-CREATOR WIFI, Core Ultra 9 285K
Enclosure: Gigabyte Aorus RTX 5090 AI Box (GV-N5090IXEB-32GD) over
Thunderbolt 5 (Intel JHL9580 "Barlow Ridge" host controller on
motherboard, JHL9480 hub in enclosure)
OS: Fedora 43, kernel 7.0.0-rc7
Driver: open-gpu-kernel-modules 595.58.03 (DKMS), unmodified
Kernel cmdline: pci=assign-busses,hpbussize=0x10,hpmmiosize=64M,hpmmioprefsize=384M,realloc pcie_port_pm=off pcie_aspm.policy=performance intel_iommu=off thunderbolt.clx=0
(The kernel cmdline sets the hotplug bridge prefetchable window to
384 MiB, which fits BAR1 at the default 256 MiB size but cannot fit
a 32 GiB resized BAR1.)

Observed without this PR:

[   ...] nvidia 0000:8b:00.0: enabling device (0000 -> 0002)
[   ...] nvidia 0000:8b:00.0: BAR 14 [mem size 0x100000000 64bit pref]: failed to resize: -2
[   ...] NVRM: Fatal Error while attempting to resize PCIe BARs.
(device never binds)

Observed with this PR:

[   ...] nvidia 0000:8b:00.0: enabling device (0000 -> 0002)
[   ...] nvidia 0000:8b:00.0: device is downstream of Thunderbolt, skipping BAR1 resize
[   ...] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  595.58.03  Release Build
(device binds, nvidia-smi shows the eGPU alongside the internal GPU,
CUDA workloads run normally)

Beyond eGPUs, the same non-fatal semantics help in other constrained
topologies:

Hypervisor guests with a passed-through GPU and a constrained MMIO
window from the host.
Older chipsets with small prefetchable windows that haven't been
tuned for ReBAR.
Platforms where firmware has locked resources and
host->preserve_config is set but the existing guard didn't cover
every failure path.

Fix (two commits)

Commit 1: "nvidia: make Resizable BAR resize failure non-fatal"

This is the correctness fix. Replace the goto err_zero_dev with a
warning print and continue probe. The device will bind with whatever
BAR size was allocated at PCI enumeration time. This commit alone
fixes the reported symptom.

Commit 2: "nvidia: skip Resizable BAR for Thunderbolt-attached devices"

This is a complementary optimization on top of commit 1. Detect
Thunderbolt attachment via pci_is_thunderbolt_attached() (in
<linux/pci.h> since Linux v4.15 / 2017) and skip the resize attempt
proactively, rather than attempting it and falling through to the
warning from commit 1. This saves some probe time and keeps the
kernel log free of the uninformative -ENOENT from
pci_resize_resource().

The helper is gated behind a new conftest
(NV_PCI_IS_THUNDERBOLT_ATTACHED_PRESENT) in
kernel-open/conftest.sh and declared in kernel-open/nvidia/nvidia.Kbuild,
so older kernels without the helper still build cleanly and fall back
to the generic commit-1 behavior. Non-Thunderbolt devices are
unaffected -- pci_is_thunderbolt_attached() returns false for a GPU
on a native PCIe slot, so normal ReBAR continues to run.

Compatibility and review notes

Either commit stands alone: commit 1 alone fixes the bug.
Commit 2 can be dropped if reviewers prefer a minimal change. But
the two together are tidier because commit 2 keeps the kernel log
clean on what is now the most common failure case (TB eGPUs).
No new module parameters, no new registry keys. This is a pure
behavioral fix; nothing to document in nv-reg.h.
Non-TB GPUs unchanged. The commit 2 fast-path predicate is
topology-based and only matches GPUs behind TB bridges.
Preserves existing diagnostic behavior for the old "really
fatal" path: nv_pci_validate_bars() still bails out hard on
BAR0 checks elsewhere in nv_pci_probe(), and err_zero_dev is
still used for the failures that are genuinely fatal (e.g.
rm_init_private_state()). This PR only changes the one specific
failure mode that had false-positive fatality.
Conftest pattern mirrors existing neighbors (see the
pci_rebar_get_possible_sizes case immediately above the new
entry).

Composes cleanly with PR nvidia: add RmForceExternalGpu registry key #984 ("nvidia: add RmForceExternalGpu
registry key" by @roger-pmta). nvidia: add RmForceExternalGpu registry key #984 teaches the driver to treat a
specified GPU as external; this PR makes sure the driver can
actually bind a TB-attached GPU in the first place, which is the
prerequisite for nvidia: add RmForceExternalGpu registry key #984's registry key to do anything useful on a
TB5 eGPU like the Aorus AI Box. We've been running both patches
together in production for a dual-RTX-5090 workstation (internal
- TB5 eGPU). They solve different problems and are orthogonal.
Reference recipe for the full TB5 eGPU enablement is documented at
https://egpu.io/forums/builds/2023-14-lenovo-thinkpad-x1-carbon-gen-11-13th10cu-rtx-5080-32gbps-tb4-sonnet-breakaway-box-850-t5-linux-rocky-10-1/
(5080 / Sonnet) and in the companion configuration notes for the
RTX 5090 + Aorus AI Box setup this PR was developed against.

nv_resize_pcie_bars() is an optimization: it tries to grow BAR1 to the largest size the hardware advertises so the CPU can address the full VRAM directly. When the resize fails -- typically because the upstream bridge's prefetchable MMIO window is too small to accommodate the requested size -- the driver currently treats this as a fatal probe error and bails out via err_zero_dev, preventing the GPU from binding at all. This is overly aggressive. The GPU is still perfectly usable with its existing (un-resized) BAR allocation; that is the entire point of Resizable BAR being an optional enhancement rather than a hard requirement. Systems that cannot accommodate the full resize include: - Thunderbolt / USB4 eGPU enclosures, where the hotplug PCIe bridge prefetchable window is typically hundreds of MiB, not tens of GiB. With a modern GPU advertising a maximum BAR1 size of 16-32 GiB, pci_resize_resource() returns -ENOENT and nv_pci_probe() fails the whole device, so the eGPU silently never appears in nvidia-smi. - Hypervisor guests where the host has passed a constrained MMIO window through to the guest. - Older chipsets with small prefetchable windows. - Platforms where the firmware has locked resources conservatively (preserve_config set). The existing code already detects preserve_config and returns early without failure -- this patch extends the same "skip but keep going" principle to all other failure modes. Replace the goto err_zero_dev with a warning print and continue probe. The device will bind with whatever BAR size was allocated at PCI enumeration time, which for constrained bridges is already the largest size that fits. Tested on an RTX 5090 in a Gigabyte Aorus RTX 5090 AI Box (TB5) on Fedora 43 / kernel 7.0.0-rc7 + open-gpu-kernel-modules 595.58.03. Without this patch, the eGPU fails to bind during probe with the "Fatal Error while attempting to resize PCIe BARs" message and no further action is possible. With this patch, the eGPU binds successfully with its initial 256 MiB BAR1 (the largest that fits the Thunderbolt hotplug bridge prefetch window) and works normally for CUDA compute workloads. Signed-off-by: Eric Christenson <eric@neuralnetwork.media>

The previous commit ("nvidia: make Resizable BAR resize failure non-fatal") is the primary bug fix: it ensures that a failed resize no longer prevents device binding. This commit is a complementary optimization on top of that fix. Thunderbolt / USB4 hotplug PCIe bridges fundamentally cannot host a GiB-scale prefetchable MMIO window: the bridge prefetchable allocation on these buses is typically bounded to hundreds of MiB, which is far smaller than the multi-GiB BAR1 a modern NVIDIA GPU advertises. Attempting the resize on such a device wastes probe time, emits an uninformative ENOENT in the kernel log, and then takes the failure path (now softened to a warning by the previous commit). Avoid all of that by detecting Thunderbolt attachment up front via pci_is_thunderbolt_attached(), which walks the parent bridge chain looking for any bridge with is_thunderbolt set (set by the PCI core's existing quirks table for known Intel TB host controllers). The helper has been available in <linux/pci.h> since Linux v4.15 (2017-12-04). For older kernels, the code is gated behind a conftest check (NV_PCI_IS_THUNDERBOLT_ATTACHED_PRESENT) and the original resize attempt is used unchanged; older kernels also predate most of the hardware this optimization targets, so the protection is low-value there but the guard keeps the driver build-clean on ancient trees. Non-Thunderbolt devices are unaffected: pci_is_thunderbolt_attached() returns false for any GPU on a native PCIe slot (CPU root complex or chipset downstream port), so normal ReBAR continues to run and GPUs keep their full resized BAR1. Tested alongside the previous commit on an RTX 5090 in a Gigabyte Aorus RTX 5090 AI Box (TB5) alongside an internal RTX 5090 in a PCIe 5.0 x16 slot. Result: internal card keeps its full 32 GiB resized BAR1 (verified by lspci and /sys/bus/pci/devices/.../resource); the eGPU stays at 256 MiB BAR1 (the largest that fits the TB5 hotplug bridge window) and binds cleanly without the resize attempt. Signed-off-by: Eric Christenson <eric@neuralnetwork.media>

CLAassistant · 2026-04-16T00:30:36Z

All committers have signed the CLA.

orkineric added 2 commits April 15, 2026 19:22

orkineric mentioned this pull request Apr 16, 2026

nvidia: add RmForceExternalGpu registry key #984

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia: make Resizable BAR resize failure non-fatal, and skip proactively on Thunderbolt#1109

nvidia: make Resizable BAR resize failure non-fatal, and skip proactively on Thunderbolt#1109
orkineric wants to merge 2 commits intoNVIDIA:mainfrom
orkineric:rebar-nonfatal-thunderbolt

orkineric commented Apr 16, 2026

Uh oh!

CLAassistant commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

orkineric commented Apr 16, 2026

nvidia: make Resizable BAR resize failure non-fatal, and skip proactively on Thunderbolt

Problem

Motivating case: Thunderbolt 5 eGPU enclosures

Fix (two commits)

Compatibility and review notes

Related

Uh oh!

CLAassistant commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLAassistant commented Apr 16, 2026 •

edited

Loading