Skip to content

kernel 7.0.3 + nvidia-open 595.71.05 on RTX 3090: __nv_drm_gem_nvkms_map requests range exceeding PCI BAR1 → Xid 31 → Xid 154 (Node Reboot Required) under Chromium GPU workload #1134

@Zeus-Deus

Description

@Zeus-Deus

NVIDIA Open GPU Kernel Modules Version

595.71.05 (Arch package nvidia-open-dkms 595.71.05-2)

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Arch Linux (rolling release)

Kernel Release

Linux host 7.0.3-arch1-2 #1 SMP PREEMPT_DYNAMIC Fri, 01 May 2026 15:49:22 +0000 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-)

Describe the bug

On a single-GPU RTX 3090 desktop running Linux 7.0.3 with
nvidia-open-dkms 595.71.05, the kernel logged a resource sanity check
warning naming __nv_drm_gem_nvkms_map as the caller of an mmap that
"spans more than" the device's BAR1 region. The same instant, the GPU
took an MMU fault on Copy Engine 2 (Xid 31) and the driver self-declared
the GPU unrecoverable (Xid 154, "Node Reboot Required") with
uvm encountered global fatal error 0x60. GSP RPC then timed out
(Xid 175). The display compositor's vblank stalled, the screen froze, and
neither nvidia-smi nor systemctl reboot could complete; recovery
required a hardware power-cycle. The trigger workload was a
Chromium-based browser (Brave) starting a new renderer process.

To Reproduce

  • Wayland compositor (Hyprland) running, ~2 hours uptime since boot
  • Brave (Chromium-based browser) open with several tabs
  • Brave subprocess started a new renderer/GPU process — call stack shows
    Chromium worker thread deep in kperfBoostSet_IMPL → rpcRmApiControl_GSP →
    _kgspRpcRecvPoll, consistent with a GPU-frequency-boost RPC during
    renderer spin-up
  • No CUDA process active; no userspace had /dev/nvidia-uvm open
  • System RAM healthy: 7.6 GiB / 61 GiB used, no swap pressure
  • Single occurrence so far; not yet a deterministic reproducer
  • See "Smoking-gun evidence" and "Fault sequence" in More Info below

Bug Incidence

Once

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

Note: I have not tested with the proprietary nvidia-dkms package, so I have
left the proprietary-driver-confirmation checkbox unchecked. The kernel's
own resource sanity check warning names __nv_drm_gem_nvkms_map+0x99/0xf0 [nvidia_drm] as the caller, which is specific to nvidia-open's DRM layer.
I am happy to test the proprietary driver if maintainers think it would
help isolate the regression.

Smoking-gun evidence

Single line, logged by the kernel core (not by NVRM) at t = 0:

resource: resource sanity check: requesting [mem 0x000000fccfdd0000-0x000000fcd00fffff], which spans more than 0000:01:00.0 [mem 0xfcc0000000-0xfccfffffff 64bit pref]
caller __nv_drm_gem_nvkms_map+0x99/0xf0 [nvidia_drm] mapping multiple BARs

The requested range starts ~3 MiB before the end of BAR1

(0xfcc0000000-0xfccfffffff) and runs ~33 MiB past it, into BAR3
(0xfcd0000000 + 32 MiB). The kernel's PCI resource validation rejects
the request, and the subsequent
[drm:_nv_drm_gem_nvkms_map] ERROR Failed to map NvKmsKapiMemory 0x00000000616506ff
confirms the map failed.

Immediately preceding this, NVRM logged ~25 repetitions of:

NVRM: dmaAllocMapping_GM107: can't alloc VA space for mapping.
NVRM: nvAssertOkFailedNoLog: ... [NV_ERR_NO_MEMORY] (0x00000051) ... @ mapping_reuse.c:273
... @ kern_bus_gm107.c:3141 // ("pBar1VaInfo->reuseDb")

so BAR1 VA space was being repeatedly exhausted in the seconds leading up
to the bad-range request. That suggests the bad mapping is a fallback (or
an arithmetic mistake) on the BAR1-VA-exhausted path rather than a
random misuse of pci_resource
*.

Fault sequence

All times relative to t = 0 (the resource sanity check line above).
Full redacted log in kernel-log-excerpt.txt.

Offset Event
t+0:00:00 resource sanity check, __nv_drm_gem_nvkms_map ... mapping multiple BARs, Failed to map NvKmsKapiMemory.
t+0:00:00 Xid 31 — MMU Fault: ENGINE CE2 HUBCLIENT_CE0 faulted @ 0x1_21000000, FAULT_PTE ACCESS_TYPE_VIRT_WRITE.
t+0:00:00 nvGpuOpsReportFatalError: uvm encountered global fatal error 0x60, requiring os reboot to recover.
t+0:00:00 Xid 154 — GPU recovery action changed from 0x0 (None) to 0x2 (Node Reboot Required).
t+0:00:00 Brave GPU subprocess receives SIGILL (trap invalid opcode ... in brave[...]).
t+0:00:01 [drm:nv_drm_atomic_apply_modeset_config] Failed to initialize semaphore for plane fence, nv_drm_atomic_commit Error code: -11.
t+0:01:15 _kgspIsHeartbeatTimedOut: diff 75117 timeout 5200. GSP heartbeat lost.
t+0:01:45 Memory Subsystem Error detected. kgmmuInvalidateTlb failed.
t+0:01:45 Xid 175 — Timeout after 75s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL). Originating thread name ThreadPoolSingl (Chromium worker).
t+0:01:48 Call trace dumped: _kgspRpcRecvPoll → _issueRpcAndWait → rpcRmApiControl_GSP → kperfBoostSet_IMPL → resControl_IMPL → ... → nvidia_unlocked_ioctl.
t+0:01:48 onward RC watchdog: GPU is probably locked! Notify Timeout Seconds: 7 repeats every 30-60 s. Hundreds of NV_ERR_RESET_REQUIRED assertions firing as the fullchip-reset path itself fails its preconditions.
t+0:06:18 Xid 16, Head 00000003 Count ..., RM has detected that 7 Seconds without a Vblank Counter Update on head:D0. Display visibly froze.
t+0:12:48 Second Xid 16 / vblank-watchdog.

Recovery

  • nvidia-smi accepted the ioctl but never returned (hung indefinitely; killed after ~5 min).

  • The driver's own RC path tried FULLCHIP_RESET repeatedly; every attempt failed with NV_ERR_RESET_REQUIRED precondition assertions — the chip-reset path itself was wedged.

  • systemctl reboot was invoked from an SSH session and hung at nvidia_drm module teardown for >5 minutes without progress.

  • Recovery required holding the hardware power button.

The system was otherwise functional throughout: SSH stayed up, the Wayland compositor's main thread was alive in do_epoll_wait, no processes were in D-state. The wedge is entirely below nvidia_drm.

What I have ruled out

  • Hardware fault on the GPU. This 3090 had been stable for many months on the previous linux 6.19.11 + nvidia-open 595.58.03 stack with the same workload. After the hardware power-cycle, the system came up cleanly on the same 7.0.3 + 595.71.05 stack and has so far been stable.

  • Host OOM. 7.6 GiB / 61 GiB host RAM in use at fault time. No swap pressure. No oom_reaper activity in the journal. The OOM was GPU-VA, not host RAM.

  • Userspace-only fault. The kernel core's resource sanity check was emitted from inside nvidia_drm's __nv_drm_gem_nvkms_map. The subsequent Xid 31 MMU fault is a consequence of the bad mapping being used. The Brave SIGILL came after the kernel error and looks like a downstream consequence of the GPU buffer the renderer expected being inaccessible.

  • DKMS build mismatch / firmware mismatch. DKMS built nvidia-open 595.71.05 cleanly for both kernels at upgrade time; modules load cleanly; firmware version matches the driver expectations (linux-firmware-nvidia 20260410-1).

I cannot rule out — and want to be careful not to overclaim — which component regressed. The kernel and the driver were both upgraded in the same transaction, so this could be a bug in nvidia-open's PCI BAR-range arithmetic, a kernel-side change to the resource validation that nvidia-open is the first to trip, or a problem in the combination (e.g. a new pci_resource_* semantic on 7.0.x that nvidia-open hasn't adopted yet). I have not yet had the opportunity to bisect.

Open questions

If you (or anyone reading) have seen this signature before, I'd value pointers on any of:

  1. Does this reproduce on nvidia-dkms (proprietary kernel module) at 595.71.05, holding kernel 7.0.3 fixed?

  2. Does this reproduce on kernel 6.19.11 with nvidia-open 595.71.05?

  3. Does disabling Chromium-side GPU acceleration (e.g. --disable-gpu-rasterization, --disable-gpu) prevent it on 7.0.3 + 595.71.05?

  4. Does the resource sanity check line precede every freeze of this form, or are there freezes without it? (I have only this one occurrence.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions