GSP firmware halt on sm_120 (RTX Pro 6000 Blackwell) under sustained zero-gap llama.cpp inference — silent hard hang

### NVIDIA Open GPU Kernel Modules Version

580.126.20 (Ubuntu package nvidia-driver-580-open 580.126.20-1ubuntu1, built Feb 18 2026    $ cat /proc/driver/nvidia/version   NVRM version: NVIDIA UNIX Open Kernel Module for x86_64  580.126.20  Release Build   GCC version:  gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04.1)

### Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

- [ ] I confirm that this does not happen with the proprietary driver package.

### Operating System and Version

Ubuntu 24.04.4 LTS

### Kernel Release

 Linux TRX5 6.14.0-37-generic #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2 x86_64 GNU/Linux

### Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

- [x] I am running on a stable kernel release.

### Hardware: GPU

GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (UUID: GPU-93808743-a5f5-6561-b7ea-e70aa68a7e75)

### Describe the bug

[nvidia-bug-report.log.gz](https://github.com/user-attachments/files/26814183/nvidia-bug-report.log.gz)

Hardware: GPU

  $ nvidia-smi -L
  GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (UUID: GPU-93808743-a5f5-6561-b7ea-e70aa68a7e75)
  GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-5251258f-f189-8197-1523-06bc955c221d)
  GPU 2: NVIDIA GeForce RTX 3090 (UUID: GPU-61483aef-c24c-1dd1-d09e-7323c87380d8)
  GPU 3: NVIDIA GeForce RTX 3090 (UUID: GPU-ecd9b283-ad80-8315-b9d0-09af02b663bc)

The crash is specific to GPU 0 (Blackwell GB202, sm_120). The three 3090s (sm_86, Ampere) run the same sustained workload on separate ports without issue.

 ## Summary

  Sustained zero-gap LLM inference against a local llama.cpp server on **GPU 0 (RTX Pro 6000 Blackwell, GB202, sm_120)** causes a **silent system-wide hard hang**
  after ~45 minutes / ~3000 sequential chat-completion requests. The entire host becomes unresponsive: SSH drops, `nvidia-smi` doesn't return, the physical display
  freezes. Only a hard power-cycle / BIOS watchdog reboot recovers.

  ## Distinguishing signature: **no Xid, no NVRM error, no kernel log before freeze**

  Unlike #1080 (GB202 + Vulkan, logs `Xid 109 / Xid 8`) and #1045 (RTX 5080, logs `Xid 62 / 45 / 119 / 154`), my hang produces **zero diagnostic output**:

  - No `Xid` in `dmesg`
  - No `NVRM` / `nvidia-modeset` error
  - `journalctl` output after recovery shows the journal simply truncating at the hang moment — the kernel cannot flush before the freeze
  - `nvidia-smi` never returns after onset
  - `krcWatchdog` never fires

  This is a harder failure mode than the other GSP timeout bugs already filed — the GSP firmware appears to take the kernel down with it before any signal can be
  raised.

  ## Reproduced across 5+ occurrences

  Over 3 weeks of batch inference workloads, the same hang has reproduced at least 5 separate times, always on GPU 0 (Blackwell) and always under sustained
  chat-completion load. Switching identical workload to a tensor-parallel pair of sm_86 Ampere GPUs (RTX 3090×2 on the same machine, NVLink NV4 56 GB/s) is stable for
  20+ hour runs with the same request rate.

  ## Mitigations tried (none fix the hang)

  | Attempt | Effect |
  |---|---|
  | BIOS ASUS WRX90E-SAGE SE 0901 → 1317 | No change |
  | llama.cpp rebuild from HEAD (SHA `ff5ef8278`, version 8763) | No change |
  | `--parallel 2` → `--parallel 1` | No change |
  | Stopped secondary llama.cpp instance on port 8012 (same GPU) | No change |
  | CUDA 12.8.0 runtime | No change |

  ## Workaround in production

  Route all sustained LLM workloads to a separate tensor-parallel llama.cpp instance on GPU 2 + GPU 3 (RTX 3090 × 2, NVLink NV4). GPU 0 is restricted to bursty /
  interactive use only. This works reliably for multi-hour batch inference.

  ## Related existing issues (different workload / different failure trace, likely same family)

  - #1080 — RTX 5090 GB202 GSP heartbeat timeout under Vulkan/Proton. Same silicon (GB202), Xid-logging variant.
  - #1045 — RTX 5080 Xid 62/45/119/154 lockup under desktop load. Same Blackwell generation.
  - #987 — Dual RTX Pro 6000 Blackwell Max-Q, vLLM + shutdown issue. **Same exact card family**, different failure mode.

### To Reproduce

### Environment

  - GPU under test: GPU 0 only — `NVIDIA RTX PRO 6000 Blackwell Workstation Edition` (GB202, sm_120, 97 GB VRAM, PCIe 5.0 x16)
  - Motherboard / CPU / PSU: ASUS WRX90E-SAGE SE, BIOS 1317 (latest, 2026-02-03) / AMD Threadripper Pro / 1600W Platinum
  - Driver: 580.126.20 open kernel module (package `nvidia-driver-580-open 580.126.20-1ubuntu1`)
  - CUDA runtime: 12.8.0 (inside `nvidia/cuda:12.8.0-runtime-ubuntu24.04` container)
  - Secure Boot: **disabled** (required for nvidia-open DKMS)
  - llama.cpp: commit `ff5ef8278`, version tag 8763, flash-attn ON

  ### Reproduction

  1. Pull a large GGUF (reproduced on `Qwen_Qwen3-32B-Q4_K_M.gguf`, ~19 GB).

  2. Run llama.cpp bound to GPU 0 with exactly these flags:
     llama-server
       --model /models/Qwen_Qwen3-32B-Q4_K_M.gguf
       --host 0.0.0.0 --port 8008
       --ctx-size 32768
       --n-gpu-layers 999
       --threads 8
       --parallel 1
       --batch-size 2048
       --cont-batching
       --flash-attn on
       --reasoning-budget 0

  3. Issue a tight loop of sequential chat-completion requests with **no idle gap**:
  ```python
  import httpx, asyncio

  async def main():
      async with httpx.AsyncClient() as c:
          for i in range(10_000):
              r = await c.post(
                  "http://127.0.0.1:8008/v1/chat/completions",
                  json={
                      "model": "qwen3-32b",
                      "messages": [
                          {"role": "system", "content": "You extract structured legal obligations."},
                          {"role": "user", "content": "Clause text: <1-2 KB body>. Return JSON array."},
                      ],
                      "temperature": 0.0,
                      "max_tokens": 2048,
                      "response_format": {"type": "json_schema", "json_schema": {...}},
                  },
                  timeout=300,
              )
              r.raise_for_status()

  asyncio.run(main())

     Each request: ~1–2 K prompt tokens in, ~500–1500 tokens out. No asyncio.sleep, no backoff, no batching pause — each iteration fires as soon as the previous
  response returns.

  4. Watch nvidia-smi dmon in another terminal. GPU 0 sits at 60–90 % util continuously.
  5. After roughly 3,000 requests / ~45 minutes wall-clock, the system freezes silently. SSH drops; physical console is non-responsive; hard reset required.

  Control experiment (same workload, different GPU)

  Identical client code, identical model, same machine, pointing at a tensor-parallel llama.cpp on GPU 2 + GPU 3 (RTX 3090 × 2, sm_86, NVLink NV4):

  llama-server \
    --model /models/Qwen_Qwen3-32B-Q4_K_M.gguf \
    --host 0.0.0.0 --port 8009 \
    --tensor-split 0,0,1,1 \
    --ctx-size 16384 --parallel 1 \
    --n-gpu-layers 999 --flash-attn on

  Stable for 20+ hour batch runs at the same request rate. No hang, no Xid, no warnings.

  ### **Bug Incidence**
  Always (when GPU 0 serves sustained zero-gap inference for >30 min).

  Reproduced 5+ times over ~3 weeks. Mean time to hang: ~45 min.
  Zero occurrences on sm_86 (Ampere) with identical workload on the same host.

### Bug Incidence

Always

### nvidia-bug-report.log.gz

[nvidia-bug-report.log.gz](https://github.com/user-attachments/files/26814657/nvidia-bug-report.log.gz)

nvidia-bug-report.sh will now collect information about your
system and create the file 'nvidia-bug-report.log.gz' in the current
directory.  It may take several seconds to run.  In some
cases, it may hang trying to capture data generated dynamically
by the Linux kernel and/or the NVIDIA kernel module.  While
the bug report log file will be incomplete if this happens, it
may still contain enough data to diagnose your problem.

If nvidia-bug-report.sh hangs, consider running with the --safe-mode
and --extra-system-data command line arguments.

Please include the 'nvidia-bug-report.log.gz' log file when reporting
your bug via the NVIDIA Linux forum (see forums.developer.nvidia.com)
or by sending email to 'linux-bugs@nvidia.com'.

By delivering 'nvidia-bug-report.log.gz' to NVIDIA, you acknowledge
and agree that personal information may inadvertently be included in
the output.  Notwithstanding the foregoing, NVIDIA will use the
output only for the purpose of investigating your reported issue.

Running nvidia-bug-report.sh...
Detected driver type: Running RM Driver on GPUs
 complete.


Summary of Skipped Sections:

Skipped Component                   | Details
================================================================================
ldd output                          | glxinfo not found
--------------------------------------------------------------------------------
vulkaninfo output                   | vulkaninfo not found
--------------------------------------------------------------------------------
ibstat output                       | ibstat not found
--------------------------------------------------------------------------------
acpidump output                     | acpidump not found
--------------------------------------------------------------------------------
mst output                          | mst not found
--------------------------------------------------------------------------------
nvlsm-bug-report.sh output          | nvlsm-bug-report.sh not found
--------------------------------------------------------------------------------

Summary of Errors:

Error Component                     | Details                                                      | Resolution
=========================================================================================================================


### More Info

### Additional observations

  - **Not a thermal issue**: GPU 0 package temp sits at 60–65 °C throughout, with fan curve responding normally. Full thermal headroom.
  - **Not PSU / PCIe power**: 1600 W Platinum PSU with all 12V-2×6 rails dedicated per vendor spec; no dropouts, no correctable errors on GPU 0's PCIe link until after
   the hang.
  - **Post-reboot `dmesg` shows correctable AER errors on an unrelated PCIe device** (`0000:02:00.0` — Intel NIC, not the GPU), which appear to be a downstream
  side-effect of the hard reset, not the cause.
  - **Can't test against the proprietary driver** because Blackwell (sm_120) is supported only via the open kernel module; the `.run` proprietary package does not
  build against this silicon. This forecloses the standard "try proprietary" debugging path for anyone using RTX Pro 6000 Blackwell.

  ### What I expected

  Either:
  1. The GSP firmware to self-recover (as it does on my sm_86 GPUs under equivalent or heavier sustained load), **or**
  2. At minimum, raise an Xid + recoverable kernel error — even Xid 119 / 154 would let the kernel log and let operators write a watchdog around it.

  Instead the failure mode is an un-diagnosable silent lockup, which is a significant step backward in operability compared to Ampere and Hopper GSP behavior.

  ### Ask

  - Confirmation whether the silent-hang class is tracked internally, or if it's a new variant of the #1080 / #1045 GSP-firmware-death family.
  - Any workaround flag / firmware parameter that would either
    (a) lower the sustained-inference timeout so the GSP resets before locking the host, or
    (b) force Xid/NVRM reporting on GSP death so a user-space watchdog can act.
  - Guidance on whether this is expected to be addressed in 595.x / 600.x driver line; Blackwell support in 590 regressed and driver 590 dropped sm_120, so 580-open is
   currently the only path forward for this card.

  Thank you for maintaining the open driver. Happy to run any additional diagnostics, capture perf counters before a controlled hang, or test patches against this
  reproducer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSP firmware halt on sm_120 (RTX Pro 6000 Blackwell) under sustained zero-gap llama.cpp inference — silent hard hang #1111

NVIDIA Open GPU Kernel Modules Version

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Kernel Release

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

Describe the bug

Summary

Distinguishing signature: no Xid, no NVRM error, no kernel log before freeze

Reproduced across 5+ occurrences

Mitigations tried (none fix the hang)

Workaround in production

Related existing issues (different workload / different failure trace, likely same family)

To Reproduce

Environment

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Attempt	Effect
BIOS ASUS WRX90E-SAGE SE 0901 → 1317	No change
llama.cpp rebuild from HEAD (SHA `ff5ef8278`, version 8763)	No change
`--parallel 2` → `--parallel 1`	No change
Stopped secondary llama.cpp instance on port 8012 (same GPU)	No change
CUDA 12.8.0 runtime	No change

GSP firmware halt on sm_120 (RTX Pro 6000 Blackwell) under sustained zero-gap llama.cpp inference — silent hard hang #1111

Description

NVIDIA Open GPU Kernel Modules Version

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Kernel Release

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

Describe the bug

Summary

Distinguishing signature: no Xid, no NVRM error, no kernel log before freeze

Reproduced across 5+ occurrences

Mitigations tried (none fix the hang)

Workaround in production

Related existing issues (different workload / different failure trace, likely same family)

To Reproduce

Environment

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions