Skip to content

GSP firmware halt on sm_120 (RTX Pro 6000 Blackwell) under sustained zero-gap llama.cpp inference — silent hard hang #1111

@vjureta

Description

@vjureta

NVIDIA Open GPU Kernel Modules Version

580.126.20 (Ubuntu package nvidia-driver-580-open 580.126.20-1ubuntu1, built Feb 18 2026 $ cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 580.126.20 Release Build GCC version: gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04.1)

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 24.04.4 LTS

Kernel Release

Linux TRX5 6.14.0-37-generic #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (UUID: GPU-93808743-a5f5-6561-b7ea-e70aa68a7e75)

Describe the bug

nvidia-bug-report.log.gz

Hardware: GPU

$ nvidia-smi -L
GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (UUID: GPU-93808743-a5f5-6561-b7ea-e70aa68a7e75)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-5251258f-f189-8197-1523-06bc955c221d)
GPU 2: NVIDIA GeForce RTX 3090 (UUID: GPU-61483aef-c24c-1dd1-d09e-7323c87380d8)
GPU 3: NVIDIA GeForce RTX 3090 (UUID: GPU-ecd9b283-ad80-8315-b9d0-09af02b663bc)

The crash is specific to GPU 0 (Blackwell GB202, sm_120). The three 3090s (sm_86, Ampere) run the same sustained workload on separate ports without issue.

Summary

Sustained zero-gap LLM inference against a local llama.cpp server on GPU 0 (RTX Pro 6000 Blackwell, GB202, sm_120) causes a silent system-wide hard hang
after ~45 minutes / ~3000 sequential chat-completion requests. The entire host becomes unresponsive: SSH drops, nvidia-smi doesn't return, the physical display
freezes. Only a hard power-cycle / BIOS watchdog reboot recovers.

Distinguishing signature: no Xid, no NVRM error, no kernel log before freeze

Unlike #1080 (GB202 + Vulkan, logs Xid 109 / Xid 8) and #1045 (RTX 5080, logs Xid 62 / 45 / 119 / 154), my hang produces zero diagnostic output:

  • No Xid in dmesg
  • No NVRM / nvidia-modeset error
  • journalctl output after recovery shows the journal simply truncating at the hang moment — the kernel cannot flush before the freeze
  • nvidia-smi never returns after onset
  • krcWatchdog never fires

This is a harder failure mode than the other GSP timeout bugs already filed — the GSP firmware appears to take the kernel down with it before any signal can be
raised.

Reproduced across 5+ occurrences

Over 3 weeks of batch inference workloads, the same hang has reproduced at least 5 separate times, always on GPU 0 (Blackwell) and always under sustained
chat-completion load. Switching identical workload to a tensor-parallel pair of sm_86 Ampere GPUs (RTX 3090×2 on the same machine, NVLink NV4 56 GB/s) is stable for
20+ hour runs with the same request rate.

Mitigations tried (none fix the hang)

Attempt Effect
BIOS ASUS WRX90E-SAGE SE 0901 → 1317 No change
llama.cpp rebuild from HEAD (SHA ff5ef8278, version 8763) No change
--parallel 2--parallel 1 No change
Stopped secondary llama.cpp instance on port 8012 (same GPU) No change
CUDA 12.8.0 runtime No change

Workaround in production

Route all sustained LLM workloads to a separate tensor-parallel llama.cpp instance on GPU 2 + GPU 3 (RTX 3090 × 2, NVLink NV4). GPU 0 is restricted to bursty /
interactive use only. This works reliably for multi-hour batch inference.

Related existing issues (different workload / different failure trace, likely same family)

To Reproduce

Environment

  • GPU under test: GPU 0 only — NVIDIA RTX PRO 6000 Blackwell Workstation Edition (GB202, sm_120, 97 GB VRAM, PCIe 5.0 x16)
  • Motherboard / CPU / PSU: ASUS WRX90E-SAGE SE, BIOS 1317 (latest, 2026-02-03) / AMD Threadripper Pro / 1600W Platinum
  • Driver: 580.126.20 open kernel module (package nvidia-driver-580-open 580.126.20-1ubuntu1)
  • CUDA runtime: 12.8.0 (inside nvidia/cuda:12.8.0-runtime-ubuntu24.04 container)
  • Secure Boot: disabled (required for nvidia-open DKMS)
  • llama.cpp: commit ff5ef8278, version tag 8763, flash-attn ON

Reproduction

  1. Pull a large GGUF (reproduced on Qwen_Qwen3-32B-Q4_K_M.gguf, ~19 GB).

  2. Run llama.cpp bound to GPU 0 with exactly these flags:
    llama-server
    --model /models/Qwen_Qwen3-32B-Q4_K_M.gguf
    --host 0.0.0.0 --port 8008
    --ctx-size 32768
    --n-gpu-layers 999
    --threads 8
    --parallel 1
    --batch-size 2048
    --cont-batching
    --flash-attn on
    --reasoning-budget 0

  3. Issue a tight loop of sequential chat-completion requests with no idle gap:

import httpx, asyncio

async def main():
    async with httpx.AsyncClient() as c:
        for i in range(10_000):
            r = await c.post(
                "http://127.0.0.1:8008/v1/chat/completions",
                json={
                    "model": "qwen3-32b",
                    "messages": [
                        {"role": "system", "content": "You extract structured legal obligations."},
                        {"role": "user", "content": "Clause text: <1-2 KB body>. Return JSON array."},
                    ],
                    "temperature": 0.0,
                    "max_tokens": 2048,
                    "response_format": {"type": "json_schema", "json_schema": {...}},
                },
                timeout=300,
            )
            r.raise_for_status()

asyncio.run(main())

   Each request: ~12 K prompt tokens in, ~5001500 tokens out. No asyncio.sleep, no backoff, no batching pauseeach iteration fires as soon as the previous
response returns.

4. Watch nvidia-smi dmon in another terminal. GPU 0 sits at 6090 % util continuously.
5. After roughly 3,000 requests / ~45 minutes wall-clock, the system freezes silently. SSH drops; physical console is non-responsive; hard reset required.

Control experiment (same workload, different GPU)

Identical client code, identical model, same machine, pointing at a tensor-parallel llama.cpp on GPU 2 + GPU 3 (RTX 3090 × 2, sm_86, NVLink NV4):

llama-server \
  --model /models/Qwen_Qwen3-32B-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8009 \
  --tensor-split 0,0,1,1 \
  --ctx-size 16384 --parallel 1 \
  --n-gpu-layers 999 --flash-attn on

Stable for 20+ hour batch runs at the same request rate. No hang, no Xid, no warnings.

### **Bug Incidence**
Always (when GPU 0 serves sustained zero-gap inference for >30 min).

Reproduced 5+ times over ~3 weeks. Mean time to hang: ~45 min.
Zero occurrences on sm_86 (Ampere) with identical workload on the same host.

### Bug Incidence

Always

### nvidia-bug-report.log.gz

[nvidia-bug-report.log.gz](https://github.com/user-attachments/files/26814657/nvidia-bug-report.log.gz)

nvidia-bug-report.sh will now collect information about your
system and create the file 'nvidia-bug-report.log.gz' in the current
directory.  It may take several seconds to run.  In some
cases, it may hang trying to capture data generated dynamically
by the Linux kernel and/or the NVIDIA kernel module.  While
the bug report log file will be incomplete if this happens, it
may still contain enough data to diagnose your problem.

If nvidia-bug-report.sh hangs, consider running with the --safe-mode
and --extra-system-data command line arguments.

Please include the 'nvidia-bug-report.log.gz' log file when reporting
your bug via the NVIDIA Linux forum (see forums.developer.nvidia.com)
or by sending email to 'linux-bugs@nvidia.com'.

By delivering 'nvidia-bug-report.log.gz' to NVIDIA, you acknowledge
and agree that personal information may inadvertently be included in
the output.  Notwithstanding the foregoing, NVIDIA will use the
output only for the purpose of investigating your reported issue.

Running nvidia-bug-report.sh...
Detected driver type: Running RM Driver on GPUs
complete.


Summary of Skipped Sections:

Skipped Component                   | Details
================================================================================
ldd output                          | glxinfo not found
--------------------------------------------------------------------------------
vulkaninfo output                   | vulkaninfo not found
--------------------------------------------------------------------------------
ibstat output                       | ibstat not found
--------------------------------------------------------------------------------
acpidump output                     | acpidump not found
--------------------------------------------------------------------------------
mst output                          | mst not found
--------------------------------------------------------------------------------
nvlsm-bug-report.sh output          | nvlsm-bug-report.sh not found
--------------------------------------------------------------------------------

Summary of Errors:

Error Component                     | Details                                                      | Resolution
=========================================================================================================================


### More Info

### Additional observations

- **Not a thermal issue**: GPU 0 package temp sits at 6065 °C throughout, with fan curve responding normally. Full thermal headroom.
- **Not PSU / PCIe power**: 1600 W Platinum PSU with all 12V-2×6 rails dedicated per vendor spec; no dropouts, no correctable errors on GPU 0's PCIe link until after
 the hang.
- **Post-reboot `dmesg` shows correctable AER errors on an unrelated PCIe device** (`0000:02:00.0`Intel NIC, not the GPU), which appear to be a downstream
side-effect of the hard reset, not the cause.
- **Can't test against the proprietary driver** because Blackwell (sm_120) is supported only via the open kernel module; the `.run` proprietary package does not
build against this silicon. This forecloses the standard "try proprietary" debugging path for anyone using RTX Pro 6000 Blackwell.

### What I expected

Either:
1. The GSP firmware to self-recover (as it does on my sm_86 GPUs under equivalent or heavier sustained load), **or**
2. At minimum, raise an Xid + recoverable kernel erroreven Xid 119 / 154 would let the kernel log and let operators write a watchdog around it.

Instead the failure mode is an un-diagnosable silent lockup, which is a significant step backward in operability compared to Ampere and Hopper GSP behavior.

### Ask

- Confirmation whether the silent-hang class is tracked internally, or if it's a new variant of the #1080 / #1045 GSP-firmware-death family.
- Any workaround flag / firmware parameter that would either
  (a) lower the sustained-inference timeout so the GSP resets before locking the host, or
  (b) force Xid/NVRM reporting on GSP death so a user-space watchdog can act.
- Guidance on whether this is expected to be addressed in 595.x / 600.x driver line; Blackwell support in 590 regressed and driver 590 dropped sm_120, so 580-open is
 currently the only path forward for this card.

Thank you for maintaining the open driver. Happy to run any additional diagnostics, capture perf counters before a controlled hang, or test patches against this
reproducer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions