Skip to content

GH200: test_get_bar_size_in_kb fails with CUDA_ERROR_NOT_SUPPORTED #2299

Description

@kkraus14

Problem

The GH200 nightly-standard row re-enabled by #2296 now reaches the cuda.bindings test suite, but tests/test_cufile.py::test_get_bar_size_in_kb fails on the GH200 runner.

Failure: https://github.com/NVIDIA/cuda-python/actions/runs/28600139877/job/84805962749?pr=2296#step:33:582

Environment:

  • Linux aarch64
  • Python 3.14
  • CUDA Toolkit 13.3.0
  • NVIDIA GH200 480GB
  • Runner group nv-gpu-arm64-gh200-1gpu

Failure

tests/test_cufile.py::test_get_bar_size_in_kb FAILED

bar_size_kb = cufile.get_bar_size_in_kb(0)

cuda.bindings.cufile.cuFileError:
SUCCESS (0): cufile success; CUDA status: CUDA_ERROR_NOT_SUPPORTED (801)

The remainder of the bindings suite completed with 419 passed and 22 skipped; this was the only failure.

Assessment

GH200 connects the Grace CPU and Hopper GPU through NVLink-C2C rather than using PCIe as the CPU-GPU data path. cuFile also has a C2C P2P mode, so this does not imply that cuFile/GDS is generally unsupported on GH200.

The likely narrower explanation is that GH200 does not expose or use the GPU BAR aperture that cuFileGetBARSizeInKB expects, making this specific query inapplicable. The platform still has PCIe I/O, and the runner reports a PCI-style GPU bus ID (00000000:FD:00.0), so "GH200 has no PCIe support" would be too broad.

References:

One uncertainty remains: the cuFileGetBARSizeInKB API reference does not document CUDA_ERROR_NOT_SUPPORTED as a return for this function, so we should confirm whether the observed result is expected platform behavior or a cuFile/documentation gap.

Expected outcome

  • Confirm whether CUDA_ERROR_NOT_SUPPORTED is expected from cuFileGetBARSizeInKB on GH200/C2C systems.
  • If expected, update test_get_bar_size_in_kb to skip or xfail only for this supported "no applicable BAR aperture" result instead of failing the full bindings suite.
  • If unexpected, follow up with cuFile and update the test once the intended API behavior is known.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcuda.bindingsEverything related to the cuda.bindings moduletestImprovements or additions to teststriageNeeds the team's attention

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions