Skip to content

Cuda 12.8 Multi GPU Blackwell/A100 Fails #858

@vladrad

Description

@vladrad

NVIDIA Open GPU Kernel Modules Version

575.51.03 and 570.133.20 and 570.124.06

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Description: Ubuntu 22.04.3 LTS

Kernel Release

6.8.0-59-generic #61~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 15 17:03:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, GPU 1: NVIDIA A100 80GB PCIe

Describe the bug

I am not able to call get available GPU's in multiple applications.

Nvidia SMI shows the correct GPUs but using both of them in torch application or cuda application fails when getting available devices.

If I export CUDA_VISIBLE_DEVICES=0 or 1

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

It seems to work. I understand that there may be a mismatch between the gpus however even using them independently has issues unless I do the steps above.

I tired cuda 12.8 and 12.9 and they all don't work. I am assuming at this point this may be a driver issue? or is it a cuda issue?

To Reproduce

Using any compiled libraries with like vllm/llama.cpp causes the issues.

Also tried:

export CUDA_VISIBLE_DEVICES=0,1
/usr/local/cuda-12.8/extras/CUPTI/samples/event_multi_gpu$ ./event_multi_gpu
Usage: ./event_multi_gpu [event_name]

Error: event_multi_gpu.cu:63: Function cuInit(0) failed with error(3): initialization error.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log

More Info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions