CUDA Device Binding Runtime Error When Running GPT-3 in Multi-Node Mode Using Slurm

### System Info

2 nodes configuration  (4*A100 GPU per node)   



### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

* sbatch script 
#!/bin/bash
#SBATCH --account=rainbow
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --partition=debug
#SBATCH --job-name=multinode_GPT3
#SBATCH --time=1:00:00
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --output=gpt_run_%j.log

srun --mpi=pmix --ntasks=${tp_size} --ntasks-per-node=4 --nodes=2 \
                --export=ALL,gpt_model=${gpt_model},tp_size=${tp_size} \
            ./multi_run_profile.sh

* multi_run_profile.sh 
source gpt_config.sh
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=0
export NCCL_NET_GDR_LEVEL=2
export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID

case $SLURMD_NODENAME in
  slurm1-04)
    export NCCL_SOCKET_IFNAME=ens21f0
    ;;
  slurm2-03)
    export NCCL_SOCKET_IFNAME=eno8403
    ;;
  *)
    echo "[ERROR] Unknown node $SLURMD_NODENAME"
    exit 1
    ;;
esac

echo "====================== DEBUG INFO ======================"
echo "[Rank $SLURM_PROCID] LocalID=$SLURM_LOCALID"
echo "[Rank $SLURM_PROCID] Host=$(hostname)"
echo "[Rank $SLURM_PROCID] Node=$SLURMD_NODENAME"
echo "[Rank $SLURM_PROCID] CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
echo "[Rank $SLURM_PROCID] torch sees $(python3 -c 'import torch; print(torch.cuda.device_count())') GPUs"
echo "========================================================"

        python3 ../run.py --engine_dir ${gpt_model}/trt_engines/fp16/${tp_size}-gpu \
        --tokenizer_dir gpt \
        --input_file input.txt \
        --max_output_len 1024 \
        --max_input_length 2048 \
        --output_csv output.csv \
        --run_profiling



### Expected behavior

Run TP8 GPT-3 on 2 nodes (total A100*8GPU)

### actual behavior

* Error 
Traceback (most recent call last):
  File "/host/TensorRT-LLM/examples/gpt/../run.py", line 579, in <module>
    main(args)
  File "/host/TensorRT-LLM/examples/gpt/../run.py", line 428, in main
    runner = runner_cls.from_dir(**runner_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 182, in from_dir
    session = GptSession(config=session_config,
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaSetDevice(device): invalid device ordinal 

### additional notes

I successfully ran TensorRT-LLM on a single node with tensor parallelism(TP=4) without using Slurm, and everything worked fine. Now, after installing and configuring Slurm for multi-node execution, I'm encountering the following issue:

RuntimeError: CUDA error: invalid device ordinal
This error appears when torch.cuda.set_device() is called.

I believe this is due to the GPU device visibility being limited by Slurm, but I couldn’t find clear guidance on how to configure torch.cuda.set_device() properly under Slurm.

How should device selection be handled in TensorRT-LLM when running in a Slurm-managed multi-node environment?
Is there a recommended way to set the CUDA device per rank while respecting Slurm’s GPU binding?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Device Binding Runtime Error When Running GPT-3 in Multi-Node Mode Using Slurm #3123

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA Device Binding Runtime Error When Running GPT-3 in Multi-Node Mode Using Slurm #3123

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions