Skip to content

CUDA Device Binding Runtime Error When Running GPT-3 in Multi-Node Mode Using Slurm #3123

@glara76

Description

@glara76

System Info

2 nodes configuration (4*A100 GPU per node)

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  • sbatch script
    #!/bin/bash
    #SBATCH --account=rainbow
    #SBATCH --gres=gpu:4
    #SBATCH --ntasks-per-node=4
    #SBATCH --partition=debug
    #SBATCH --job-name=multinode_GPT3
    #SBATCH --time=1:00:00
    #SBATCH --nodes=2
    #SBATCH --ntasks=8
    #SBATCH --output=gpt_run_%j.log

srun --mpi=pmix --ntasks=${tp_size} --ntasks-per-node=4 --nodes=2
--export=ALL,gpt_model=${gpt_model},tp_size=${tp_size}
./multi_run_profile.sh

  • multi_run_profile.sh
    source gpt_config.sh
    export NCCL_DEBUG=INFO
    export NCCL_IB_DISABLE=1
    export NCCL_P2P_DISABLE=0
    export NCCL_NET_GDR_LEVEL=2
    export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID

case $SLURMD_NODENAME in
slurm1-04)
export NCCL_SOCKET_IFNAME=ens21f0
;;
slurm2-03)
export NCCL_SOCKET_IFNAME=eno8403
;;
*)
echo "[ERROR] Unknown node $SLURMD_NODENAME"
exit 1
;;
esac

echo "====================== DEBUG INFO ======================"
echo "[Rank $SLURM_PROCID] LocalID=$SLURM_LOCALID"
echo "[Rank $SLURM_PROCID] Host=$(hostname)"
echo "[Rank $SLURM_PROCID] Node=$SLURMD_NODENAME"
echo "[Rank $SLURM_PROCID] CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
echo "[Rank $SLURM_PROCID] torch sees $(python3 -c 'import torch; print(torch.cuda.device_count())') GPUs"
echo "========================================================"

    python3 ../run.py --engine_dir ${gpt_model}/trt_engines/fp16/${tp_size}-gpu \
    --tokenizer_dir gpt \
    --input_file input.txt \
    --max_output_len 1024 \
    --max_input_length 2048 \
    --output_csv output.csv \
    --run_profiling

Expected behavior

Run TP8 GPT-3 on 2 nodes (total A100*8GPU)

actual behavior

  • Error
    Traceback (most recent call last):
    File "/host/TensorRT-LLM/examples/gpt/../run.py", line 579, in
    main(args)
    File "/host/TensorRT-LLM/examples/gpt/../run.py", line 428, in main
    runner = runner_cls.from_dir(**runner_kwargs)
    File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 182, in from_dir
    session = GptSession(config=session_config,
    RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaSetDevice(device): invalid device ordinal

additional notes

I successfully ran TensorRT-LLM on a single node with tensor parallelism(TP=4) without using Slurm, and everything worked fine. Now, after installing and configuring Slurm for multi-node execution, I'm encountering the following issue:

RuntimeError: CUDA error: invalid device ordinal
This error appears when torch.cuda.set_device() is called.

I believe this is due to the GPU device visibility being limited by Slurm, but I couldn’t find clear guidance on how to configure torch.cuda.set_device() properly under Slurm.

How should device selection be handled in TensorRT-LLM when running in a Slurm-managed multi-node environment?
Is there a recommended way to set the CUDA device per rank while respecting Slurm’s GPU binding?

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions