-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
System Info
2 nodes configuration (4*A100 GPU per node)
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- sbatch script
#!/bin/bash
#SBATCH --account=rainbow
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --partition=debug
#SBATCH --job-name=multinode_GPT3
#SBATCH --time=1:00:00
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --output=gpt_run_%j.log
srun --mpi=pmix --ntasks=${tp_size} --ntasks-per-node=4 --nodes=2
--export=ALL,gpt_model=${gpt_model},tp_size=${tp_size}
./multi_run_profile.sh
- multi_run_profile.sh
source gpt_config.sh
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=0
export NCCL_NET_GDR_LEVEL=2
export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID
case $SLURMD_NODENAME in
slurm1-04)
export NCCL_SOCKET_IFNAME=ens21f0
;;
slurm2-03)
export NCCL_SOCKET_IFNAME=eno8403
;;
*)
echo "[ERROR] Unknown node $SLURMD_NODENAME"
exit 1
;;
esac
echo "====================== DEBUG INFO ======================"
echo "[Rank $SLURM_PROCID] LocalID=$SLURM_LOCALID"
echo "[Rank
echo "[Rank $SLURM_PROCID] Node=$SLURMD_NODENAME"
echo "[Rank $SLURM_PROCID] CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
echo "[Rank
echo "========================================================"
python3 ../run.py --engine_dir ${gpt_model}/trt_engines/fp16/${tp_size}-gpu \
--tokenizer_dir gpt \
--input_file input.txt \
--max_output_len 1024 \
--max_input_length 2048 \
--output_csv output.csv \
--run_profiling
Expected behavior
Run TP8 GPT-3 on 2 nodes (total A100*8GPU)
actual behavior
- Error
Traceback (most recent call last):
File "/host/TensorRT-LLM/examples/gpt/../run.py", line 579, in
main(args)
File "/host/TensorRT-LLM/examples/gpt/../run.py", line 428, in main
runner = runner_cls.from_dir(**runner_kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 182, in from_dir
session = GptSession(config=session_config,
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaSetDevice(device): invalid device ordinal
additional notes
I successfully ran TensorRT-LLM on a single node with tensor parallelism(TP=4) without using Slurm, and everything worked fine. Now, after installing and configuring Slurm for multi-node execution, I'm encountering the following issue:
RuntimeError: CUDA error: invalid device ordinal
This error appears when torch.cuda.set_device() is called.
I believe this is due to the GPU device visibility being limited by Slurm, but I couldn’t find clear guidance on how to configure torch.cuda.set_device() properly under Slurm.
How should device selection be handled in TensorRT-LLM when running in a Slurm-managed multi-node environment?
Is there a recommended way to set the CUDA device per rank while respecting Slurm’s GPU binding?