You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been trying to run example chess_finetune.py with fsdp. I have multple node with multipe gpu. Here is my configuration
#SBATCH --job-name=chess_finetune # create a short name for your job
#SBATCH --output=chess_finetune.out # file to write stdout
#SBATCH --nodes=2 # node count
#SBATCH --cpus-per-task=4 # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --gres=gpu:2 # number of gpus per node
#SBATCH --time=01:00:00 # total run time limit (HH:MM:SS)
#SBATCH --ntasks=2 # total number of tasks across all nodes
Traning seems to start but be stuck without any update, is there any way to debug and check for bugs.
Logs:
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING]
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] *****************************************
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] *****************************************
The text was updated successfully, but these errors were encountered:
OriAlpha
changed the title
Multinode and multi gpu traning gets stuck
Multinode and multi gpu training gets stuck
Apr 13, 2024
Hey, did you get any fix? I am facing the same issue. Specifically when --ntasks-per-node=2 gpu=2. When I set ntasks-per-node=1 its working but slow given only one process is handiling both. I feel there is a deadlock or something happening.
I have been trying to run example chess_finetune.py with fsdp. I have multple node with multipe gpu. Here is my configuration
Traning seems to start but be stuck without any update, is there any way to debug and check for bugs.
Logs:
The text was updated successfully, but these errors were encountered: