Multinode and multi gpu training gets stuck #3

OriAlpha · 2024-04-13T11:55:00Z

I have been trying to run example chess_finetune.py with fsdp. I have multple node with multipe gpu. Here is my configuration

#SBATCH --job-name=chess_finetune  # create a short name for your job
#SBATCH --output=chess_finetune.out      # file to write stdout
#SBATCH --nodes=2                  # node count
#SBATCH --cpus-per-task=4          # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --gres=gpu:2               # number of gpus per node
#SBATCH --time=01:00:00            # total run time limit (HH:MM:SS)
#SBATCH --ntasks=2                 # total number of tasks across all nodes

Traning seems to start but be stuck without any update, is there any way to debug and check for bugs.
Logs:

[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] 
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] *****************************************
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] *****************************************

The text was updated successfully, but these errors were encountered:

Divyanshupy · 2024-07-04T14:42:12Z

Hey, did you get any fix? I am facing the same issue. Specifically when --ntasks-per-node=2 gpu=2. When I set ntasks-per-node=1 its working but slow given only one process is handiling both. I feel there is a deadlock or something happening.

OriAlpha changed the title ~~Multinode and multi gpu traning gets stuck~~ Multinode and multi gpu training gets stuck Apr 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multinode and multi gpu training gets stuck #3

Multinode and multi gpu training gets stuck #3

OriAlpha commented Apr 13, 2024 •

edited

Loading

Divyanshupy commented Jul 4, 2024

Multinode and multi gpu training gets stuck #3

Multinode and multi gpu training gets stuck #3

Comments

OriAlpha commented Apr 13, 2024 • edited Loading

Divyanshupy commented Jul 4, 2024

OriAlpha commented Apr 13, 2024 •

edited

Loading