Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multinode and multi gpu training gets stuck #3

Open
OriAlpha opened this issue Apr 13, 2024 · 1 comment
Open

Multinode and multi gpu training gets stuck #3

OriAlpha opened this issue Apr 13, 2024 · 1 comment

Comments

@OriAlpha
Copy link

OriAlpha commented Apr 13, 2024

I have been trying to run example chess_finetune.py with fsdp. I have multple node with multipe gpu. Here is my configuration

#SBATCH --job-name=chess_finetune  # create a short name for your job
#SBATCH --output=chess_finetune.out      # file to write stdout
#SBATCH --nodes=2                  # node count
#SBATCH --cpus-per-task=4          # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --gres=gpu:2               # number of gpus per node
#SBATCH --time=01:00:00            # total run time limit (HH:MM:SS)
#SBATCH --ntasks=2                 # total number of tasks across all nodes

Traning seems to start but be stuck without any update, is there any way to debug and check for bugs.
Logs:

[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] 
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] *****************************************
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] *****************************************
@OriAlpha OriAlpha changed the title Multinode and multi gpu traning gets stuck Multinode and multi gpu training gets stuck Apr 13, 2024
@Divyanshupy
Copy link

Hey, did you get any fix? I am facing the same issue. Specifically when --ntasks-per-node=2 gpu=2. When I set ntasks-per-node=1 its working but slow given only one process is handiling both. I feel there is a deadlock or something happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants