-
Notifications
You must be signed in to change notification settings - Fork 3.4k
[Tacotron2/Pytorch] Multi-node error on saving checkpoints? #1092
Copy link
Copy link
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Related to Model/Framework(s)
PyTorch Distributed Training
Describe the bug
The bug happens, with multinode training, cause in the training script local_rank is used to save checkpoints, so it repeats for each node. And sometimes an error produced as more than one node trying to write the checkpoint files in the same time, or trying to create a symlink to the last checkpoint .
ERROR:
File "train.py", line 229, in save_checkpoint
print("Updating symlink", symlink_dst, "to point to", symlink_src)
FileExistsError: [Errno 17] File exists: 'checkpoint_Tacotron2_0.pt' -> 'output/checkpoint_Tacotron2_last.pt'
To Reproduce
Steps to reproduce the behavior:
- Train on a multinode cluster.
Expected behavior
I think a good solution is to save checkpoints based on global_rank so we save once
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working