Skip to content

[Tacotron2/Pytorch] Multi-node error on saving checkpoints? #1092

@bodasadallah

Description

@bodasadallah

Related to Model/Framework(s)
PyTorch Distributed Training

Describe the bug
The bug happens, with multinode training, cause in the training script local_rank is used to save checkpoints, so it repeats for each node. And sometimes an error produced as more than one node trying to write the checkpoint files in the same time, or trying to create a symlink to the last checkpoint .

ERROR:

File "train.py", line 229, in save_checkpoint
    print("Updating symlink", symlink_dst, "to point to", symlink_src)
FileExistsError: [Errno 17] File exists: 'checkpoint_Tacotron2_0.pt' -> 'output/checkpoint_Tacotron2_last.pt' 

To Reproduce
Steps to reproduce the behavior:

  1. Train on a multinode cluster.

Expected behavior
I think a good solution is to save checkpoints based on global_rank so we save once

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions