[Tacotron2/Pytorch] Multi-node error on saving checkpoints?

Related to **Model/Framework(s)** 
PyTorch Distributed Training 

**Describe the bug**
The bug happens, with multinode training, cause in the training script `local_rank` is used to save checkpoints, so it repeats for each node. And sometimes an error produced as more than one node trying to write the checkpoint files in the same time, or trying to create a symlink  to the last checkpoint .

ERROR:
```
File "train.py", line 229, in save_checkpoint
    print("Updating symlink", symlink_dst, "to point to", symlink_src)
FileExistsError: [Errno 17] File exists: 'checkpoint_Tacotron2_0.pt' -> 'output/checkpoint_Tacotron2_last.pt' 
```

**To Reproduce**
Steps to reproduce the behavior:

1. Train on a multinode cluster. 

**Expected behavior**
I think a good solution is to save checkpoints based on `global_rank` so we save once



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tacotron2/Pytorch] Multi-node error on saving checkpoints? #1092

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Tacotron2/Pytorch] Multi-node error on saving checkpoints? #1092

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions