Cannot train with multi GPUs #13

Yablon · 2019-12-04T02:21:23Z

I clone the repository to my local server, then start to train on my own dataset.

I can run with one GPU, and the logs are as follwing:

FP16 Run: False
Dynamic Loss Scaling: False
Distributed Run: False
cuDNN Enabled: True
cuDNN Benchmark: False
Epoch: 0
/home/yablon/mellotron/yin.py:44: RuntimeWarning: invalid value encountered in true_divide
  cmndf = df[1:] * range(1, N) / np.cumsum(df[1:]).astype(float) #scipy method
Train loss 0 18.868097 Grad Norm 6.209010 19.63s/it
Validation loss 0: 63.929592
Saving model and optimizer state at iteration 0 to /home/yablon/training/mellotron/output/checkpoint_0
Train loss 1 39.906715 Grad Norm 18.103324 3.63s/it

But when I run with multi-GPUs, life becomes difficult for me.

The first problem is "apply_gradient_allreduce is not defined" error. OK, that's easy to fix, I just import it from distributed.

The coming problem is that the training seems to stop at the "Done initializing distributed", no more logs is printed further.

Can you fix this ? Thank you !

The text was updated successfully, but these errors were encountered:

rafaelvalle · 2019-12-04T06:32:42Z

Pull from master and try again with FP16 enabled and disabled.

Yablon · 2019-12-05T03:03:27Z

Hi, rafaelvalle. I tried and it seems to be stuck here for a long time.
I change nothing in the hparams but turn the "fp16_run" and "distributed_run" to be true.

FP16 Run: True
Dynamic Loss Scaling: False
Distributed Run: True
cuDNN Enabled: True
cuDNN Benchmark: False
Initializing Distributed
Done initializing distributed
Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic

rafaelvalle · 2019-12-05T05:00:08Z

Try with fp16_run=False

n5-suzuki · 2019-12-09T07:12:36Z

Hi, rafaelvalle. I also got a same error.
I copied from newest code. And I modified distributed_run=True in hparams.py.
Then I execute blow command.
python train.py -o out_dir -l logdir -g

After a few minutes, below log appeared and it seemed to stop.

FP16 Run: False
Dynamic Loss Scaling: True
Distributed Run: True
cuDNN Enabled: True
cuDNN Benchmark: False
Initializing Distributed
Done initializing distributed

I checked my netword status with netstat -atno .
Then I find "localhost:54321 LISTEN" and "localhost => localhost:54321".
But process seems to stop...

pneumoman · 2020-03-17T17:10:54Z

@n5-suzuki : for multi-gpu you should be running multiproc.
python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True

aijianiula0601 · 2020-04-09T09:15:19Z

I got the same error. It's the same problem for tacotron-pytorch.So sad!

Yablon · 2020-04-23T03:19:49Z

I think we can learn from this project and see how it is done to synthesis music rather than running this project. So I manually close this for lack of activity. If anybody has a solution, welcome to reopen and share it below.

Yablon closed this as completed Apr 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot train with multi GPUs #13

Cannot train with multi GPUs #13

Yablon commented Dec 4, 2019 •

edited

rafaelvalle commented Dec 4, 2019

Yablon commented Dec 5, 2019

rafaelvalle commented Dec 5, 2019 •

edited

n5-suzuki commented Dec 9, 2019

pneumoman commented Mar 17, 2020

aijianiula0601 commented Apr 9, 2020

Yablon commented Apr 23, 2020

Cannot train with multi GPUs #13

Cannot train with multi GPUs #13

Comments

Yablon commented Dec 4, 2019 • edited

rafaelvalle commented Dec 4, 2019

Yablon commented Dec 5, 2019

rafaelvalle commented Dec 5, 2019 • edited

n5-suzuki commented Dec 9, 2019

pneumoman commented Mar 17, 2020

aijianiula0601 commented Apr 9, 2020

Yablon commented Apr 23, 2020

Yablon commented Dec 4, 2019 •

edited

rafaelvalle commented Dec 5, 2019 •

edited