Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot train with multi GPUs #13

Closed
Yablon opened this issue Dec 4, 2019 · 7 comments
Closed

Cannot train with multi GPUs #13

Yablon opened this issue Dec 4, 2019 · 7 comments

Comments

@Yablon
Copy link

Yablon commented Dec 4, 2019

I clone the repository to my local server, then start to train on my own dataset.

I can run with one GPU, and the logs are as follwing:

FP16 Run: False
Dynamic Loss Scaling: False
Distributed Run: False
cuDNN Enabled: True
cuDNN Benchmark: False
Epoch: 0
/home/yablon/mellotron/yin.py:44: RuntimeWarning: invalid value encountered in true_divide
  cmndf = df[1:] * range(1, N) / np.cumsum(df[1:]).astype(float) #scipy method
Train loss 0 18.868097 Grad Norm 6.209010 19.63s/it
Validation loss 0: 63.929592
Saving model and optimizer state at iteration 0 to /home/yablon/training/mellotron/output/checkpoint_0
Train loss 1 39.906715 Grad Norm 18.103324 3.63s/it

But when I run with multi-GPUs, life becomes difficult for me.

The first problem is "apply_gradient_allreduce is not defined" error. OK, that's easy to fix, I just import it from distributed.

The coming problem is that the training seems to stop at the "Done initializing distributed", no more logs is printed further.

Can you fix this ? Thank you !

@rafaelvalle
Copy link
Contributor

Pull from master and try again with FP16 enabled and disabled.

@Yablon
Copy link
Author

Yablon commented Dec 5, 2019

Hi, rafaelvalle. I tried and it seems to be stuck here for a long time.
I change nothing in the hparams but turn the "fp16_run" and "distributed_run" to be true.

FP16 Run: True
Dynamic Loss Scaling: False
Distributed Run: True
cuDNN Enabled: True
cuDNN Benchmark: False
Initializing Distributed
Done initializing distributed
Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic

@rafaelvalle
Copy link
Contributor

rafaelvalle commented Dec 5, 2019

Try with fp16_run=False

@n5-suzuki
Copy link

Hi, rafaelvalle. I also got a same error.
I copied from newest code. And I modified distributed_run=True in hparams.py.
Then I execute blow command.
python train.py -o out_dir -l logdir -g

After a few minutes, below log appeared and it seemed to stop.

FP16 Run: False
Dynamic Loss Scaling: True
Distributed Run: True
cuDNN Enabled: True
cuDNN Benchmark: False
Initializing Distributed
Done initializing distributed

I checked my netword status with netstat -atno .
Then I find "localhost:54321 LISTEN" and "localhost => localhost:54321".
But process seems to stop...

@pneumoman
Copy link

@n5-suzuki : for multi-gpu you should be running multiproc.
python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True

@aijianiula0601
Copy link

I got the same error. It's the same problem for tacotron-pytorch.So sad!

@Yablon
Copy link
Author

Yablon commented Apr 23, 2020

I think we can learn from this project and see how it is done to synthesis music rather than running this project. So I manually close this for lack of activity. If anybody has a solution, welcome to reopen and share it below.

@Yablon Yablon closed this as completed Apr 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants