Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where to download the pretrained model? #31

Closed
OptimusPrimeCao opened this issue Jun 8, 2018 · 13 comments
Closed

Where to download the pretrained model? #31

OptimusPrimeCao opened this issue Jun 8, 2018 · 13 comments

Comments

@OptimusPrimeCao
Copy link

Is there a way to get checkpoint_15500 in inference file?

@rafaelvalle
Copy link
Contributor

rafaelvalle commented Jun 8, 2018

Checkpoint files are saved to the output_directory that is specified when running the train.py command. The example code on our repo saves to outdir

@OptimusPrimeCao
Copy link
Author

@rafaelvalle
When I run train.py with default hparams on 8GB 1080 gpu, I get this error. I change batchsize to 24 but the error still exists!

src/tcmalloc.cc:278] Attempt to free invalid pointer 0x100000009
Traceback (most recent call last):
  File "train.py", line 291, in <module>
    args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
  File "train.py", line 209, in train
    for i, batch in enumerate(train_loader):
  File "/data00/home/caoyuetian/share/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 330, in __next__
    idx, batch = self._get_batch()
  File "/data00/home/caoyuetian/share/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 309, in _get_batch
    return self.data_queue.get()
  File "/data00/home/caoyuetian/share/anaconda3/lib/python3.6/multiprocessing/queues.py", line 335, in get
    res = self._reader.recv_bytes()
  File "/data00/home/caoyuetian/share/anaconda3/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/data00/home/caoyuetian/share/anaconda3/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/data00/home/caoyuetian/share/anaconda3/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
  File "/data00/home/caoyuetian/share/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 227, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 125263) is killed by signal: Aborted.
Segmentation fault (core dumped)

@rafaelvalle
Copy link
Contributor

This is probably related to your CPU not being able to load the data.
Try changing the number of workers or smaller batch size.

@MXGray
Copy link

MXGray commented Jun 26, 2018

@rafaelvalle
Can you share your pretrained Tacotron2 model and hparams that generated the sample audio?

@rafaelvalle
Copy link
Contributor

@OptimusPrimeCao The model uses approximately 300MB per sample. Try reducing your batch size to 16.

@rafaelvalle
Copy link
Contributor

@MXGray We can share the hparams.
Please do not post the same message on multiple issues. I deleted your message from the Audio Examples issue.

@gsoul
Copy link

gsoul commented Jun 26, 2018

Please share the hparams.

@vijaysumaravi
Copy link

Is there a way I can continue training my model from a particular point?

To be specific, my training crashed at checkpoint_32000 because of memory issue which I have fixed. Can I now somehow resume training from this point or should I again begin from the start? If yes, how do I do it?

I wasn't sure if opening a new issue for this was required hence posting my comment here.

Any help is appreciated. Thanks!

@vijaysumaravi
Copy link

Never mind, figured it out.

python train.py --output_directory=outdir --log_directory=logdir --checkpoint_path='outdir/checkpoint_32500'

@rafaelvalle
Copy link
Contributor

Pre-trained model has been made available on our README page.

@beknazar
Copy link

beknazar commented Mar 2, 2019

Never mind, figured it out.

python train.py --output_directory=outdir --log_directory=logdir --checkpoint_path='outdir/checkpoint_32500'

@vijaysumaravi Mine was also stopped after 32.5 epochs, did you figure out the reason?

@vijaysumaravi
Copy link

Reducing my batch size helped. I was training it on a single GPU.

@ErfolgreichCharismatisch

Tutorial: Training on GPU with Colab, Inference with CPU on Server here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants