Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: code is too big #125

Closed
annlaumets opened this issue Jan 9, 2019 · 15 comments
Closed

RuntimeError: code is too big #125

annlaumets opened this issue Jan 9, 2019 · 15 comments

Comments

@annlaumets
Copy link

Hi!
I'm having a problem with getting a RuntimeError saying that the code is too big. I'm using the LJSpeech dataset and I'm trying to train my own model. The machine has GTX1080 Ti GPU. I installed Pytorch 1.0 with CUDA 10.0. Here's the traceback:

(tacotron2-py3) annika@mavs:~/tacotron2$ python train.py --output_directory=data/lj_speech --log_directory=logs/lj_speech
FP16 Run: False
Dynamic Loss Scaling: True
Distributed Run: False
cuDNN Enabled: True
cuDNN Benchmark: False
Input dir: None
Output dir: data/lj_speech
Batch size: 64
Epoch: 0
Traceback (most recent call last):
File "train.py", line 289, in
train(args.input_directory, args.output_directory, args.log_directory, args.checkpoint_path, args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
File "train.py", line 205, in train
for i, batch in enumerate(train_loader):
File "/home/annika/.conda/envs/tacotron2-py3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 468, in next
return self._process_next_batch(batch)
File "/home/annika/.conda/envs/tacotron2-py3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 489, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/home/annika/.conda/envs/tacotron2-py3/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/annika/.conda/envs/tacotron2-py3/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/annika/tacotron2/data_utils.py", line 63, in getitem
return self.get_mel_text_pair(self.audiopaths_and_text[index])
File "/home/annika/tacotron2/data_utils.py", line 34, in get_mel_text_pair
mel = self.get_mel(audiopath)
File "/home/annika/tacotron2/data_utils.py", line 48, in get_mel
melspec = self.stft.mel_spectrogram(audio_norm)
File "/home/annika/tacotron2/layers.py", line 76, in mel_spectrogram
magnitudes, phases = self.stft_fn.transform(y)
File "/home/annika/tacotron2/stft.py", line 95, in transform
padding=0)
RuntimeError: code is too big

@peter05010402
Copy link

how about reducing batch size to 32

@annlaumets
Copy link
Author

I tried reducing batch size. I even tried batch size 8 and the problem was still there.

@annlaumets
Copy link
Author

Hi!
The problem still persists. I even tried it with batch size 1. Can somebody help me?

(tacotron2-uus) [annilau@rocket tacotron2]$ tail -f slurm-4104404.out 
Start training
FP16 Run: False
Dynamic Loss Scaling: True
Distributed Run: False
cuDNN Enabled: True
cuDNN Benchmark: False
Text cleaners: ['basic_cleaners']
Epoch: 0
Traceback (most recent call last):
  File "train.py", line 285, in <module>
    args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
  File "train.py", line 205, in train
    for i, batch in enumerate(train_loader):
  File "/gpfs/hpchome/annilau/anaconda3/envs/tacotron2-uus/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in __next__
    return self._process_next_batch(batch)
  File "/gpfs/hpchome/annilau/anaconda3/envs/tacotron2-uus/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
  File "/gpfs/hpchome/annilau/anaconda3/envs/tacotron2-uus/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/gpfs/hpchome/annilau/anaconda3/envs/tacotron2-uus/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/gpfs/rocket/home/annilau/tacotron2-uus/tacotron2/data_utils.py", line 61, in __getitem__
    return self.get_mel_text_pair(self.audiopaths_and_text[index])
  File "/gpfs/rocket/home/annilau/tacotron2-uus/tacotron2/data_utils.py", line 34, in get_mel_text_pair
    mel = self.get_mel(audiopath)
  File "/gpfs/rocket/home/annilau/tacotron2-uus/tacotron2/data_utils.py", line 46, in get_mel
    melspec = self.stft.mel_spectrogram(audio_norm)
  File "/gpfs/rocket/home/annilau/tacotron2-uus/tacotron2/layers.py", line 76, in mel_spectrogram
    magnitudes, phases = self.stft_fn.transform(y)
  File "/gpfs/rocket/home/annilau/tacotron2-uus/tacotron2/stft.py", line 95, in transform
    padding=0)
RuntimeError: code is too big

Job finished

@Xueyuan-Zhang
Copy link

Xueyuan-Zhang commented Feb 14, 2019

Same problem here with same dataset, batch_size 24, segment_len 16000 using V100

@androidof2008
Copy link

When I use torch-1.0.1.post2, I got the same error.
I uninstalled torch-1.0.1.post2, and installed torch-1.0.0, didn't get this error, but I got another error:
Traceback (most recent call last):
File "train.py", line 289, in
args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
File "train.py", line 217, in train
y_pred = model(x)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/work/NLP/waveglow_test/tacotron2/model.py", line 519, in forward
encoder_outputs, targets, memory_lengths=input_lengths)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/work/NLP/waveglow_test/tacotron2/model.py", line 419, in forward
mel_outputs, gate_outputs, alignments)
File "/work/NLP/waveglow_test/tacotron2/model.py", line 329, in parse_decoder_outputs
gate_outputs = torch.stack(gate_outputs).transpose(0, 1)
RuntimeError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

@androidof2008
Copy link

Use pytorch 1.0.0 and don't set batch_size to 1.
It works now.

@rafaelvalle
Copy link
Contributor

There seems to be some CPU memory issue.
You can try this hack

        forward_transform = F.conv1d(
            input_data.cuda(),
            Variable(self.forward_basis, requires_grad=False).cuda(),
            stride=self.hop_length,
            padding=0).cpu()

@vikrantsharma7
Copy link

Tried this, but got RuntimeError: CUDA error: initialization error.
However, using num_workers=0 in both train and valid loaders in train.py, along with the above hack seems to work.

@rakshithvasudev
Copy link

Maybe this is useful : pytorch/pytorch#24174
However, @rafaelvalle's hack worked for me.

Thanks!

coinsbarboss added a commit to coinsbarboss/tacotron2 that referenced this issue Sep 12, 2019
coinsbarboss added a commit to coinsbarboss/tacotron2 that referenced this issue Sep 12, 2019
@Artanis09
Copy link

Try use the latest pytorch-1.4.0. "conda install pytorch"
It works very well.

@Text2-m
Copy link

Text2-m commented Jul 27, 2020

Tried this, but got RuntimeError: CUDA error: initialization error.
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

@JosefJoubert
Copy link

Should this still be an issue? I'm running Pytorch 1.5 and it still happens. Reducing the num_workers does reduce the amount of memory required significantly, but this is not an ideal solution.

@rafaelvalle 's solution still works, but it is a bit of a hack.

Is it possible to instead calculate the stft/mel_spec using librosa? I quickly implemented it but the two functions gave different results, and I don't have the time currently to investigate why.

@Mingrg
Copy link

Mingrg commented Sep 23, 2020

Tried this, but got RuntimeError: CUDA error: initialization error.
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

I got that too, do you know how to resolve this now?

@JosefJoubert
Copy link

Tried this, but got RuntimeError: CUDA error: initialization error.
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

I got that too, do you know how to resolve this now?

What the error means is that the data you're passing in to the function is residing on the GPU, while the model itself is still on the CPU.

If you're getting the error around the code @rafaelvalle shared then most likely you forgot the second .cuda() command. Note there are three things happening in that snippet:

  1. Move input data to GPU
  2. Move weights of filters to GPU
  3. When done, move results back to CPU

@Mingrg
Copy link

Mingrg commented Sep 23, 2020

Tried this, but got RuntimeError: CUDA error: initialization error.
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

I got that too, do you know how to resolve this now?

What the error means is that the data you're passing in to the function is residing on the GPU, while the model itself is still on the CPU.

If you're getting the error around the code @rafaelvalle shared then most likely you forgot the second .cuda() command. Note there are three things happening in that snippet:

  1. Move input data to GPU
  2. Move weights of filters to GPU
  3. When done, move results back to CPU

oh, I see, thank you so much!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests