RuntimeError: code is too big #125

annlaumets · 2019-01-09T10:38:05Z

Hi!
I'm having a problem with getting a RuntimeError saying that the code is too big. I'm using the LJSpeech dataset and I'm trying to train my own model. The machine has GTX1080 Ti GPU. I installed Pytorch 1.0 with CUDA 10.0. Here's the traceback:

(tacotron2-py3) annika@mavs:~/tacotron2$ python train.py --output_directory=data/lj_speech --log_directory=logs/lj_speech
FP16 Run: False
Dynamic Loss Scaling: True
Distributed Run: False
cuDNN Enabled: True
cuDNN Benchmark: False
Input dir: None
Output dir: data/lj_speech
Batch size: 64
Epoch: 0
Traceback (most recent call last):
File "train.py", line 289, in
train(args.input_directory, args.output_directory, args.log_directory, args.checkpoint_path, args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
File "train.py", line 205, in train
for i, batch in enumerate(train_loader):
File "/home/annika/.conda/envs/tacotron2-py3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 468, in next
return self._process_next_batch(batch)
File "/home/annika/.conda/envs/tacotron2-py3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 489, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/home/annika/.conda/envs/tacotron2-py3/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/annika/.conda/envs/tacotron2-py3/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/annika/tacotron2/data_utils.py", line 63, in getitem
return self.get_mel_text_pair(self.audiopaths_and_text[index])
File "/home/annika/tacotron2/data_utils.py", line 34, in get_mel_text_pair
mel = self.get_mel(audiopath)
File "/home/annika/tacotron2/data_utils.py", line 48, in get_mel
melspec = self.stft.mel_spectrogram(audio_norm)
File "/home/annika/tacotron2/layers.py", line 76, in mel_spectrogram
magnitudes, phases = self.stft_fn.transform(y)
File "/home/annika/tacotron2/stft.py", line 95, in transform
padding=0)
RuntimeError: code is too big

peter05010402 · 2019-01-09T10:51:29Z

how about reducing batch size to 32

annlaumets · 2019-01-09T11:33:42Z

I tried reducing batch size. I even tried batch size 8 and the problem was still there.

annlaumets · 2019-02-12T16:22:42Z

Hi!
The problem still persists. I even tried it with batch size 1. Can somebody help me?

(tacotron2-uus) [annilau@rocket tacotron2]$ tail -f slurm-4104404.out 
Start training
FP16 Run: False
Dynamic Loss Scaling: True
Distributed Run: False
cuDNN Enabled: True
cuDNN Benchmark: False
Text cleaners: ['basic_cleaners']
Epoch: 0
Traceback (most recent call last):
  File "train.py", line 285, in <module>
    args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
  File "train.py", line 205, in train
    for i, batch in enumerate(train_loader):
  File "/gpfs/hpchome/annilau/anaconda3/envs/tacotron2-uus/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in __next__
    return self._process_next_batch(batch)
  File "/gpfs/hpchome/annilau/anaconda3/envs/tacotron2-uus/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
  File "/gpfs/hpchome/annilau/anaconda3/envs/tacotron2-uus/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/gpfs/hpchome/annilau/anaconda3/envs/tacotron2-uus/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/gpfs/rocket/home/annilau/tacotron2-uus/tacotron2/data_utils.py", line 61, in __getitem__
    return self.get_mel_text_pair(self.audiopaths_and_text[index])
  File "/gpfs/rocket/home/annilau/tacotron2-uus/tacotron2/data_utils.py", line 34, in get_mel_text_pair
    mel = self.get_mel(audiopath)
  File "/gpfs/rocket/home/annilau/tacotron2-uus/tacotron2/data_utils.py", line 46, in get_mel
    melspec = self.stft.mel_spectrogram(audio_norm)
  File "/gpfs/rocket/home/annilau/tacotron2-uus/tacotron2/layers.py", line 76, in mel_spectrogram
    magnitudes, phases = self.stft_fn.transform(y)
  File "/gpfs/rocket/home/annilau/tacotron2-uus/tacotron2/stft.py", line 95, in transform
    padding=0)
RuntimeError: code is too big

Job finished

Xueyuan-Zhang · 2019-02-14T23:09:11Z

Same problem here with same dataset, batch_size 24, segment_len 16000 using V100

androidof2008 · 2019-02-19T01:10:28Z

When I use torch-1.0.1.post2, I got the same error.
I uninstalled torch-1.0.1.post2, and installed torch-1.0.0, didn't get this error, but I got another error:
Traceback (most recent call last):
File "train.py", line 289, in
args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
File "train.py", line 217, in train
y_pred = model(x)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/work/NLP/waveglow_test/tacotron2/model.py", line 519, in forward
encoder_outputs, targets, memory_lengths=input_lengths)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/work/NLP/waveglow_test/tacotron2/model.py", line 419, in forward
mel_outputs, gate_outputs, alignments)
File "/work/NLP/waveglow_test/tacotron2/model.py", line 329, in parse_decoder_outputs
gate_outputs = torch.stack(gate_outputs).transpose(0, 1)
RuntimeError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

androidof2008 · 2019-02-19T01:44:08Z

Use pytorch 1.0.0 and don't set batch_size to 1.
It works now.

rafaelvalle · 2019-06-08T01:49:58Z

There seems to be some CPU memory issue.
You can try this hack

        forward_transform = F.conv1d(
            input_data.cuda(),
            Variable(self.forward_basis, requires_grad=False).cuda(),
            stride=self.hop_length,
            padding=0).cpu()

vikrantsharma7 · 2019-06-11T09:25:36Z

Tried this, but got RuntimeError: CUDA error: initialization error.
However, using num_workers=0 in both train and valid loaders in train.py, along with the above hack seems to work.

rakshithvasudev · 2019-08-14T21:39:09Z

Maybe this is useful : pytorch/pytorch#24174
However, @rafaelvalle's hack worked for me.

Thanks!

Artanis09 · 2020-04-13T09:07:23Z

Try use the latest pytorch-1.4.0. "conda install pytorch"
It works very well.

Text2-m · 2020-07-27T05:23:29Z

Tried this, but got RuntimeError: CUDA error: initialization error.
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

JosefJoubert · 2020-08-03T11:25:17Z

Should this still be an issue? I'm running Pytorch 1.5 and it still happens. Reducing the num_workers does reduce the amount of memory required significantly, but this is not an ideal solution.

@rafaelvalle 's solution still works, but it is a bit of a hack.

Is it possible to instead calculate the stft/mel_spec using librosa? I quickly implemented it but the two functions gave different results, and I don't have the time currently to investigate why.

Mingrg · 2020-09-23T07:22:02Z

Tried this, but got RuntimeError: CUDA error: initialization error.
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

I got that too, do you know how to resolve this now?

JosefJoubert · 2020-09-23T07:33:24Z

Tried this, but got RuntimeError: CUDA error: initialization error.
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

I got that too, do you know how to resolve this now?

What the error means is that the data you're passing in to the function is residing on the GPU, while the model itself is still on the CPU.

If you're getting the error around the code @rafaelvalle shared then most likely you forgot the second .cuda() command. Note there are three things happening in that snippet:

Move input data to GPU
Move weights of filters to GPU
When done, move results back to CPU

Mingrg · 2020-09-23T07:36:44Z

Tried this, but got RuntimeError: CUDA error: initialization error.
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

I got that too, do you know how to resolve this now?

What the error means is that the data you're passing in to the function is residing on the GPU, while the model itself is still on the CPU.

If you're getting the error around the code @rafaelvalle shared then most likely you forgot the second .cuda() command. Note there are three things happening in that snippet:

Move input data to GPU

Move weights of filters to GPU

When done, move results back to CPU

oh, I see, thank you so much!!

rafaelvalle closed this as completed Feb 27, 2019

jongwook mentioned this issue Sep 12, 2019

Can't resolve this issue at training start NVIDIA/waveglow#153

Closed

coinsbarboss added a commit to coinsbarboss/tacotron2 that referenced this issue Sep 12, 2019

github.com/NVIDIA/issues/125

b906260

coinsbarboss added a commit to coinsbarboss/tacotron2 that referenced this issue Sep 12, 2019

https://github.com/NVIDIA/tacotron2/issues/125

a648361

Charlottecuc mentioned this issue Sep 21, 2020

[PyTorch/WaveGlow] NaN loss after 80 epochs NVIDIA/DeepLearningExamples#694

Open

BasmaOUKIT mentioned this issue Dec 29, 2020

RuntimeError: code is too big #437

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: code is too big #125

RuntimeError: code is too big #125

annlaumets commented Jan 9, 2019

peter05010402 commented Jan 9, 2019

annlaumets commented Jan 9, 2019

annlaumets commented Feb 12, 2019

Xueyuan-Zhang commented Feb 14, 2019 •

edited

Loading

androidof2008 commented Feb 19, 2019

androidof2008 commented Feb 19, 2019

rafaelvalle commented Jun 8, 2019

vikrantsharma7 commented Jun 11, 2019

rakshithvasudev commented Aug 14, 2019

Artanis09 commented Apr 13, 2020

Text2-m commented Jul 27, 2020

JosefJoubert commented Aug 3, 2020

Mingrg commented Sep 23, 2020

JosefJoubert commented Sep 23, 2020

Mingrg commented Sep 23, 2020

RuntimeError: code is too big #125

RuntimeError: code is too big #125

Comments

annlaumets commented Jan 9, 2019

peter05010402 commented Jan 9, 2019

annlaumets commented Jan 9, 2019

annlaumets commented Feb 12, 2019

Xueyuan-Zhang commented Feb 14, 2019 • edited Loading

androidof2008 commented Feb 19, 2019

androidof2008 commented Feb 19, 2019

rafaelvalle commented Jun 8, 2019

vikrantsharma7 commented Jun 11, 2019

rakshithvasudev commented Aug 14, 2019

Artanis09 commented Apr 13, 2020

Text2-m commented Jul 27, 2020

JosefJoubert commented Aug 3, 2020

Mingrg commented Sep 23, 2020

JosefJoubert commented Sep 23, 2020

Mingrg commented Sep 23, 2020

Xueyuan-Zhang commented Feb 14, 2019 •

edited

Loading