Out of memory during training #304

AlexMikhalev · 2018-05-25T20:46:05Z

I am running out of memory on every epoch:
I have merged A4 and TED datasets and trying to train on the merged dataset and I am getting out of memory every epoch:

Epoch: [4][13/4336]	Time 0.538 (0.680)	Data 0.003 (0.003)	Loss 58.5850 (69.9133)	
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "train.py", line 304, in <module>
    loss.backward()
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/generic/THCStorage.cu:58

Is any way to set max process gpu memory in pytorch similar to TF:

sess_config = tf.ConfigProto()
sess_config.gpu_options.per_process_gpu_memory_fraction = 0.90

Fortunately I am able to resume using checkpoints. Seems relevant to issue #172

The text was updated successfully, but these errors were encountered:

AlexMikhalev · 2018-05-25T20:47:03Z

I am using GeForce GTX 1080.

miguelvr · 2018-05-27T14:52:35Z

you can run benchmark.py to check how big of a batch size you can use with your setup

oguzelibol · 2018-06-28T19:42:53Z

I'm having the same issue - has nothing to do with the batch size, GPU memory keeps increasing regardless. I am using CUDA8 and pytorch 0.4.0 with Python 3.5. Has anyone figured a solution to this?

SeanNaren · 2018-07-14T12:15:53Z

@oguzelibol how have you checked this has nothing to do with batch size?

bliunlpr · 2018-07-21T10:42:21Z

I'm having the same issue too. I am also using CUDA8 and pytorch 0.4.0 with Python 3.5. Any ideas? Ihave set batch-size=1, but GPU memory keeps increasing. I use TITAN XP.

bliunlpr · 2018-07-23T01:15:24Z

@oguzelibol Maybe it is useful to reinstall pytorch and warp-ctc.

ageojo · 2018-07-27T18:38:44Z

@oguzelibol Setting the Docker shm-size parameter helped; (5G worked well).

oguzelibol · 2018-08-10T20:10:40Z

@SeanNaren - Yes I have experimented with several different batch sizes and had the same issue.

Here is the solution that worked for me (and worked regardless of the batch size) - hopefully this will also tell something about the root cause:

-Roll back to pytorch 0.3.1
-Also requires to roll back pytorch audio: git checkout 0fe0305a54d84161008d84492268d279279fb3bb before python setup.py install
-Then warp-ctc does not work, so need to git checkout 02a292986c8ac67a4a6c7959aef67cfdea79c688 for warp-ctc

zhang-wy15 · 2018-12-11T06:11:53Z

I'm having the same issue when training. It may be related to pytorch and RNN. Are there any solutions to it? @SeanNaren

zzvara · 2019-05-19T18:30:30Z

Same issue here on Pytorch 1.0.0 and latest warp-ctc with latest pytorch audio. Cuda goes OOM irrespective or layer dimensions or batch size.

rajeevbaalwan · 2019-11-01T10:17:54Z

still same issue here on latest pull of repo with Pytorch 1.2 and latest warp-ctc. Cuda goes OOM irrespective or layer dimensions or batch size.
@SeanNaren any help on this ??

SeanNaren · 2019-11-01T11:53:21Z

@rajeevbaalwan have you tried with a batch size of 1? How much VRAM (GPU memory) do you have?

rajeevbaalwan · 2019-11-04T08:21:22Z

@SeanNaren i am using 2080Ti having 11 GB memory

stale · 2020-02-27T12:59:27Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

abidmalikwaterloo mentioned this issue Sep 18, 2018

Running Pytorch with Horovod horovod/horovod#492

Closed

stale bot added the stale label Feb 27, 2020

stale bot closed this as completed Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory during training #304

Out of memory during training #304

AlexMikhalev commented May 25, 2018

AlexMikhalev commented May 25, 2018

miguelvr commented May 27, 2018

oguzelibol commented Jun 28, 2018

SeanNaren commented Jul 14, 2018

bliunlpr commented Jul 21, 2018 •

edited

bliunlpr commented Jul 23, 2018

ageojo commented Jul 27, 2018 •

edited

oguzelibol commented Aug 10, 2018

zhang-wy15 commented Dec 11, 2018

zzvara commented May 19, 2019

rajeevbaalwan commented Nov 1, 2019

SeanNaren commented Nov 1, 2019

rajeevbaalwan commented Nov 4, 2019

stale bot commented Feb 27, 2020

Out of memory during training #304

Out of memory during training #304

Comments

AlexMikhalev commented May 25, 2018

AlexMikhalev commented May 25, 2018

miguelvr commented May 27, 2018

oguzelibol commented Jun 28, 2018

SeanNaren commented Jul 14, 2018

bliunlpr commented Jul 21, 2018 • edited

bliunlpr commented Jul 23, 2018

ageojo commented Jul 27, 2018 • edited

oguzelibol commented Aug 10, 2018

zhang-wy15 commented Dec 11, 2018

zzvara commented May 19, 2019

rajeevbaalwan commented Nov 1, 2019

SeanNaren commented Nov 1, 2019

rajeevbaalwan commented Nov 4, 2019

stale bot commented Feb 27, 2020

bliunlpr commented Jul 21, 2018 •

edited

ageojo commented Jul 27, 2018 •

edited