Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory during training #304

Closed
AlexMikhalev opened this issue May 25, 2018 · 14 comments
Closed

Out of memory during training #304

AlexMikhalev opened this issue May 25, 2018 · 14 comments
Labels

Comments

@AlexMikhalev
Copy link

I am running out of memory on every epoch:
I have merged A4 and TED datasets and trying to train on the merged dataset and I am getting out of memory every epoch:

Epoch: [4][13/4336]	Time 0.538 (0.680)	Data 0.003 (0.003)	Loss 58.5850 (69.9133)	
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "train.py", line 304, in <module>
    loss.backward()
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/generic/THCStorage.cu:58

Is any way to set max process gpu memory in pytorch similar to TF:

sess_config = tf.ConfigProto()
sess_config.gpu_options.per_process_gpu_memory_fraction = 0.90

Fortunately I am able to resume using checkpoints. Seems relevant to issue #172

@AlexMikhalev
Copy link
Author

I am using GeForce GTX 1080.

@miguelvr
Copy link
Contributor

you can run benchmark.py to check how big of a batch size you can use with your setup

@oguzelibol
Copy link

I'm having the same issue - has nothing to do with the batch size, GPU memory keeps increasing regardless. I am using CUDA8 and pytorch 0.4.0 with Python 3.5. Has anyone figured a solution to this?

@SeanNaren
Copy link
Owner

@oguzelibol how have you checked this has nothing to do with batch size?

@bliunlpr
Copy link

bliunlpr commented Jul 21, 2018

I'm having the same issue too. I am also using CUDA8 and pytorch 0.4.0 with Python 3.5. Any ideas? Ihave set batch-size=1, but GPU memory keeps increasing. I use TITAN XP.

@bliunlpr
Copy link

@oguzelibol Maybe it is useful to reinstall pytorch and warp-ctc.

@ageojo
Copy link

ageojo commented Jul 27, 2018

@oguzelibol Setting the Docker shm-size parameter helped; (5G worked well).

@oguzelibol
Copy link

@SeanNaren - Yes I have experimented with several different batch sizes and had the same issue.

Here is the solution that worked for me (and worked regardless of the batch size) - hopefully this will also tell something about the root cause:

-Roll back to pytorch 0.3.1
-Also requires to roll back pytorch audio: git checkout 0fe0305a54d84161008d84492268d279279fb3bb before python setup.py install
-Then warp-ctc does not work, so need to git checkout 02a292986c8ac67a4a6c7959aef67cfdea79c688 for warp-ctc

@zhang-wy15
Copy link

I'm having the same issue when training. It may be related to pytorch and RNN. Are there any solutions to it? @SeanNaren

@zzvara
Copy link
Contributor

zzvara commented May 19, 2019

Same issue here on Pytorch 1.0.0 and latest warp-ctc with latest pytorch audio. Cuda goes OOM irrespective or layer dimensions or batch size.

@rajeevbaalwan
Copy link

still same issue here on latest pull of repo with Pytorch 1.2 and latest warp-ctc. Cuda goes OOM irrespective or layer dimensions or batch size.
@SeanNaren any help on this ??

@SeanNaren
Copy link
Owner

@rajeevbaalwan have you tried with a batch size of 1? How much VRAM (GPU memory) do you have?

@rajeevbaalwan
Copy link

@SeanNaren i am using 2080Ti having 11 GB memory

@stale
Copy link

stale bot commented Feb 27, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 27, 2020
@stale stale bot closed this as completed Mar 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants