Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using train_sr for second stage results out of memory #73

Closed
Rose-sys opened this issue Jan 24, 2021 · 5 comments
Closed

Using train_sr for second stage results out of memory #73

Rose-sys opened this issue Jan 24, 2021 · 5 comments

Comments

@Rose-sys
Copy link

After trying second stage learning, I get out of memory issue:

cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 536,870,912 bytes (allocated so far: 10,940,589,568 bytes).
localuser@localuser-All-Series:~/vc/become-yukarin$ python3 train_sr.py config_sr.json ../2ndstage/
/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/connection/convolution_2d.py:228: PerformanceWarning: The best algo of conv fwd might not be selected due to lack of workspace size (8388608)
  auto_tune=auto_tune, tensor_core=tensor_core)
predictor/loss
Exception in main training loop: Out of memory allocating 536,870,912 bytes (allocated so far: 10,940,589,568 bytes).
Traceback (most recent call last):
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/trainer.py", line 315, in run
    update()
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
    self.update_core()
  File "/home/localuser/vc/become-yukarin/become_yukarin/updater/sr_updater.py", line 79, in update_core
    opt_predictor.update(loss.get, 'predictor')
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/optimizer.py", line 685, in update
    loss.backward(loss_scale=self._loss_scale)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 981, in backward
    self._backward_main(retain_grad, loss_scale)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 1061, in _backward_main
    func, target_input_indexes, out_grad, in_grad)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 179, in backprop_step
    _reduce(gx)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 10, in _reduce
    grad_list[:] = [chainer.functions.add(*grad_list)]
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 229, in add
    return Add().apply((lhs, rhs))[0]
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/function_node.py", line 263, in apply
    outputs = self.forward(in_data)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 156, in forward
    y = utils.force_array(x[0] + x[1])
  File "cupy/core/core.pyx", line 968, in cupy.core.core.ndarray.__add__
  File "cupy/core/_kernel.pyx", line 930, in cupy.core._kernel.ufunc.__call__
  File "cupy/core/_kernel.pyx", line 397, in cupy.core._kernel._get_out_args
  File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 544, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 1243, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1264, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1042, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1062, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 784, in cupy.cuda.memory._try_malloc
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "train_sr.py", line 83, in <module>
    trainer.run()
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/trainer.py", line 329, in run
    six.reraise(*sys.exc_info())
  File "/home/localuser/.local/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/trainer.py", line 315, in run
    update()
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
    self.update_core()
  File "/home/localuser/vc/become-yukarin/become_yukarin/updater/sr_updater.py", line 79, in update_core
    opt_predictor.update(loss.get, 'predictor')
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/optimizer.py", line 685, in update
    loss.backward(loss_scale=self._loss_scale)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 981, in backward
    self._backward_main(retain_grad, loss_scale)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 1061, in _backward_main
    func, target_input_indexes, out_grad, in_grad)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 179, in backprop_step
    _reduce(gx)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 10, in _reduce
    grad_list[:] = [chainer.functions.add(*grad_list)]
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 229, in add
    return Add().apply((lhs, rhs))[0]
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/function_node.py", line 263, in apply
    outputs = self.forward(in_data)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 156, in forward
    y = utils.force_array(x[0] + x[1])
  File "cupy/core/core.pyx", line 968, in cupy.core.core.ndarray.__add__
  File "cupy/core/_kernel.pyx", line 930, in cupy.core._kernel.ufunc.__call__
  File "cupy/core/_kernel.pyx", line 397, in cupy.core._kernel._get_out_args
  File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 544, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 1243, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1264, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1042, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1062, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 784, in cupy.cuda.memory._try_malloc
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 536,870,912 bytes (allocated so far: 10,940,589,568 bytes).

It seems to allocate too much, as there is hardly anything using the videocard memory:

$ nvidia-smi
Sun Jan 24 16:27:32 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:05:00.0  On |                  N/A |
|  0%   39C    P8    11W / 275W |      1MiB / 11177MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
@Rose-sys
Copy link
Author

In case you need more information.

Environment:
$ python3 -V
Python 3.7.5

$ pip3 list
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
absl-py (0.11.0)
alembic (1.5.2)
appdirs (1.4.4)
asn1crypto (0.24.0)
astor (0.8.1)
audioread (2.1.9)
become-yukarin (1.0.0)
Brlapi (0.6.6)
cached-property (1.5.2)
certifi (2020.12.5)
cffi (1.14.4)
chainer (5.4.0)
chainerui (0.11.0)
chardet (4.0.0)
click (7.1.2)
cryptography (2.1.4)
cupshelpers (1.0)
cupy-cuda110 (8.3.0)
cycler (0.10.0)
Cython (0.29.21)
decorator (4.4.2)
defer (1.0.6)
distro-info (0.18ubuntu0.18.04.1)
evdev (1.4.0)
fastdtw (0.3.4)
fastrlock (0.5)
filelock (3.0.12)
Flask (1.1.2)
gast (0.4.0)
gevent (21.1.2)
google-pasta (0.2.0)
greenlet (1.0.0)
grpcio (1.35.0)
h5py (3.1.0)
httplib2 (0.9.2)
idna (2.10)
imageio (2.9.0)
imageio-ffmpeg (0.4.3)
importlib-metadata (3.4.0)
itsdangerous (1.1.0)
Jinja2 (2.11.2)
joblib (1.0.0)
Keras-Applications (1.0.8)
Keras-Preprocessing (1.1.2)
keyring (10.6.0)
keyrings.alt (3.0)
kiwisolver (1.3.1)
launchpadlib (1.10.6)
lazr.restfulclient (0.13.5)
lazr.uri (1.0.3)
librosa (0.8.0)
llvmlite (0.35.0)
louis (3.5.0)
macaroonbakery (1.1.3)
Mako (1.1.4)
Markdown (3.3.3)
MarkupSafe (1.1.1)
matplotlib (3.3.3)
moviepy (1.0.3)
msgpack (1.0.2)
netifaces (0.10.4)
numba (0.52.0)
numpy (1.19.5)
oauth (1.0.1)
olefile (0.45.1)
packaging (20.8)
pexpect (4.2.1)
Pillow (8.1.0)
pip (9.0.1)
pooch (1.3.0)
proglog (0.1.9)
protobuf (3.14.0)
psutil (5.8.0)
PyAudio (0.2.11)
pycairo (1.16.2)
pycparser (2.20)
pycrypto (2.6.1)
pycups (1.9.73)
Pygments (2.2.0)
pygobject (3.26.1)
pymacaroons (0.13.0)
PyNaCl (1.1.2)
pynput (1.7.2)
pyparsing (2.4.7)
pyRFC3339 (1.0)
pysptk (0.1.18)
python-apt (1.6.5+ubuntu0.5)
python-dateutil (2.8.1)
python-debian (0.1.32)
python-editor (1.0.4)
python-xlib (0.29)
pytz (2018.3)
pyworld (0.2.12)
pyxdg (0.25)
PyYAML (5.4.1)
reportlab (3.4.0)
requests (2.25.1)
requests-unixsocket (0.1.5)
resampy (0.2.2)
scikit-learn (0.24.1)
scipy (1.6.0)
screen-resolution-extra (0.0.0)
SecretStorage (2.3.1)
setuptools (52.0.0)
simplejson (3.13.2)
six (1.15.0)
SoundFile (0.10.3.post1)
SQLAlchemy (1.3.22)
ssh-import-id (5.7)
structlog (20.2.0)
systemd-python (234)
tensorboard (1.14.0)
tensorboard-chainer (0.5.3)
tensorflow (1.14.0)
tensorflow-estimator (1.14.0)
termcolor (1.1.0)
threadpoolctl (2.1.0)
tqdm (4.56.0)
typing (3.7.4.3)
typing-extensions (3.7.4.3)
ufw (0.36)
unattended-upgrades (0.1)
urllib3 (1.26.2)
usb-creator (0.3.3)
wadllib (1.3.2)
Werkzeug (1.0.1)
wheel (0.36.2)
world4py (0.1.1)
wrapt (1.12.1)
xkit (0.0.0)
yukarin (0.1.0)
zipp (3.4.0)
zope.event (4.5.0)
zope.interface (5.2.0)

@Rose-sys
Copy link
Author

I am wondering what you use as Cuda and driver version, and Cupy version (since you mention below 7.0.0 in recommendations, I installed Cuda 10.0 with Cupy100==5.4.0, as I do not see cupy version of 6, yet still gave out of memory error although taking longer in training somehow) so I can rebuild the environment. I used Ubuntu 18.04, and I could not install drivers lower than 450XX

@Rose-sys
Copy link
Author

I want to add, I had 12 .npy files after using spectrogram pairs, total of 55 mb in size (around 4mb on average), I removed few npy files, reducing to 7 files and it no longer crashes.

Is it supposed to be like that? looks a bit weird as it now uses 7.4gb with 7 files. I might be able to squeeze 8th file in there but isn't it like better the more you have?

Probably something to do with cupy/chainer, but they say do it in batches when I look at their pages when looking further into this. So not sure if that is a possibility with your code.

I'll await your reply to see what you recommend with this (I couldn't give too less as it would whine about epoch divided by zero, but more than 8 just eats all 11 gb of my videocard and crash)

@Hiroshiba
Copy link
Owner

Try lowering the batch size here.

"batchsize": 8,

I will close this Issue once, but please open it again if you need anything else.

@Rose-sys
Copy link
Author

Thanks, it helped! using about 20 files now with batch size on 5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants