Using train_sr for second stage results out of memory #73

Rose-sys · 2021-01-24T15:42:51Z

After trying second stage learning, I get out of memory issue:

cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 536,870,912 bytes (allocated so far: 10,940,589,568 bytes).
localuser@localuser-All-Series:~/vc/become-yukarin$ python3 train_sr.py config_sr.json ../2ndstage/
/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/connection/convolution_2d.py:228: PerformanceWarning: The best algo of conv fwd might not be selected due to lack of workspace size (8388608)
  auto_tune=auto_tune, tensor_core=tensor_core)
predictor/loss
Exception in main training loop: Out of memory allocating 536,870,912 bytes (allocated so far: 10,940,589,568 bytes).
Traceback (most recent call last):
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/trainer.py", line 315, in run
    update()
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
    self.update_core()
  File "/home/localuser/vc/become-yukarin/become_yukarin/updater/sr_updater.py", line 79, in update_core
    opt_predictor.update(loss.get, 'predictor')
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/optimizer.py", line 685, in update
    loss.backward(loss_scale=self._loss_scale)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 981, in backward
    self._backward_main(retain_grad, loss_scale)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 1061, in _backward_main
    func, target_input_indexes, out_grad, in_grad)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 179, in backprop_step
    _reduce(gx)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 10, in _reduce
    grad_list[:] = [chainer.functions.add(*grad_list)]
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 229, in add
    return Add().apply((lhs, rhs))[0]
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/function_node.py", line 263, in apply
    outputs = self.forward(in_data)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 156, in forward
    y = utils.force_array(x[0] + x[1])
  File "cupy/core/core.pyx", line 968, in cupy.core.core.ndarray.__add__
  File "cupy/core/_kernel.pyx", line 930, in cupy.core._kernel.ufunc.__call__
  File "cupy/core/_kernel.pyx", line 397, in cupy.core._kernel._get_out_args
  File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 544, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 1243, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1264, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1042, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1062, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 784, in cupy.cuda.memory._try_malloc
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "train_sr.py", line 83, in <module>
    trainer.run()
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/trainer.py", line 329, in run
    six.reraise(*sys.exc_info())
  File "/home/localuser/.local/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/trainer.py", line 315, in run
    update()
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/training/updaters/standard_updater.py", line 165, in update
    self.update_core()
  File "/home/localuser/vc/become-yukarin/become_yukarin/updater/sr_updater.py", line 79, in update_core
    opt_predictor.update(loss.get, 'predictor')
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/optimizer.py", line 685, in update
    loss.backward(loss_scale=self._loss_scale)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 981, in backward
    self._backward_main(retain_grad, loss_scale)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/variable.py", line 1061, in _backward_main
    func, target_input_indexes, out_grad, in_grad)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 179, in backprop_step
    _reduce(gx)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/_backprop_utils.py", line 10, in _reduce
    grad_list[:] = [chainer.functions.add(*grad_list)]
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 229, in add
    return Add().apply((lhs, rhs))[0]
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/function_node.py", line 263, in apply
    outputs = self.forward(in_data)
  File "/home/localuser/.local/lib/python3.7/site-packages/chainer/functions/math/basic_math.py", line 156, in forward
    y = utils.force_array(x[0] + x[1])
  File "cupy/core/core.pyx", line 968, in cupy.core.core.ndarray.__add__
  File "cupy/core/_kernel.pyx", line 930, in cupy.core._kernel.ufunc.__call__
  File "cupy/core/_kernel.pyx", line 397, in cupy.core._kernel._get_out_args
  File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 544, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 1243, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1264, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1042, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1062, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 784, in cupy.cuda.memory._try_malloc
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 536,870,912 bytes (allocated so far: 10,940,589,568 bytes).

It seems to allocate too much, as there is hardly anything using the videocard memory:

$ nvidia-smi
Sun Jan 24 16:27:32 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:05:00.0  On |                  N/A |
|  0%   39C    P8    11W / 275W |      1MiB / 11177MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

Rose-sys · 2021-01-24T16:03:01Z

In case you need more information.

Environment:
$ python3 -V
Python 3.7.5

$ pip3 list
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
absl-py (0.11.0)
alembic (1.5.2)
appdirs (1.4.4)
asn1crypto (0.24.0)
astor (0.8.1)
audioread (2.1.9)
become-yukarin (1.0.0)
Brlapi (0.6.6)
cached-property (1.5.2)
certifi (2020.12.5)
cffi (1.14.4)
chainer (5.4.0)
chainerui (0.11.0)
chardet (4.0.0)
click (7.1.2)
cryptography (2.1.4)
cupshelpers (1.0)
cupy-cuda110 (8.3.0)
cycler (0.10.0)
Cython (0.29.21)
decorator (4.4.2)
defer (1.0.6)
distro-info (0.18ubuntu0.18.04.1)
evdev (1.4.0)
fastdtw (0.3.4)
fastrlock (0.5)
filelock (3.0.12)
Flask (1.1.2)
gast (0.4.0)
gevent (21.1.2)
google-pasta (0.2.0)
greenlet (1.0.0)
grpcio (1.35.0)
h5py (3.1.0)
httplib2 (0.9.2)
idna (2.10)
imageio (2.9.0)
imageio-ffmpeg (0.4.3)
importlib-metadata (3.4.0)
itsdangerous (1.1.0)
Jinja2 (2.11.2)
joblib (1.0.0)
Keras-Applications (1.0.8)
Keras-Preprocessing (1.1.2)
keyring (10.6.0)
keyrings.alt (3.0)
kiwisolver (1.3.1)
launchpadlib (1.10.6)
lazr.restfulclient (0.13.5)
lazr.uri (1.0.3)
librosa (0.8.0)
llvmlite (0.35.0)
louis (3.5.0)
macaroonbakery (1.1.3)
Mako (1.1.4)
Markdown (3.3.3)
MarkupSafe (1.1.1)
matplotlib (3.3.3)
moviepy (1.0.3)
msgpack (1.0.2)
netifaces (0.10.4)
numba (0.52.0)
numpy (1.19.5)
oauth (1.0.1)
olefile (0.45.1)
packaging (20.8)
pexpect (4.2.1)
Pillow (8.1.0)
pip (9.0.1)
pooch (1.3.0)
proglog (0.1.9)
protobuf (3.14.0)
psutil (5.8.0)
PyAudio (0.2.11)
pycairo (1.16.2)
pycparser (2.20)
pycrypto (2.6.1)
pycups (1.9.73)
Pygments (2.2.0)
pygobject (3.26.1)
pymacaroons (0.13.0)
PyNaCl (1.1.2)
pynput (1.7.2)
pyparsing (2.4.7)
pyRFC3339 (1.0)
pysptk (0.1.18)
python-apt (1.6.5+ubuntu0.5)
python-dateutil (2.8.1)
python-debian (0.1.32)
python-editor (1.0.4)
python-xlib (0.29)
pytz (2018.3)
pyworld (0.2.12)
pyxdg (0.25)
PyYAML (5.4.1)
reportlab (3.4.0)
requests (2.25.1)
requests-unixsocket (0.1.5)
resampy (0.2.2)
scikit-learn (0.24.1)
scipy (1.6.0)
screen-resolution-extra (0.0.0)
SecretStorage (2.3.1)
setuptools (52.0.0)
simplejson (3.13.2)
six (1.15.0)
SoundFile (0.10.3.post1)
SQLAlchemy (1.3.22)
ssh-import-id (5.7)
structlog (20.2.0)
systemd-python (234)
tensorboard (1.14.0)
tensorboard-chainer (0.5.3)
tensorflow (1.14.0)
tensorflow-estimator (1.14.0)
termcolor (1.1.0)
threadpoolctl (2.1.0)
tqdm (4.56.0)
typing (3.7.4.3)
typing-extensions (3.7.4.3)
ufw (0.36)
unattended-upgrades (0.1)
urllib3 (1.26.2)
usb-creator (0.3.3)
wadllib (1.3.2)
Werkzeug (1.0.1)
wheel (0.36.2)
world4py (0.1.1)
wrapt (1.12.1)
xkit (0.0.0)
yukarin (0.1.0)
zipp (3.4.0)
zope.event (4.5.0)
zope.interface (5.2.0)

Rose-sys · 2021-01-28T07:45:59Z

I am wondering what you use as Cuda and driver version, and Cupy version (since you mention below 7.0.0 in recommendations, I installed Cuda 10.0 with Cupy100==5.4.0, as I do not see cupy version of 6, yet still gave out of memory error although taking longer in training somehow) so I can rebuild the environment. I used Ubuntu 18.04, and I could not install drivers lower than 450XX

Rose-sys · 2021-01-28T20:58:29Z

I want to add, I had 12 .npy files after using spectrogram pairs, total of 55 mb in size (around 4mb on average), I removed few npy files, reducing to 7 files and it no longer crashes.

Is it supposed to be like that? looks a bit weird as it now uses 7.4gb with 7 files. I might be able to squeeze 8th file in there but isn't it like better the more you have?

Probably something to do with cupy/chainer, but they say do it in batches when I look at their pages when looking further into this. So not sure if that is a possibility with your code.

I'll await your reply to see what you recommend with this (I couldn't give too less as it would whine about epoch divided by zero, but more than 8 just eats all 11 gb of my videocard and crash)

Hiroshiba · 2021-01-29T08:15:16Z

Try lowering the batch size here.

become-yukarin/recipe/config_sr.json

Line 26 in 99a4998

"batchsize": 8,

I will close this Issue once, but please open it again if you need anything else.

Rose-sys · 2021-01-29T22:15:10Z

Thanks, it helped! using about 20 files now with batch size on 5.

Hiroshiba closed this as completed Jan 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using train_sr for second stage results out of memory #73

Using train_sr for second stage results out of memory #73

Rose-sys commented Jan 24, 2021

Rose-sys commented Jan 24, 2021

Rose-sys commented Jan 28, 2021

Rose-sys commented Jan 28, 2021

Hiroshiba commented Jan 29, 2021

Rose-sys commented Jan 29, 2021

Using train_sr for second stage results out of memory #73

Using train_sr for second stage results out of memory #73

Comments

Rose-sys commented Jan 24, 2021

Rose-sys commented Jan 24, 2021

Rose-sys commented Jan 28, 2021

Rose-sys commented Jan 28, 2021

Hiroshiba commented Jan 29, 2021

Rose-sys commented Jan 29, 2021