Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue full error - Multi-GPU 1M custom dataset #1454

Closed
pbamotra opened this issue Jun 4, 2019 · 2 comments

Comments

Projects
None yet
3 participants
@pbamotra
Copy link

commented Jun 4, 2019

Traceback (most recent call last):
  File "/opt/conda/envs/learn-dev/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/envs/learn-dev/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/OpenNMT-py/train.py", line 127, in batch_producer
    q.put(b, False)
  File "/opt/conda/envs/learn-dev/lib/python3.7/multiprocessing/queues.py", line 83, in put
    raise Full
queue.Full
[2019-06-04 07:32:18,291 INFO] Step 1200/100000; acc:  79.56; ppl:  1.82; xent: 0.60; lr: 1.00000; 33996/13919 tok/s;    401 sec
Traceback (most recent call last):
  File "train.py", line 196, in <module>
    main(opt)
  File "train.py", line 78, in main
    p.join()
  File "/opt/conda/envs/learn-dev/lib/python3.7/multiprocessing/process.py", line 140, in join
    res = self._popen.wait(timeout)
  File "/opt/conda/envs/learn-dev/lib/python3.7/multiprocessing/popen_fork.py", line 48, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/opt/conda/envs/learn-dev/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "train.py", line 180, in signal_handler
    raise Exception(msg)
Exception: 

-- Tracebacks above this line can probably
                 be ignored --

Traceback (most recent call last):
  File "/workspace/OpenNMT-py/train.py", line 138, in run
    single_main(opt, device_id, batch_queue, semaphore)
  File "/workspace/OpenNMT-py/onmt/train_single.py", line 139, in main
    valid_steps=opt.valid_steps)
  File "/workspace/OpenNMT-py/onmt/trainer.py", line 224, in train
    self._accum_batches(train_iter)):
  File "/workspace/OpenNMT-py/onmt/trainer.py", line 162, in _accum_batches
    for batch in iterator:
  File "/workspace/OpenNMT-py/onmt/train_single.py", line 116, in _train_iter
    batch = batch_queue.get()
  File "/opt/conda/envs/learn-dev/lib/python3.7/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/opt/conda/envs/learn-dev/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 109, in rebuild_cuda_tensor
    event_sync_required)
RuntimeError: CUDA error: unknown error
@vince62s

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

Can you please post your command line ?

francoishernandez added a commit to francoishernandez/OpenNMT-py that referenced this issue Jun 4, 2019

@francoishernandez

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

Hey @pbamotra
My bad, it was introduced in the last commit of #1450 when switching back to sequential queue filling.
#1455 fixes it, we'll merge asap.

vince62s added a commit that referenced this issue Jun 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.