RuntimeError: received 0 items of ancdata #701

petteriTeikari · 2020-07-06T18:52:43Z

"Stochastic" issue happening with training at some point. Training starts okay for x number of epochs and at some point this often happens with Pytorch Lightning (quite close still to the Build a segmentation workflow (with PyTorch Lightning)
)
, and is probably propagating from Pytorch code? (e.g. fastai/fastai#23)

reduction.py", line 161, in recvfds
    len(ancdata))
RuntimeError: received 0 items of ancdata
TypeError: 'NoneType' object is not iterable

Which I thought was happening first with the CacheDataset as it was quite RAM-intensive?:

train_ds = CacheDataset(data=datalist_train, transform=train_trans, cache_rate=1, num_workers=4)
val_ds = CacheDataset(data=datalist_val, transform=val_trans, cache_rate=1, num_workers=4)

but the same behavior was happening with the vanilla loader

train_ds = Dataset(data=datalist_train, transform=train_trans)
val_ds = Dataset(data=datalist_val, transform=val_trans)

with the following transformation

I guess this depends on environment in which the code is run, but do you have any ideas how to get rid of this?

Full trace:

MONAI version: 0.2.0
Python version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)  [GCC 7.3.0]
Numpy version: 1.18.5
Pytorch version: 1.5.0

Optional dependencies:
Pytorch Ignite version: 0.3.0
Nibabel version: 3.1.0
scikit-image version: 0.17.2
Pillow version: 7.1.2
Tensorboard version: 2.2.2

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type     | Params
-------------------------------------------
0 | _model        | UNet     | 4 M   
1 | loss_function | DiceLoss | 0     
Validation sanity check: 100%|████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.89s/it]
Validation sanity check: 100%|████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.89s/it]
current epoch: 0 current mean loss: 0.6968 best mean loss: 0.6968 (best dice at that loss 0.0061) at epoch 0
Epoch 1: 100%|███████████████████████████████████████████████████████████████████████████████| 222/222 [08:21<00:00,  2.26s/it, loss=0.604, v_num=0]
...
Epoch 41:  83%|████████████████████████████████████████████████████████████████▋             | 184/222 [06:40<01:22,  2.18s/it, loss=0.281, v_num=0]
Traceback (most recent call last):                                                                                                                  
  trainer.fit(net)
  site-packages/pytorch_lightning/trainer/trainer.py", line 918, in fit
    self.single_gpu_train(model)
  site-packages/pytorch_lightning/trainer/distrib_parts.py", line 176, in single_gpu_train
    self.run_pretrain_routine(model)
  site-packages/pytorch_lightning/trainer/trainer.py", line 1093, in run_pretrain_routine
    self.train()
  site-packages/pytorch_lightning/trainer/training_loop.py", line 375, in train
    self.run_training_epoch()
  site-packages/pytorch_lightning/trainer/training_loop.py", line 445, in run_training_epoch
    enumerate(_with_is_last(train_dataloader)), "get_train_batch"
  site-packages/pytorch_lightning/profiler/profilers.py", line 64, in profile_iterable
    value = next(iterator)
  site-packages/pytorch_lightning/trainer/training_loop.py", line 844, in _with_is_last
    for val in it:
  site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
    idx, data = self._get_data()
  site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
    success, data = self._try_get_data()
  site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/petteri/anaconda3/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  site-packages/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd
    fd = df.detach()
  File "/home/petteri/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/petteri/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/petteri/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 161, in recvfds
    len(ancdata))

RuntimeError: received 0 items of ancdata
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  site-packages/tqdm/std.py", line 1086, in __del__
  site-packages/tqdm/std.py", line 1293, in close
  site-packages/tqdm/std.py", line 1471, in display
  site-packages/tqdm/std.py", line 1089, in __repr__
  site-packages/tqdm/std.py", line 1433, in format_dict
TypeError: 'NoneType' object is not iterable

The text was updated successfully, but these errors were encountered:

Nic-Ma · 2020-07-07T06:48:12Z

Hi @petteriTeikari ,

Thanks for your bug report, could you please help provide the test program to reproduce this issue?
Hi @marksgraham , have you seen this issue with PyTorch Lightning before?

Thanks.

petteriTeikari · 2020-07-08T16:00:53Z

I try to make you some minimal example @Nic-Ma (as my codebase has grown and hard to share as it is), but I assume that the problem is somewhere deeper, not handling properly the case when there is no access to data (for unknown reason on my end) for a while.

marksgraham · 2020-07-12T08:32:32Z

I haven't seen this problem before - did you make any progress with it, @petteriTeikari ?

petteriTeikari · 2020-07-13T17:12:44Z

I "solved" this by using the standard train_ds = Dataset instead of the train_ds = CacheDataset and that has not occurred now, and can report back (and try to do a more compact example) when I go back to cached world

petteriTeikari · 2020-07-25T18:37:15Z

@Nic-Ma : I came across again this error when trying a custom loss outside the Monai and added one more volume to the dataloader, thus giving CPUs more stuff to compute. I could not get past 1st epoch.

Apparently it is the multiprocessing that is causing the headaches, thus making the problem a bit local to my machine and hard to reproduce.

And tried some of the fixes from there pytorch/pytorch#973, and got at least past the first epoch, and will see how robust these workarounds are

pytorch/pytorch#973 (comment):

torch.multiprocessing.set_sharing_strategy('file_system')

pytorch/pytorch#973 (comment):

pool = torch.multiprocessing.Pool(torch.multiprocessing.cpu_count(), maxtasksperchild=1)

pytorch/pytorch#973 (comment):

import resource
rlimit = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (4096, rlimit[1]))

@sampathweb had that suggestion for debugging pytorch/pytorch#973 (comment):
If the core devs want to see the error, just reduce the ulimit to 1024 and run the code of @kamo-naoyuki above and you might see the same problem.

Nic-Ma · 2020-07-25T23:12:02Z

Hi @petteriTeikari ,

I see, so what's the latest status? Have you solved this issue?
I would suggest trying on some ubuntu machine to make sure the program is running as expected first.

Thanks.

petteriTeikari · 2020-07-26T07:27:18Z

@Nic-Ma Yes I started training last night and it has not crashed by now so in that sense the fix seems to be working

Nic-Ma · 2020-07-26T22:58:35Z

Sounds good!

cuge1995 · 2021-01-26T06:32:17Z

import resource
rlimit = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (4096, rlimit[1]))

Thanks, it solved my problem.

wyli · 2022-08-10T16:25:52Z

the issue is reproducible with this script (pytorch/pytorch#973 (comment))

import torch
import torch.multiprocessing as multiprocessing

torch.multiprocessing.set_sharing_strategy('file_descriptor')

def _worker_loop(data_queue, ):
    while True:
        t = torch.FloatTensor(1)
        data_queue.put(t)


if __name__ == '__main__':
    data_queue = multiprocessing.Queue(maxsize=1)
    p = multiprocessing.Process(
        target=_worker_loop,
        args=(data_queue,))

    p.daemon = True
    p.start()
    lis = []
    for i in range(10000):
        try:
            lis.append(data_queue.get())
        except:
            print('i = {}'.format(i))
            raise

when ulimit -n is set to 1024 (ubuntu default)

Nic-Ma added the question Further information is requested label Jul 7, 2020

Nic-Ma closed this as completed Jul 20, 2020

jcohenadad mentioned this issue Jan 30, 2023

RuntimeError: received 0 items of ancdata jcohenadad/monai-tutorial#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: received 0 items of ancdata #701

RuntimeError: received 0 items of ancdata #701

petteriTeikari commented Jul 6, 2020

Nic-Ma commented Jul 7, 2020

petteriTeikari commented Jul 8, 2020

marksgraham commented Jul 12, 2020

petteriTeikari commented Jul 13, 2020

petteriTeikari commented Jul 25, 2020 •

edited

Loading

Nic-Ma commented Jul 25, 2020

petteriTeikari commented Jul 26, 2020

Nic-Ma commented Jul 26, 2020

cuge1995 commented Jan 26, 2021

wyli commented Aug 10, 2022

RuntimeError: received 0 items of ancdata #701

RuntimeError: received 0 items of ancdata #701

Comments

petteriTeikari commented Jul 6, 2020

Nic-Ma commented Jul 7, 2020

petteriTeikari commented Jul 8, 2020

marksgraham commented Jul 12, 2020

petteriTeikari commented Jul 13, 2020

petteriTeikari commented Jul 25, 2020 • edited Loading

Nic-Ma commented Jul 25, 2020

petteriTeikari commented Jul 26, 2020

Nic-Ma commented Jul 26, 2020

cuge1995 commented Jan 26, 2021

wyli commented Aug 10, 2022

petteriTeikari commented Jul 25, 2020 •

edited

Loading