Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: received 0 items of ancdata #701

Closed
petteriTeikari opened this issue Jul 6, 2020 · 10 comments
Closed

RuntimeError: received 0 items of ancdata #701

petteriTeikari opened this issue Jul 6, 2020 · 10 comments
Labels
question Further information is requested

Comments

@petteriTeikari
Copy link

"Stochastic" issue happening with training at some point. Training starts okay for x number of epochs and at some point this often happens with Pytorch Lightning (quite close still to the Build a segmentation workflow (with PyTorch Lightning)
)
, and is probably propagating from Pytorch code? (e.g. fastai/fastai#23)

reduction.py", line 161, in recvfds
    len(ancdata))
RuntimeError: received 0 items of ancdata
TypeError: 'NoneType' object is not iterable

Which I thought was happening first with the CacheDataset as it was quite RAM-intensive?:

train_ds = CacheDataset(data=datalist_train, transform=train_trans, cache_rate=1, num_workers=4)
val_ds = CacheDataset(data=datalist_val, transform=val_trans, cache_rate=1, num_workers=4)

but the same behavior was happening with the vanilla loader

train_ds = Dataset(data=datalist_train, transform=train_trans)
val_ds = Dataset(data=datalist_val, transform=val_trans)

with the following transformation

image

I guess this depends on environment in which the code is run, but do you have any ideas how to get rid of this?

Full trace:

MONAI version: 0.2.0
Python version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)  [GCC 7.3.0]
Numpy version: 1.18.5
Pytorch version: 1.5.0

Optional dependencies:
Pytorch Ignite version: 0.3.0
Nibabel version: 3.1.0
scikit-image version: 0.17.2
Pillow version: 7.1.2
Tensorboard version: 2.2.2

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type     | Params
-------------------------------------------
0 | _model        | UNet     | 4 M   
1 | loss_function | DiceLoss | 0     
Validation sanity check: 100%|████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.89s/it]
Validation sanity check: 100%|████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.89s/it]
current epoch: 0 current mean loss: 0.6968 best mean loss: 0.6968 (best dice at that loss 0.0061) at epoch 0
Epoch 1: 100%|███████████████████████████████████████████████████████████████████████████████| 222/222 [08:21<00:00,  2.26s/it, loss=0.604, v_num=0]
...
Epoch 41:  83%|████████████████████████████████████████████████████████████████▋             | 184/222 [06:40<01:22,  2.18s/it, loss=0.281, v_num=0]
Traceback (most recent call last):                                                                                                                  
  trainer.fit(net)
  site-packages/pytorch_lightning/trainer/trainer.py", line 918, in fit
    self.single_gpu_train(model)
  site-packages/pytorch_lightning/trainer/distrib_parts.py", line 176, in single_gpu_train
    self.run_pretrain_routine(model)
  site-packages/pytorch_lightning/trainer/trainer.py", line 1093, in run_pretrain_routine
    self.train()
  site-packages/pytorch_lightning/trainer/training_loop.py", line 375, in train
    self.run_training_epoch()
  site-packages/pytorch_lightning/trainer/training_loop.py", line 445, in run_training_epoch
    enumerate(_with_is_last(train_dataloader)), "get_train_batch"
  site-packages/pytorch_lightning/profiler/profilers.py", line 64, in profile_iterable
    value = next(iterator)
  site-packages/pytorch_lightning/trainer/training_loop.py", line 844, in _with_is_last
    for val in it:
  site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
    idx, data = self._get_data()
  site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
    success, data = self._try_get_data()
  site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/petteri/anaconda3/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  site-packages/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd
    fd = df.detach()
  File "/home/petteri/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/petteri/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/petteri/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 161, in recvfds
    len(ancdata))

RuntimeError: received 0 items of ancdata
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  site-packages/tqdm/std.py", line 1086, in __del__
  site-packages/tqdm/std.py", line 1293, in close
  site-packages/tqdm/std.py", line 1471, in display
  site-packages/tqdm/std.py", line 1089, in __repr__
  site-packages/tqdm/std.py", line 1433, in format_dict
TypeError: 'NoneType' object is not iterable
@Nic-Ma
Copy link
Contributor

Nic-Ma commented Jul 7, 2020

Hi @petteriTeikari ,

Thanks for your bug report, could you please help provide the test program to reproduce this issue?
Hi @marksgraham , have you seen this issue with PyTorch Lightning before?

Thanks.

@Nic-Ma Nic-Ma added the question Further information is requested label Jul 7, 2020
@petteriTeikari
Copy link
Author

I try to make you some minimal example @Nic-Ma (as my codebase has grown and hard to share as it is), but I assume that the problem is somewhere deeper, not handling properly the case when there is no access to data (for unknown reason on my end) for a while.

@marksgraham
Copy link
Contributor

I haven't seen this problem before - did you make any progress with it, @petteriTeikari ?

@petteriTeikari
Copy link
Author

I "solved" this by using the standard train_ds = Dataset instead of the train_ds = CacheDataset and that has not occurred now, and can report back (and try to do a more compact example) when I go back to cached world

@Nic-Ma Nic-Ma closed this as completed Jul 20, 2020
@petteriTeikari
Copy link
Author

petteriTeikari commented Jul 25, 2020

@Nic-Ma : I came across again this error when trying a custom loss outside the Monai and added one more volume to the dataloader, thus giving CPUs more stuff to compute. I could not get past 1st epoch.

Apparently it is the multiprocessing that is causing the headaches, thus making the problem a bit local to my machine and hard to reproduce.

And tried some of the fixes from there pytorch/pytorch#973, and got at least past the first epoch, and will see how robust these workarounds are

pytorch/pytorch#973 (comment):

torch.multiprocessing.set_sharing_strategy('file_system')

pytorch/pytorch#973 (comment):

pool = torch.multiprocessing.Pool(torch.multiprocessing.cpu_count(), maxtasksperchild=1)

pytorch/pytorch#973 (comment):

import resource
rlimit = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (4096, rlimit[1]))

@sampathweb had that suggestion for debugging pytorch/pytorch#973 (comment):
If the core devs want to see the error, just reduce the ulimit to 1024 and run the code of @kamo-naoyuki above and you might see the same problem.

@Nic-Ma
Copy link
Contributor

Nic-Ma commented Jul 25, 2020

Hi @petteriTeikari ,

I see, so what's the latest status? Have you solved this issue?
I would suggest trying on some ubuntu machine to make sure the program is running as expected first.

Thanks.

@petteriTeikari
Copy link
Author

@Nic-Ma Yes I started training last night and it has not crashed by now so in that sense the fix seems to be working

@Nic-Ma
Copy link
Contributor

Nic-Ma commented Jul 26, 2020

Sounds good!

@cuge1995
Copy link

import resource
rlimit = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (4096, rlimit[1]))

Thanks, it solved my problem.

@wyli
Copy link
Contributor

wyli commented Aug 10, 2022

the issue is reproducible with this script (pytorch/pytorch#973 (comment))

import torch
import torch.multiprocessing as multiprocessing

torch.multiprocessing.set_sharing_strategy('file_descriptor')

def _worker_loop(data_queue, ):
    while True:
        t = torch.FloatTensor(1)
        data_queue.put(t)


if __name__ == '__main__':
    data_queue = multiprocessing.Queue(maxsize=1)
    p = multiprocessing.Process(
        target=_worker_loop,
        args=(data_queue,))

    p.daemon = True
    p.start()
    lis = []
    for i in range(10000):
        try:
            lis.append(data_queue.get())
        except:
            print('i = {}'.format(i))
            raise

when ulimit -n is set to 1024 (ubuntu default)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants