Device handling in Datasets #105

C-nit · 2020-12-08T13:50:02Z

@drugilsberg @jannisborn I'm opening this issue to revise and discuss whether there should be any device handling and what the default device or even datatype of items should be. Much of this is motivated by PyTorch Lightning.

First of all, a Dataset is not required at all to return a torch.Tensor when indexed (also briefly discussed in #98), e.g. SmilesDataset returns str. The default torch Dataloader/collate_fn handles np.arrays, strings, even Mappable such as dicts. Actually it handles np.arrays even better than we do (avoiding a copy if possible https://github.com/pytorch/pytorch/blob/b643dbb8a4bf9850171b2de848b6b89206973972/torch/utils/data/_utils/collate.py#L51).

Would you agree that we should not cast to tensor if there is no benefit within the dataset itself?

I think our handling of device is uncommon. One would usually only send data to the gpu in advance of the training loop when one knows the entire data to fit in GPU memory, as this avoids any sending of data between devices. This is not the case for us where most of the time we end up calling tensor.to(device) in __getitem__.

I'm assuming now that the data originates from cpu anyway, and we merely have to decide when the transfer to other devices should happen. In the generic case that not all data can be sent to gpu, doing it in __getitem__ has a big overhead versus sending batches. So it should to be done in the training loop. (Lightning does this for us btw)

So I argue to not ever set the device parameter in our datasets to use a gpu. I think we should remove any device handling from pytoda.datasets. At least we should set the default to 'cpu'.

Leaving things on the cpu allows also to later send only parts of the data to specific devices, as torch.DistributedSampler does (and again Lightning for us under the hood), allowing multi gpu training. Currently our datasets would send all data to all gpus.
Also the dataloader could prepare on cpu with multiple workers. So instead of worrying about device, we should take care that the datasets do well on cpu and can be used with multiple workers (torch docs issue a warning here not to return cuda tensors!).

We should check if we have to do anything to support Dataloader(..., pin_memory=True).

Reading:
https://pytorch-lightning.readthedocs.io/en/stable/multi_gpu.html
https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/
https://pytorch.org/docs/stable/data.html#multi-process-data-loading
https://pytorch.org/docs/stable/data.html#memory-pinning

The text was updated successfully, but these errors were encountered:

jannisborn · 2022-01-03T22:40:22Z

I agree on the sentiment of this issue, but I'm closing it due to low priority (#wontfix label).

Indeed, there's a significant overhead when each sample is sent to the gpu in __get_item__. The reason for not following your suggestion is practical. Removing the device option entirely from pytoda.datasets is drastic. Effectively we kill a feature that might be useful to some users. Also, some dataset helpers like StochasticItems can be used independently from datasets but are quite intertwined.
The alternative could be changing the default to cpu. This seems a good idea at first, but for the case where the user wants to use the gpu option we need to ensure that the argument is passed down all the time to each helper function (e.g., transforms like ToTensor). That is totally doable, but it's time-consuming to debug and issues might arise later during training.

I'm not happy with keeping this as is, but due to bandwidth, this would be my go-to.

YoelShoshan · 2022-05-10T09:51:52Z

@C-nit I completely agree and provide additional strong motivation (heavy under-utilization of GPU)

#155

C-nit · 2022-05-11T14:35:25Z

Thanks a lot!

C-nit added this to the 0.3 milestone Dec 8, 2020

C-nit added the enhancement New feature or request label Dec 8, 2020

jannisborn added the wontfix This will not be worked on label Jan 3, 2022

jannisborn closed this as completed Jan 3, 2022

jannisborn mentioned this issue Jan 4, 2022

PyPI Release #149

Merged

YoelShoshan mentioned this issue May 10, 2022

low GPU utilization #155

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device handling in Datasets #105

Device handling in Datasets #105

C-nit commented Dec 8, 2020

jannisborn commented Jan 3, 2022

YoelShoshan commented May 10, 2022

C-nit commented May 11, 2022

Device handling in Datasets #105

Device handling in Datasets #105

Comments

C-nit commented Dec 8, 2020

jannisborn commented Jan 3, 2022

YoelShoshan commented May 10, 2022

C-nit commented May 11, 2022