You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@drugilsberg@jannisborn I'm opening this issue to revise and discuss whether there should be any device handling and what the default device or even datatype of items should be. Much of this is motivated by PyTorch Lightning.
Would you agree that we should not cast to tensor if there is no benefit within the dataset itself?
I think our handling of device is uncommon. One would usually only send data to the gpu in advance of the training loop when one knows the entire data to fit in GPU memory, as this avoids any sending of data between devices. This is not the case for us where most of the time we end up calling tensor.to(device) in __getitem__.
I'm assuming now that the data originates from cpu anyway, and we merely have to decide when the transfer to other devices should happen. In the generic case that not all data can be sent to gpu, doing it in __getitem__ has a big overhead versus sending batches. So it should to be done in the training loop. (Lightning does this for us btw)
So I argue to not ever set the device parameter in our datasets to use a gpu. I think we should remove any device handling from pytoda.datasets. At least we should set the default to 'cpu'.
Leaving things on the cpu allows also to later send only parts of the data to specific devices, as torch.DistributedSampler does (and again Lightning for us under the hood), allowing multi gpu training. Currently our datasets would send all data to all gpus.
Also the dataloader could prepare on cpu with multiple workers. So instead of worrying about device, we should take care that the datasets do well on cpu and can be used with multiple workers (torch docs issue a warning here not to return cuda tensors!).
We should check if we have to do anything to support Dataloader(..., pin_memory=True).
I agree on the sentiment of this issue, but I'm closing it due to low priority (#wontfix label).
Indeed, there's a significant overhead when each sample is sent to the gpu in __get_item__. The reason for not following your suggestion is practical. Removing the device option entirely from pytoda.datasets is drastic. Effectively we kill a feature that might be useful to some users. Also, some dataset helpers like StochasticItems can be used independently from datasets but are quite intertwined.
The alternative could be changing the default to cpu. This seems a good idea at first, but for the case where the user wants to use the gpu option we need to ensure that the argument is passed down all the time to each helper function (e.g., transforms like ToTensor). That is totally doable, but it's time-consuming to debug and issues might arise later during training.
I'm not happy with keeping this as is, but due to bandwidth, this would be my go-to.
@drugilsberg @jannisborn I'm opening this issue to revise and discuss whether there should be any device handling and what the default device or even datatype of items should be. Much of this is motivated by PyTorch Lightning.
First of all, a Dataset is not required at all to return a torch.Tensor when indexed (also briefly discussed in #98), e.g. SmilesDataset returns
str
. The default torch Dataloader/collate_fn handles np.arrays, strings, even Mappable such as dicts. Actually it handles np.arrays even better than we do (avoiding a copy if possible https://github.com/pytorch/pytorch/blob/b643dbb8a4bf9850171b2de848b6b89206973972/torch/utils/data/_utils/collate.py#L51).I think our handling of
device
is uncommon. One would usually only send data to the gpu in advance of the training loop when one knows the entire data to fit in GPU memory, as this avoids any sending of data between devices. This is not the case for us where most of the time we end up callingtensor.to(device)
in __getitem__.I'm assuming now that the data originates from cpu anyway, and we merely have to decide when the transfer to other devices should happen. In the generic case that not all data can be sent to gpu, doing it in __getitem__ has a big overhead versus sending batches. So it should to be done in the training loop. (Lightning does this for us btw)
Leaving things on the cpu allows also to later send only parts of the data to specific devices, as torch.DistributedSampler does (and again Lightning for us under the hood), allowing multi gpu training. Currently our datasets would send all data to all gpus.
Also the dataloader could prepare on cpu with multiple workers. So instead of worrying about device, we should take care that the datasets do well on cpu and can be used with multiple workers (torch docs issue a warning here not to return cuda tensors!).
Dataloader(..., pin_memory=True)
.Reading:
https://pytorch-lightning.readthedocs.io/en/stable/multi_gpu.html
https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/
https://pytorch.org/docs/stable/data.html#multi-process-data-loading
https://pytorch.org/docs/stable/data.html#memory-pinning
The text was updated successfully, but these errors were encountered: