Introduce PyTorch DataPipes #173

gsakkis · 2022-08-10T07:40:33Z

This PR introduces torchdata, a PyTorch library of common modular data loading primitives for easily constructing flexible and performant data pipelines, as opposed to a custom monolithic IterableDataset subclass.

Currently the main practical benefit is related to shuffling:

Replace the custom _iter_shuffled function with the Shuffler datapipe.
Apply shuffling to the whole range of the key dimension even with multiple workers, instead of shuffling separately the subrange assigned to each worker.

In the future we might consider exposing the composed IterDataPipe in addition to (or instead of) the DataLoader, so that for example the user can apply a mapping datapipe before passing it to the DataLoader.

shortcut-integration · 2022-08-10T07:40:37Z

This pull request has been linked to Shortcut Story #19597: Investigate iterable-style DataPipes for Pytorch.

georgeSkoumas · 2022-08-16T09:49:32Z

tiledb/ml/readers/pytorch.py

-        schema.kind is not TensorKind.DENSE for schema in schemas
-    ):
-        raise NotImplementedError("https://github.com/pytorch/pytorch/issues/20248")
+    datapipe_for_key_range = partial(_get_datapipe, schemas)


Please add some comments because it's a tricky part.

Comments & docstrings added at a93e675

georgeSkoumas

LGTM

Replace the _PyTorchTileDBDataset IterableDataset with an IterDataPipe

57fa672

gsakkis requested review from georgeSkoumas and ktsitsi August 10, 2022 07:40

Run shuffling on the whole data stream

64ee90e

gsakkis force-pushed the gsa/sc-19597/torch-iterdatapipe branch from 1d34834 to 64ee90e Compare August 11, 2022 11:15

georgeSkoumas reviewed Aug 16, 2022

View reviewed changes

georgeSkoumas approved these changes Aug 16, 2022

View reviewed changes

Add missing docstrings and comments in readers.pytorch

a93e675

gsakkis merged commit c836ee9 into master Aug 17, 2022

gsakkis deleted the gsa/sc-19597/torch-iterdatapipe branch August 17, 2022 11:58

This was referenced Aug 19, 2022

Fix PyTorchTileDBDataLoader to be iterable more than once #177

Merged

Bump pytorch version to 1.12 #178

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce PyTorch DataPipes #173

Introduce PyTorch DataPipes #173

Uh oh!

gsakkis commented Aug 10, 2022

Uh oh!

shortcut-integration bot commented Aug 10, 2022

Uh oh!

georgeSkoumas Aug 16, 2022

Uh oh!

gsakkis Aug 16, 2022

Uh oh!

georgeSkoumas left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Introduce PyTorch DataPipes #173

Introduce PyTorch DataPipes #173

Uh oh!

Conversation

gsakkis commented Aug 10, 2022

Uh oh!

shortcut-integration bot commented Aug 10, 2022

Uh oh!

georgeSkoumas Aug 16, 2022

Choose a reason for hiding this comment

Uh oh!

gsakkis Aug 16, 2022

Choose a reason for hiding this comment

Uh oh!

georgeSkoumas left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants