Skip to content

Conversation

@gsakkis
Copy link
Contributor

@gsakkis gsakkis commented Aug 10, 2022

This PR introduces torchdata, a PyTorch library of common modular data loading primitives for easily constructing flexible and performant data pipelines, as opposed to a custom monolithic IterableDataset subclass.

Currently the main practical benefit is related to shuffling:

In the future we might consider exposing the composed IterDataPipe in addition to (or instead of) the DataLoader, so that for example the user can apply a mapping datapipe before passing it to the DataLoader.

@shortcut-integration
Copy link

This pull request has been linked to Shortcut Story #19597: Investigate iterable-style DataPipes for Pytorch.

@gsakkis gsakkis force-pushed the gsa/sc-19597/torch-iterdatapipe branch from 1d34834 to 64ee90e Compare August 11, 2022 11:15
schema.kind is not TensorKind.DENSE for schema in schemas
):
raise NotImplementedError("https://github.com/pytorch/pytorch/issues/20248")
datapipe_for_key_range = partial(_get_datapipe, schemas)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some comments because it's a tricky part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments & docstrings added at a93e675

Copy link
Contributor

@georgeSkoumas georgeSkoumas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gsakkis gsakkis merged commit c836ee9 into master Aug 17, 2022
@gsakkis gsakkis deleted the gsa/sc-19597/torch-iterdatapipe branch August 17, 2022 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants