Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add iterable dataset support for multiprocess DataLoader #25558

Merged

Conversation

heavengate
Copy link
Contributor

@heavengate heavengate commented Jul 16, 2020

PR types

New features

PR changes

APIs

Describe

add IterableDataset support for multiprocess DataLoader

  • add paddle.io.IterableDataset base class
  • add paddle.io.get_worker_info to get worker process information for data splitting in IterableDataset

屏幕快照 2020-07-30 下午9 05 38

屏幕快照 2020-07-30 下午9 07 05

屏幕快照 2020-07-30 下午9 07 20

屏幕快照 2020-07-30 下午9 07 30

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

def get_worker_info():
"""
Get DataLoader worker process information function, this function is
used to splitd data copy in worker process for IterableDataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitd typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!


class IterableDataset(Dataset):
"""
An abstract class to encapsulates methods and behaviors of iterable datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encapsulates -> encapsulate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

An abstract class to encapsulates methods and behaviors of iterable datasets.

All datasets in iterable-style(can only get sample one by one sequentially, like
a python iterater) should be a subclass of `paddle.io.IterableDataset`. All subclasses should
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iterater -> iterator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!


:code:`__iter__`: yield sample sequentially. This method is required by reading dataset sample in :code:`paddle.io.DataLoader`.

NOTE: do not implement :code:`__getitem__` and :code:`__len__` in IterableDataset, should not be called either.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use new doc style? NOTE -> .. note::

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

print(img, lbl)

When :attr:`num_workers > 0`, each worker has a different copy of the dataset object and
will yield whole dataset samples, which means samples in dataset will be repeat in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repeat -> repeated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@@ -286,6 +288,10 @@ def forward(self, image, label=None):

# -------------------------------------------------------

Note:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE -> .. note:: ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

format(shuffle))
if batch_sampler is not None:
raise ValueError(
"IterableDataset expect unspecified batch_sample")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch_sample -> batch_sampler ?

Copy link
Contributor Author

@heavengate heavengate Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

while not self._thread_done_event.is_set():
# For IterableDataset, batch indices is generate infinitely
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is generate -> is generated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

"""
An abstract class to encapsulates methods and behaviors of iterable datasets.

All datasets in iterable-style(can only get sample one by one sequentially, like
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iterable-style(can -> iterable-style ( can
python -> Python

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

place = fluid.CPUPlace()
with fluid.dygraph.guard(place):
dataset = SplitedIterableDataset(start=2, end=9)
dataloader = DataLoader(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why fluid.dygraph.guard is needed for DataLoader? It cann't used in static graph?

Copy link
Contributor Author

@heavengate heavengate Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In static mode, fluid.data should be defined and given as parameter feed_list, which is not concerned in this test case, so use dynamic mode to simplify the test code

@heavengate heavengate changed the title Add iterable dataset support Add iterable dataset support for multiprocess DataLoader Aug 7, 2020
def __init__(self, dataset, batch_size=1):
assert isinstance(
dataset, IterableDataset
), "dataset should be an instnace of paddle.io.IterableDataset"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!


When :attr:`num_workers > 0`, each worker has a different copy of the dataset object and
will yield whole dataset samples, which means samples in dataset will be repeated in
:attr:`num_workers` times. If it is require that each sample to be yield only once, there
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is required for each sample to yield once only, ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

will yield whole dataset samples, which means samples in dataset will be repeated in
:attr:`num_workers` times. If it is require that each sample to be yield only once, there
are two methods to configure different copy in each worker process to avoid duplicate data
among workers as follows. In both the two methods, worker information that can be get in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both the methods, ... can be getted in...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@@ -136,7 +137,8 @@ class DataLoader(object):

Args:
dataset(Dataset): the dataset to load data from, should be an
instance of subclass of :code:`paddle.io.Dataset`.
instance of subclass of :code:`paddle.io.Dataset` or
:code:`paddle.io.IterableDataset`.
feed_list (list(Variable)|tuple(Variable)): feed variable list.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据新文档规范,variable的表述全部改为tensor。feed_list (list(Tensor)|tuple(Tensor)): feed tensor list. 请将其他位置的表述一起完成修改。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

"""
Get DataLoader worker process information function, this function is
used to split data copy in worker process for IterableDataset
(see :code:`paddle.io.IterableDataset`), worker informations contains
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

information

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@Heeenrrry Heeenrrry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@guoshengCS guoshengCS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

"IterableDataset expect unspecified batch_sampler")
else:
self.dataset_kind = _DatasetKind.MAP

if batch_sampler is not None:
assert isinstance(batch_sampler, BatchSampler), \
"batch_sampler should be None or subclass instance " \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the BatchSampler later, maybe it can also be Iterable object.

Additionally, can we support specified batch_sampler for IterableDataset later. It seems that users can't custom sampling or batch strategies even by themselves, since we can only support Iterable data with IterableDataset and _InfiniteIterableSampler

And would we also support Sampler except BatchSampler later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BatchSampler can be custom in map-style dataset(implement __getitem__), for IterableDataset, which can only get sample sequencely, I couldn't think of scenarios that require batch_sampler customization, sure it should be support if there is customization requirements.

Sampler is mostly a sub-function of BatchSampler, IMHO, custom Sampler can be defined in custom BatchSampler?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example for batch_sampler customization is Transformer, it changes the batch size counter by using word number rather than sentence number , currently it uses a map-style dataset .

When custom batching strategies is needed, then Sampler may be abstracted from BatchSampler to reuse the sampling strategies.

However, it doesn't bother and we can consider it later. I also try to provide some helper to make it can be use like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll try to do some research and try to add this later~

@heavengate heavengate merged commit dbc88bb into PaddlePaddle:develop Aug 12, 2020
@heavengate heavengate deleted the add_iterable_dataset_support branch August 12, 2020 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants