Low level reader design #4394

typhoonzero · 2017-09-26T14:28:26Z

This design doc describe a solution for #3675

Better to review from here

helinwang · 2017-09-29T00:18:47Z

doc/design/reader/low_level_reader.md

+
+## Background
+
+We have sevaral issues talking about training with v2 API get bad training performance,


I think we need more profiling before reaching the conclusion that v2 API get bad training performance because of the data reader (and to understand where is the bottleneck). Please take a look at this comment, we don't have enough evidence that the reader is the reason of the low performance.

That being said, the reader performance can be improved (one tip for the users and one actionable item for the development team):

Use paddle.reader.buffered decorator to prefetch the data in the background thread. Please see example here.

Even if data is prefetched from a background thread, the implementation paddle.train does not pre-transfter the training data from the Python world to the C++ world. We can add that support without modifying the reader interface that the user uses.

That's correct. My first thought was before implementing the reader, benchmark v1 reader and v2 reader by running a training instance with the same dataset and model. Then measure the throughput of training, like MB/s.

In the past infer jobs, we can get >=80 GPU utilization just used multi-thread reader and a queue just like v2 now.

We need a benchmark to prove copy from python to c++ is the bottle and

In most of the scenes, other takes the longest time not copy data.We have to consider how much will improve after we improve reading speed.

#3675 (comment)

后续把用户及视频内容的泛化特征给都去掉，只保留id特征，reader的时间未发生变化（因为读取的数据字段还是跟之前一样，只是在模型训练的时候未用到，所以这块儿的处理时间肯定是一样的），但每个pass的处理时间却快了10倍，由原来的70秒左右变成了7秒左右，具体如下

helinwang · 2017-09-29T00:43:13Z

doc/design/reader/low_level_reader.md

+    Tensor* parse(const std::string& line);
+    ```
+1. Buffer:
+    A `DoubleBuffer` which is able to async load data when caculations are running.


I think we can have a queue OP, and connect the reader OP to the queue OP. In this way the reader OP does not need to implement double buffer which duplicates functionality with the queue OP.

Yep. This is a better design. I should have split the design for the layer-based reader and op-based reader.

Also, the queue op can be used for queueing messages between multi-device and multi-node training.

helinwang · 2017-09-29T00:44:27Z

doc/design/reader/low_level_reader.md

+1. Buffer:
+    A `DoubleBuffer` which is able to async load data when caculations are running.
+    We can select to use a general "memory buffer" or "mmap" buffer.
+1. Thread pool:


In my opinion, after the refactor, the thread pool should be managed by the scheduler, independent of the reader OP.

For current V2 reader, we can use paddle.reader.buffered decorator to prefetch the data in the background thread. Please see example here.

helinwang · 2017-09-29T00:54:57Z

doc/design/reader/low_level_reader.md

+A low-level reader contains below components:
+
+1. Data parser:
+    A data parser will use `dlopen` to open user-defined parser plugin, and call


With the current V2 API data reader, user can implement the C++ parser as a Python module, and call it from the user-defined data reader. Maybe we don't have to do anything on our side.

After the refactor, reader will be an OP. In this case we can have a post-process OP that allow dlopen any *.so file. It's a cool idea.

Curious what is the priority of this feature? e.g., any feature request from the user that need the custom post-processing code? I think there are a limited set of post-processing, such as image enhancements and subtract mean, maybe we can provide the corresponding OP so users don't have to write their own version. Just my 2 cents.

user can implement the C++ parser as a Python module, and call it from the user-defined data reader. Maybe we don't have to do anything on our side.

Do you mean by "Python extension"? If so, maybe this will be less efficient because the data must be transferred though python-extension.so -> python -> numpy(extension too) -> paddle C++

what is the priority of this feature?

Improving v2 reader performance is of high priority for users need to improve their jobs.

any feature request from the user that need the custom post-processing code

What do you mean by "post-processing"?

Thanks for the reply!

Do you mean by "Python extension"? If so, maybe this will be less efficient because the data must be transferred though python-extension.so -> python -> numpy(extension too) -> paddle C++

Yes, that's what I mean. This cross language inefficiency is due to the fact that the data reader is in Python. I think if we want to eliminate this part of inefficiency, the user should use the read data OP after refactor.

What do you mean by "post-processing"?

My bad, I meant "pre-processing". What in my mind is that we will have a binary reader OP which output bytes, so the MyFeeder class (that converts bytes to Tensor) you mentioned is actually a "pre-processing" module. We are eventually talking about the same thing. But I think we should classify it as a "pre-processing" module, in this way the implementation only cares about the parsing and don't need to care about the reading (can switch different readers as input).

what is the priority of this feature?

Sorry, I meant what is the priority of allowing user to write custom feeder OP (equivalent to pre-processing OP, reason mentioned above) v.s. we implement many common pre-processing OP for the user to use.

I agree that "Improving v2 reader performance is of high priority for users need to improve their jobs.", but the reader performance will be solved once we have a reader OP and queue OP. I think "do we want to allow user to write custom *.so file and able to dlopen from PaddlePaddle" is more a flexibility problem than a performance problem.

lcy-seso · 2017-09-29T02:39:43Z

doc/design/reader/low_level_reader.md

+    plugin should implement one interface:
+    ```c++
+    // return Matrix* for v2 API call
+    Tensor* parse(const std::string& line);


According to my own experience, I agree that if a user is not careful enough, the current reader interface may cause some the efficiency problem, especially when dense vectors or dense vector sequences are used as the inputs.

The data feeding process always involves:

Implement a python module to read the original data from the disk (sometimes, users do not carefully implement this, which will also cause efficiency problem.), and return a Python list/tuple or a numpy array;

Some related operations, for example, shuffle, which is always a concern in SGD training.

Construct a data batch and convert it into a Paddle Argument (V2 API, an argument contains information more than the Matrix, but also sequence information)/Tensor;

Currently, Paddle does step 2 and 3, this step is not exposed to the users.

This design allows users parse the raw data and directly feed an Argument (It cannot be a Matrix in V2)/Tensor (or a LoDTensor ?) to the C++ end.

I have some questions:

Here does user return one sample or a sample batch?
I guess here user returns a single sample? Am I right?

Who will do the shuffle?
I think to leave shuffle to users is acceptable, but if we only require user parse one data, we have to implement shuffle ourselves. But it is also very easy to directly shuffle the data pool, it may not a big problem....

Are the buffer and thread pool all implemented by the users or by Paddle?

If a user returns a sample batch:

Does this mean we have to expose too many details about how data organized in Paddle's C++ codes? Is this necessary?

How data organized in Paddle's C++ codes may become much simple after refactoring because all types of input data are unified into Tensor (or LoDTensor?).
But returning a Matrix is not enough in v2 version, sequence information is stored separately in sequenceStartPositions and subSequenceStartPositions, and we even have Matrix/SparseMatrix for the different types of inputs.

I am not very sure, from this interface Tensor* parse(const std::string& line); it seems that the user directly returns a data batch, does this means the user should allocate the memory and release it correctly?

If a data batch is directly returned, does this mean the user should also consider shuffling the data, and returned an already shuffled data batch? Shuffle is now in Python reader decorator. Do we intend to leave all these details to the user?

Thanks for such detailed commonts.

Here does user return one sample or a sample batch?
I guess here user returns a single sample? Am I right?

You are right. User only need to define a "parser" to parse one raw sample data to a Tensor.

Who will do the shuffle?

Let paddle do it of course. Low-level reader is not that "low-level" to let user do everything. The user interface should be the same as python, except use C++.

Are the buffer and thread pool all implemented by the users or by Paddle?

Same as above, let paddle implement it.

Does this mean we have to expose too many details about how data organized in Paddle's C++ codes? Is this necessary?

We only expose Tensor and Vector(for v1/v2)

I am not very sure, from this interface Tensor* parse(const std::string& line); it seems that the user directly returns a data batch, does this means the user should allocate the memory and release it correctly?

User should allocate memory, but should not release it, let paddle do the job.

If a data batch is directly returned, does this mean the user should also consider shuffling the data, and returned an already shuffled data batch? Shuffle is now in Python reader decorator. Do we intend to leave all these details to the user?

Buffering, queueing, shuffle should be done by paddle, like reader = api.low_level_reader(parser="myparser.so", buffer_size=8192, buffer_type="mmap", batch_size=128, shuffle=True)

I see, thank you~

I like the interface:

reader = api.low_level_reader(parser="myparser.so", buffer_size=8192, buffer_type="mmap", batch_size=128, shuffle=True)

A little uncertain about if we need to allow user providing custom *.so file, because it has its associated problems:

User need to compile a *.so file for the same architecture as Paddle Cloud, and upload to the cloud.

Paddle Cloud after refactor is just a runtime engine that execute a user-define graph, no user code is running on it. But allowing custom *.so file will allow user code to run on Paddle cloud after refactor.

I agree that *.so file is more flexible. However, I am curious if we implement some common parsing OP, will it be sufficient? Like:

reader = api.low_level_reader(parser_chain=[paddle.decode_jpeg, paddle.substract_mean], buffer_size=8192, batch_size=128, shuffle=True)

How do you think? @lcy-seso @typhoonzero

I am also concerned to compile a *.so is not very convenient. Personally, I prefer to implement some common parsing OP, and enhance the buffering, queueing, shuffle in the PaddlePaddle end which can be transparent to users. The low-level API can be exposed to users if they are extremely concerns about accelerating the I-O speed.

Common parsers are fine, like parsers to process specific datasets. We can gather these parsers implements gradually, and replace data readers under python/paddle/v2/datasets.

For normal jobs which reader's performance does not harm, they can use: https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/reader/decorator.py#L162

Some cases like job can not get >80% of GPU performance can consider using low-level reader.

helinwang · 2017-10-03T17:47:43Z

Love the low level reader idea.
I think we need to discuss if we want to allow custom *.so. It is flexible but complex, most of the people do not know how to use it / do not need it. Maybe for those who want to use, they can contribute a specific low level reader OP to PaddlePaddle?

yi.wu added 3 commits September 26, 2017 22:22

low level reader design

145314c

update

586d4d7

update

428b803

typhoonzero requested review from reyoung, lcy-seso and luotao1 September 27, 2017 05:43

helinwang self-requested a review September 28, 2017 17:09

helinwang reviewed Sep 29, 2017

View reviewed changes

lcy-seso reviewed Sep 29, 2017

View reviewed changes

typhoonzero closed this Apr 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low level reader design #4394

Low level reader design #4394

typhoonzero commented Sep 26, 2017 •

edited

helinwang Sep 29, 2017 •

edited

typhoonzero Sep 29, 2017

gongweibao Oct 1, 2017 •

edited

helinwang Sep 29, 2017

typhoonzero Sep 29, 2017

helinwang Sep 29, 2017 •

edited

helinwang Sep 29, 2017 •

edited

typhoonzero Sep 29, 2017

helinwang Sep 29, 2017 •

edited

lcy-seso Sep 29, 2017 •

edited

typhoonzero Sep 29, 2017

lcy-seso Sep 29, 2017

helinwang Sep 29, 2017 •

edited

lcy-seso Sep 30, 2017 •

edited

typhoonzero Sep 30, 2017

helinwang commented Oct 3, 2017 •

edited


		## Background

		We have sevaral issues talking about training with v2 API get bad training performance,

Low level reader design #4394

Low level reader design #4394

Conversation

typhoonzero commented Sep 26, 2017 • edited

helinwang Sep 29, 2017 • edited

Choose a reason for hiding this comment

typhoonzero Sep 29, 2017

Choose a reason for hiding this comment

gongweibao Oct 1, 2017 • edited

Choose a reason for hiding this comment

helinwang Sep 29, 2017

Choose a reason for hiding this comment

typhoonzero Sep 29, 2017

Choose a reason for hiding this comment

helinwang Sep 29, 2017 • edited

Choose a reason for hiding this comment

helinwang Sep 29, 2017 • edited

Choose a reason for hiding this comment

typhoonzero Sep 29, 2017

Choose a reason for hiding this comment

helinwang Sep 29, 2017 • edited

Choose a reason for hiding this comment

lcy-seso Sep 29, 2017 • edited

Choose a reason for hiding this comment

typhoonzero Sep 29, 2017

Choose a reason for hiding this comment

lcy-seso Sep 29, 2017

Choose a reason for hiding this comment

helinwang Sep 29, 2017 • edited

Choose a reason for hiding this comment

lcy-seso Sep 30, 2017 • edited

Choose a reason for hiding this comment

typhoonzero Sep 30, 2017

Choose a reason for hiding this comment

helinwang commented Oct 3, 2017 • edited

typhoonzero commented Sep 26, 2017 •

edited

helinwang Sep 29, 2017 •

edited

gongweibao Oct 1, 2017 •

edited

helinwang Sep 29, 2017 •

edited

helinwang Sep 29, 2017 •

edited

helinwang Sep 29, 2017 •

edited

lcy-seso Sep 29, 2017 •

edited

helinwang Sep 29, 2017 •

edited

lcy-seso Sep 30, 2017 •

edited

helinwang commented Oct 3, 2017 •

edited