Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low level reader design #4394

Closed

Conversation

typhoonzero
Copy link
Contributor

@typhoonzero typhoonzero commented Sep 26, 2017

This design doc describe a solution for #3675

Better to review from here


## Background

We have sevaral issues talking about training with v2 API get bad training performance,
Copy link
Contributor

@helinwang helinwang Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need more profiling before reaching the conclusion that v2 API get bad training performance because of the data reader (and to understand where is the bottleneck). Please take a look at this comment, we don't have enough evidence that the reader is the reason of the low performance.

That being said, the reader performance can be improved (one tip for the users and one actionable item for the development team):

  1. Use paddle.reader.buffered decorator to prefetch the data in the background thread. Please see example here.

  2. Even if data is prefetched from a background thread, the implementation paddle.train does not pre-transfter the training data from the Python world to the C++ world. We can add that support without modifying the reader interface that the user uses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. My first thought was before implementing the reader, benchmark v1 reader and v2 reader by running a training instance with the same dataset and model. Then measure the throughput of training, like MB/s.

Copy link
Contributor

@gongweibao gongweibao Oct 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past infer jobs, we can get >=80 GPU utilization just used multi-thread reader and a queue just like v2 now.

  1. We need a benchmark to prove copy from python to c++ is the bottle and
  2. In most of the scenes, other takes the longest time not copy data.We have to consider how much will improve after we improve reading speed.

#3675 (comment)

后续把用户及视频内容的泛化特征给都去掉,只保留id特征,reader的时间未发生变化(因为读取的数据字段还是跟之前一样,只是在模型训练的时候未用到,所以这块儿的处理时间肯定是一样的),
但每个pass的处理时间却快了10倍,由原来的70秒左右变成了7秒左右,具体如下

Tensor* parse(const std::string& line);
```
1. Buffer:
A `DoubleBuffer` which is able to async load data when caculations are running.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can have a queue OP, and connect the reader OP to the queue OP. In this way the reader OP does not need to implement double buffer which duplicates functionality with the queue OP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. This is a better design. I should have split the design for the layer-based reader and op-based reader.

Also, the queue op can be used for queueing messages between multi-device and multi-node training.

1. Buffer:
A `DoubleBuffer` which is able to async load data when caculations are running.
We can select to use a general "memory buffer" or "mmap" buffer.
1. Thread pool:
Copy link
Contributor

@helinwang helinwang Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, after the refactor, the thread pool should be managed by the scheduler, independent of the reader OP.

For current V2 reader, we can use paddle.reader.buffered decorator to prefetch the data in the background thread. Please see example here.

A low-level reader contains below components:

1. Data parser:
A data parser will use `dlopen` to open user-defined parser plugin, and call
Copy link
Contributor

@helinwang helinwang Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current V2 API data reader, user can implement the C++ parser as a Python module, and call it from the user-defined data reader. Maybe we don't have to do anything on our side.

After the refactor, reader will be an OP. In this case we can have a post-process OP that allow dlopen any *.so file. It's a cool idea.

Curious what is the priority of this feature? e.g., any feature request from the user that need the custom post-processing code? I think there are a limited set of post-processing, such as image enhancements and subtract mean, maybe we can provide the corresponding OP so users don't have to write their own version. Just my 2 cents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user can implement the C++ parser as a Python module, and call it from the user-defined data reader. Maybe we don't have to do anything on our side.

Do you mean by "Python extension"? If so, maybe this will be less efficient because the data must be transferred though python-extension.so -> python -> numpy(extension too) -> paddle C++

what is the priority of this feature?

Improving v2 reader performance is of high priority for users need to improve their jobs.

any feature request from the user that need the custom post-processing code

What do you mean by "post-processing"?

Copy link
Contributor

@helinwang helinwang Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply!

Do you mean by "Python extension"? If so, maybe this will be less efficient because the data must be transferred though python-extension.so -> python -> numpy(extension too) -> paddle C++

Yes, that's what I mean. This cross language inefficiency is due to the fact that the data reader is in Python. I think if we want to eliminate this part of inefficiency, the user should use the read data OP after refactor.

What do you mean by "post-processing"?

My bad, I meant "pre-processing". What in my mind is that we will have a binary reader OP which output bytes, so the MyFeeder class (that converts bytes to Tensor) you mentioned is actually a "pre-processing" module. We are eventually talking about the same thing. But I think we should classify it as a "pre-processing" module, in this way the implementation only cares about the parsing and don't need to care about the reading (can switch different readers as input).

what is the priority of this feature?

Sorry, I meant what is the priority of allowing user to write custom feeder OP (equivalent to pre-processing OP, reason mentioned above) v.s. we implement many common pre-processing OP for the user to use.

I agree that "Improving v2 reader performance is of high priority for users need to improve their jobs.", but the reader performance will be solved once we have a reader OP and queue OP. I think "do we want to allow user to write custom *.so file and able to dlopen from PaddlePaddle" is more a flexibility problem than a performance problem.

plugin should implement one interface:
```c++
// return Matrix* for v2 API call
Tensor* parse(const std::string& line);
Copy link
Contributor

@lcy-seso lcy-seso Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to my own experience, I agree that if a user is not careful enough, the current reader interface may cause some the efficiency problem, especially when dense vectors or dense vector sequences are used as the inputs.

The data feeding process always involves:

  1. Implement a python module to read the original data from the disk (sometimes, users do not carefully implement this, which will also cause efficiency problem.), and return a Python list/tuple or a numpy array;
  2. Some related operations, for example, shuffle, which is always a concern in SGD training.
  3. Construct a data batch and convert it into a Paddle Argument (V2 API, an argument contains information more than the Matrix, but also sequence information)/Tensor;

Currently, Paddle does step 2 and 3, this step is not exposed to the users.

This design allows users parse the raw data and directly feed an Argument (It cannot be a Matrix in V2)/Tensor (or a LoDTensor ?) to the C++ end.


I have some questions:

  1. Here does user return one sample or a sample batch?
    I guess here user returns a single sample? Am I right?

  2. Who will do the shuffle?
    I think to leave shuffle to users is acceptable, but if we only require user parse one data, we have to implement shuffle ourselves. But it is also very easy to directly shuffle the data pool, it may not a big problem....

  3. Are the buffer and thread pool all implemented by the users or by Paddle?


If a user returns a sample batch:

  • Does this mean we have to expose too many details about how data organized in Paddle's C++ codes? Is this necessary?

    How data organized in Paddle's C++ codes may become much simple after refactoring because all types of input data are unified into Tensor (or LoDTensor?).
    But returning a Matrix is not enough in v2 version, sequence information is stored separately in sequenceStartPositions and subSequenceStartPositions, and we even have Matrix/SparseMatrix for the different types of inputs.

  • I am not very sure, from this interface Tensor* parse(const std::string& line); it seems that the user directly returns a data batch, does this means the user should allocate the memory and release it correctly?

  1. If a data batch is directly returned, does this mean the user should also consider shuffling the data, and returned an already shuffled data batch? Shuffle is now in Python reader decorator. Do we intend to leave all these details to the user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for such detailed commonts.

Here does user return one sample or a sample batch?
I guess here user returns a single sample? Am I right?

You are right. User only need to define a "parser" to parse one raw sample data to a Tensor.

Who will do the shuffle?

Let paddle do it of course. Low-level reader is not that "low-level" to let user do everything. The user interface should be the same as python, except use C++.

Are the buffer and thread pool all implemented by the users or by Paddle?

Same as above, let paddle implement it.


Does this mean we have to expose too many details about how data organized in Paddle's C++ codes? Is this necessary?

We only expose Tensor and Vector(for v1/v2)

I am not very sure, from this interface Tensor* parse(const std::string& line); it seems that the user directly returns a data batch, does this means the user should allocate the memory and release it correctly?

User should allocate memory, but should not release it, let paddle do the job.

If a data batch is directly returned, does this mean the user should also consider shuffling the data, and returned an already shuffled data batch? Shuffle is now in Python reader decorator. Do we intend to leave all these details to the user?

Buffering, queueing, shuffle should be done by paddle, like reader = api.low_level_reader(parser="myparser.so", buffer_size=8192, buffer_type="mmap", batch_size=128, shuffle=True)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thank you~

Copy link
Contributor

@helinwang helinwang Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the interface:

reader = api.low_level_reader(parser="myparser.so", buffer_size=8192, buffer_type="mmap", batch_size=128, shuffle=True)

A little uncertain about if we need to allow user providing custom *.so file, because it has its associated problems:

  1. User need to compile a *.so file for the same architecture as Paddle Cloud, and upload to the cloud.
  2. Paddle Cloud after refactor is just a runtime engine that execute a user-define graph, no user code is running on it. But allowing custom *.so file will allow user code to run on Paddle cloud after refactor.

I agree that *.so file is more flexible. However, I am curious if we implement some common parsing OP, will it be sufficient? Like:

reader = api.low_level_reader(parser_chain=[paddle.decode_jpeg, paddle.substract_mean], buffer_size=8192, batch_size=128, shuffle=True)

How do you think? @lcy-seso @typhoonzero

Copy link
Contributor

@lcy-seso lcy-seso Sep 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also concerned to compile a *.so is not very convenient. Personally, I prefer to implement some common parsing OP, and enhance the buffering, queueing, shuffle in the PaddlePaddle end which can be transparent to users. The low-level API can be exposed to users if they are extremely concerns about accelerating the I-O speed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common parsers are fine, like parsers to process specific datasets. We can gather these parsers implements gradually, and replace data readers under python/paddle/v2/datasets.

For normal jobs which reader's performance does not harm, they can use: https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/reader/decorator.py#L162

Some cases like job can not get >80% of GPU performance can consider using low-level reader.

@helinwang
Copy link
Contributor

helinwang commented Oct 3, 2017

Love the low level reader idea.
I think we need to discuss if we want to allow custom *.so. It is flexible but complex, most of the people do not know how to use it / do not need it. Maybe for those who want to use, they can contribute a specific low level reader OP to PaddlePaddle?

@typhoonzero typhoonzero closed this Apr 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants