Create data_pipeline.md #10179

wangkuiyi · 2018-04-24T20:05:54Z

No description provided.

abhinavarora

Thank you for the document. This is really useful. Apart from few minor language issues, everything else looks pretty good.

abhinavarora · 2018-04-24T20:41:57Z

doc/fluid/design/data_pipeline/data_pipeline.md

+
+### Case 1: Data from Files
+
+Consider a Fluid trianing program, `resnet50.py`, needs to read data from disk:


trianing -> training

Can we have a link to resnet.py.

Sorry I don't have a resnet.py program. It is just a placeholder here.

abhinavarora · 2018-04-24T20:46:45Z

doc/fluid/design/data_pipeline/data_pipeline.md

+cat data | python resnet50.py
+```
+
+The fact that the who collect the data might not be the same person who wrote `resnet50.py` inspires that


This line can be re-phrased as follows:

Since the person who collects the data might be different from the person who wrote resnet50.py:

abhinavarora · 2018-04-24T20:47:01Z

doc/fluid/design/data_pipeline/data_pipeline.md

+The fact that the who collect the data might not be the same person who wrote `resnet50.py` inspires that
+
+1. Fluid operators used in `resnet50.py` need to recognize the file format of `data`, or, we need a standard data format.
+1. These operators need to be able to read the standard input.


read the standard input -> read from the standard input

wangkuiyi · 2018-04-25T05:33:20Z

Thanks to @abhinavarora for the comments. I followed all of them.

reyoung · 2018-04-25T10:16:36Z

doc/fluid/design/data_pipeline/data_pipeline.md

+  }
+  message Field {
+    required Type type = 1;
+    repeated bytes image = 2; // PNG image


I think we might not support so many types. The recordio format should only store LoDTensor for simplicity.

Files with other types, like PNG images, should be preprocessed to LoDTensor firstly, then save them into recordio file.

If we support many kinds of binary format storing in recordio, there are could be too flexible and we cannot reuse the data argumentation program.

Good point. It looks to me that your point is the alternative I described in the Open Discussion section below. Isn't it? If so, I agree with you that it would be convenient to save Fluid variable instead of raw data in records. I am updating the document accordingly.

reyoung · 2018-04-25T10:24:51Z

doc/fluid/design/data_pipeline/data_pipeline.md

+It is far from trivial to implement the many augmentation operations as Fluid operators, so we'd adopt a more flexible approach -- write the data augmentation program in arbitrary languages and pipe them up.  For example:
+
+```bash
+cat data | go run add_noise.go | python resnet50.py


Data augmentation can be the performance bottleneck for a training process, especially when using many GPUs. It is hard to use GPUs for data augmentation when we use a separate program, which is necessary for image models. Maybe we should provide some common data augmentation algorithms by operators. We can also provide a flexible approach by piping multiple processes.

Reference: High-Performance Models by tensorflow. The half of this article describes how to feed data.

I agree that we provide some augmentation operators. The capability of using pipeline is also critical; I am afraid that it is intractable to implement all augmentation operations as Fluid operators.

reyoung · 2018-04-25T10:41:04Z

doc/fluid/design/data_pipeline/data_pipeline.md

+Consider a Fluid training program, `resnet50.py`, needs to read data from disk:
+
+```bash
+cat data | python resnet50.py


A note about pipe.

Fluid can not only read data fron stdin, but also read from file. A pipe and stdin are special files. We can even use mkfifo to create a named pipe, and it just acts like a real file.
Currently, Fluid support an operator named open_files. It can open multiple files and read them concurrently. We can create many named pipes and use many process to generate data.

mkfifo data1 mkfifo data2 ./data_generator > data1 & # make this process running at background ./data_generator > data2 & # data1 and data2 will be generated concurrently. # in train.py # file_obj = fluid.open_files(['data1', 'data2']) # img, label = read(file_obj) python train.py

Thanks for pointing this out!

I agree it is a good idea to improve the data generation performance by reading from multiple data generators via pipes.

I also want to note that it is not a good design to have OpenFiles; instead, what we need are finer grained APIs:

OpenFile -- File, but not Files, which opens a file as a byte stream,

CreateRecordIOReader -- which extracts records, instead of bytes, from the stream,

CreateMultiplexReader -- which monitors multiple readers and reads from one if it has any record ready for reading.

which provides the usage like the following

r = CreateMultiplexReader( OpenFile("/afs/resnet50/traininig-data-00000-of-00005"), OpenFile("/dev/stdin"), OpenFile("a_named_pipe"));

I am adding the above discusson to the document.

@JiayiFeng MultiplexReader is an alternative way to implement open_files. Please take a look.

reyoung · 2018-04-25T10:46:04Z

doc/fluid/design/data_pipeline/data_pipeline.md

+
+we need
+
+1. the data format is fault-tolerable.


We also need the fluid can read from multipe files concurrently? Suppose there are many data generators, they should generate the data into multiples files to avoid mutual interference.

Yes, I agree. And the solution in my mind is as in #10179 (comment)

JiayiFeng · 2018-04-25T12:13:04Z

doc/fluid/design/data_pipeline/data_pipeline.md

+
+Usually, each instance contains multiple fields.  For example, each instance of ResNet includes an image and one or more text labels. In another example of text classification, each instance contains text and its labels.
+
+Data reading operators of Fluid must understand not only the file format, but also the record format, so could it map fields into Fluid variables of various types.  For this purpose, we propose to standardize the record format as a protobuf message:


A Protobuf message has a limit of storage size. I remember it is 64M by default. We shall take it into consideration although a single instance seems unlikely to be larger than 64M.

Thanks for the reminder and yes, I assume that each training instance shouldn't be over 64MB, actually, 32MB, because the serialization/deserialization mechanism of protobuf message warns if a message is over 32MB, and errors if it is over 64MB.

JiayiFeng · 2018-04-25T12:18:16Z

doc/fluid/design/data_pipeline/data_pipeline.md

+cat data | go run add_noise.go | python resnet50.py
+```
+
+As we use standard data format, `add_noise.go` must be able to decode `data` and encode its outputs into the RecordIO format.  We could provide the Go binding of the RecordIO API.


What's the format of data? If it's RecordIO, do we need to provide users tools to convert their own data to RecordIO format? If it is user-defined format, how to make sure the add_noise.go can be applied to it?

I should write data.recordio, instead of data, to make it clear that the format is RecordIO.

Yes, we need to provide at least one of the two kinds of tools, as I mentioned in this design:

a library can be called by data augmentation programs

a pair of decoding/encoding executable binaries -- recordio_encode and recordio_decode.

wangkuiyi · 2018-05-01T05:11:12Z

doc/fluid/design/data_pipeline/data_pipeline.md

+  }
+  message Field {
+    required Type type = 1;
+    optional bytes png_image = 2;


Merge png_image, jpg_image, text into one field value.

wangkuiyi added 3 commits April 24, 2018 13:05

Create data_pipeline.md

eca6d41

Update data_pipeline.md

e6caf2f

Grammarly.com checks

f48f096

wangkuiyi added this to TODO in Complete Fluid via automation Apr 24, 2018

abhinavarora reviewed Apr 24, 2018

View reviewed changes

Update data_pipeline.md

418445f

reyoung reviewed Apr 25, 2018

View reviewed changes

reyoung requested a review from JiayiFeng April 25, 2018 10:34

reyoung reviewed Apr 25, 2018

View reviewed changes

JiayiFeng reviewed Apr 25, 2018

View reviewed changes

wangkuiyi added 2 commits April 25, 2018 15:20

Update data_pipeline.md

fc918a6

Integrate Yu Yang's suggestions

ba74bc7

wangkuiyi commented May 1, 2018

View reviewed changes

luotao1 closed this Oct 15, 2018

Complete Fluid automation moved this from TODO to Done Oct 15, 2018

luotao1 deleted the wangkuiyi-patch-2 branch October 15, 2018 10:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create data_pipeline.md #10179

Create data_pipeline.md #10179

wangkuiyi commented Apr 24, 2018

abhinavarora left a comment

abhinavarora Apr 24, 2018

abhinavarora Apr 24, 2018

wangkuiyi Apr 25, 2018

wangkuiyi Apr 25, 2018

abhinavarora Apr 24, 2018

wangkuiyi Apr 25, 2018

abhinavarora Apr 24, 2018

wangkuiyi Apr 25, 2018

wangkuiyi commented Apr 25, 2018

reyoung Apr 25, 2018

wangkuiyi Apr 25, 2018 •

edited

reyoung Apr 25, 2018

wangkuiyi Apr 25, 2018

reyoung Apr 25, 2018 •

edited

wangkuiyi Apr 25, 2018 •

edited

reyoung Apr 26, 2018

reyoung Apr 25, 2018

wangkuiyi Apr 25, 2018

JiayiFeng Apr 25, 2018

wangkuiyi Apr 25, 2018

JiayiFeng Apr 25, 2018 •

edited

wangkuiyi Apr 25, 2018

wangkuiyi May 1, 2018


		### Case 1: Data from Files

		Consider a Fluid trianing program, `resnet50.py`, needs to read data from disk:


		Usually, each instance contains multiple fields. For example, each instance of ResNet includes an image and one or more text labels. In another example of text classification, each instance contains text and its labels.

		Data reading operators of Fluid must understand not only the file format, but also the record format, so could it map fields into Fluid variables of various types. For this purpose, we propose to standardize the record format as a protobuf message:

Create data_pipeline.md #10179

Create data_pipeline.md #10179

Conversation

wangkuiyi commented Apr 24, 2018

abhinavarora left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangkuiyi commented Apr 25, 2018

Choose a reason for hiding this comment

wangkuiyi Apr 25, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reyoung Apr 25, 2018 • edited

Choose a reason for hiding this comment

wangkuiyi Apr 25, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JiayiFeng Apr 25, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangkuiyi Apr 25, 2018 •

edited

reyoung Apr 25, 2018 •

edited

wangkuiyi Apr 25, 2018 •

edited

JiayiFeng Apr 25, 2018 •

edited