Fluid data pipeline interface #10102

helinwang · 2018-04-20T17:53:21Z

The data pipeline (processing and reading) in a traditional deep learning framework is typically defined in Python scripts, utilizing many Python libraries. We propose a new data pipeline scheme based on Unix pipe, utilizing all executables, including Python scripts.

Read Data from stdin

A typical use case is a Fluid training program reads mini-batches from stdin:

$ cat ~/dataset/fit_a_line | paddle run train.py

In the above command, ~/dataset/fit_a_line is a file encoded with Fluid data format. It contains a sequence of training data entries. paddle run train.py runs a Fluid program that trains a neural network, reading the training data from stdin.

Here is an example of train.py:

from paddle import fluid

reader = fluid.reader("/dev/stdin")
with reader.iterate():
  x = fluid.layers.data(name="feature")
  label = fluid.layers.data(name="label")
  y_predict = fluid.layers.fc(input=x, size=1, act=None)
  cost = fluid.layers.square_error_cost(input=y_predict, label=label)
  avg_cost = fluid.layers.mean(cost)
  sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
  sgd_optimizer.minimize(avg_cost)

writer = fluid.writer(os.stdout)
fluid.save_parameters(writer=writer)

The code block inside the with statement will be iterated every mini-batch, until EOF is read from stdin.

Output Data to stdout

In the train.py above, the trained parameters is saved to stdout after the training iterations.:

writer = fluid.writer("/dev/stdout")
fluid.save_parameters(writer=writer)

We can have a test.py that reads the parameters from stdin and test the model accuracy:

from paddle import fluid

model_reader = fluid.reader("/dev/stdin")
test_reader = fluid.reader("~/dataset/testset/test")
writer = fluid.writer("/dev/stdout")
fluid.load_parameters(reader=model_reader)
with test_reader.iterate():
  x = fluid.layers.data(name="feature")
  label = fluid.layers.data(name="label")
  y_predict = fluid.layers.fc(input=x, size=1, act=None)
  cost = fluid.layers.square_error_cost(input=y_predict, label=label)
  writer.write(cost)

We can use this command to train, test and pretty print the test result:

$ cat ~/dataset/fit_a_line | paddle run train.py | paddle run test.py | paddle pretty_print

Interprocess Communicating with Non-Fluid Programs

Non-Fluid programs can communicate with Fluid programs as long as they understands the Fluid data format. We will provide APIs for Non-Fluid programs to read and write data to file descriptors.

In the code below, preprocess.py preprocess the data from stdin before writing it to stdout:

from paddle import fluid

w = fluid.write_to("/dev/stdout")
w.set_columns("feature", "label")
r = fluid.read_from("/dev/stdin")

for entry in r:
  feature = entry["feature"]
  label = entry["label"]
  # feature and label are both numpy.ndarray
  new_feature = some_preprocessing(feature)
  # new_feature is a numpy.ndarray, it will be serialized to a format
  # that can be deserialized to fluid.lod_tensor.
  w.write(new_feature, label)

We can then chain it to our training process:

$ cat ~/dataset/fit_a_line | python preprocess.py | paddle run train.py

We will provide the equivalent C++ API so the user can write high efficient C++ programs that interact with Fluid programs.

Run Fluid Program from Python

All the above examples run Fluid program inside bash with paddle run *.py. We can run Fluid from Python as well, fully compatible with the Fluid data pipeline interface. Please see examples in #9912.

The text was updated successfully, but these errors were encountered:

wangkuiyi · 2018-04-20T18:53:11Z

It is common that a training program needs to evaluate the model quality (e.g, precision, recall) w.r.t. iterations. This implies that the training program would need two input streams -- the training data and the test/dev data.

We might duplex these two streams on stdin, but this would be technically much more difficult than having multiple input streams.

At the current moment, what's in my mind is that we need the following four operators:

https://github.com/PaddlePaddle/Paddle/pull/10179/files#r184158489

Also, in case that we want to read records from stdin, we should be able to pass a specific filename to OpenReordIO, like

CreateRecordIOReader(OpenFile("/dev/stdin"))

helinwang · 2018-04-20T19:04:02Z

@wangkuiyi thanks! That is a great idea that we should allow multiple input streams! I really like the "/dev/stdin" approach (updated the example code to use this approach).

Please see section "Output Data to stdout" for an example of using stdin as model input and another file as test data input.

In terms of data format, I have some doubt on if recordio can be used as our need without any modification. So I have not stated any opinion regarding to the file format (e.g., using "~/dataset/testset/test", rather than using "~/dataset/testset/test.recordio"). I will state the file format requirement in this issue, so we can evaluate the recordio format.

shanyi15 · 2018-08-15T10:28:30Z

您好，此issue在近一个月内暂无更新，我们将于今天内关闭。若在关闭后您仍需跟进提问，可重新开启此问题，我们将在24小时内回复您。因关闭带来的不便我们深表歉意，请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

wangkuiyi mentioned this issue Apr 20, 2018

Design Doc: Complete Fluid #10103

Closed

helinwang changed the title ~~Fluid streaming data format design~~ Fluid data pipeline interface Apr 20, 2018

helinwang mentioned this issue Apr 20, 2018

The data format of Fluid data pipeline #10107

Closed

helinwang assigned cs2be, abhinavarora, varunarora, wangkuiyi and tonyyang-svail Apr 20, 2018

shanyi15 closed this as completed Aug 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluid data pipeline interface #10102

Fluid data pipeline interface #10102

helinwang commented Apr 20, 2018 •

edited

Loading

wangkuiyi commented Apr 20, 2018 •

edited

Loading

helinwang commented Apr 20, 2018 •

edited

Loading

shanyi15 commented Aug 15, 2018

Fluid data pipeline interface #10102

Fluid data pipeline interface #10102

Comments

helinwang commented Apr 20, 2018 • edited Loading

Read Data from stdin

Output Data to stdout

Interprocess Communicating with Non-Fluid Programs

Run Fluid Program from Python

wangkuiyi commented Apr 20, 2018 • edited Loading

helinwang commented Apr 20, 2018 • edited Loading

shanyi15 commented Aug 15, 2018

helinwang commented Apr 20, 2018 •

edited

Loading

wangkuiyi commented Apr 20, 2018 •

edited

Loading

helinwang commented Apr 20, 2018 •

edited

Loading