Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluid data pipeline interface #10102

Closed
helinwang opened this issue Apr 20, 2018 · 3 comments
Closed

Fluid data pipeline interface #10102

helinwang opened this issue Apr 20, 2018 · 3 comments
Assignees

Comments

@helinwang
Copy link
Contributor

helinwang commented Apr 20, 2018

The data pipeline (processing and reading) in a traditional deep learning framework is typically defined in Python scripts, utilizing many Python libraries. We propose a new data pipeline scheme based on Unix pipe, utilizing all executables, including Python scripts.

Read Data from stdin

A typical use case is a Fluid training program reads mini-batches from stdin:

$ cat ~/dataset/fit_a_line | paddle run train.py

In the above command, ~/dataset/fit_a_line is a file encoded with Fluid data format. It contains a sequence of training data entries. paddle run train.py runs a Fluid program that trains a neural network, reading the training data from stdin.

Here is an example of train.py:

from paddle import fluid

reader = fluid.reader("/dev/stdin")
with reader.iterate():
  x = fluid.layers.data(name="feature")
  label = fluid.layers.data(name="label")
  y_predict = fluid.layers.fc(input=x, size=1, act=None)
  cost = fluid.layers.square_error_cost(input=y_predict, label=label)
  avg_cost = fluid.layers.mean(cost)
  sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
  sgd_optimizer.minimize(avg_cost)

writer = fluid.writer(os.stdout)
fluid.save_parameters(writer=writer)

The code block inside the with statement will be iterated every mini-batch, until EOF is read from stdin.

Output Data to stdout

In the train.py above, the trained parameters is saved to stdout after the training iterations.:

writer = fluid.writer("/dev/stdout")
fluid.save_parameters(writer=writer)

We can have a test.py that reads the parameters from stdin and test the model accuracy:

from paddle import fluid

model_reader = fluid.reader("/dev/stdin")
test_reader = fluid.reader("~/dataset/testset/test")
writer = fluid.writer("/dev/stdout")
fluid.load_parameters(reader=model_reader)
with test_reader.iterate():
  x = fluid.layers.data(name="feature")
  label = fluid.layers.data(name="label")
  y_predict = fluid.layers.fc(input=x, size=1, act=None)
  cost = fluid.layers.square_error_cost(input=y_predict, label=label)
  writer.write(cost)

We can use this command to train, test and pretty print the test result:

$ cat ~/dataset/fit_a_line | paddle run train.py | paddle run test.py | paddle pretty_print

Interprocess Communicating with Non-Fluid Programs

Non-Fluid programs can communicate with Fluid programs as long as they understands the Fluid data format. We will provide APIs for Non-Fluid programs to read and write data to file descriptors.

In the code below, preprocess.py preprocess the data from stdin before writing it to stdout:

from paddle import fluid

w = fluid.write_to("/dev/stdout")
w.set_columns("feature", "label")
r = fluid.read_from("/dev/stdin")

for entry in r:
  feature = entry["feature"]
  label = entry["label"]
  # feature and label are both numpy.ndarray
  new_feature = some_preprocessing(feature)
  # new_feature is a numpy.ndarray, it will be serialized to a format
  # that can be deserialized to fluid.lod_tensor.
  w.write(new_feature, label)

We can then chain it to our training process:

$ cat ~/dataset/fit_a_line | python preprocess.py | paddle run train.py

We will provide the equivalent C++ API so the user can write high efficient C++ programs that interact with Fluid programs.

Run Fluid Program from Python

All the above examples run Fluid program inside bash with paddle run *.py. We can run Fluid from Python as well, fully compatible with the Fluid data pipeline interface. Please see examples in #9912.

@wangkuiyi
Copy link
Collaborator

wangkuiyi commented Apr 20, 2018

It is common that a training program needs to evaluate the model quality (e.g, precision, recall) w.r.t. iterations. This implies that the training program would need two input streams -- the training data and the test/dev data.

We might duplex these two streams on stdin, but this would be technically much more difficult than having multiple input streams.

At the current moment, what's in my mind is that we need the following four operators:

https://github.com/PaddlePaddle/Paddle/pull/10179/files#r184158489

Also, in case that we want to read records from stdin, we should be able to pass a specific filename to OpenReordIO, like

CreateRecordIOReader(OpenFile("/dev/stdin"))

@helinwang
Copy link
Contributor Author

helinwang commented Apr 20, 2018

@wangkuiyi thanks! That is a great idea that we should allow multiple input streams! I really like the "/dev/stdin" approach (updated the example code to use this approach).

Please see section "Output Data to stdout" for an example of using stdin as model input and another file as test data input.

In terms of data format, I have some doubt on if recordio can be used as our need without any modification. So I have not stated any opinion regarding to the file format (e.g., using "~/dataset/testset/test", rather than using "~/dataset/testset/test.recordio"). I will state the file format requirement in this issue, so we can evaluate the recordio format.

@helinwang helinwang changed the title Fluid streaming data format design Fluid data pipeline interface Apr 20, 2018
@shanyi15
Copy link
Collaborator

您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants