-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fluid data pipeline interface #10102
Comments
It is common that a training program needs to evaluate the model quality (e.g, precision, recall) w.r.t. iterations. This implies that the training program would need two input streams -- the training data and the test/dev data. We might duplex these two streams on stdin, but this would be technically much more difficult than having multiple input streams. At the current moment, what's in my mind is that we need the following four operators: https://github.com/PaddlePaddle/Paddle/pull/10179/files#r184158489 Also, in case that we want to read records from stdin, we should be able to pass a specific filename to OpenReordIO, like CreateRecordIOReader(OpenFile("/dev/stdin")) |
@wangkuiyi thanks! That is a great idea that we should allow multiple input streams! I really like the "/dev/stdin" approach (updated the example code to use this approach). Please see section "Output Data to stdout" for an example of using stdin as model input and another file as test data input. In terms of data format, I have some doubt on if recordio can be used as our need without any modification. So I have not stated any opinion regarding to the file format (e.g., using |
您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持! |
The data pipeline (processing and reading) in a traditional deep learning framework is typically defined in Python scripts, utilizing many Python libraries. We propose a new data pipeline scheme based on Unix pipe, utilizing all executables, including Python scripts.
Read Data from stdin
A typical use case is a Fluid training program reads mini-batches from stdin:
In the above command,
~/dataset/fit_a_line
is a file encoded with Fluid data format. It contains a sequence of training data entries.paddle run train.py
runs a Fluid program that trains a neural network, reading the training data from stdin.Here is an example of
train.py
:The code block inside the with statement will be iterated every mini-batch, until EOF is read from stdin.
Output Data to stdout
In the
train.py
above, the trained parameters is saved to stdout after the training iterations.:We can have a
test.py
that reads the parameters from stdin and test the model accuracy:We can use this command to train, test and pretty print the test result:
Interprocess Communicating with Non-Fluid Programs
Non-Fluid programs can communicate with Fluid programs as long as they understands the Fluid data format. We will provide APIs for Non-Fluid programs to read and write data to file descriptors.
In the code below,
preprocess.py
preprocess the data from stdin before writing it to stdout:We can then chain it to our training process:
We will provide the equivalent C++ API so the user can write high efficient C++ programs that interact with Fluid programs.
Run Fluid Program from Python
All the above examples run Fluid program inside bash with
paddle run *.py
. We can run Fluid from Python as well, fully compatible with the Fluid data pipeline interface. Please see examples in #9912.The text was updated successfully, but these errors were encountered: