Feature/recordio file reader #8830

reyoung · 2018-03-07T08:21:40Z

…reader

Make vec<Tensor> can be serialized to RecordIO

… feature/recordio_file_reader

dzhwinter · 2018-03-10T01:59:42Z

paddle/fluid/pybind/recordio.cc

+  }
+
+ private:
+  std::vector<framework::LoDTensor> tensors_;


In my mind, a batch of LoDTensor means a bunch of big blobs, should we use list here to avoid useless copy?
Then the code can be

std::forward_list<framework::LoDTensor> tensors_; template<typename Container> void WriteToRecordIO(recordio::Writer &writer, const Container &tensors, const platform::DeviceContext &dev_ctx) { for(auto& tensor : tensors) { TensorToStream(); // } }

LoDTensor holds a shared_ptr. Copy a LoDTensor is just like passing by pointers.

dzhwinter · 2018-03-10T02:35:18Z

paddle/fluid/recordio/scanner.h

+
+class Scanner {
+public:
+  explicit Scanner(std::unique_ptr<std::istream>&& stream);


Since we store the pointer of an istream object, there is no difference between make a right value copy or not.

Scanner will read the istream until ends. It will steal the ownership of stream and free the stream inside. So, here use right-value reference.

BTW, this API is not used for operators. It just used for unit tests.

Sorry, I didn't point out my opinion clearly. I thought that we can only use an std::istream* as the constructor parameter. Otherwise, it will force other modules to use unique_ptr. Nevermind since it just used for unit tests.

dzhwinter · 2018-03-10T02:43:26Z

paddle/fluid/recordio/scanner.h

+
+private:
+  std::unique_ptr<std::istream> stream_;
+  Chunk cur_chunk_;


The chunk store a massive amount of LoDTensor, it should be put into the heap instead of the stack.

A LoDTensor just a pointer to data memory.

dzhwinter · 2018-03-10T03:01:56Z

paddle/fluid/operators/reader/create_recordio_file_reader_op.cc

+
+    auto* out = scope.FindVar(Output("Out"))
+                    ->template GetMutable<framework::ReaderHolder>();
+    out->Reset(new RecordIOFileReader(filename, shapes));


My concern is that user had to convert data into the recordio format, this operation will cost double size disk and waste a lot of time to convert massive amount of data.
RecordIOReader should be a stream reader support in memory converting, and RecordIOFileReader is a special case of stream.

What is in memory converting? (Almost) everything in unix is a file, including memory. If we want to read recordio directly from memory, we can just mount memory into the filesystem.

… feature/recordio_file_reader

JiayiFeng · 2018-03-13T06:14:43Z

paddle/fluid/operators/reader/create_recordio_file_reader_op.cc

+            platform::CPUPlace())) {}
+
+  void ReadNext(std::vector<framework::LoDTensor>* out) override {
+    *out = framework::ReadFromRecordIO(scanner_, dev_ctx_);


According to the conclusion of our last discussion, a FileReader should check the result's shape before return. However, we can merge this PR first and I will refine this in my next PR.

JiayiFeng · 2018-03-13T06:18:00Z

paddle/fluid/operators/reader/reader_op_registry.cc

@@ -35,7 +35,7 @@ FileReaderMakerBase::FileReaderMakerBase(
    framework::OpProtoAndCheckerMaker::OpProto* op_proto,
    framework::OpAttrChecker* op_checker)
    : OpProtoAndCheckerMaker(op_proto, op_checker) {
-  AddOutput("Out", "(ReaderHolder) The created random reader.");
+  AddOutput("Out", "(ReaderHolder) The created random reader.").AsDuplicable();


Why it's duplicable?

The output of a file reader should be a list of variable.

dzhwinter

LGTM. There are only some questions in my side.

dzhwinter · 2018-03-13T08:01:51Z

paddle/fluid/recordio/scanner.h

+
+class Scanner {
+public:
+  explicit Scanner(std::unique_ptr<std::istream>&& stream);


Sorry, I didn't point out my opinion clearly. I thought that we can only use an std::istream* as the constructor parameter. Otherwise, it will force other modules to use unique_ptr. Nevermind since it just used for unit tests.

dzhwinter · 2018-03-13T08:35:41Z

paddle/fluid/recordio/scanner.h

+namespace paddle {
+namespace recordio {
+
+class Scanner {


In previous implementation, Scanneris a concept refer to multi-files reader, it holds multiple files, and each range_scanner construct the index and parse a single recordio file. At the same time, benefits from the range_scanner, a scanner can scan through a list of files and generates a sequence of records.
In this PR, Scanner is a Reader or Parser to a single recordio file.
Should we change it to another name or leave a comment? Scanner as a single file reader seems confused.

dzhwinter · 2018-03-13T08:39:35Z

python/paddle/fluid/recordio_writer.py

+
+@contextlib.contextmanager
+def create_recordio_writer(filename,
+                           compressor=core.RecordIOWriter.Compressor.Snappy,


these default value has been set in convert_reader_to_recordio_file .

reyoung added 6 commits March 7, 2018 10:43

Add magic number in recordio

f1d61e6

Merge branch 'feature/extract_reader_ops' into feature/recordio_file_…

1034312

…reader

Add Writer/Scanner

bcb8075

Make vec<Tensor> can be serialized to RecordIO

Complete RecordIO reader op

72be7a6

Fix CI

5cb7952

Polish codes and comments

db46778

reyoung requested review from JiayiFeng and dzhwinter March 8, 2018 03:13

reyoung added 3 commits March 8, 2018 11:14

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

b536799

… feature/recordio_file_reader

Fix CI

9d4c93a

Use new program when unittest

a305cb2

reyoung force-pushed the feature/recordio_file_reader branch from 3739828 to a305cb2 Compare March 8, 2018 08:02

dzhwinter reviewed Mar 10, 2018

View reviewed changes

reyoung added 3 commits March 12, 2018 10:49

Refine

fea4307

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

cfca8a3

… feature/recordio_file_reader

Polish RecordIO

7eedced

JiayiFeng reviewed Mar 13, 2018

View reviewed changes

dzhwinter approved these changes Mar 13, 2018

View reviewed changes

dzhwinter reviewed Mar 13, 2018

View reviewed changes

reyoung merged commit e13aec6 into PaddlePaddle:develop Mar 13, 2018

dzhwinter mentioned this pull request Mar 15, 2018

Feature/recordio #8759

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/recordio file reader #8830

Feature/recordio file reader #8830

reyoung commented Mar 7, 2018 •

edited

Loading

dzhwinter Mar 10, 2018

reyoung Mar 12, 2018

dzhwinter Mar 10, 2018

reyoung Mar 12, 2018

reyoung Mar 12, 2018

dzhwinter Mar 13, 2018

dzhwinter Mar 10, 2018

reyoung Mar 12, 2018

dzhwinter Mar 10, 2018

reyoung Mar 12, 2018

JiayiFeng Mar 13, 2018

JiayiFeng Mar 13, 2018

reyoung Mar 13, 2018

dzhwinter left a comment

dzhwinter Mar 13, 2018

dzhwinter Mar 13, 2018

dzhwinter Mar 13, 2018

Feature/recordio file reader #8830

Feature/recordio file reader #8830

Conversation

reyoung commented Mar 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dzhwinter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reyoung commented Mar 7, 2018 •

edited

Loading