Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/recordio file reader #8830

Merged
merged 12 commits into from
Mar 13, 2018

Conversation

reyoung
Copy link
Collaborator

@reyoung reyoung commented Mar 7, 2018

Fix #8861

@reyoung reyoung force-pushed the feature/recordio_file_reader branch from 3739828 to a305cb2 Compare March 8, 2018 08:02
}

private:
std::vector<framework::LoDTensor> tensors_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind, a batch of LoDTensor means a bunch of big blobs, should we use list here to avoid useless copy?
Then the code can be

std::forward_list<framework::LoDTensor> tensors_;

template<typename Container>
void WriteToRecordIO(recordio::Writer &writer,
                     const Container &tensors,
                     const platform::DeviceContext &dev_ctx) {
      for(auto& tensor : tensors) {
           TensorToStream();
           //
      }
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoDTensor holds a shared_ptr. Copy a LoDTensor is just like passing by pointers.


class Scanner {
public:
explicit Scanner(std::unique_ptr<std::istream>&& stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we store the pointer of an istream object, there is no difference between make a right value copy or not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scanner will read the istream until ends. It will steal the ownership of stream and free the stream inside. So, here use right-value reference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this API is not used for operators. It just used for unit tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't point out my opinion clearly. I thought that we can only use an std::istream* as the constructor parameter. Otherwise, it will force other modules to use unique_ptr. Nevermind since it just used for unit tests.


private:
std::unique_ptr<std::istream> stream_;
Chunk cur_chunk_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chunk store a massive amount of LoDTensor, it should be put into the heap instead of the stack.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A LoDTensor just a pointer to data memory.


auto* out = scope.FindVar(Output("Out"))
->template GetMutable<framework::ReaderHolder>();
out->Reset(new RecordIOFileReader(filename, shapes));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that user had to convert data into the recordio format, this operation will cost double size disk and waste a lot of time to convert massive amount of data.
RecordIOReader should be a stream reader support in memory converting, and RecordIOFileReader is a special case of stream.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is in memory converting? (Almost) everything in unix is a file, including memory. If we want to read recordio directly from memory, we can just mount memory into the filesystem.

platform::CPUPlace())) {}

void ReadNext(std::vector<framework::LoDTensor>* out) override {
*out = framework::ReadFromRecordIO(scanner_, dev_ctx_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the conclusion of our last discussion, a FileReader should check the result's shape before return. However, we can merge this PR first and I will refine this in my next PR.

@@ -35,7 +35,7 @@ FileReaderMakerBase::FileReaderMakerBase(
framework::OpProtoAndCheckerMaker::OpProto* op_proto,
framework::OpAttrChecker* op_checker)
: OpProtoAndCheckerMaker(op_proto, op_checker) {
AddOutput("Out", "(ReaderHolder) The created random reader.");
AddOutput("Out", "(ReaderHolder) The created random reader.").AsDuplicable();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it's duplicable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output of a file reader should be a list of variable.

Copy link
Contributor

@dzhwinter dzhwinter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. There are only some questions in my side.


class Scanner {
public:
explicit Scanner(std::unique_ptr<std::istream>&& stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't point out my opinion clearly. I thought that we can only use an std::istream* as the constructor parameter. Otherwise, it will force other modules to use unique_ptr. Nevermind since it just used for unit tests.

namespace paddle {
namespace recordio {

class Scanner {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In previous implementation, Scanneris a concept refer to multi-files reader, it holds multiple files, and each range_scanner construct the index and parse a single recordio file. At the same time, benefits from the range_scanner, a scanner can scan through a list of files and generates a sequence of records.
In this PR, Scanner is a Reader or Parser to a single recordio file.
Should we change it to another name or leave a comment? Scanner as a single file reader seems confused.


@contextlib.contextmanager
def create_recordio_writer(filename,
compressor=core.RecordIOWriter.Compressor.Snappy,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these default value has been set in convert_reader_to_recordio_file .

@reyoung reyoung merged commit e13aec6 into PaddlePaddle:develop Mar 13, 2018
@dzhwinter dzhwinter mentioned this pull request Mar 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants