-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/recordio file reader #8830
Feature/recordio file reader #8830
Conversation
Make vec<Tensor> can be serialized to RecordIO
… feature/recordio_file_reader
3739828
to
a305cb2
Compare
} | ||
|
||
private: | ||
std::vector<framework::LoDTensor> tensors_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind, a batch of LoDTensor means a bunch of big blobs, should we use list here to avoid useless copy?
Then the code can be
std::forward_list<framework::LoDTensor> tensors_;
template<typename Container>
void WriteToRecordIO(recordio::Writer &writer,
const Container &tensors,
const platform::DeviceContext &dev_ctx) {
for(auto& tensor : tensors) {
TensorToStream();
//
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LoDTensor holds a shared_ptr. Copy a LoDTensor is just like passing by pointers.
|
||
class Scanner { | ||
public: | ||
explicit Scanner(std::unique_ptr<std::istream>&& stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we store the pointer of an istream object, there is no difference between make a right value copy or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scanner
will read the istream until ends. It will steal the ownership of stream and free the stream inside. So, here use right-value reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this API is not used for operators. It just used for unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't point out my opinion clearly. I thought that we can only use an std::istream*
as the constructor parameter. Otherwise, it will force other modules to use unique_ptr. Nevermind since it just used for unit tests.
|
||
private: | ||
std::unique_ptr<std::istream> stream_; | ||
Chunk cur_chunk_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The chunk store a massive amount of LoDTensor, it should be put into the heap instead of the stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A LoDTensor just a pointer to data memory.
|
||
auto* out = scope.FindVar(Output("Out")) | ||
->template GetMutable<framework::ReaderHolder>(); | ||
out->Reset(new RecordIOFileReader(filename, shapes)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that user had to convert data into the recordio format, this operation will cost double size disk and waste a lot of time to convert massive amount of data.
RecordIOReader
should be a stream reader support in memory converting, and RecordIOFileReader is a special case of stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is in memory converting
? (Almost) everything in unix is a file, including memory. If we want to read recordio directly from memory, we can just mount memory into the filesystem.
… feature/recordio_file_reader
platform::CPUPlace())) {} | ||
|
||
void ReadNext(std::vector<framework::LoDTensor>* out) override { | ||
*out = framework::ReadFromRecordIO(scanner_, dev_ctx_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the conclusion of our last discussion, a FileReader
should check the result's shape before return. However, we can merge this PR first and I will refine this in my next PR.
@@ -35,7 +35,7 @@ FileReaderMakerBase::FileReaderMakerBase( | |||
framework::OpProtoAndCheckerMaker::OpProto* op_proto, | |||
framework::OpAttrChecker* op_checker) | |||
: OpProtoAndCheckerMaker(op_proto, op_checker) { | |||
AddOutput("Out", "(ReaderHolder) The created random reader."); | |||
AddOutput("Out", "(ReaderHolder) The created random reader.").AsDuplicable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it's duplicable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output of a file reader should be a list of variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. There are only some questions in my side.
|
||
class Scanner { | ||
public: | ||
explicit Scanner(std::unique_ptr<std::istream>&& stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't point out my opinion clearly. I thought that we can only use an std::istream*
as the constructor parameter. Otherwise, it will force other modules to use unique_ptr. Nevermind since it just used for unit tests.
namespace paddle { | ||
namespace recordio { | ||
|
||
class Scanner { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In previous implementation, Scanner
is a concept refer to multi-files reader
, it holds multiple files, and each range_scanner
construct the index and parse a single recordio file. At the same time, benefits from the range_scanner, a scanner can scan through a list of files and generates a sequence of records.
In this PR, Scanner
is a Reader
or Parser
to a single recordio file.
Should we change it to another name or leave a comment? Scanner
as a single file reader seems confused.
|
||
@contextlib.contextmanager | ||
def create_recordio_writer(filename, | ||
compressor=core.RecordIOWriter.Compressor.Snappy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these default value has been set in convert_reader_to_recordio_file
.
Fix #8861