-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plain Datumfile input format for minimum memory usage #2193
Conversation
@@ -181,6 +181,87 @@ class LMDB : public DB { | |||
MDB_dbi mdb_dbi_; | |||
}; | |||
|
|||
|
|||
#define MAX_BUF 10485760 // max entry size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It prevents a too large key_size
or data_size
read from file, maybe from file corruption.
I thought 10M for a datum is large enough, or maybe should be larger? 100M?
Thanks for the review @sguada ! |
Plain Datumfile input format for minimum memory usage
@immars Thanks for the pull! I have been using it, and found that when I start N training jobs accessing the same datumfile, each one only uses 100/N % of CPU. Is it normal? I am not sure if it is going to make the training slower or not. |
@weiliu89 this should not be happening, not according to my test. No locking is used, training process should not be IO bound either. Are you running N process? what's your |
Closing as better addressed by the |
When training with leveldb/lmdb, memory increases linearly with iterations and even with Next()s for DataLayer's
rand_skip
.Here's simple plain datum file format via std::fstream to address this issue.
Pros
Cons