Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plain Datumfile input format for minimum memory usage #2193

Closed
wants to merge 6 commits into from

Conversation

immars
Copy link

@immars immars commented Mar 25, 2015

When training with leveldb/lmdb, memory increases linearly with iterations and even with Next()s for DataLayer's rand_skip.

Here's simple plain datum file format via std::fstream to address this issue.

Pros

  • RAM usage basically stay constant (<1G, on googlenet small batches) during training as expected
  • datum file size between leveldb and lmdb format
  • no noticeable impact on training speed because of prefetch
  • concurrent read for multiple process

Cons

  • no random read. But caffe does not need random key-value access anyway.

@@ -181,6 +181,87 @@ class LMDB : public DB {
MDB_dbi mdb_dbi_;
};


#define MAX_BUF 10485760 // max entry size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this value?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It prevents a too large key_size or data_size read from file, maybe from file corruption.
I thought 10M for a datum is large enough, or maybe should be larger? 100M?

@immars
Copy link
Author

immars commented Mar 27, 2015

Thanks for the review @sguada !

weiliu89 added a commit to weiliu89/caffe that referenced this pull request Apr 14, 2015
Plain Datumfile input format for minimum memory usage
weiliu89 added a commit to weiliu89/caffe that referenced this pull request Apr 16, 2015
@weiliu89
Copy link

@immars Thanks for the pull! I have been using it, and found that when I start N training jobs accessing the same datumfile, each one only uses 100/N % of CPU. Is it normal? I am not sure if it is going to make the training slower or not.

@immars
Copy link
Author

immars commented Apr 18, 2015

@weiliu89 this should not be happening, not according to my test. No locking is used, training process should not be IO bound either. Are you running N process? what's your iostat -kx 1 output ? or nvidia-smi?

@shelhamer
Copy link
Member

Closing as better addressed by the Python layer. There are many types of data, and as long as it can be handled in Python it can be handled as a Python layer.

@shelhamer shelhamer closed this Apr 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants