Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Don't Merge] Rebase and Clean up Hdf5DataLayer Prefetch #2892

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ronghanghu
Copy link
Member

Cleaned up #2271 to adapt to #2836. Original authors are @jeffdonahue and @pclove1

Note: this adds prefetching but disables shuffling rows for hdf5, and still preserve shuffling files.

DO NOT MERGE

@shelhamer shelhamer added the JD label Aug 9, 2015
@shelhamer
Copy link
Member

It might be best if @erictzeng takes a look at this given his recent hdf5 work.

@ronghanghu
Copy link
Member Author

If I understand correctly, shuffling is still preserved in FillHDF5FileData

https://github.com/ronghanghu/caffe/blob/hdf5-prefetch/src/caffe/layers/hdf5_data_layer.cpp#L61-L64

@shelhamer
Copy link
Member

This still shuffles the order of hdf5 files but no longer shuffles rows within hdf5. Generally rows >> files so shuffling is limited by this change. This is no different than lmdb / leveldb however.

Adapt HDF5DataLayer Prefetch to BVLC#2836
@ronghanghu
Copy link
Member Author

@shelhamer OK, I see. This may be a serious drawback of this PR.

Instead of this PR, we can also keep the current shuffle behavior and just implement the prefetch (using additional prefetch memory blob, like in other prefetch data layers). I am hacking that directly based upon #2870.

@ronghanghu ronghanghu changed the title Rebase and Clean up Hdf5DataLayer Prefetch [Don't Merge] Rebase and Clean up Hdf5DataLayer Prefetch Aug 10, 2015
@jeffdonahue
Copy link
Contributor

Another disadvantage of the PR (which I didn't realize until I started using HDF5DataLayer myself after writing the initial version of this PR) is the optimization in the case of the entire dataset being a single HDF5 file -- with the current HDF5DataLayer, the entire file is loaded into memory initially and nothing is ever read from disk again. I'm not sure an HDF5 prefetching PR should be merged until the optimization of this important special case is somehow brought back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants