Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caffe cannot handle HDF5 files larger as large as 20GB? #2953

Closed
mgarbade opened this issue Aug 21, 2015 · 15 comments
Closed

Caffe cannot handle HDF5 files larger as large as 20GB? #2953

mgarbade opened this issue Aug 21, 2015 · 15 comments

Comments

@mgarbade
Copy link

I have a training database stored in the hdf5 format. However caffe immediately breaks down when it tries to train on it. Error-Message:

I0820 16:56:50.634572 15886 hdf5_data_layer.cpp:80] Loading list of HDF5 filenames from: /home/Databases/train.txt
I0820 16:56:50.634627 15886 hdf5_data_layer.cpp:94] Number of HDF5 files: 1
F0820 16:56:50.655230 15886 blob.cpp:101] Check failed: data_ 
*** Check failure stack trace: ***
    @     0x7f5f7eebcdaa  (unknown)
    @     0x7f5f7eebcce4  (unknown)
    @     0x7f5f7eebc6e6  (unknown)
    @     0x7f5f7eebf687  (unknown)
    @     0x7f5f7f2b63ce  caffe::Blob<>::mutable_cpu_data()
    @     0x7f5f7f20e85d  caffe::hdf5_load_nd_dataset<>()
    @     0x7f5f7f2575ae  caffe::HDF5DataLayer<>::LoadHDF5FileData()
    @     0x7f5f7f2563d8  caffe::HDF5DataLayer<>::LayerSetUp()
    @     0x7f5f7f2d0332  caffe::Net<>::Init()
    @     0x7f5f7f2d1df2  caffe::Net<>::Net()
    @     0x7f5f7f2ddec0  caffe::Solver<>::InitTrainNet()
    @     0x7f5f7f2defd3  caffe::Solver<>::Init()
    @     0x7f5f7f2df1a6  caffe::Solver<>::Solver()
    @           0x40c4b0  caffe::GetSolver<>()
    @           0x406481  train()
    @           0x404a21  main
    @     0x7f5f7e3cdec5  (unknown)
    @           0x404fcd  (unknown)
    @              (nil)  (unknown)

When I split my training database into a smaller chunk (~13GB) everything works fine (all other parameters remained unchanged).
So I guess caffe has a problem with large HDF5 files?

@bhack
Copy link
Contributor

bhack commented Aug 21, 2015

You need to compile caffe in debug mode, run with gdb a send the stacktrace.

@mgarbade
Copy link
Author

So I compiled caffe in debug-mode. This is the output:

I0828 11:55:43.010573  9445 hdf5_data_layer.cpp:94] Number of HDF5 files: 1
I0828 11:55:43.010650  9445 hdf5_data_layer.cpp:29] Loading HDF5 file:     /path/to/data/trainDataset.h5
F0828 11:55:43.055294  9445 blob.cpp:29] Check failed: shape[i] <= 2147483647 / count_ (833 vs.     715) blob size exceeds INT_MAX
*** Check failure stack trace: ***
    @     0x7f51df2f6daa  (unknown)
    @     0x7f51df2f6ce4  (unknown)
    @     0x7f51df2f66e6  (unknown)
    @     0x7f51df2f9687  (unknown)
    @     0x7f51dfb254dd  caffe::Blob<>::Reshape()
    @     0x7f51dfa7132d  caffe::hdf5_load_nd_dataset_helper<>()
    @     0x7f51dfa70006  caffe::hdf5_load_nd_dataset<>()
    @     0x7f51dfab2e9f  caffe::HDF5DataLayer<>::LoadHDF5FileData()
    @     0x7f51dfab25e0  caffe::HDF5DataLayer<>::LayerSetUp()
    @     0x7f51dfacf4ba  caffe::Layer<>::SetUp()
    @     0x7f51dfb32602  caffe::Net<>::Init()
    @     0x7f51dfb30779  caffe::Net<>::Net()
    @     0x7f51dfb4fe43  caffe::Solver<>::InitTrainNet()
    @     0x7f51dfb4f665  caffe::Solver<>::Init()
    @     0x7f51dfb4f15a  caffe::Solver<>::Solver()
    @           0x41b9e3  caffe::SGDSolver<>::SGDSolver()
    @           0x419363  caffe::GetSolver<>()
    @           0x41503b  train()
    @           0x4173fa  main
    @     0x7f51de807ec5  (unknown)
    @           0x413fd9  (unknown)
    @              (nil)  (unknown)

Unfortunately I don't know how to use gdb to help me in that case

@bhack
Copy link
Contributor

bhack commented Aug 28, 2015

Nevermind. It is enougth

@bhack
Copy link
Contributor

bhack commented Aug 28, 2015

There is an intrisinc limit on the blob shape size that is CHECK_LE(shape[i], INT_MAX / count_)

@bhack
Copy link
Contributor

bhack commented Aug 28, 2015

So the blob has 2 GB limit minus 1 byte. You are over this limit.

@mgarbade
Copy link
Author

Ok. But what should I do then? I guess the number of training samples should not matter. I'm sure there are people with more training data than 2GB.

I could cut my training data into chunks of < 2GB, train on the first chunk, save the caffenetmodel file, then load the next chunk and finetune the caffenetmodel on that chunk and so on...

Or is there a more elegant way?

Thanks for your help so far

@bhack
Copy link
Contributor

bhack commented Aug 30, 2015

This is not an bug. You need to close this ticket and continue the discussion on caffe-users mailing list

@seanbell
Copy link

@mgarbade I believe you can have multiple HDF5 files, each with fewer than 2GB of data, but where the combination of all of them is above 2GB. You specify all the files in a list. The data layer will then cycle through the list of files. You can also get it to shuffle the list of files itself. See: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/hdf5_data_layer.cpp#L138

@lukeyeager
Copy link
Contributor

I just ran into this issue as well.

My batch size is 100, so my blob shape should be (100,3,256,256). Altogether, that's 19,660,800 floats. Should be fine.

19M << 2147M (INT_MAX)

The whole dataset, however, would be (35676,3,256,256) as a single blob. Altogether that's 7,014,187,008 floats.

7014M > 2147M (INT_MAX)

Is the HDF5Data layer trying to read the whole HDF5 dataset into a single blob? Why?

@lukeyeager
Copy link
Contributor

I just verified that a batch size of 10,923 fails (109233256256 = 2147549184) and a batch size of 10,922 doesn't (109223256256 = 2147352576). That is true whether the HDF5 dataset dtype is float32 (8.1G file) or uint8 (2.1G file) (requires #2978 to test). So the actual file size doesn't matter. What matters is the product of the dimensions.

  1. Why is everyone talking about the filesize as if it matters? Is that a separate error?
  2. Why is there a restriction on the amount of data in a blob? If it's an indexing problem, why wouldn't it be UINT_MAX?
  3. Why does the HDF5Data layer read all data into a single blob?

@bhack
Copy link
Contributor

bhack commented Sep 2, 2015

@lukeyeager for 1 I always talked of blob limit of 2gb minus 1 byte so 2147483647

@bhack
Copy link
Contributor

bhack commented Sep 2, 2015

For 2. see #1470

@lukeyeager
Copy link
Contributor

(1) Yeah but if it's 4bytes per number (for float32 dtype), isn't that an (implicit) 8gb file limit? That's what I'm seeing.

(2,3) Aha, so the HDF5Data layer doesn't prefetch? That's vexing. I still don't see a need for the INT_MAX limit, but it won't matter after #2892.

@rogertrullo
Copy link

Hi, I can't see the need for using integer for the count variable in blobs as @lukeyeager said. Is there a particular reason for this, instead of using uint? I am having issues for big 3D data.
Thanks!

@shelhamer
Copy link
Member

Closing as duplicate of #1470.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants