Caffe cannot handle HDF5 files larger as large as 20GB? #2953

mgarbade · 2015-08-21T09:43:19Z

I have a training database stored in the hdf5 format. However caffe immediately breaks down when it tries to train on it. Error-Message:

I0820 16:56:50.634572 15886 hdf5_data_layer.cpp:80] Loading list of HDF5 filenames from: /home/Databases/train.txt
I0820 16:56:50.634627 15886 hdf5_data_layer.cpp:94] Number of HDF5 files: 1
F0820 16:56:50.655230 15886 blob.cpp:101] Check failed: data_ 
*** Check failure stack trace: ***
    @     0x7f5f7eebcdaa  (unknown)
    @     0x7f5f7eebcce4  (unknown)
    @     0x7f5f7eebc6e6  (unknown)
    @     0x7f5f7eebf687  (unknown)
    @     0x7f5f7f2b63ce  caffe::Blob<>::mutable_cpu_data()
    @     0x7f5f7f20e85d  caffe::hdf5_load_nd_dataset<>()
    @     0x7f5f7f2575ae  caffe::HDF5DataLayer<>::LoadHDF5FileData()
    @     0x7f5f7f2563d8  caffe::HDF5DataLayer<>::LayerSetUp()
    @     0x7f5f7f2d0332  caffe::Net<>::Init()
    @     0x7f5f7f2d1df2  caffe::Net<>::Net()
    @     0x7f5f7f2ddec0  caffe::Solver<>::InitTrainNet()
    @     0x7f5f7f2defd3  caffe::Solver<>::Init()
    @     0x7f5f7f2df1a6  caffe::Solver<>::Solver()
    @           0x40c4b0  caffe::GetSolver<>()
    @           0x406481  train()
    @           0x404a21  main
    @     0x7f5f7e3cdec5  (unknown)
    @           0x404fcd  (unknown)
    @              (nil)  (unknown)

When I split my training database into a smaller chunk (~13GB) everything works fine (all other parameters remained unchanged).
So I guess caffe has a problem with large HDF5 files?

The text was updated successfully, but these errors were encountered:

bhack · 2015-08-21T10:27:24Z

You need to compile caffe in debug mode, run with gdb a send the stacktrace.

mgarbade · 2015-08-28T10:01:26Z

So I compiled caffe in debug-mode. This is the output:

I0828 11:55:43.010573  9445 hdf5_data_layer.cpp:94] Number of HDF5 files: 1
I0828 11:55:43.010650  9445 hdf5_data_layer.cpp:29] Loading HDF5 file:     /path/to/data/trainDataset.h5
F0828 11:55:43.055294  9445 blob.cpp:29] Check failed: shape[i] <= 2147483647 / count_ (833 vs.     715) blob size exceeds INT_MAX
*** Check failure stack trace: ***
    @     0x7f51df2f6daa  (unknown)
    @     0x7f51df2f6ce4  (unknown)
    @     0x7f51df2f66e6  (unknown)
    @     0x7f51df2f9687  (unknown)
    @     0x7f51dfb254dd  caffe::Blob<>::Reshape()
    @     0x7f51dfa7132d  caffe::hdf5_load_nd_dataset_helper<>()
    @     0x7f51dfa70006  caffe::hdf5_load_nd_dataset<>()
    @     0x7f51dfab2e9f  caffe::HDF5DataLayer<>::LoadHDF5FileData()
    @     0x7f51dfab25e0  caffe::HDF5DataLayer<>::LayerSetUp()
    @     0x7f51dfacf4ba  caffe::Layer<>::SetUp()
    @     0x7f51dfb32602  caffe::Net<>::Init()
    @     0x7f51dfb30779  caffe::Net<>::Net()
    @     0x7f51dfb4fe43  caffe::Solver<>::InitTrainNet()
    @     0x7f51dfb4f665  caffe::Solver<>::Init()
    @     0x7f51dfb4f15a  caffe::Solver<>::Solver()
    @           0x41b9e3  caffe::SGDSolver<>::SGDSolver()
    @           0x419363  caffe::GetSolver<>()
    @           0x41503b  train()
    @           0x4173fa  main
    @     0x7f51de807ec5  (unknown)
    @           0x413fd9  (unknown)
    @              (nil)  (unknown)

Unfortunately I don't know how to use gdb to help me in that case

bhack · 2015-08-28T10:04:14Z

Nevermind. It is enougth

bhack · 2015-08-28T10:14:13Z

There is an intrisinc limit on the blob shape size that is CHECK_LE(shape[i], INT_MAX / count_)

bhack · 2015-08-28T12:41:01Z

So the blob has 2 GB limit minus 1 byte. You are over this limit.

mgarbade · 2015-08-30T10:27:51Z

Ok. But what should I do then? I guess the number of training samples should not matter. I'm sure there are people with more training data than 2GB.

I could cut my training data into chunks of < 2GB, train on the first chunk, save the caffenetmodel file, then load the next chunk and finetune the caffenetmodel on that chunk and so on...

Or is there a more elegant way?

Thanks for your help so far

bhack · 2015-08-30T10:33:02Z

This is not an bug. You need to close this ticket and continue the discussion on caffe-users mailing list

seanbell · 2015-08-30T23:59:53Z

@mgarbade I believe you can have multiple HDF5 files, each with fewer than 2GB of data, but where the combination of all of them is above 2GB. You specify all the files in a list. The data layer will then cycle through the list of files. You can also get it to shuffle the list of files itself. See: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/hdf5_data_layer.cpp#L138

lukeyeager · 2015-09-02T22:57:49Z

I just ran into this issue as well.

My batch size is 100, so my blob shape should be (100,3,256,256). Altogether, that's 19,660,800 floats. Should be fine.

19M << 2147M (INT_MAX)

The whole dataset, however, would be (35676,3,256,256) as a single blob. Altogether that's 7,014,187,008 floats.

7014M > 2147M (INT_MAX)

Is the HDF5Data layer trying to read the whole HDF5 dataset into a single blob? Why?

lukeyeager · 2015-09-02T23:38:29Z

I just verified that a batch size of 10,923 fails (109233256256 = 2147549184) and a batch size of 10,922 doesn't (109223256256 = 2147352576). That is true whether the HDF5 dataset dtype is float32 (8.1G file) or uint8 (2.1G file) (requires #2978 to test). So the actual file size doesn't matter. What matters is the product of the dimensions.

Why is everyone talking about the filesize as if it matters? Is that a separate error?
Why is there a restriction on the amount of data in a blob? If it's an indexing problem, why wouldn't it be UINT_MAX?
Why does the HDF5Data layer read all data into a single blob?

bhack · 2015-09-02T23:47:47Z

@lukeyeager for 1 I always talked of blob limit of 2gb minus 1 byte so 2147483647

bhack · 2015-09-02T23:51:42Z

For 2. see #1470

lukeyeager · 2015-09-03T00:10:06Z

(1) Yeah but if it's 4bytes per number (for float32 dtype), isn't that an (implicit) 8gb file limit? That's what I'm seeing.

(2,3) Aha, so the HDF5Data layer doesn't prefetch? That's vexing. I still don't see a need for the INT_MAX limit, but it won't matter after #2892.

rogertrullo · 2016-07-19T22:06:33Z

Hi, I can't see the need for using integer for the count variable in blobs as @lukeyeager said. Is there a particular reason for this, instead of using uint? I am having issues for big 3D data.
Thanks!

shelhamer · 2017-04-14T02:10:23Z

Closing as duplicate of #1470.

lukeyeager mentioned this issue Sep 2, 2015

Add support for HDF5 datasets NVIDIA/DIGITS#226

Merged

3 tasks

seanbell mentioned this issue Sep 18, 2015

Blob size exceeds INT_MAX #3084

Closed

shelhamer closed this as completed Apr 14, 2017

shelhamer added the duplicate label Apr 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caffe cannot handle HDF5 files larger as large as 20GB? #2953

Caffe cannot handle HDF5 files larger as large as 20GB? #2953

mgarbade commented Aug 21, 2015

bhack commented Aug 21, 2015

mgarbade commented Aug 28, 2015

bhack commented Aug 28, 2015

bhack commented Aug 28, 2015

bhack commented Aug 28, 2015

mgarbade commented Aug 30, 2015

bhack commented Aug 30, 2015

seanbell commented Aug 30, 2015

lukeyeager commented Sep 2, 2015

lukeyeager commented Sep 2, 2015

bhack commented Sep 2, 2015

bhack commented Sep 2, 2015

lukeyeager commented Sep 3, 2015

rogertrullo commented Jul 19, 2016

shelhamer commented Apr 14, 2017

Caffe cannot handle HDF5 files larger as large as 20GB? #2953

Caffe cannot handle HDF5 files larger as large as 20GB? #2953

Comments

mgarbade commented Aug 21, 2015

bhack commented Aug 21, 2015

mgarbade commented Aug 28, 2015

bhack commented Aug 28, 2015

bhack commented Aug 28, 2015

bhack commented Aug 28, 2015

mgarbade commented Aug 30, 2015

bhack commented Aug 30, 2015

seanbell commented Aug 30, 2015

lukeyeager commented Sep 2, 2015

lukeyeager commented Sep 2, 2015

bhack commented Sep 2, 2015

bhack commented Sep 2, 2015

lukeyeager commented Sep 3, 2015

rogertrullo commented Jul 19, 2016

shelhamer commented Apr 14, 2017