Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault with large hdf5 file #1322

Closed
twerdster opened this issue Oct 19, 2014 · 5 comments
Closed

Segmentation fault with large hdf5 file #1322

twerdster opened this issue Oct 19, 2014 · 5 comments

Comments

@twerdster
Copy link

I have originally tested my network on a 6GB HDF5 file of 25000 128x128 images and 25000 9 dimensional labels and everything works and trains really well. Then I recreate my dataset with 10 variation of each image
and the HDF5 file becomes 35GB (i.e 250k images). I apply exactly the same .solver and .prototxt and it crashes with a Segmentation fault.
(If I knew how to use leveldb to create a dataset with a 9d float label then I would just modify the create_imagenet.cpp file, but the Datum has only a char label that I dont know how to modify to what I need...)

Here is the input line in iPython:
!../../build/tools/caffe train -solver solver.prototxt

And here is the output:

I1019 03:50:39.860368 6013 caffe.cpp:99] Use GPU with device ID 0
I1019 03:50:39.994917 6013 caffe.cpp:107] Starting Optimization
I1019 03:50:39.994995 6013 solver.cpp:32] Initializing solver from parameters:
test_iter: 15
test_interval: 500
base_lr: 0.001
display: 20
max_iter: 100000
lr_policy: "step"
gamma: 0.1
momentum: 0.9
weight_decay: 0.0005
stepsize: 10000
snapshot: 2000
snapshot_prefix: "mynet_train"
solver_mode: GPU
net: "train_val.prototxt"
I1019 03:50:39.995048 6013 solver.cpp:67] Creating training net from net file: train_val.prototxt
I1019 03:50:39.995362 6013 net.cpp:39] Initializing net from parameters:
name: "Mynet"
layers {
top: "data"
top: "label"
name: "data"
type: HDF5_DATA
hdf5_data_param {
source: "train_data.txt"
batch_size: 2
}
}
layers {
bottom: "data"
top: "conv1"
name: "conv1"
type: CONVOLUTION
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 96
kernel_size: 23
stride: 2
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
bottom: "conv1"
top: "conv1"
name: "relu1"
type: RELU
}
layers {
bottom: "conv1"
top: "norm1"
name: "norm1"
type: LRN
lrn_param {
local_size: 5
alpha: 0.0001
beta: 0.75
}
}
layers {
bottom: "norm1"
top: "pool1"
name: "pool1"
type: POOLING
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layers {
bottom: "pool1"
top: "conv2"
name: "conv2"
type: CONVOLUTION
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 256
kernel_size: 5
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
bottom: "conv2"
top: "conv2"
name: "relu2"
type: RELU
}
layers {
bottom: "conv2"
top: "norm2"
name: "norm2"
type: LRN
lrn_param {
local_size: 5
alpha: 0.0001
beta: 0.75
}
}
layers {
bottom: "norm2"
top: "pool2"
name: "pool2"
type: POOLING
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layers {
bottom: "pool2"
top: "fc1"
name: "fc1"
type: INNER_PRODUCT
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
inner_product_param {
num_output: 1000
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
bottom: "fc1"
top: "fc1"
name: "relu2"
type: RELU
}
layers {
bottom: "fc1"
top: "fc2"
name: "fc2"
type: INNER_PRODUCT
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
inner_product_param {
num_output: 200
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
bottom: "fc2"
top: "fc3"
name: "fc3"
type: INNER_PRODUCT
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
inner_product_param {
num_output: 9
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
bottom: "fc3"
bottom: "label"
top: "loss"
name: "loss"
type: EUCLIDEAN_LOSS
}
state {
phase: TRAIN
}
I1019 03:50:39.995841 6013 net.cpp:67] Creating Layer data
I1019 03:50:39.995851 6013 net.cpp:356] data -> data
I1019 03:50:39.995863 6013 net.cpp:356] data -> label
I1019 03:50:39.995872 6013 net.cpp:96] Setting up data
I1019 03:50:39.995879 6013 hdf5_data_layer.cpp:57] Loading filename from train_data.txt
I1019 03:50:39.995925 6013 hdf5_data_layer.cpp:69] Number of files: 1
I1019 03:50:39.995932 6013 hdf5_data_layer.cpp:29] Loading HDF5 file/home/gipuser/Documents/Synch1_10/data_train.h5
Segmentation fault (core dumped)

Any ideas on how to get over this issue?

@shelhamer
Copy link
Member

Split your input into separate data and label sources and define two data
layers. Store the images in an LMDB and the 9D labels in HDF5. Two data
layers work fine because Caffe understands DAGs.

Please search and post on the caffe-users mailing list for modeling
questions and usage of multiple data layers.

Thanks for noting the HDF5 crash under stress.

On Saturday, October 18, 2014, Aaron Wetzler <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

I have originally tested my network on a 6GB HDF5 file of 25000 128x128
images and 25000 9 dimensional labels and everything works and trains
really well. Then I recreate my dataset with 10 variation of each image
and the HDF5 file becomes 35GB (i.e 250k images). I apply exactly the same
.solver and .prototxt and it crashes with a Segmentation fault.
(If I knew how to use leveldb to create a dataset with a 9d float label
then I would just modify the create_imagenet.cpp file, but the Datum has
only a char label that I dont know how to modify to what I need...)

Here is the input line in iPython:
!../../build/tools/caffe train -solver solver.prototxt

And here is the output:

I1019 03:50:39.860368 6013 caffe.cpp:99] Use GPU with device ID 0
I1019 03:50:39.994917 6013 caffe.cpp:107] Starting Optimization
I1019 03:50:39.994995 6013 solver.cpp:32] Initializing solver from
parameters:
test_iter: 15
test_interval: 500
base_lr: 0.001
display: 20
max_iter: 100000
lr_policy: "step"
gamma: 0.1
momentum: 0.9
weight_decay: 0.0005
stepsize: 10000
snapshot: 2000
snapshot_prefix: "mynet_train"
solver_mode: GPU
net: "train_val.prototxt"
I1019 03:50:39.995048 6013 solver.cpp:67] Creating training net from net
file: train_val.prototxt
I1019 03:50:39.995362 6013 net.cpp:39] Initializing net from parameters:
name: "Mynet"
layers {
top: "data"
top: "label"
name: "data"
type: HDF5_DATA
hdf5_data_param {
source: "train_data.txt"
batch_size: 2
}
}
layers {
bottom: "data"
top: "conv1"
name: "conv1"
type: CONVOLUTION
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 96
kernel_size: 23
stride: 2
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
bottom: "conv1"
top: "conv1"
name: "relu1"
type: RELU
}
layers {
bottom: "conv1"
top: "norm1"
name: "norm1"
type: LRN
lrn_param {
local_size: 5
alpha: 0.0001
beta: 0.75
}
}
layers {
bottom: "norm1"
top: "pool1"
name: "pool1"
type: POOLING
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layers {
bottom: "pool1"
top: "conv2"
name: "conv2"
type: CONVOLUTION
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 256
kernel_size: 5
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
bottom: "conv2"
top: "conv2"
name: "relu2"
type: RELU
}
layers {
bottom: "conv2"
top: "norm2"
name: "norm2"
type: LRN
lrn_param {
local_size: 5
alpha: 0.0001
beta: 0.75
}
}
layers {
bottom: "norm2"
top: "pool2"
name: "pool2"
type: POOLING
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layers {
bottom: "pool2"
top: "fc1"
name: "fc1"
type: INNER_PRODUCT
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
inner_product_param {
num_output: 1000
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
bottom: "fc1"
top: "fc1"
name: "relu2"
type: RELU
}
layers {
bottom: "fc1"
top: "fc2"
name: "fc2"
type: INNER_PRODUCT
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
inner_product_param {
num_output: 200
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
bottom: "fc2"
top: "fc3"
name: "fc3"
type: INNER_PRODUCT
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
inner_product_param {
num_output: 9
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
bottom: "fc3"
bottom: "label"
top: "loss"
name: "loss"
type: EUCLIDEAN_LOSS
}
state {
phase: TRAIN
}
I1019 03:50:39.995841 6013 net.cpp:67] Creating Layer data
I1019 03:50:39.995851 6013 net.cpp:356] data -> data
I1019 03:50:39.995863 6013 net.cpp:356] data -> label
I1019 03:50:39.995872 6013 net.cpp:96] Setting up data
I1019 03:50:39.995879 6013 hdf5_data_layer.cpp:57] Loading filename from
train_data.txt
I1019 03:50:39.995925 6013 hdf5_data_layer.cpp:69] Number of files: 1
I1019 03:50:39.995932 6013 hdf5_data_layer.cpp:29] Loading HDF5
file/home/gipuser/Documents/Synch1_10/data_train.h5
Segmentation fault (core dumped)

Any ideas on how to get over this issue?


Reply to this email directly or view it on GitHub
#1322.

Evan Shelhamer

@nihiladrem
Copy link

I believe that this is probably not an HDF5 issue, but rather related to the reshape function for the data blob. The capacity_ and count_ variables are both 32 bit signed int, and there is no check for integer overflow. A data file with a large number of images (that are themselves not small) can easily lead to an overflow.

@ryanamelon
Copy link

I believe nihiladrem is correct.
The _count variable that is used to define the _capacity is {num_images}x{channels}x{height}x{width} (within blobs.cpp Line 19).

The maximum size for _count as a 32-bit signed int is 2.1x10^9. Declared as int in blobs.hpp Line 136.

With 25,000 128x128 images (assuming single channel), the _count is: 4.1x10^8 which is within the bounds.

With 10 variations of each, the _count would be pushed to 4.1x10^9 and _count cannot store that large of a number.

@twerdster
Copy link
Author

Yes, indeed this does seem to be the issue. I changed the method to use
multiple input files as Evan suggested and it works
almost as expected. However there is a strange behaviour every time a new
input hdf5 file is loaded in:
the loss jumps dramatically. It stabilizes after a few hundred iterations
but it doesnt make sense to me that
it would jump. Are the gradients being reset accidentally? Or some other
initialization procedure that might
affect the loss after loading in a new file?

For example:
I1025 18:07:56.066611 25680 solver.cpp:206] Train net output #0: loss =
0.0599947 (* 1 = 0.0599947 loss)
I1025 18:07:56.066622 25680 solver.cpp:403] Iteration 740, lr = 0.01
I1025 18:08:02.819130 25680 hdf5_data_layer.cpp:29] Loading HDF5
file/home/gipuser/Documents/Synch1_5_Pose/data_train_004.h5
I1025 18:08:28.625567 25680 hdf5_data_layer.cpp:49] Successully loaded
19001 rows
I1025 18:08:28.756755 25680 solver.cpp:191] Iteration 760, loss = 0.324694
I1025 18:08:28.756798 25680 solver.cpp:206] Train net output #0: loss =
0.324694 (* 1 = 0.324694 loss)
I1025 18:08:28.756811 25680 solver.cpp:403] Iteration 760, lr = 0.01
I1025 18:08:34.151319 25680 solver.cpp:191] Iteration 780, loss = 0.107596
I1025 18:08:34.151368 25680 solver.cpp:206] Train net output #0: loss =
0.107596 (* 1 = 0.107596 loss)
I1025 18:08:34.151381 25680 solver.cpp:403] Iteration 780, lr = 0.01
I1025 18:08:39.546066 25680 solver.cpp:191] Iteration 800, loss = 0.106797
I1025 18:08:39.546113 25680 solver.cpp:206] Train net output #0: loss =
0.106797 (* 1 = 0.106797 loss)
I1025 18:08:39.546123 25680 solver.cpp:403] Iteration 800, lr = 0.01
I1025 18:08:46.238445 25680 solver.cpp:191] Iteration 820, loss = 0.108824
I1025 18:08:46.238481 25680 solver.cpp:206] Train net output #0: loss =
0.108824 (* 1 = 0.108824 loss)
I1025 18:08:46.238492 25680 solver.cpp:403] Iteration 820, lr = 0.01
I1025 18:08:53.109293 25680 solver.cpp:191] Iteration 840, loss = 0.0657912
and then for the next file

I1025 18:09:27.503116 25680 solver.cpp:206] Train net output #0: loss =
0.0503713 (* 1 = 0.0503713 loss)
I1025 18:09:27.503142 25680 solver.cpp:403] Iteration 940, lr = 0.01
I1025 18:09:30.816984 25680 hdf5_data_layer.cpp:29] Loading HDF5
file/home/gipuser/Documents/Synch1_5_Pose/data_train_005.h5
I1025 18:09:56.727255 25680 hdf5_data_layer.cpp:49] Successully loaded
19001 rows
I1025 18:09:59.587239 25680 solver.cpp:191] Iteration 960, loss = 0.347449
I1025 18:09:59.587276 25680 solver.cpp:206] Train net output #0: loss =
0.347449 (* 1 = 0.347449 loss)
I1025 18:09:59.587285 25680 solver.cpp:403] Iteration 960, lr = 0.01
I1025 18:10:04.953788 25680 solver.cpp:191] Iteration 980, loss = 0.163337
I1025 18:10:04.954114 25680 solver.cpp:206] Train net output #0: loss =
0.163337 (* 1 = 0.163337 loss)

On 24 October 2014 17:53, usufructuarius notifications@github.com wrote:

I believe nihiladrem is correct.
The _count variable that is used to define the _capacity is
{num_images}x{channels}x{height}x{width} (within blobs.cpp Line 19).

The maximum size for _count as a 32-bit signed int is 2.1x10^9. Declared
as int in blobs.hpp Line 136.

With 25,000 128x128 images (assuming single channel), the _count is:
4.1x10^8 which is within the bounds.

With 10 variations of each, the _count would be pushed to 4.1x10^9 and
_count cannot store that large of a number.


Reply to this email directly or view it on GitHub
#1322 (comment).

@shelhamer
Copy link
Member

@twerdster make sure the data is shuffled -- otherwise the learning can overfit to each chunk of data in sequence with the loss jumping between chunks. Please ask usage questions on the caffe-users group.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants