New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault with large hdf5 file #1322
Comments
Split your input into separate data and label sources and define two data Please search and post on the caffe-users mailing list for modeling Thanks for noting the HDF5 crash under stress. On Saturday, October 18, 2014, Aaron Wetzler <notifications@github.com
Evan Shelhamer |
I believe that this is probably not an HDF5 issue, but rather related to the reshape function for the data blob. The capacity_ and count_ variables are both 32 bit signed int, and there is no check for integer overflow. A data file with a large number of images (that are themselves not small) can easily lead to an overflow. |
I believe nihiladrem is correct. The maximum size for _count as a 32-bit signed int is 2.1x10^9. Declared as int in blobs.hpp Line 136. With 25,000 128x128 images (assuming single channel), the _count is: 4.1x10^8 which is within the bounds. With 10 variations of each, the _count would be pushed to 4.1x10^9 and _count cannot store that large of a number. |
Yes, indeed this does seem to be the issue. I changed the method to use For example: I1025 18:09:27.503116 25680 solver.cpp:206] Train net output #0: loss = On 24 October 2014 17:53, usufructuarius notifications@github.com wrote:
|
@twerdster make sure the data is shuffled -- otherwise the learning can overfit to each chunk of data in sequence with the loss jumping between chunks. Please ask usage questions on the caffe-users group. |
I have originally tested my network on a 6GB HDF5 file of 25000 128x128 images and 25000 9 dimensional labels and everything works and trains really well. Then I recreate my dataset with 10 variation of each image
and the HDF5 file becomes 35GB (i.e 250k images). I apply exactly the same .solver and .prototxt and it crashes with a Segmentation fault.
(If I knew how to use leveldb to create a dataset with a 9d float label then I would just modify the create_imagenet.cpp file, but the Datum has only a char label that I dont know how to modify to what I need...)
Here is the input line in iPython:
!../../build/tools/caffe train -solver solver.prototxt
And here is the output:
I1019 03:50:39.860368 6013 caffe.cpp:99] Use GPU with device ID 0
I1019 03:50:39.994917 6013 caffe.cpp:107] Starting Optimization
I1019 03:50:39.994995 6013 solver.cpp:32] Initializing solver from parameters:
test_iter: 15
test_interval: 500
base_lr: 0.001
display: 20
max_iter: 100000
lr_policy: "step"
gamma: 0.1
momentum: 0.9
weight_decay: 0.0005
stepsize: 10000
snapshot: 2000
snapshot_prefix: "mynet_train"
solver_mode: GPU
net: "train_val.prototxt"
I1019 03:50:39.995048 6013 solver.cpp:67] Creating training net from net file: train_val.prototxt
I1019 03:50:39.995362 6013 net.cpp:39] Initializing net from parameters:
name: "Mynet"
layers {
top: "data"
top: "label"
name: "data"
type: HDF5_DATA
hdf5_data_param {
source: "train_data.txt"
batch_size: 2
}
}
layers {
bottom: "data"
top: "conv1"
name: "conv1"
type: CONVOLUTION
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 96
kernel_size: 23
stride: 2
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
bottom: "conv1"
top: "conv1"
name: "relu1"
type: RELU
}
layers {
bottom: "conv1"
top: "norm1"
name: "norm1"
type: LRN
lrn_param {
local_size: 5
alpha: 0.0001
beta: 0.75
}
}
layers {
bottom: "norm1"
top: "pool1"
name: "pool1"
type: POOLING
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layers {
bottom: "pool1"
top: "conv2"
name: "conv2"
type: CONVOLUTION
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
convolution_param {
num_output: 256
kernel_size: 5
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
bottom: "conv2"
top: "conv2"
name: "relu2"
type: RELU
}
layers {
bottom: "conv2"
top: "norm2"
name: "norm2"
type: LRN
lrn_param {
local_size: 5
alpha: 0.0001
beta: 0.75
}
}
layers {
bottom: "norm2"
top: "pool2"
name: "pool2"
type: POOLING
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layers {
bottom: "pool2"
top: "fc1"
name: "fc1"
type: INNER_PRODUCT
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
inner_product_param {
num_output: 1000
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
bottom: "fc1"
top: "fc1"
name: "relu2"
type: RELU
}
layers {
bottom: "fc1"
top: "fc2"
name: "fc2"
type: INNER_PRODUCT
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
inner_product_param {
num_output: 200
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
bottom: "fc2"
top: "fc3"
name: "fc3"
type: INNER_PRODUCT
blobs_lr: 1
blobs_lr: 2
weight_decay: 1
weight_decay: 0
inner_product_param {
num_output: 9
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layers {
bottom: "fc3"
bottom: "label"
top: "loss"
name: "loss"
type: EUCLIDEAN_LOSS
}
state {
phase: TRAIN
}
I1019 03:50:39.995841 6013 net.cpp:67] Creating Layer data
I1019 03:50:39.995851 6013 net.cpp:356] data -> data
I1019 03:50:39.995863 6013 net.cpp:356] data -> label
I1019 03:50:39.995872 6013 net.cpp:96] Setting up data
I1019 03:50:39.995879 6013 hdf5_data_layer.cpp:57] Loading filename from train_data.txt
I1019 03:50:39.995925 6013 hdf5_data_layer.cpp:69] Number of files: 1
I1019 03:50:39.995932 6013 hdf5_data_layer.cpp:29] Loading HDF5 file/home/gipuser/Documents/Synch1_10/data_train.h5
Segmentation fault (core dumped)
Any ideas on how to get over this issue?
The text was updated successfully, but these errors were encountered: