Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF5 snapshot crashes on duplicate layer names #2885

Closed
jeffdonahue opened this issue Aug 8, 2015 · 8 comments · Fixed by #3011
Closed

HDF5 snapshot crashes on duplicate layer names #2885

jeffdonahue opened this issue Aug 8, 2015 · 8 comments · Fixed by #3011
Labels

Comments

@jeffdonahue
Copy link
Contributor

With the new HDF5 snapshotting (#2836), if two or more layers of the net have the same name, Caffe crashes when it tries to snapshot. (This includes the case of two layers with unspecified names, which are interpreted by Caffe via the protobuf default as two layers named "".) Maybe Caffe shouldn't allow layers to have the same name (or unspecified names), but it should probably be checked in Net::Init if that's going to be the rule. We might want to temporarily switch the default serialization back to proto until we figure out what to do about this.

An example of the error message that comes with the crash is below (was training this model, which has two layers named data):

I0807 19:34:51.663398 23450 solver.cpp:375] Snapshotting to HDF5 file ./examples/coco_caption/lrcn_iter_5000.caffemodel.h5
HDF5-DIAG: Error detected in HDF5 (1.8.13) thread 139730974058368:
  #000: H5G.c line 312 in H5Gcreate2(): unable to create group
    major: Symbol table
    minor: Unable to initialize object
  #001: H5Gint.c line 194 in H5G__create_named(): unable to create and link to group
    major: Symbol table
    minor: Unable to initialize object
  #002: H5L.c line 1638 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #003: H5L.c line 1882 in H5L_create_real(): can't insert link
    major: Symbol table
    minor: Unable to insert object
  #004: H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #006: H5L.c line 1674 in H5L_link_cb(): name already exists
    major: Symbol table
    minor: Object already exists
F0807 19:34:51.663725 23450 net.cpp:871] Check failed: layer_data_hid >= 0 (-1 vs. 0) Error saving weights to ./examples/coco_caption/lrcn_iter_5000.caffemodel.h5.
*** Check failure stack trace: ***
    @     0x7f15a65c3daa  (unknown)
    @     0x7f15a65c3ce4  (unknown)
    @     0x7f15a65c36e6  (unknown)
    @     0x7f15a65c6687  (unknown)
    @     0x7f15a6a26ac1  caffe::Net<>::ToHDF5()
    @     0x7f15a6a32f3c  caffe::Solver<>::SnapshotToHDF5()
    @     0x7f15a6a3c790  caffe::Solver<>::Snapshot()
    @     0x7f15a6a3d00c  caffe::Solver<>::Step()
    @     0x7f15a6a3d31f  caffe::Solver<>::Solve()
    @           0x4079b6  train()
    @           0x405d71  main
    @     0x7f15a5063ec5  (unknown)
    @           0x4063cb  (unknown)
    @              (nil)  (unknown)
./examples/coco_caption/train_lrcn.sh: line 17: 23450 Aborted                 (core dumped) ./build/tools/caffe train -solver ./examples/coco_caption/lrcn_solver.prototxt -weights $WEIGHTS -gpu $GPU_ID
jeffdonahue added a commit to jeffdonahue/caffe that referenced this issue Aug 8, 2015
out of anticipation for user issues due to issue BVLC#2885, which causes
Caffe to crash when it attempts to snapshot nets with duplicate layer
names
jeffdonahue added a commit that referenced this issue Aug 8, 2015
out of anticipation for user issues due to issue #2885, which causes
Caffe to crash when it attempts to snapshot nets with duplicate layer
names
@seanbell
Copy link

seanbell commented Aug 8, 2015

I like the idea of disallowing two layers from having the same name, and enforcing it upon initialization. For example, in the python interface (_Net_forward), you can specify a start and end layer, by name. This doesn't work if there are duplicate names. That function (and probably others) seems to assume without checking that the names are unique.

@bhack
Copy link
Contributor

bhack commented Aug 8, 2015

Are you really sure that we want hdf5 default?

@ronghanghu
Copy link
Member

@bhack HDF5 default has been revoked in #2886

@bhack
Copy link
Contributor

bhack commented Aug 8, 2015

I know but it is only a temporary choice caused by the issue generated on merge. I'm really guessing if flatbuffer was evaluated to solve protobuf limit.

@jeffdonahue
Copy link
Contributor Author

@bhack switching all of caffe from PB to flatbuffers is probably a multi-thousand line change... the addition of the option to use HDF5 for snapshotting was a relatively small change that fixes a single isolated issue with PB serialization in Caffe, and is completely backward compatible and optional. Maybe switching to flatbuffers is a good idea, but it's mostly irrelevant here. If you want to discuss flatbuffers, rather than bringing it up in every loosely related discussion, I'd suggest opening an issue for it that lists the advantages of switching and discusses what it would take to port the entirety of Caffe to use it.

@bhack
Copy link
Contributor

bhack commented Aug 8, 2015

I not meant to immediately substitute all protobuf use in caffe with faltbuffers. My comment was releated to snapshotting. Please see my original old undiscussed comment at #2006 (comment)

@jeffdonahue
Copy link
Contributor Author

It's not clear to me that there is any advantage to using flatbuffers over HDF5 purely for the purpose of serialization; it seems like we'd just be needlessly adding a new dependency with little or no improvement in code complexity in that case...

As far as why your comments aren't being discussed extensively, it's because they have little to no content to discuss in them. Simply name-dropping other projects is really not helpful. Post an issue if you want to discuss flatbuffers: describe specifically how you think it could be used in Caffe, what problems it solves, and what its actual technical advantages over the status quo in the context of Caffe are.

@bhack
Copy link
Contributor

bhack commented Aug 9, 2015

Hdf5 was not the status quo for snapshotting at the time of my first comment (on April). And my comment was in the clear context of the issue opened on 2gb size limit of your historical protobuf caffe choice. My thought was really simple: ask to core dev if they had evaluated the natural google successor of protobuf.
Probably there is any advantage or there is someone but a reply months ago like "yes we have evaluated but we don't like flatbuffer" or no we have not evaluated because we like hdf5 as we already depend on it probably could be more constructive that a simple internal shadow planning on a solution that culminate with a "fast track" PR. It is an example of policy for me. So it is not so important hdf5 or flatbuffer. It is always an issue on how do you want to relate to people that are around caffe (I don't know if you want a community).

matthiasplappert pushed a commit to matthiasplappert/caffe that referenced this issue Aug 10, 2015
out of anticipation for user issues due to issue BVLC#2885, which causes
Caffe to crash when it attempts to snapshot nets with duplicate layer
names
rokm pushed a commit to rokm/caffe that referenced this issue Aug 10, 2015
out of anticipation for user issues due to issue BVLC#2885, which causes
Caffe to crash when it attempts to snapshot nets with duplicate layer
names
cbfinn pushed a commit to cbfinn/caffe that referenced this issue Aug 12, 2015
out of anticipation for user issues due to issue BVLC#2885, which causes
Caffe to crash when it attempts to snapshot nets with duplicate layer
names
@longjon longjon added the bug label Aug 16, 2015
darrengarvey added a commit to darrengarvey/caffe that referenced this issue Sep 1, 2015
Commit 4227828 set the default binary format from HDF5 to BINARYPROTO to
fix BVLC#2885. This broke the cifar10 examples which relied on this default.

This commit specifies the snapshot_format explicitly since the rest of the
example relies on this being HDF5.
darrengarvey added a commit to darrengarvey/caffe that referenced this issue Sep 1, 2015
Commit 4227828 set the default binary format from HDF5 to BINARYPROTO to
fix BVLC#2885. This broke the cifar10 examples which relied on this default.

This commit specifies the snapshot_format explicitly since the rest of the
example relies on this being HDF5.
darrengarvey added a commit to darrengarvey/caffe that referenced this issue Sep 1, 2015
Commit 4227828 set the default binary format from HDF5 to BINARYPROTO to
fix BVLC#2885. This broke the cifar10 examples which relied on this default.

This commit specifies the snapshot_format explicitly since the rest of the
example relies on this being HDF5.
wangyida pushed a commit to wangyida/caffe that referenced this issue Sep 14, 2015
out of anticipation for user issues due to issue BVLC#2885, which causes
Caffe to crash when it attempts to snapshot nets with duplicate layer
names
wangyida pushed a commit to wangyida/caffe that referenced this issue Sep 14, 2015
Commit 4227828 set the default binary format from HDF5 to BINARYPROTO to
fix BVLC#2885. This broke the cifar10 examples which relied on this default.

This commit specifies the snapshot_format explicitly since the rest of the
example relies on this being HDF5.
wangyida pushed a commit to wangyida/caffe that referenced this issue Sep 15, 2015
out of anticipation for user issues due to issue BVLC#2885, which causes
Caffe to crash when it attempts to snapshot nets with duplicate layer
names
wangyida pushed a commit to wangyida/caffe that referenced this issue Sep 15, 2015
Commit 4227828 set the default binary format from HDF5 to BINARYPROTO to
fix BVLC#2885. This broke the cifar10 examples which relied on this default.

This commit specifies the snapshot_format explicitly since the rest of the
example relies on this being HDF5.
wangyida pushed a commit to wangyida/caffe that referenced this issue Sep 15, 2015
out of anticipation for user issues due to issue BVLC#2885, which causes
Caffe to crash when it attempts to snapshot nets with duplicate layer
names
wangyida pushed a commit to wangyida/caffe that referenced this issue Sep 15, 2015
Commit 4227828 set the default binary format from HDF5 to BINARYPROTO to
fix BVLC#2885. This broke the cifar10 examples which relied on this default.

This commit specifies the snapshot_format explicitly since the rest of the
example relies on this being HDF5.
wangyida pushed a commit to wangyida/caffe that referenced this issue Sep 16, 2015
out of anticipation for user issues due to issue BVLC#2885, which causes
Caffe to crash when it attempts to snapshot nets with duplicate layer
names
wangyida pushed a commit to wangyida/caffe that referenced this issue Sep 16, 2015
Commit 4227828 set the default binary format from HDF5 to BINARYPROTO to
fix BVLC#2885. This broke the cifar10 examples which relied on this default.

This commit specifies the snapshot_format explicitly since the rest of the
example relies on this being HDF5.
wangyida pushed a commit to wangyida/caffe that referenced this issue Sep 22, 2015
out of anticipation for user issues due to issue BVLC#2885, which causes
Caffe to crash when it attempts to snapshot nets with duplicate layer
names
wangyida pushed a commit to wangyida/caffe that referenced this issue Sep 22, 2015
Commit 4227828 set the default binary format from HDF5 to BINARYPROTO to
fix BVLC#2885. This broke the cifar10 examples which relied on this default.

This commit specifies the snapshot_format explicitly since the rest of the
example relies on this being HDF5.
acmiyaguchi pushed a commit to acmiyaguchi/caffe that referenced this issue Nov 13, 2017
Commit 4227828 set the default binary format from HDF5 to BINARYPROTO to
fix BVLC#2885. This broke the cifar10 examples which relied on this default.

This commit specifies the snapshot_format explicitly since the rest of the
example relies on this being HDF5.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants