Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data source identification breaks the loading of multiple instances of the same net #3108

Closed
beniz opened this issue Sep 22, 2015 · 8 comments
Assignees
Labels

Comments

@beniz
Copy link

beniz commented Sep 22, 2015

Source identification at https://github.com/BVLC/caffe/blob/master/include/caffe/data_reader.hpp#L68 was introduced by bcc8f50 and leads to raising a fatal check at https://github.com/BVLC/caffe/blob/master/src/caffe/data_reader.cpp#L98 whenever two nearly identical nets train concurrently (e.g. on a single GPU).

The problem occurs if two nets are trained concurrently and share:

  • layer names
  • data source

This typically occurs when training several nets with different layer parameters but identical source and layer names.

My current solution to this problem is to enrich the source identification routine with a hash of the running thread, but my understanding is that it might break the original detection of identical sources from within the same net. For this reason, I am not sharing a PR in order to gather more thoughts on this issue.

@mohomran
Copy link
Contributor

Different symptom but same underlying issue reported here: #3037

@longjon
Copy link
Contributor

longjon commented Nov 5, 2015

I can confirm that this is a bug in need of attention.

  • This breaks multiple nets using the same source (even, e.g., reloading a net from an interactive Python session).
  • The encoding of sources in strings is bogus and not necessary; e.g., ":" is a valid character in both layer names and file names, so spurious collisions are possible.

@beniz
Copy link
Author

beniz commented Nov 5, 2015

Then for the sake of discussion, here is my fix from last month: jolibrain@70fd4f7

One of the many reasons you may not want this in vanilla Caffe is that it requires a C++11 compiler.

@shelhamer shelhamer changed the title Data source identification breaks the training of multiple instances of the same net Data source identification breaks the loading of multiple instances of the same net Mar 31, 2016
@tarunsharma1
Copy link

So I have the same issue...is there a fix yet?

@beniz
Copy link
Author

beniz commented Jul 23, 2016

@tarunsharma1 if you use the fork I maintain at https://github.com/beniz/caffe it should work fine. This fork remains up to date with master with a short delay. You'd need a C++11 compiler however.

@tarunsharma1
Copy link

This is for other/new users who have the same issue and want a quick easy hack around it. This is not a permanent solution ->

Turns out that the issue is with opening the same lmdb twice irrespective of whether you use CPU or GPU. A quick fix is to make a copy of your lmdb and give it a different name and use these two different lmdb names in your two networks respectively.

Does not work
Net1 -> train_lmdb
Net2 -> train_lmdb

Works
Net1 -> train_lmdb
Net2 -> train_lmdb_copy

@beniz
Copy link
Author

beniz commented Jul 26, 2016

So FTR, I've tested my branch against #3037 and it doesn't fix the problem there. I believe this is because my fix uses the thread id, which fixes the issue when training multiple models from the same data source using multiple threads, not from the same inner model.

@shelhamer
Copy link
Member

Fixed by the parallelism reformation in #4563

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants