Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A MultiGPU bug with multiple input layers #2977

Closed
raingo opened this issue Aug 26, 2015 · 15 comments
Closed

A MultiGPU bug with multiple input layers #2977

raingo opened this issue Aug 26, 2015 · 15 comments

Comments

@raingo
Copy link

raingo commented Aug 26, 2015

With #2903, the data is logically laid out sequentially for multiple solvers. However, there is no coordinates regarding multiple input layers. Following is an example of logical error.

Suppose solver S1 and S2 are running in parallel, loading two input layers D1 and D2. S1 gets the first batch of D1, but somehow it get evicted from running queue. In parallel, S2 get second batch of D1 and continue to get first batch of D2. Obviously, both S1 and S2 get inconsistent data, which is a serious bug.

Suggestions, (either..or...)

  1. logical check in the net init function to disallow multiple input in a multi-solver process;
  2. Forget about sharing data loaders, use the original random skip parameter to make sure each solver load different data batches.
@ronghanghu
Copy link
Member

@raingo I don't quite understand here, could you be more specific?

Suppose we are using DataLayer that loads LMDB/LEVELDB, which is not shared.

With #2903, the data is logically laid out sequentially for multiple solvers

Yes. Sequential loading for (LMDB/LEVELDB) DataLayer is ensured by DataReader which distribute data in a round robin style, and each solver's training net owns a data layer.

Suppose solver S1 and S2 are running in parallel, loading two input layers D1 and D2.

Yes, this is what's happening. Each solver has different input data layer in its train net.

But:

S1 gets the first batch of D1, but somehow it get evicted from running queue. In parallel, S2 get second batch of D1 and continue to get first batch of D2.

This cannot happen. S1 can only load from D1 while S2 only from D2. S1 owns D1 in its train net, and S2 owns D2 in its test net. There is no way that S2 can load data from D1. Also, data is distributed in a round-robin fashion in DataReader.

Please correct me if I'm wrong.

@raingo
Copy link
Author

raingo commented Aug 26, 2015

@ronghanghu Both S1 and S2 will load data from D1 and D2. Only concerned about the training phase. Following is an example of prototxt.

layer {
  name: "D1"
  type: "Data"
  top: "data"
  top: "label"
  data_param {
    source: "path1"
  }
}
layer {
  name: "D2"
  type: "Data"
  top: "data2"
  top: "label2"
  data_param {
    source: "path2"
  }
}

# both data1 and data2 will be consumed in the following layers

The actual usage of multiple input layers are #1698 and #1873.

@ronghanghu
Copy link
Member

Oh, I see what you mean here. You mean two different data layers in one net. Yes, this is indeed a problem.

@ronghanghu ronghanghu added the bug label Aug 26, 2015
@ronghanghu
Copy link
Member

I agree this should be fixed. I'm not sure if this is also a problem for (LMDB/LEVELDB) DataLayer illustrated in your case (which relies on DataReader). I'll check for that.

@raingo
Copy link
Author

raingo commented Aug 26, 2015

I think all the input layers, include HDF5 and DataLayer, have this problem.

Essentially, the entire data loading part of the net should be serialized. A solver parameter to lock all the input layers will be a quick solution, which makes #2903 unnecessary.

@ronghanghu
Copy link
Member

Essentially, the entire data loading part of the net should be serialized.

This is probably unaffordable, since it's going to introduce big overhead for data loading for LMDB/LEVELDBs. I'll first check if there is such a problem in DataLayer, and if yes, whether this can be fixed in DataReader class.

Anyway, we should fix this issue.

@ctensmeyer
Copy link

Might it be sufficient to extend the Datum protobuf to include an arbitrary number of arbitrarily shaped inputs, eliminating the need for multiple synchronized input layers?

@raingo
Copy link
Author

raingo commented Aug 26, 2015

@ronghanghu Agreed, there is an overhead, but might be a good idea to use as an option. If this is implemented properly, the overhead can be minimized. The prefetching threads can still working in the background to fill the buffer. The forward thread simply memcpy from the buffer.

Anyway, thanks a lot for looking at this issue!

@ronghanghu
Copy link
Member

I checked DataReader today. Unfortunately it seems to suffer from this issue as well, since in DataReader::DataReader there is no guarantee on the order of push in body_->new_queue_pairs_.push(queue_pair_);

https://github.com/BVLC/caffe/blob/master/src/caffe/data_reader.cpp#L30

@thatguymike
Copy link
Contributor

Interesting. The original code specifically used a singular layer for all solvers but had other issues.

So where @ronghanghu where do you want to go? Either we need to move to an ordered queue or unwind some of the data layer changes. It sounds like we basically need to guarantee the order of building the batch is always stable and notification to the solvers is only done when the batch is fully completed.

The complicating in the future will be distributed training where we will have to figure out distributed loading of data and sychronization of the datalayers. We should probably carefully think this through before we hack something in.

@ronghanghu
Copy link
Member

@thatguymike I'm not sure where to go at this moment, as it takes careful consideration.

The complicating in the future will be distributed training where we will have to figure out distributed loading of data and sychronization of the datalayers. We should probably carefully think this through before we hack something in.

I agree.

@cypof
Copy link
Member

cypof commented Aug 28, 2015

Maybe one way to fix the DB code, that might also simplify other data layer types, would be to create all solvers on the main thread, and hand them off to the worker thread only when training starts?

@ronghanghu
Copy link
Member

After another check today, it seems to me that solvers are indeed created in the main thread, which sort of confused me...

On one hand,

Worker solvers are created in P2PSync<Dtype>::P2PSync, which is called by P2PSync<Dtype>::run and train() in
https://github.com/BVLC/caffe/blob/master/src/caffe/parallel.cpp#L224
https://github.com/BVLC/caffe/blob/master/src/caffe/parallel.cpp#L416
https://github.com/BVLC/caffe/blob/master/tools/caffe.cpp#L208-L209
and root solver is created in train() in
https://github.com/BVLC/caffe/blob/master/tools/caffe.cpp#L195-L196
all these are called from main thread. P2PSync<Dtype>::InternalThreadEntry() only calls solver_->Step. So it seems that all solver construction are done in main thread. Only solving is performed in each solver's thread.

Therefore, there shouldn't be race in constructing solvers since they all run in the main thread sequentially. And DataReader::DataReader is called in DataLayer<Dtype>::DataLayer, which should only be called by net construction during solver construction. Thus, at a glance there isn't concurrence on body_->new_queue_pairs_.push(queue_pair_); in
https://github.com/BVLC/caffe/blob/master/src/caffe/data_reader.cpp#L30

On the other hand,

I see that DataReader::DataReader is protected by mutex as in #2903 (originally #2114), as in
https://github.com/BVLC/caffe/blob/master/src/caffe/data_reader.cpp#L22
so according to this design, I assumed previously that there should actually be concurrence on DataReader::DataReader.

So...

I am a bit confused here whether there is concurrence and race condition on DataReader::DataReader (I hope there isn't and the mutex is just for future-proof). Maybe I messed up something in my mind and I'll double check.

@cypof @thatguymike perhaps you can enlighten me on this?

@cypof
Copy link
Member

cypof commented Sep 4, 2015

Oh you're right, worker solvers used to be created in P2PSync::InternalThreadEntry(). I don't remember when it changed. The lock in the data reader might still be necessary in case a solver exits while another one is still getting constructed. Not really possible today but in case we go asynchronous maybe.

@ronghanghu
Copy link
Member

OK, then I expect this multi-gpu bug described in this issue doesn't apply to DataLayer (LMDB/LevelDB) . I'll try to test it sometime in the weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants