A MultiGPU bug with multiple input layers #2977

raingo · 2015-08-26T01:13:44Z

With #2903, the data is logically laid out sequentially for multiple solvers. However, there is no coordinates regarding multiple input layers. Following is an example of logical error.

Suppose solver S1 and S2 are running in parallel, loading two input layers D1 and D2. S1 gets the first batch of D1, but somehow it get evicted from running queue. In parallel, S2 get second batch of D1 and continue to get first batch of D2. Obviously, both S1 and S2 get inconsistent data, which is a serious bug.

Suggestions, (either..or...)

logical check in the net init function to disallow multiple input in a multi-solver process;
Forget about sharing data loaders, use the original random skip parameter to make sure each solver load different data batches.

ronghanghu · 2015-08-26T01:56:31Z

@raingo I don't quite understand here, could you be more specific?

Suppose we are using DataLayer that loads LMDB/LEVELDB, which is not shared.

With #2903, the data is logically laid out sequentially for multiple solvers

Yes. Sequential loading for (LMDB/LEVELDB) DataLayer is ensured by DataReader which distribute data in a round robin style, and each solver's training net owns a data layer.

Suppose solver S1 and S2 are running in parallel, loading two input layers D1 and D2.

Yes, this is what's happening. Each solver has different input data layer in its train net.

But:

S1 gets the first batch of D1, but somehow it get evicted from running queue. In parallel, S2 get second batch of D1 and continue to get first batch of D2.

This cannot happen. S1 can only load from D1 while S2 only from D2. S1 owns D1 in its train net, and S2 owns D2 in its test net. There is no way that S2 can load data from D1. Also, data is distributed in a round-robin fashion in DataReader.

Please correct me if I'm wrong.

raingo · 2015-08-26T02:05:34Z

@ronghanghu Both S1 and S2 will load data from D1 and D2. Only concerned about the training phase. Following is an example of prototxt.

layer {
  name: "D1"
  type: "Data"
  top: "data"
  top: "label"
  data_param {
    source: "path1"
  }
}
layer {
  name: "D2"
  type: "Data"
  top: "data2"
  top: "label2"
  data_param {
    source: "path2"
  }
}

# both data1 and data2 will be consumed in the following layers

The actual usage of multiple input layers are #1698 and #1873.

ronghanghu · 2015-08-26T02:08:02Z

Oh, I see what you mean here. You mean two different data layers in one net. Yes, this is indeed a problem.

ronghanghu · 2015-08-26T02:10:26Z

I agree this should be fixed. I'm not sure if this is also a problem for (LMDB/LEVELDB) DataLayer illustrated in your case (which relies on DataReader). I'll check for that.

raingo · 2015-08-26T02:12:02Z

I think all the input layers, include HDF5 and DataLayer, have this problem.

Essentially, the entire data loading part of the net should be serialized. A solver parameter to lock all the input layers will be a quick solution, which makes #2903 unnecessary.

ronghanghu · 2015-08-26T02:14:54Z

Essentially, the entire data loading part of the net should be serialized.

This is probably unaffordable, since it's going to introduce big overhead for data loading for LMDB/LEVELDBs. I'll first check if there is such a problem in DataLayer, and if yes, whether this can be fixed in DataReader class.

Anyway, we should fix this issue.

ctensmeyer · 2015-08-26T03:49:51Z

Might it be sufficient to extend the Datum protobuf to include an arbitrary number of arbitrarily shaped inputs, eliminating the need for multiple synchronized input layers?

raingo · 2015-08-26T14:37:02Z

@ronghanghu Agreed, there is an overhead, but might be a good idea to use as an option. If this is implemented properly, the overhead can be minimized. The prefetching threads can still working in the background to fill the buffer. The forward thread simply memcpy from the buffer.

Anyway, thanks a lot for looking at this issue!

ronghanghu · 2015-08-26T21:37:05Z

I checked DataReader today. Unfortunately it seems to suffer from this issue as well, since in DataReader::DataReader there is no guarantee on the order of push in body_->new_queue_pairs_.push(queue_pair_);

https://github.com/BVLC/caffe/blob/master/src/caffe/data_reader.cpp#L30

thatguymike · 2015-08-26T21:43:09Z

Interesting. The original code specifically used a singular layer for all solvers but had other issues.

So where @ronghanghu where do you want to go? Either we need to move to an ordered queue or unwind some of the data layer changes. It sounds like we basically need to guarantee the order of building the batch is always stable and notification to the solvers is only done when the batch is fully completed.

The complicating in the future will be distributed training where we will have to figure out distributed loading of data and sychronization of the datalayers. We should probably carefully think this through before we hack something in.

ronghanghu · 2015-08-26T22:04:26Z

@thatguymike I'm not sure where to go at this moment, as it takes careful consideration.

The complicating in the future will be distributed training where we will have to figure out distributed loading of data and sychronization of the datalayers. We should probably carefully think this through before we hack something in.

I agree.

cypof · 2015-08-28T21:35:20Z

Maybe one way to fix the DB code, that might also simplify other data layer types, would be to create all solvers on the main thread, and hand them off to the worker thread only when training starts?

ronghanghu · 2015-09-04T20:46:19Z

After another check today, it seems to me that solvers are indeed created in the main thread, which sort of confused me...

On one hand,

Worker solvers are created in P2PSync<Dtype>::P2PSync, which is called by P2PSync<Dtype>::run and train() in
https://github.com/BVLC/caffe/blob/master/src/caffe/parallel.cpp#L224
https://github.com/BVLC/caffe/blob/master/src/caffe/parallel.cpp#L416
https://github.com/BVLC/caffe/blob/master/tools/caffe.cpp#L208-L209
and root solver is created in train() in
https://github.com/BVLC/caffe/blob/master/tools/caffe.cpp#L195-L196
all these are called from main thread. P2PSync<Dtype>::InternalThreadEntry() only calls solver_->Step. So it seems that all solver construction are done in main thread. Only solving is performed in each solver's thread.

Therefore, there shouldn't be race in constructing solvers since they all run in the main thread sequentially. And DataReader::DataReader is called in DataLayer<Dtype>::DataLayer, which should only be called by net construction during solver construction. Thus, at a glance there isn't concurrence on body_->new_queue_pairs_.push(queue_pair_); in
https://github.com/BVLC/caffe/blob/master/src/caffe/data_reader.cpp#L30

On the other hand,

I see that DataReader::DataReader is protected by mutex as in #2903 (originally #2114), as in
https://github.com/BVLC/caffe/blob/master/src/caffe/data_reader.cpp#L22
so according to this design, I assumed previously that there should actually be concurrence on DataReader::DataReader.

So...

I am a bit confused here whether there is concurrence and race condition on DataReader::DataReader (I hope there isn't and the mutex is just for future-proof). Maybe I messed up something in my mind and I'll double check.

@cypof @thatguymike perhaps you can enlighten me on this?

cypof · 2015-09-04T21:02:33Z

Oh you're right, worker solvers used to be created in P2PSync::InternalThreadEntry(). I don't remember when it changed. The lock in the data reader might still be necessary in case a solver exits while another one is still getting constructed. Not really possible today but in case we go asynchronous maybe.

ronghanghu · 2015-09-04T21:30:58Z

OK, then I expect this multi-gpu bug described in this issue doesn't apply to DataLayer (LMDB/LevelDB) . I'll try to test it sometime in the weekend.

ronghanghu added the bug label Aug 26, 2015

ronghanghu mentioned this issue Oct 8, 2015

Deal with Non-Deterministic Behavior (Ensure Determinism?) #3168

Open

longjon added the multi-GPU label Nov 5, 2015

cypof closed this as completed Jan 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A MultiGPU bug with multiple input layers #2977

A MultiGPU bug with multiple input layers #2977

raingo commented Aug 26, 2015

ronghanghu commented Aug 26, 2015

raingo commented Aug 26, 2015

ronghanghu commented Aug 26, 2015

ronghanghu commented Aug 26, 2015

raingo commented Aug 26, 2015

ronghanghu commented Aug 26, 2015

ctensmeyer commented Aug 26, 2015

raingo commented Aug 26, 2015

ronghanghu commented Aug 26, 2015

thatguymike commented Aug 26, 2015

ronghanghu commented Aug 26, 2015

cypof commented Aug 28, 2015

ronghanghu commented Sep 4, 2015

cypof commented Sep 4, 2015

ronghanghu commented Sep 4, 2015

A MultiGPU bug with multiple input layers #2977

A MultiGPU bug with multiple input layers #2977

Comments

raingo commented Aug 26, 2015

ronghanghu commented Aug 26, 2015

raingo commented Aug 26, 2015

ronghanghu commented Aug 26, 2015

ronghanghu commented Aug 26, 2015

raingo commented Aug 26, 2015

ronghanghu commented Aug 26, 2015

ctensmeyer commented Aug 26, 2015

raingo commented Aug 26, 2015

ronghanghu commented Aug 26, 2015

thatguymike commented Aug 26, 2015

ronghanghu commented Aug 26, 2015

cypof commented Aug 28, 2015

ronghanghu commented Sep 4, 2015

On one hand,

On the other hand,

So...

cypof commented Sep 4, 2015

ronghanghu commented Sep 4, 2015