-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A MultiGPU bug with multiple input layers #2977
Comments
@raingo I don't quite understand here, could you be more specific? Suppose we are using DataLayer that loads LMDB/LEVELDB, which is not shared.
Yes. Sequential loading for (LMDB/LEVELDB) DataLayer is ensured by DataReader which distribute data in a round robin style, and each solver's training net owns a data layer.
Yes, this is what's happening. Each solver has different input data layer in its train net. But:
This cannot happen. S1 can only load from D1 while S2 only from D2. S1 owns D1 in its train net, and S2 owns D2 in its test net. There is no way that S2 can load data from D1. Also, data is distributed in a round-robin fashion in DataReader. Please correct me if I'm wrong. |
@ronghanghu Both S1 and S2 will load data from D1 and D2. Only concerned about the training phase. Following is an example of prototxt.
The actual usage of multiple input layers are #1698 and #1873. |
Oh, I see what you mean here. You mean two different data layers in one net. Yes, this is indeed a problem. |
I agree this should be fixed. I'm not sure if this is also a problem for (LMDB/LEVELDB) DataLayer illustrated in your case (which relies on DataReader). I'll check for that. |
I think all the input layers, include HDF5 and DataLayer, have this problem. Essentially, the entire data loading part of the net should be serialized. A solver parameter to lock all the input layers will be a quick solution, which makes #2903 unnecessary. |
This is probably unaffordable, since it's going to introduce big overhead for data loading for LMDB/LEVELDBs. I'll first check if there is such a problem in DataLayer, and if yes, whether this can be fixed in DataReader class. Anyway, we should fix this issue. |
Might it be sufficient to extend the Datum protobuf to include an arbitrary number of arbitrarily shaped inputs, eliminating the need for multiple synchronized input layers? |
@ronghanghu Agreed, there is an overhead, but might be a good idea to use as an option. If this is implemented properly, the overhead can be minimized. The prefetching threads can still working in the background to fill the buffer. The forward thread simply memcpy from the buffer. Anyway, thanks a lot for looking at this issue! |
I checked DataReader today. Unfortunately it seems to suffer from this issue as well, since in https://github.com/BVLC/caffe/blob/master/src/caffe/data_reader.cpp#L30 |
Interesting. The original code specifically used a singular layer for all solvers but had other issues. So where @ronghanghu where do you want to go? Either we need to move to an ordered queue or unwind some of the data layer changes. It sounds like we basically need to guarantee the order of building the batch is always stable and notification to the solvers is only done when the batch is fully completed. The complicating in the future will be distributed training where we will have to figure out distributed loading of data and sychronization of the datalayers. We should probably carefully think this through before we hack something in. |
@thatguymike I'm not sure where to go at this moment, as it takes careful consideration.
I agree. |
Maybe one way to fix the DB code, that might also simplify other data layer types, would be to create all solvers on the main thread, and hand them off to the worker thread only when training starts? |
After another check today, it seems to me that solvers are indeed created in the main thread, which sort of confused me... On one hand,Worker solvers are created in Therefore, there shouldn't be race in constructing solvers since they all run in the main thread sequentially. And On the other hand,I see that So...I am a bit confused here whether there is concurrence and race condition on @cypof @thatguymike perhaps you can enlighten me on this? |
Oh you're right, worker solvers used to be created in P2PSync::InternalThreadEntry(). I don't remember when it changed. The lock in the data reader might still be necessary in case a solver exits while another one is still getting constructed. Not really possible today but in case we go asynchronous maybe. |
OK, then I expect this multi-gpu bug described in this issue doesn't apply to DataLayer (LMDB/LevelDB) . I'll try to test it sometime in the weekend. |
With #2903, the data is logically laid out sequentially for multiple solvers. However, there is no coordinates regarding multiple input layers. Following is an example of logical error.
Suppose solver S1 and S2 are running in parallel, loading two input layers D1 and D2. S1 gets the first batch of D1, but somehow it get evicted from running queue. In parallel, S2 get second batch of D1 and continue to get first batch of D2. Obviously, both S1 and S2 get inconsistent data, which is a serious bug.
Suggestions, (either..or...)
The text was updated successfully, but these errors were encountered: