New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Reader #2386

Closed
wants to merge 5 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@cypof
Member

cypof commented Apr 29, 2015

Part of #2351, but switched to a round-robin way of distributing data to solvers, instead of the shared queue that was not deterministic. Combined with random seeds initialization on threads in #2367, it should make parallel training reproducible.

The data reader sits between a database and each solver's prefetch thread. It makes sure each solver processes a different subset of the database. It also prefetches data, but to host memory, and the amount is configurable. Solvers prefetch threads instead only prefetch a fixed small amount of data, since it is stored in GPU memory.

  • A single reading thread is created per source, even if multiple solvers are running in parallel.
  • Sources are identified by layer name + source path, in case net has multiple data layers on same DB.
  • Databases are read sequentially, for better performance.
  • Each solver sees a different subset of the database.
  • Data is distributed to solvers in a round-robin way to keep parallel training deterministic.

@cypof cypof referenced this pull request Apr 29, 2015

Closed

Multi-GPU #2114

cypof added some commits Apr 28, 2015

Changed the way threads are started and stopped
- Interrupt the thread before waiting on join
- Provide a method for looping threads to exit on demand
- CHECK if start and stop succeed instead of returning an error
Added DataReader for parallel training with one DB session
- Makes sure each solver accesses a different subset of the data
- Sequential reading of DB for performance
- Prefetches a configurable amount of data to host memory
- Distributes data to solvers in round-robin way for determinism
@cdoersch

This comment has been minimized.

Show comment
Hide comment
@cdoersch

cdoersch Jun 1, 2015

The variable/class names in this file and the associated comments are pretty uninformative. I can't think of a better name than QueuePair (though it's pretty uninformative). However, 'body' could mean many different things. 'DataSource' would be better for body, but it would be good to have something even more descriptive if you can think of one.

cdoersch commented on include/caffe/data_reader.hpp in 0bd8238 Jun 1, 2015

The variable/class names in this file and the associated comments are pretty uninformative. I can't think of a better name than QueuePair (though it's pretty uninformative). However, 'body' could mean many different things. 'DataSource' would be better for body, but it would be good to have something even more descriptive if you can think of one.

@cdoersch

This comment has been minimized.

Show comment
Hide comment
@cdoersch

cdoersch Jun 1, 2015

Whenever I see these two variables referenced elsewhere in the code, I think they're booleans. How about 'populated_datums_' and 'unpopulated_datums_'?

cdoersch commented on include/caffe/data_reader.hpp in 0bd8238 Jun 1, 2015

Whenever I see these two variables referenced elsewhere in the code, I think they're booleans. How about 'populated_datums_' and 'unpopulated_datums_'?

@cdoersch

This comment has been minimized.

Show comment
Hide comment
@cdoersch

cdoersch Jun 1, 2015

You know there's problems with variable names when there are comments like this one. To me, this says there are these things called 'sources' and each one has a 'body,' and nothing else. What do you mean by source? What types of sources are currently supported? A body is not just paired with a source, it's the code's primary representation of a source; any access to a source must be done through a body. But then why is it called a 'Body'? Why not call it a 'Source'?

cdoersch commented on include/caffe/data_reader.hpp in 0bd8238 Jun 1, 2015

You know there's problems with variable names when there are comments like this one. To me, this says there are these things called 'sources' and each one has a 'body,' and nothing else. What do you mean by source? What types of sources are currently supported? A body is not just paired with a source, it's the code's primary representation of a source; any access to a source must be done through a body. But then why is it called a 'Body'? Why not call it a 'Source'?

This comment has been minimized.

Show comment
Hide comment
@cypof

cypof Jun 9, 2015

Owner

OK not very clear. The name body is in the sense of this http://c2.com/cgi/wiki?HandleBodyPattern. I will probably switch to DataSource or something similar, and comment that a single instance gets created even if multiple readers use it.

Owner

cypof replied Jun 9, 2015

OK not very clear. The name body is in the sense of this http://c2.com/cgi/wiki?HandleBodyPattern. I will probably switch to DataSource or something similar, and comment that a single instance gets created even if multiple readers use it.

@cdoersch

This comment has been minimized.

Show comment
Hide comment
@cdoersch

cdoersch Jun 1, 2015

I don't understand this comment.

cdoersch commented on src/caffe/data_reader.cpp in 0bd8238 Jun 1, 2015

I don't understand this comment.

This comment has been minimized.

Show comment
Hide comment
@cypof

cypof Jun 9, 2015

Owner

A body gets created only if one doesn't exist for this source. A source here is identified by its source_key().

Owner

cypof replied Jun 9, 2015

A body gets created only if one doesn't exist for this source. A source here is identified by its source_key().

@cdoersch

This comment has been minimized.

Show comment
Hide comment
@cdoersch

cdoersch Jun 1, 2015

Maybe a nitpick, but it seems odd to me that this variable would be solver-specific, since it's possible for other code to use this for multithreaded access.

Perhaps more importantly, looking at your later code, it seems somewhat hacky to read this value from a global variable; it's a sort of hidden dependency. Is there no way for this to be passed in? I guess DataReaders are currently created by layers, but layers aren't really supposed to understand that they're multi-threaded. The necessity of this global variable access here suggests that this may be the wrong design. One alternative approach is to have this function wait for new_queue_pairs_ indefinitely, until it gets a NULL. Then we can have a separate static function in Body that's called at the end of network setup, which goes through all existing Body's and sends a NULL to each one's new_queue_pairs_. Not sure I like this approach that much better than what we have now, but I think it's an improvement.

Later we can change the bodies_ variable to not be static, but instead passed in by Caffe during solver construction. In general, I think we want to get away from the assumption that there's only one solver running at a time, because there's nothing obviously blocking that pattern in the Python/Matlab interfaces.

cdoersch commented on src/caffe/data_reader.cpp in 0bd8238 Jun 1, 2015

Maybe a nitpick, but it seems odd to me that this variable would be solver-specific, since it's possible for other code to use this for multithreaded access.

Perhaps more importantly, looking at your later code, it seems somewhat hacky to read this value from a global variable; it's a sort of hidden dependency. Is there no way for this to be passed in? I guess DataReaders are currently created by layers, but layers aren't really supposed to understand that they're multi-threaded. The necessity of this global variable access here suggests that this may be the wrong design. One alternative approach is to have this function wait for new_queue_pairs_ indefinitely, until it gets a NULL. Then we can have a separate static function in Body that's called at the end of network setup, which goes through all existing Body's and sends a NULL to each one's new_queue_pairs_. Not sure I like this approach that much better than what we have now, but I think it's an improvement.

Later we can change the bodies_ variable to not be static, but instead passed in by Caffe during solver construction. In general, I think we want to get away from the assumption that there's only one solver running at a time, because there's nothing obviously blocking that pattern in the Python/Matlab interfaces.

This comment has been minimized.

Show comment
Hide comment
@cypof

cypof Jun 9, 2015

Owner

The number of solvers if used in a few places. It would be great to have a concept of a group of solvers, or some object representing the current training task that can be passed around. I am not sure we should change things until we have a better idea of the right way to express this. For now, I feel it OK to store this on the Caffe object (and a couple other variables like is this the root solver, and random generator). It's only visible to the current thread.

Owner

cypof replied Jun 9, 2015

The number of solvers if used in a few places. It would be great to have a concept of a group of solvers, or some object representing the current training task that can be passed around. I am not sure we should change things until we have a better idea of the right way to express this. For now, I feel it OK to store this on the Caffe object (and a couple other variables like is this the root solver, and random generator). It's only visible to the current thread.

@shelhamer

This comment has been minimized.

Show comment
Hide comment
@shelhamer

shelhamer Sep 26, 2015

Member

Merged with revisions in #2903, thanks.

Member

shelhamer commented Sep 26, 2015

Merged with revisions in #2903, thanks.

@shelhamer shelhamer closed this Sep 26, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment