Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streams #31

Open
untom opened this issue Sep 9, 2015 · 14 comments
Open

Streams #31

untom opened this issue Sep 9, 2015 · 14 comments

Comments

@untom
Copy link
Collaborator

untom commented Sep 9, 2015

Sooner or later, we should think about introducing CUDA streams for our GPU implementation. Side-Effect: Looking at the profiling outputs, across various example the most expensive call we make is usually the set_from_numpy call in the PyCudaHandler. We should be able to completely eliminate the cost of that call completely once we use streams, as the memory-transfers can all be done asynchronously (and we could finally implement a sensible double-buffering on GPUs).

I can think of two ways to add Streams:

  1. Specify Stream for each Call
    Add a stream=None optional argument to all the handler functions, so that the caller can specify the stream on which to execute. When the stream is not specified, we run on the default stream. We could pass either real cuda-streams, or just stream-IDs (integers). Calls would then maybe look like this:

        _h.dot_add_mm(dIa[t], x[t], dWi, transa=True, stream=_h.stream[1])
        _h.dot_add_mm(dFa[t], x[t], dWf, transa=True, stream=_h.stream[2])
        _h.dot_add_mm(dOa[t], x[t], dWo, transa=True, stream=_h.stream[3])
        _h.dot_add_mm(dZa[t], x[t], dWz, transa=True, stream=_h.stream[4])
        ...
        _h.synchronize_all_streams()
    
  2. Add a separate function for specifying streams:

        _h.set_stream(1)
        _h.dot_add_mm(dIa[t], x[t], dWi, transa=True)
        _h.set_stream(2)
        _h.dot_add_mm(dFa[t], x[t], dWf, transa=True)
        _h.set_stream(3)
        _h.dot_add_mm(dOa[t], x[t], dWo, transa=True)
        _h.set_stream(4)
        _h.dot_add_mm(dZa[t], x[t], dWz, transa=True)
        ...
        _h.synchronize_all_streams() 
    

In this short example, option 1 clearly looks better (IMO), but I can see option 2 working out nicely, too.

Another thing to consider is that we might set up some rules about streams. For example, something like "outputs should always be computed on streams 0-4"... or maybe it even makes sense to have different streams for outputs, internals and parameters, so we know which ones we need to synchronize on before starting computations in a new layer (or not, IDK).

@flukeskywalker
Copy link
Collaborator

Some handler might need multiple streams, so I guess it needs to be a list of arrays.
_h.set_stream([]) can simply set the stream ids and then return the handler. That way it will be:

_h.set_stream(1).dot_add_mm(dIa[t], x[t], dWi, transa=True)
_h.set_stream(2).dot_add_mm(dFa[t], x[t], dWf, transa=True)
_h.set_stream(3).dot_add_mm(dOa[t], x[t], dWo, transa=True)
_h.set_stream(4).dot_add_mm(dZa[t], x[t], dWz, transa=True)

@untom
Copy link
Collaborator Author

untom commented Sep 9, 2015

Yeah, that looks nice!

@Qwlouse
Copy link
Collaborator

Qwlouse commented Sep 9, 2015

How about we (ab)use indexing notation for that:

_h[1].dot_add_mm(dIa[t], x[t], dWi, transa=True)
_h[2].dot_add_mm(dFa[t], x[t], dWf, transa=True)
_h[3].dot_add_mm(dOa[t], x[t], dWo, transa=True)
_h[4].dot_add_mm(dZa[t], x[t], dWz, transa=True)

If _h[0] returns a thin wrapper around the handler you could even assign them to a name if several operations need to use the same stream:

h1 = _h[1]
h1.dot_add_mm(dFa[t], x[t], dW, transa=True)
h1.dot_add_mm(dOa[t], x[t], dW, transa=True)
h1.dot_add_mm(dZa[t], x[t], dW, transa=True)

@flukeskywalker
Copy link
Collaborator

Another thing to keep in mind: it'd be nice if streams can be specified for layers too. Then we could run layers in parallel, which would be nice.

Of course, just like one needs to know how many streams are used by an operation while writing a layer implementation, one would also need to know how many streams are used by a layer while building a network. This isn't too much to ask: the docs should take care of it ;)

@untom
Copy link
Collaborator Author

untom commented Sep 10, 2015

I don't like the abused indexing notation, its a bit too unintuitive for someone who doesn't know the codebase too well. I'd rather do something like

h = _h.get_stream_handler(streamid=1)

where get_stream_handler() returns a childclass of PyCudaHandler that always operates at a specific stream.

@Qwlouse
Copy link
Collaborator

Qwlouse commented Sep 10, 2015

Ok, that's a fair point.

What I don't like about _h.set_stream(4).dot_add_mm(...) is that it actually sets the stream, i.e. changes the state of the handler. So all of these would for example use stream 1:

_h.set_stream(1).dot_add_mm(dIa[t], x[t], dWi, transa=True)
_h.dot_add_mm(dFa[t], x[t], dWf, transa=True)
_h.dot_add_mm(dOa[t], x[t], dWo, transa=True)
_h.dot_add_mm(dZa[t], x[t], dWz, transa=True)

We could make some kind of with_stream function that returns a thin wrapper and use it like this:

_h.with_stream(1).dot_add_mm(dIa[t], x[t], dWi, transa=True)
_h.with_stream(2).dot_add_mm(dFa[t], x[t], dWf, transa=True)
_h.with_stream(3).dot_add_mm(dOa[t], x[t], dWo, transa=True)
_h.with_stream(4).dot_add_mm(dZa[t], x[t], dWz, transa=True)

But that of course implies some (small) overhead.

@flukeskywalker
Copy link
Collaborator

Alright, to summarize:

1 - We can add stream as an argument to all operations, but then we do it for other handlers which may not use streams, so it's a bit weird.

2 - We can use set_streams() without returning anything. Then we'd do

_h.set_streams([1])
_h.dot_add_mm(flat_dH, W, out=flat_in_delta_buffer)
_h.set_streams([2])
_h.dot_mm(flat_dH, flat_input, out=dW, transa=True)
_h.sum_t(flat_dH, axis=0, out=dbias)  # runs on stream 2

This option means that

  • You have to set streams before calling operations, and thus you need to know how many streams do those operations expect the handler to provide. In Option 1, you still need to know this, but the operation docs can help you.
  • You need to call something like _h.set_streams(None) to reset the handler to the default stream after you are done with calling ops. This looks like a problem.

3 - We can use _h.with_streams([...]) to return a wrapper which provides access to those streams. This option retains issue 2a but is better wrt issue 2b:

_h.with_streams([1]).dot_add_mm(flat_dH, W, out=flat_in_delta_buffer)
_h.with_streams([2]).dot_mm(flat_dH, flat_input, out=dW, transa=True)
_h.sum_t(flat_dH, axis=0, out=dbias)  # runs on default stream

We should pick one and start working on it.

@Qwlouse
Copy link
Collaborator

Qwlouse commented Sep 11, 2015

Option 4:

with _h.streams(1):
    _h.dot_add_mm(flat_dH, W, out=flat_in_delta_buffer)
with _h.streams(2):
    _h.dot_mm(flat_dH, flat_input, out=dW, transa=True)
    _h.sum_t(flat_dH, axis=0, out=dbias)
_h.sum_t(flat_dH, axis=0, out=dbias)  # runs on default stream

Considering issue 2a we could do the following: say the handler internally uses 15 streams (0 - 14), but we group them in five groups of 3 streams [(0, 1, 2), (3, 4, 5), ...]. So when you set a stream in the layer-code it really is a group of 3 streams. With these numbers that would mean that operations internally could use up to 3 streams, and for implementing the layers you could use 5 groups of streams.

@flukeskywalker
Copy link
Collaborator

@TimDettmers, this issue may be of interest.

@TimDettmers
Copy link

I will look into this and double buffering after I have taken a closer look at the codebase and the PyCUDA API. Double buffering is a bit more complicated, because even with streams there are synchronous parts when you do host -> GPU copies.

@flukeskywalker
Copy link
Collaborator

Great! Let us know if you need any clarifications. There is some restructuring of layers going on in a branch right now, but this does not affect the overall architecture and philosophy.

@untom
Copy link
Collaborator Author

untom commented Oct 14, 2015

Coming back to this: I like option 3 the best. The problem with option 4 is that it gets too wordy too quickly. Especially considering that you'll often want to interleave ops on different streams. The initial example would become:

with _h.set_stream(1):
    _h.dot_add_mm(dIa[t], x[t], dWi, transa=True)
with _h.set_stream(2)
    _h.dot_add_mm(dFa[t], x[t], dWf, transa=True)
with _h.set_stream(3)
    _h.dot_add_mm(dOa[t], x[t], dWo, transa=True)
with _h.set_stream(4)
    _h.dot_add_mm(dZa[t], x[t], dWz, transa=True)

which doubles the line-count AND adds a lot of indentation.

@flukeskywalker
Copy link
Collaborator

I agree.
I don't have much experience with streams, but @TimDettmers shared some thoughts recently which seemed to suggest that streams won't buy us much, except in special cases, since it already performs ops concurrently when this can be done.
@TimDettmers: comments?
EDIT: The above does not appear to be true based on a quick look around. Perhaps I misunderstood what was said.

@Qwlouse
Copy link
Collaborator

Qwlouse commented Oct 15, 2015

I think this should be post-release. It is important so it shouldn't be rushed. Let's set up a benchmarking suite first, and do a little bit of profiling.

WRT Option3 Vs Option4: Actually those are not exclusive. If with_stream constructs a wrapper anyways we could allow both:

_h.with_streams([1]).dot_add_mm(flat_dH, W, out=flat_in_delta_buffer)
_h.with_streams([2]).dot_mm(flat_dH, flat_input, out=dW, transa=True)
_h.sum_t(flat_dH, axis=0, out=dbias)  # runs on default stream

with _h.with_streams([1]) as h1:
    h1.dot_add_mm(flat_dH, W, out=flat_in_delta_buffer)
    h1.dot_mm(flat_dH, flat_input, out=dW, transa=True)
    h1.sum_t(flat_dH, axis=0, out=dbias)  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants