Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plans for Queues/Data Loaders for async loading of data to the GPU? #5986

Open
botev opened this issue May 28, 2017 · 4 comments
Open

Plans for Queues/Data Loaders for async loading of data to the GPU? #5986

botev opened this issue May 28, 2017 · 4 comments

Comments

@botev
Copy link
Contributor

botev commented May 28, 2017

So I had this discussion quite a few times and had my own attempts at doing this, but it seems Theano has not been developed to allow asynchronous access to variables outside of its own internals. As such are there any plans for creating operators similar to Tensorflow's Queue or Pytorch data loaders for speeding up data transfers between main memory and the GPU?

@nouiz
Copy link
Member

nouiz commented May 31, 2017

In the past, everytimes it was requested, I asked a benchmark that show if that was taking significant time and never saw one. Do you have one?

This is useful in a specific case, where the cpu->gpu transfer is significant enough compared to the computation, but not so big that doing it in parallel won't be useful.

There is code in the new back-end that would allow that. But it request using some sync mechanism that add overhead at other places, so it was disabled by default. If you can confirm me you are in the timing windows where this would help you, I can describe in more detail how to do it. Mostly, change a Theano flag. Make the Theano function use as input a GPU object and start the async transfer outside the Theano function.

@nouiz
Copy link
Member

nouiz commented Jun 6, 2017

I discussed about this yesterday with @lamblin and he reminded me that one flag could in some cases already do what you want. The flag to try is: gpuarray.single_stream=False

Explanation:
If you have a loop that call at each iteration one Theano function and that Theano function return before all the computation is done, you could have the transfer of the next iteration start in parallel to the computation of the previous iteration with that flag.

How does that work? In the new gpu back-end, with that flag, we use 2 stream. One for memory copy and one for computation. We don't enable that by default, as when you don't benefit from that, it add overhead and slow down the computation a little bit.

When can a Theano function return while not all computation are done? This can happen if the last nodes executed in the graph don't cause a sync. Mostly, when they don't transfer to/from the GPU. If you Theano function have some output like the loss or error, this is probably not a problem for the training function, as this is could be computed in the middle of the graph.

Do you want to try it in your own code? Don't enable Theano profiler to profile it, as it force the sync for each node, so it would kill that feature.

If that don't work, you could make the Theano function take as input a gpundarray and start the transfer of 2 batch in python and pass the oldest one to the Theano function. This wasn't tested.

@botev
Copy link
Contributor Author

botev commented Jun 6, 2017

I will give it a shot when I get the time.

I think I tried option 2 with the gpuarray, but I got from pygpu some errors - #5929. Note that there I'm trying to set shared variables, but I got similar issues when trying just to pass pygpu arrays.

@nouiz
Copy link
Member

nouiz commented Jul 7, 2017

I have a PR that could make this easy: #6125

Mostly, if you try this PR with Theano flag: gpuarray.single_stream=False, different minibatch could overload transfer/computation. Note, take care to have very few operation between the two call to the Theano function with different minibatch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants