New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plans for Queues/Data Loaders for async loading of data to the GPU? #5986
Comments
In the past, everytimes it was requested, I asked a benchmark that show if that was taking significant time and never saw one. Do you have one? This is useful in a specific case, where the cpu->gpu transfer is significant enough compared to the computation, but not so big that doing it in parallel won't be useful. There is code in the new back-end that would allow that. But it request using some sync mechanism that add overhead at other places, so it was disabled by default. If you can confirm me you are in the timing windows where this would help you, I can describe in more detail how to do it. Mostly, change a Theano flag. Make the Theano function use as input a GPU object and start the async transfer outside the Theano function. |
I discussed about this yesterday with @lamblin and he reminded me that one flag could in some cases already do what you want. The flag to try is: gpuarray.single_stream=False Explanation: How does that work? In the new gpu back-end, with that flag, we use 2 stream. One for memory copy and one for computation. We don't enable that by default, as when you don't benefit from that, it add overhead and slow down the computation a little bit. When can a Theano function return while not all computation are done? This can happen if the last nodes executed in the graph don't cause a sync. Mostly, when they don't transfer to/from the GPU. If you Theano function have some output like the loss or error, this is probably not a problem for the training function, as this is could be computed in the middle of the graph. Do you want to try it in your own code? Don't enable Theano profiler to profile it, as it force the sync for each node, so it would kill that feature. If that don't work, you could make the Theano function take as input a gpundarray and start the transfer of 2 batch in python and pass the oldest one to the Theano function. This wasn't tested. |
I will give it a shot when I get the time. I think I tried option 2 with the gpuarray, but I got from |
I have a PR that could make this easy: #6125 Mostly, if you try this PR with Theano flag: gpuarray.single_stream=False, different minibatch could overload transfer/computation. Note, take care to have very few operation between the two call to the Theano function with different minibatch. |
So I had this discussion quite a few times and had my own attempts at doing this, but it seems Theano has not been developed to allow asynchronous access to variables outside of its own internals. As such are there any plans for creating operators similar to Tensorflow's Queue or Pytorch data loaders for speeding up data transfers between main memory and the GPU?
The text was updated successfully, but these errors were encountered: