Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Make automatic insertions of dnn_conv() use the optimal cuDNN code path (2) #2273
Continuation of #2272 that has already been merged.
This also chooses between GpuDnnConv and GpuDnnConvGradW depending on what the original purpose of the convolution was in the graph. For me the gradient wrt. weights is faster than before, but still quite a bit slower than manually using GpuDnnConv. I cannot look into this any more today, but if anyone wants to have a try, please go ahead. The goal would be to find out if it's slower because of extra dimshuffles/copies, or because I did something wrong. Please don't merge before investigating this, and also don't merge before re-running tests.
OK, found it, both local_conv_dnn() and dnn_conv() introduced gpu_contiguous() around the images and kernels. Because dnn_conv() needs to do a lot of flipping and dimshuffling when not using the forward pass, this introduced additional copy operations. The graph for the automated insertion still has a lot of redundant operations now (dimshuffles and flips), but they do not incur any practical overhead.
On my desktop, I get:
Before, I had:
I'll rebase and then it's ready to merge!