Rewrite crop layer GPU implementation #5548

Merged
merged 2 commits into from May 4, 2017

Conversation

Projects
None yet
2 participants
Contributor

erictzeng commented Apr 19, 2017

The crop layer currently in Caffe is really slow on GPU. For example, in fcn8s on a 500x375 image, the final crop layer alone takes 8.3ms out of a 65.1ms forward pass (12.7%)!

This seems to be a result of the fact that the original GPU implementation is a fairly faithful reproduction of the CPU version. The CPU version is a series of recursive calls that eventually delegates to caffe_copy to copy a contiguous portion of the crop. The original GPU version is thus a similar series of recursive calls that eventually delegates to a CUDA kernel. This ends up being horribly inefficient in practice, since we are forced to sync after each copy, and we do a large number of copies (one for each leaf of the recursion tree).

This PR rewrites the GPU implementation to do the entire operation in a single kernel call. Under the same conditions as before, the new implementation takes 0.3ms for a forward pass, which is roughly a 28x speedup. In practice, the speedup depends on the size of the input, with the largest gains on the largest input blobs.

I think this should be good to go. Let me know if anything seems off.

Eric Tzeng added some commits Apr 19, 2017

@shelhamer shelhamer merged commit 7d3f8a7 into BVLC:master May 4, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Owner

shelhamer commented May 4, 2017

Thanks for the speed-up Eric!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment