Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DenseNet-121 is faster than CondenseNet-74 (C=G=4) on GTX 1080 Ti #3

Open
ivankreso opened this issue Nov 30, 2017 · 3 comments
Open

Comments

@ivankreso
Copy link

I compared the forward pass speed of the larger ImageNet model with DenseNet-121 and the latter actually works faster. After benchmarking my guess is that CondenseConv layer is the cause of the slowdown due to memory transfers in ShuffleLayer and torch.index_select.
@ShichenLiu can you comment on this, did you get better performance compared to DenseNet-121 in your experiments?

@ShichenLiu
Copy link
Owner

ShichenLiu commented Nov 30, 2017

Our model is mainly designed for mobile devices, on which the actual inference time highly correlates with the theoretical complexity. However, the group convolution and index/shuffle operations are not efficiently implemented on GPU.

@lvdmaaten
Copy link
Collaborator

GPUs tend to be memory-bound rather than compute-bound, in particular, for small models that require additional memory transfers such as ShuffleNets and CondenseNets. On mobile devices, embedded systems, etc. the ratio between compute (in FLOPS) and memory bandwidth is very different though: convnets tend to be compute-bound on such platforms. If you did the same comparison on such a platform, you would find that a CondenseNet is much faster than a DenseNet (see Table 5 of the paper for actual timing results on an ARM processor).

@ivankreso ivankreso changed the title DenseNet-121 is faster than CondenseNet-74 (C=G=4) in my benchmark DenseNet-121 is faster than CondenseNet-74 (C=G=4) on GTX 1080 Ti Dec 1, 2017
@ivankreso
Copy link
Author

Thanks for clarification. I already suspected that is the reason after I measured time spent in bottleneck 1x1 layer and grouped 3x3 layer. Forward pass spends twice as much time in 1x1 compared to 3x3.
I think there is a way to avoid additional memory transfers on GPUs if CUDNN implementation allows you to specify custom feature maps ordering after grouped convolution. I don't know if this feature is available in CUDNN but if I am correct then you could remove all feature shuffling ops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants