Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify the critical parts of computation time in GPU mode #102

Closed
kloudkl opened this issue Feb 13, 2014 · 2 comments
Closed

Identify the critical parts of computation time in GPU mode #102

kloudkl opened this issue Feb 13, 2014 · 2 comments

Comments

@kloudkl
Copy link
Contributor

kloudkl commented Feb 13, 2014

There are three motivations to do this.

First, pull #99 referenced the benchmark results of pull #85. As noted in the latter, the experiments conducted in GPU mode was not very accurate because the batch size is set to 1 due to limited memory of the GPU in question. This severely reduced the data throughput and probably distorted the layer wise distribution of computation time. To make more fair comparisons, new benchmark should use devices with bigger memory.

The second objective is to compare and analyze the distributions of computation time of Caffe[1] and DeCAF[2]. During training on the ImageNet dataset, nearly 60% percent of the computation time was spent on the last three fully connected layers in DeCAF which can only run CPU. It is not necessarily the case for Caffe especially in GPU mode.

The third more practical purpose is to help future optimization efforts avoid the root of all evil (#81). This relates to the first motivation and is of the greatest value among the three.

[1] Yangqing Jia. Caffe: An Open Source Convolutional Architecture for Fast Feature Embedding. http://caffe.berkeleyvision.org/. 2013.
[2] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, Trevor Darrell. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. arXiv:1310.1531 [cs.CV]. 2013.

@Yangqing
Copy link
Member

The speed reported in the decaf paper is single image only. With batches,
less time will be spent on fc layers because of the change from gemv to
gemm. This holds for both cpus and gpus.

Yangqing

On Wed, Feb 12, 2014 at 9:15 PM, kloudkl notifications@github.com wrote:

There are three motivations to do this.

First, pull #99 #99 referenced the
benchmark results of pull #85 #85. As
noted in the latter, the experiments conducted in GPU mode was not very
accurate because the batch size is set to 1 due to limited memory of the
GPU in question. This severely reduced the data throughput and the overall
speed. To make more fair comparisons, new benchmark should use devices with
bigger memory.

The second one is to compare and analyze the distributions of computation
time of Caffe[1] and DeCAF[2]. During training on the ImageNet dataset,
nearly 60% percent of the computation time was spent on the last three
fully connected layers in DeCAF which can only run CPU. It is not
necessarily the case for Caffe especially in GPU mode.

The third more practical purpose is to help future optimization efforts
avoid the root of all evil http://c2.com/cgi/wiki?PrematureOptimization(
#84 #84). This relates to the first
motivation and is of the greatest value among the three.

[1] Yangqing Jia. Caffe: An Open Source Convolutional Architecture for
Fast Feature Embedding. http://caffe.berkeleyvision.org/. 2013.
[2] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang,
Eric Tzeng, Trevor Darrell. DeCAF: A Deep Convolutional Activation Feature
for Generic Visual Recognition. arXiv:1310.1531 [cs.CV]. 2013.

Reply to this email directly or view it on GitHubhttps://github.com//issues/102
.

@kloudkl
Copy link
Contributor Author

kloudkl commented Feb 13, 2014

This explains why the convolutional layers dominate the training time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants