Identify the critical parts of computation time in GPU mode #102

kloudkl · 2014-02-13T05:15:45Z

There are three motivations to do this.

First, pull #99 referenced the benchmark results of pull #85. As noted in the latter, the experiments conducted in GPU mode was not very accurate because the batch size is set to 1 due to limited memory of the GPU in question. This severely reduced the data throughput and probably distorted the layer wise distribution of computation time. To make more fair comparisons, new benchmark should use devices with bigger memory.

The second objective is to compare and analyze the distributions of computation time of Caffe[1] and DeCAF[2]. During training on the ImageNet dataset, nearly 60% percent of the computation time was spent on the last three fully connected layers in DeCAF which can only run CPU. It is not necessarily the case for Caffe especially in GPU mode.

The third more practical purpose is to help future optimization efforts avoid the root of all evil (#81). This relates to the first motivation and is of the greatest value among the three.

[1] Yangqing Jia. Caffe: An Open Source Convolutional Architecture for Fast Feature Embedding. http://caffe.berkeleyvision.org/. 2013.
[2] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, Trevor Darrell. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. arXiv:1310.1531 [cs.CV]. 2013.

Yangqing · 2014-02-13T05:31:42Z

The speed reported in the decaf paper is single image only. With batches,
less time will be spent on fc layers because of the change from gemv to
gemm. This holds for both cpus and gpus.

Yangqing

On Wed, Feb 12, 2014 at 9:15 PM, kloudkl notifications@github.com wrote:

There are three motivations to do this.

First, pull #99 #99 referenced the
benchmark results of pull #85 #85. As
noted in the latter, the experiments conducted in GPU mode was not very
accurate because the batch size is set to 1 due to limited memory of the
GPU in question. This severely reduced the data throughput and the overall
speed. To make more fair comparisons, new benchmark should use devices with
bigger memory.

The second one is to compare and analyze the distributions of computation
time of Caffe[1] and DeCAF[2]. During training on the ImageNet dataset,
nearly 60% percent of the computation time was spent on the last three
fully connected layers in DeCAF which can only run CPU. It is not
necessarily the case for Caffe especially in GPU mode.

The third more practical purpose is to help future optimization efforts
avoid the root of all evil http://c2.com/cgi/wiki?PrematureOptimization(
#84 #84). This relates to the first
motivation and is of the greatest value among the three.

[1] Yangqing Jia. Caffe: An Open Source Convolutional Architecture for
Fast Feature Embedding. http://caffe.berkeleyvision.org/. 2013.
[2] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang,
Eric Tzeng, Trevor Darrell. DeCAF: A Deep Convolutional Activation Feature
for Generic Visual Recognition. arXiv:1310.1531 [cs.CV]. 2013.

Reply to this email directly or view it on GitHubhttps://github.com//issues/102
.

kloudkl · 2014-02-13T07:27:00Z

This explains why the convolutional layers dominate the training time.

cuDNN and pooling fixes

kloudkl mentioned this issue Feb 14, 2014

Add steps to install multi-threaded OpenBLAS on Ubuntu for the gh-pages branch #81

Closed

Yangqing closed this as completed Feb 15, 2014

thatguymike pushed a commit to thatguymike/caffe that referenced this issue Jan 19, 2016

Merge pull request BVLC#102 from drnikolaev/pooling-cudnn-fixes

b3a4a1a

cuDNN and pooling fixes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify the critical parts of computation time in GPU mode #102

Identify the critical parts of computation time in GPU mode #102

kloudkl commented Feb 13, 2014

Yangqing commented Feb 13, 2014

kloudkl commented Feb 13, 2014

Identify the critical parts of computation time in GPU mode #102

Identify the critical parts of computation time in GPU mode #102

Comments

kloudkl commented Feb 13, 2014

Yangqing commented Feb 13, 2014

kloudkl commented Feb 13, 2014