Added bvlc_googlenet prototxt and weights #1598

sguada · 2014-12-19T14:50:39Z

This PR add GoogleNet to the set of models provided by BVLC, it includes the the prototxt needed for training and deploying.

This model is a replication of the model described in the GoogleNet publication. We would like to thank Christian Szegedy for all his help in the replication of GoogleNet model.

Differences:

not training with the relighting data-augmentation;
not training with the scale or aspect-ratio data-augmentation;
uses "xavier" to initialize the weights instead of "gaussian";
quick_solver.prototxt uses a different learning rate decay policy than the original solver.prototxt, that allows a much faster training (60 epochs vs 250 epochs);

The bundled model is the iteration 2,400,000 snapshot (60 epochs) using quick_solver.prototxt

This bundled model obtains a top-1 accuracy 68.7% (31.3% error) and a top-5 accuracy 88.9% (11.1% error) on the validation set, using just the center crop.
(Using the average of 10 crops, (4 + 1 center) * 2 mirror, should obtain a bit higher accuracy.)

Timings for bvlc_googlenet with cuDNN using batch_size:128 on a K40c:

Average Forward pass: 562.841 ms.
Average Backward pass: 1123.84 ms.
Average Forward-Backward: 1688.8 ms.

P.S: For timing look at #1317

sguada · 2014-12-19T15:47:54Z

@shelhamer @longjon @jeffdonahue do you know why Travis keep failing because it cannot download CUDA?

emasa · 2014-12-20T15:47:54Z

The publication link points to AlexNet research. Should it be http://arxiv.org/abs/1409.4842?

sguada · 2014-12-20T19:57:26Z

@emasa thanks for catching the wrong link

shelhamer · 2014-12-21T04:46:00Z

@sguada please push the GoogLeNet paper link and merge. Thanks.

(The Travis deal is a just an intermittent issue with bandwidth that doesn't matter. Feel free to ignore it.)

Added bvlc_googlenet prototxt and weights

anshan-XR-ROB · 2014-12-21T07:08:32Z

@sguada I run your implementation in the newest caffe-dev. There is an error occurred. Using alexnet prototxt in caffe model directory also cause this bug. How can it be fixed?

layers {
bottom: "inception_4e/output"
top: "pool4/3x3_s2"
name: "pool4/3x3_s2"
type:
I1221 14:51:40.897672 23978 layer_factory.hpp:78] Creating layer data
F1221 14:51:40.897716 23978 layer_factory.hpp:81] Check failed: registry.count(t
*** Check failure stack trace: ***
@ 0x7ffacbfeea5d (unknown)
@ 0x7ffacbff2c57 (unknown)
@ 0x7ffacbff0ad9 (unknown)
@ 0x7ffacbff0ddd (unknown)
@ 0x4686d8 caffe::GetLayer<>()
@ 0x474e87 caffe::Net<>::Init()
@ 0x47721e caffe::Net<>::Net()
@ 0x4598d7 caffe::Solver<>::InitTrainNet()
@ 0x45ad27 caffe::Solver<>::Init()
@ 0x45b225 caffe::Solver<>::Solver()
@ 0x41b558 caffe::GetSolver<>()
@ 0x417570 train()
@ 0x417186 main
@ 0x347bc1ecdd (unknown)
@ 0x4167c9 (unknown)
Aborted

sguada · 2014-12-21T07:17:34Z

@AnshanTJU, to double check I recompiled and tried again and got no errors. So try make clean before recompiling again and make runtest to make sure your code pass all the tests.

shelhamer · 2014-12-21T08:16:16Z

make clean should fix this -- the registry count issue is usually seen
when switching from pre-registry to layer registry code, as is the case for
the current master to dev migration.

On Sat, Dec 20, 2014 at 11:17 PM, Sergio Guadarrama <
notifications@github.com> wrote:

@AnshanTJU https://github.com/AnshanTJU, to double check I recompiled
and tried again and got no errors. So try make clean before recompiling
again and make runtest to make sure your code pass all the tests.

—
Reply to this email directly or view it on GitHub
#1598 (comment).

anshan-XR-ROB · 2014-12-21T14:04:20Z

@sguada @shelhamer The latest caffe-dev code can not pass "make runtest" tests. The log information is attached below. The master branch code can pass all the tests. However, it can't support "poly" learning rate policy.

[ RUN ] NetTest/2.TestBottomNeedBackward
[ OK ] NetTest/2.TestBottomNeedBackward (2 ms)
[ RUN ] NetTest/2.TestReshape
F1221 20:18:28.124177 2767 layer_factory.hpp:81] Check failed: registry.count(type) == 1 (0 vs. 1)
*** Check failure stack trace: ***
@ 0x7f96f4b73a5d (unknown)
@ 0x7f96f4b77c57 (unknown)
@ 0x7f96f4b75ad9 (unknown)
@ 0x7f96f4b75ddd (unknown)
@ 0x803dd8 caffe::GetLayer<>()
@ 0x810247 caffe::Net<>::Init()
@ 0x8125de caffe::Net<>::Net()
@ 0x59e116 caffe::NetTest<>::InitNetFromProtoString()
@ 0x59dcd2 caffe::NetTest<>::InitReshapableNet()
@ 0x5ab090 caffe::NetTest_TestReshape_Test<>::TestBody()
@ 0x7b0c6d testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x7a4711 testing::Test::Run()
@ 0x7a47fa testing::TestInfo::Run()
@ 0x7a4a27 testing::TestCase::Run()
@ 0x7a99ff testing::internal::UnitTestImpl::RunAllTests()
@ 0x7a3c20 testing::UnitTest::Run()
@ 0x49a7ff main
@ 0x347bc1ecdd (unknown)
@ 0x49a559 (unknown)
make: *** [runtest] Aborted

ducha-aiki · 2014-12-21T15:36:26Z

Great, thanks!

yulingzhou · 2014-12-23T02:08:29Z

@sguada I trained the Googlenet with quick_solver.prototxt. And after 730000 iterations, the top-1 accuracy is just 25.57, 33.59, 39.39. Is this any problem? What is your result during training?

sguada · 2014-12-23T03:01:08Z

@yulingzhou it seems a bit low but its ok, mine was around 41 top-1 accuracy. With quick_solver.prototxt it gain most of the accuracy near the end.
To get to 50 top-1 needed around 1.5Million and 60 top-1 around 2Million.

If you want to get a reasonable good model faster, let's say 60 top-1, you can lower max_iterations: 600000 in the solver.

anshan-XR-ROB · 2014-12-23T07:59:29Z

@sguada @shelhamer The version of caffe-dev in Dec. 20 have unknown bugs which cause the registry count issue. I have downloaded the latest caffe-dev in an hour ago. It runs well. Thanks to @sguada for the great work.

yulingzhou · 2014-12-29T02:39:24Z

@sguada It's now 1840000 iterations, and the accuracy is just 35.73, 45.3, 51.64, which is much lower than yours. Currently lr is 0.0048. I trained the googlenet in the code from bvlc_googlenet branch. What may the problem be?

sguada · 2014-12-29T03:21:06Z

@yulingzhou I think is going ok. As I said until you get close to the max_iter the accuracy should grow slowly but steadily. When you get near the end you should expect a rapid increase in accuracy.
I will recommend you plot the log file using tools\extra\parse_log.sh and your favorite program for displaying data, I usually use gnuplot.

seanbell · 2015-01-04T17:40:19Z

@sguada Was this trained by first resizing/warping all training images to 256x256 (and then taking random 224x224 crops)? The dataset preparation details aren't mentioned above or on the ModelZoo page. Also, thanks for releasing the model!

sguada · 2015-01-04T20:39:25Z

@seanbell Yes I used the same pre-processed data as for caffe_reference model. Using a more elaborated data pre-processing, such as different scales and different aspect-ratios, should lead to better results.

RazvanRanca · 2015-01-10T15:16:47Z

Thanks for this!
Not a big deal, but just noticed the softmaxes are named as:

top: "loss1/loss1"
name: "loss1/loss"

top: "loss2/loss1"
name: "loss1/loss"

top: "loss3/loss3"
name: "loss3/loss3"

Typo?

RobotiAi · 2015-01-15T04:03:10Z

Thank you very much for your sharing!
However, I encounter a problem when try you solution. In train_val.prototxt, there is an item named mean_value, but the caffe cannot recognise that during training because there is no definition in proto file. Could you please tell me how to solve this problem? Thanks!

anshan-XR-ROB · 2015-01-15T11:04:01Z

You should use the latest version of caffe-dev. @RobotiAi

RobotiAi · 2015-01-16T02:27:20Z

@AnshanTJU Thank you so much!

npit · 2015-03-11T13:49:17Z

@sguada
I have been running your googleNet implementation. If I am not mistaken, the loss3 softmax represents the final classification layer, is that correct?
According to that, the network seems not to be learning anything:

Iteration 140000, Testing net (#0)
I0311 15:26:20.679927 4472 solver.cpp:315] Test net output #0: loss1/loss1 = 4.10975 (* 0.3 = 1.23292 loss)
I0311 15:26:20.679980 4472 solver.cpp:315] Test net output #1: loss1/top-1 = 0.18788
I0311 15:26:20.679991 4472 solver.cpp:315] Test net output #2: loss1/top-5 = 0.40446
I0311 15:26:20.680001 4472 solver.cpp:315] Test net output #3: loss2/loss1 = 6.90982 (* 0.3 = 2.07294 loss)
I0311 15:26:20.680008 4472 solver.cpp:315] Test net output #4: loss2/top-1 = 0.001
I0311 15:26:20.680016 4472 solver.cpp:315] Test net output #5: loss2/top-5 = 0.005
I0311 15:26:20.680027 4472 solver.cpp:315] Test net output #6: loss3/loss3 = 6.91083 (* 1 = 6.91083 loss)
I0311 15:26:20.680033 4472 solver.cpp:315] Test net output #7: loss3/top-1 = 0.001
I0311 15:26:20.680042 4472 solver.cpp:315] Test net output #8: loss3/top-5 = 0.005

npit · 2015-04-22T14:40:17Z

I finished training exactly googleNet exactly like @sguada.
As in the post above, something's up with the loss2,loss3 layers, it's like only loss1 is optimized.
Here's the last log entries:

I0329 02:24:37.016726 21069 solver.cpp:248] Iteration 2400000, loss = 9.71667
I0329 02:24:37.016762 21069 solver.cpp:266] Iteration 2400000, Testing net (#0)
I0329 02:29:10.859261 21069 solver.cpp:315] Test net output #0: loss1/loss1 = 1.84373 (* 0.3 = 0.553119 loss)
I0329 02:29:10.859345 21069 solver.cpp:315] Test net output #1: loss1/top-1 = 0.56186
I0329 02:29:10.859355 21069 solver.cpp:315] Test net output #2: loss1/top-5 = 0.805421
I0329 02:29:10.859365 21069 solver.cpp:315] Test net output #3: loss2/loss1 = 6.90962 (* 0.3 = 2.07289 loss)
I0329 02:29:10.859375 21069 solver.cpp:315] Test net output #4: loss2/top-1 = 0.001
I0329 02:29:10.859382 21069 solver.cpp:315] Test net output #5: loss2/top-5 = 0.005
I0329 02:29:10.859391 21069 solver.cpp:315] Test net output #6: loss3/loss3 = 6.90966 (* 1 = 6.90966 loss)
I0329 02:29:10.859400 21069 solver.cpp:315] Test net output #7: loss3/top-1 = 0.001
I0329 02:29:10.859407 21069 solver.cpp:315] Test net output #8: loss3/top-5 = 0.005

And the accuracy vs iterations graph:

To further illustrate the weirdness, I used @sguada 's provided .caffemodel file for feature extraction using the C++ tool, everything went fine.
Using my trained model however, the loss2 and loss3 layers output identical junk-like features for each image (the loss1/classifier layer works fine and produces similar features as sguada's model).
Trying extraction of various layers to pinpoint where and why this is happening, I found out that it happens in the 4th inception module.

Specifically, the output features of the below components start producing identical features per image.
inception_4b/pool_proj
inception_4b/1x1
inception_4b/3x3_reduce
inception_4b/5x5_reduce

The outputs of hese layers are passed on to the next inception module, and along to the loss2 , loss3 classifiers.
On the contrary, no inception layer processes the signal that is fed to the loss1 classifier, and thus it produces correct features.

I am attaching an image of the googlenet structure to show the inception module and the layers where this occurs (2.6 MB image):

Any idea why this is happening? I'm guessing I should have stopped at first occurence of this nbehaviour.
Again, I used the prototxts and the data as given.

Thanks in advance.

sguada · 2015-04-22T15:30:30Z

@npit I'm not sure what went wrong with your training, but definitely the loss2 and loss3 indicate that the upper layers are not learning anything. A loss around 6.9 means that the network is guessing randomly. Probably it got a bad initialization and couldn't recover.
If you used the prototxt without modifications and you pre-process the data properly (including resizing and shuffling) you should give another try. Change the max_iterations=100000 if the 3 losses don't start decreasing after 10k-15k iterations something is wrong.

npit · 2015-04-23T15:19:21Z

I will, thanks for your response.
Also, I am guessing you had to edit the log parser to get the loss3 accuracy data from the log?
Because it seems I got the loss1 output, and thus the increase.

dgolden1 · 2015-04-23T15:51:26Z

@npit see #2350 for an updated log parser

npit · 2015-05-13T08:39:29Z

@drdan14 Thanks! Is there by any chance a bash version of the parser?

dgolden1 · 2015-05-13T14:51:20Z

@npit yes, it's sitting right next to the python version: https://github.com/BVLC/caffe/blob/master/tools/extra/parse_log.sh

Please move this sort of question to the Google Group: https://groups.google.com/forum/#!forum/caffe-users

npit · 2015-05-13T14:56:43Z

@drdan14 I meant of your updated log parser, not the standard one.

dgolden1 · 2015-05-13T15:02:52Z

Nope, WYSIWYG. But you can run the python version from the command line (type ./parse_log.py -h for help), so I don't know why you'd want the bash version if you prefer the python version's functionality.

npit · 2015-05-13T15:22:33Z

Allright, thanks.

wangdelp · 2015-07-02T16:29:31Z

@sguada Hi Sguada, why there is a std field in "xavier" filler? Isn't the magnitude decided by the number of fan-in and fan-out units? Thank you.

sguada · 2015-07-02T18:06:18Z

It is not used by "xavier" filler, but left there just in case someone want
to use "gausian" filler.

Sergio

2015-07-02 9:29 GMT-07:00 Xeraph notifications@github.com:

@sguada https://github.com/sguada Hi Sguada, why there is a std field
in "xavier" filler? Isn't the magnitude decided by the number of fan-in and
fan-out units? Thank you.

—
Reply to this email directly or view it on GitHub
#1598 (comment).

wangdelp · 2015-07-03T22:04:13Z

@sguada Got it. Thank you.

npit · 2015-08-29T12:29:49Z

@sguada May I ask how did you choose the poly learning rate and with the 0.5 parameter?
thanks

sguada · 2015-08-31T20:22:10Z

I tried different options and that one seem to be more consistent and
perform better.

Sergio

On Sat, Aug 29, 2015 at 5:30 AM, npit notifications@github.com wrote:

@sguada https://github.com/sguada May I ask how did you choose the poly
learning rate and with the 0.5 parameter?
thanks

—
Reply to this email directly or view it on GitHub
#1598 (comment).

npit · 2015-09-01T08:44:17Z

Thanks for the swift reply.
Wasn't there a graph that you showed the accuracy progress for the different learning rates?
Or was it different batch sizes?

This was referenced Dec 19, 2014

GoogLeNet training in Caffe #1367

Closed

How to implement the GoogleNet? #1106

Closed

Caffe Timings for GoogleNet, VGG, AlexNet with cuDNN #1317

Closed

Added Net Templates to easy definition of Nets #1518

Closed

Added bvlc_googlenet prototxt and weights

f15210c

sguada force-pushed the bvlc_googlenet branch from 6270223 to f15210c Compare December 21, 2014 05:07

sguada added a commit that referenced this pull request Dec 21, 2014

Merge pull request #1598 from sguada/bvlc_googlenet

59ecb2a

Added bvlc_googlenet prototxt and weights

sguada merged commit 59ecb2a into BVLC:dev Dec 21, 2014

sguada mentioned this pull request Dec 21, 2014

Googlenet on master #1612

Merged

Added bvlc_googlenet prototxt and weights #1598

Added bvlc_googlenet prototxt and weights #1598

Conversation

sguada commented Dec 19, 2014

sguada commented Dec 19, 2014

emasa commented Dec 20, 2014

sguada commented Dec 20, 2014

shelhamer commented Dec 21, 2014

anshan-XR-ROB commented Dec 21, 2014

sguada commented Dec 21, 2014

shelhamer commented Dec 21, 2014

anshan-XR-ROB commented Dec 21, 2014

ducha-aiki commented Dec 21, 2014

yulingzhou commented Dec 23, 2014

sguada commented Dec 23, 2014

anshan-XR-ROB commented Dec 23, 2014

yulingzhou commented Dec 29, 2014

sguada commented Dec 29, 2014

seanbell commented Jan 4, 2015

sguada commented Jan 4, 2015

RazvanRanca commented Jan 10, 2015

RobotiAi commented Jan 15, 2015

anshan-XR-ROB commented Jan 15, 2015

RobotiAi commented Jan 16, 2015

npit commented Mar 11, 2015

npit commented Apr 22, 2015

sguada commented Apr 22, 2015

npit commented Apr 23, 2015

dgolden1 commented Apr 23, 2015

npit commented May 13, 2015

dgolden1 commented May 13, 2015

npit commented May 13, 2015

dgolden1 commented May 13, 2015

npit commented May 13, 2015

wangdelp commented Jul 2, 2015

sguada commented Jul 2, 2015

wangdelp commented Jul 3, 2015

npit commented Aug 29, 2015

sguada commented Aug 31, 2015

npit commented Sep 1, 2015