Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to speed up? I'm trying to run GoogLeNet with Batch normalization, but it is not fast enough #312

Closed
lim0606 opened this issue Jun 26, 2015 · 16 comments

Comments

@lim0606
Copy link

lim0606 commented Jun 26, 2015

Hi, everyone.

I'm trying to run GoogLeNet with Batchnormalization, but it is not fast enough.

My implementation is here. (I have to write some scripts to read/writes images and learning rate scheduling)
https://github.com/lim0606/lasagne-googlenet

I'm testing the running time, and it took about 1.8 sec per iteration.
(I'm using Intel i7-4820K, titan Z, 24GB RAM, and OS running on SSD)

I used to use caffe, and I already tested googlenet with bn on caffe. it took 0.6 sec per iteration.

Is there any one who can help me to speed up this model? :)

Thank you,

Jaehyun

@benanne
Copy link
Member

benanne commented Jun 26, 2015

Your best bet is to use the Theano profiler first and see if there's anything that's taking much longer than it should. Just run your script with THEANO_FLAGS=profile=True python script.py and abort it after a minute or so of training, then Theano will print profiling info.

EDIT: it's also worthwhile to define a helper function that creates an inception module. It reduces code duplication a lot.

Also with the Titan Z, which is a dual GPU card, Theano will only ever use one of them (at least for now). I don't know what the multi-GPU situation is with Caffe right now, but maybe that could explain part of the performance difference.

@f0k
Copy link
Member

f0k commented Jun 26, 2015

Just run your script with THEANO_FLAGS=profile=True python script.py

For GPU profiling, you'll also need CUDA_LAUNCH_BLOCKING=1 (but Theano will tell you if you don't do it.)

@f0k
Copy link
Member

f0k commented Jun 26, 2015

Have you tried just using T.std(input) here? https://github.com/lim0606/lasagne-googlenet/blob/master/googlenet/layers/bn.py#L89
There's no need for you to reuse mean, Theano will merge those in the final graph. If self.input_shape[0] is an int64, the division will upcast the result to double precision and move it off the GPU. Not sure if that happens here, as input_shape should just be a Python int, but it's a possible pitfall in Theano. Things like these can be found by profiling, feel free to post (part of) your profile output here for investigation.
On a side note, shouldn't you divide by self.input_shape[0] + self.input_shape[2] + self.input_shape[3]? Feel free to have a look at my batch normalization implementation for comparison: https://gist.github.com/f0k/f1a6bd3c8585c400c190

In any case, it would be cool to have GoogLeNet in Lasagne/Recipes, so let's try to make it work!

@lim0606
Copy link
Author

lim0606 commented Jun 26, 2015

Thank you for comments!!

First of all, I'm sorry that I made a huge mistake to measure processing time. Currently, googlenet with batch normalisation in my system takes 0.4 ~ 0.5 sec per iteration (Intel i7-4820K, titan Z, 24GB RAM, Ubuntu 14.04.2 running on SSD, cuDNN v2 installed). It is slightly faster than the googlenet with batch normalisation running on caffe-dev with cuDNN v1. Cool!

Your best bet is to use the Theano profiler first and see if there's anything that's taking much longer than it should.

@benanne I will try it asap!

I don't know what the multi-GPU situation is with Caffe right now, but maybe that could explain part of the performance difference.

The version of caffe I'm using now is caffe-dev, and it works with only one GPU at a time. Anyway,

On a side note, shouldn't you divide by self.input_shape[0] + self.input_shape[2] + self.input_shape[3]? Feel free to have a look at my batch normalization implementation for comparison: https://gist.github.com/f0k/f1a6bd3c8585c400c190

@f0k you're right! I fixed it. Your implementation look cooler than mine :) I will also try it.

I will implement read/write functions for imagenet and learning rate scheduling. I will report the results of training and upload trained weights :)

@nouiz
Copy link

nouiz commented Jun 26, 2015

Can you still do the Theano profile? As you compare with caffe without the
same cudnn version, this could show up some possible speed up in Theano.

On Fri, Jun 26, 2015 at 2:04 PM, Jaehyun Lim notifications@github.com
wrote:

Thank you for comments!!

First of all, I'm sorry that I made a huge mistake to measure processing
time. Currently, googlenet with batch normalisation in my system takes 0.4
~ 0.5 sec per iteration (Intel i7-4820K, titan Z, 24GB RAM, Ubuntu 14.04.2
running on SSD, cuDNN v2 installed). It is slightly faster than the
googlenet with batch normalisation running on caffe-dev with cuDNN v1.
Cool!

Your best bet is to use the Theano profiler first and see if there's
anything that's taking much longer than it should.
@benanne https://github.com/benanne I will try it asap!
I don't know what the multi-GPU situation is with Caffe right now, but
maybe that could explain part of the performance difference.
The version of caffe I'm using now is caffe-dev, and it works with only
one GPU at a time. Anyway,

On a side note, shouldn't you divide by self.input_shape[0] +
self.input_shape[2] + self.input_shape[3]? Feel free to have a look at my
batch normalization implementation for comparison:
https://gist.github.com/f0k/f1a6bd3c8585c400c190
@f0k https://github.com/f0k you're right! I fixed it. Your
implementation look cooler than mine :) I will also try it.

I will implement read/write functions for imagenet and learning rate
scheduling. I will report the results of training and upload trained
weights :)


Reply to this email directly or view it on GitHub
#312 (comment).

@lim0606
Copy link
Author

lim0606 commented Jun 27, 2015

@nouiz
I ran it with profiling option :) Here is the log.
https://github.com/lim0606/lasagne-googlenet/blob/master/profile_20150627_100133.txt

As you compare with caffe without the
same cudnn version, this could show up some possible speed up in Theano.

I've tried to run googlenet on caffe with cuDNNv2, but there was an issue I couldn't solve. BVLC/caffe#2508
I got some errors appeared in the above issue while I run googlenet on caffe master branch. I got the same errors when I ran googlenet on caffe-dev branch I'm personally using it (I applied pull request BVLC/caffe#1731 to my caffe-dev to use cuDNNv2).

@lim0606
Copy link
Author

lim0606 commented Jun 29, 2015

I've tried to run googlenet on caffe with cuDNNv2, but there was an issue I couldn't solve. BVLC/caffe#2508

Sorry! This error should not happen if the network architecture is properly designed. The above error was raised because I changed kernel size for my personal experiments.

Anyway, googlenet with bn on caffe-dev (w/ cuDNN v2) took 0.525 sec per iteration (about 21 sec per 40 iterations).

I'm using Intel i7-4820K, titan Z, 24GB RAM, Ubuntu 14.04.2 running on SSD, and 1TB HDD for imagenet data.

@nouiz
Copy link

nouiz commented Jun 29, 2015

Thanks for the profile.

It confirm that Theano run relatively ok. It show that one elemwise
optimization I started but didn't finish will probably also benefit
googlenet:

31.0% 31.0% 174.307s 2.03e-04s C 860720 3074
theano.sandbox.cuda.basic_ops.GpuElemwise

On Mon, Jun 29, 2015 at 2:19 AM, Jaehyun Lim notifications@github.com
wrote:

I've tried to run googlenet on caffe with cuDNNv2, but there was an issue
I couldn't solve. BVLC/caffe#2508
BVLC/caffe#2508

Sorry! This error should not happen if the network architecture is
properly designed. The above error was raised because I changed kernel size
for my personal experiments.

Anyway, googlenet with bn on caffe-dev took 0.525 sec per iteration (about
21 sec per 40 iterations).

I'm using Intel i7-4820K, titan Z, 24GB RAM, Ubuntu 14.04.2 running on
SSD, and 1TB HDD for imagenet data.


Reply to this email directly or view it on GitHub
#312 (comment).

@lim0606
Copy link
Author

lim0606 commented Jul 4, 2015

Hi,

I'm sorry that I made a mistake (again) to give you guys wrong information about processing time.

I'm testing the running time, and it took about 1.8 sec per iteration.
This was the right one, not the 0.4~0.5 sec per iteration.

Thus, my implementation is about 3.5 times slower than caffe-dev with cuDNN v2.

I updated my repository, including imagenet batchiterator, which can read lmdb data, crop image, and git shuffled batch at each iteration. (https://github.com/lim0606/lasagne-googlenet)

I'm now running my system, and it seems working fine. If the training finishes (properly), I will update the results including weights.

Thank you for all your assistance.

@benanne
Copy link
Member

benanne commented Jul 4, 2015

Both Caffe and Theano use the same primitives (cublas, cudnn) so if the Theano version really is 3.5x slower we need to get to the bottom of this. If it is a Theano issue we should find a way to fix it in Theano itself, but if it is an issue in your code we should figure out how to ensure that people do not run into it when they write their own code. So either way we should definitely investigate what's causing this slowness.

Where did the 0.4~0.5 sec come from then? What was the source of the confusion?

@nouiz
Copy link

nouiz commented Jul 4, 2015

I think the first step is to find where time is spent. Can you check how
much time is spent in data set loading, the minibatch creation and the time
in the Theano function?

This would tell where more detailed investigation check are needed.
ImageNet is so big that in Fuel, we use a separate process to load and
create the minibatch. This could also be done via threads on the GPU.

Also, I suppose you use Theano dev version. If not, update, there is speed
up since the last release.

I'll be not available this week and very busy the next one. So I can't help
much in the short term.

On Sat, Jul 4, 2015 at 7:48 AM, Sander Dieleman notifications@github.com
wrote:

Both Caffe and Theano use the same primitives (cublas, cudnn) so if the
Theano version really is 3.5x slower we need to get to the bottom of this.
If it is a Theano issue we should find a way to fix it in Theano itself,
but if it is an issue in your code we should figure out how to ensure that
people do not run into it when they write their own code. So either way we
should definitely investigate what's causing this slowness.

Where did the 0.4~0.5 sec come from then? What was the source of the
confusion?


Reply to this email directly or view it on GitHub
#312 (comment).

@lim0606
Copy link
Author

lim0606 commented Jul 4, 2015

Thank you for comments!

Where did the 0.4~0.5 sec come from then? What was the source of the confusion?

@benanne I made a mistake to put start_time = time.time() to wrong place. I thought I measured the time spent at each epoch for a data I made to debug my script. It consists of 128 data; thus 4 iterations will be taken in each epoch for batch size 32. However, I actually measured the time spent at each iteration, since I put 'start_time = time.time()' at the beginning of each iteration as well as at the beginning of each epoch. 1.8 sec printed via my script should have been considered as the time per each iteration, but I thought 1.8 / 4 = 0.45 sec.

Can you check how much time is spent in data set loading, the minibatch creation and the time
in the Theano function?

@nouiz The mini-batch creation takes about 0.01 sec per each mini batch. I uploaded an example python script to test mini-batch creation for imagenet dataset (https://github.com/lim0606/lasagne-googlenet/blob/master/tools/example_batchiterator.py). I also uploaded a log file (https://github.com/lim0606/lasagne-googlenet/blob/master/example_batchiterator_20150705_014348.log). I ran the example with uncommenting line number 150, 182, 183, 282, 327, and 328 in https://github.com/lim0606/lasagne-googlenet/blob/master/utils/batchiterator.py. This batch iterator read lmdb dataset converted via caffe.

Also, I suppose you use Theano dev version. If not, update, there is speed
up since the last release.

I just checked my version, and it is 0.7.0.dev-e6a4c073e97cadde0a8398b9a672dd91120882f8.

@lim0606
Copy link
Author

lim0606 commented Jul 26, 2015

Hi, everyone

I've trained googlenet with batch normalisation, but with caffe.

I got 71.7% top1 accuracy, and 90.68% top5 accuracy for imagenet validation dataset. The model was trained

  • with batchsize 32
  • over 1000000 iteration (about 30 epochs)
  • without data-augmentation (except random crop)
  • with random crop

The result was evaluated only with single random crop.

I'd like to share the results, including learning rate scheduling, after I confirm that the results are reproducible. I'm running it from the beginning.

Anyway, I strongly believe that it is worth to train this model with lasagne (and theano), but my lasagne implementation is still not fast enough.

Best regards,

Jaehyun

@matsuyamax
Copy link

Has anybody succeeded in training a state of the art computer vision model in Theano (Lasagne or otherwise?), such as GoogLeNet on ImageNet?

I find that Theano scales very poorly to complex models and large datasets. I'd be happy to have examples to the contrary!

@benanne
Copy link
Member

benanne commented Sep 19, 2015

Theano's lack of multi-GPU support (for now) is the main thing holding it back in this respect. Also, people who work on convnets for computer vision seem to prefer other libraries (e.g. Caffe). Theano seems to be more popular in different deep learning subfields.

As a result, there is indeed very little work in terms of training these large-scale image classification models in Theano directly, but I don't think it's impossible. The fact is that these models are a lot less demanding in terms of the types of neural network building blocks they consist of, and all the more demanding in terms of raw training performance, so Theano might not be the best fit in that case.

That said, it definitely isn't impossible, and nothing stops you from implementing custom optimized ops for certain operations yourself. I believe this is currently being done for batch normalization: Theano/Theano#3410

This type of work is what many other libraries expect you to do all the time anyway. Theano's symbolic paradigm just makes rapid prototyping of new ideas a bit easier.

Besides GoogLeNet though, I don't really know of any 'complex' models that Theano is less suitable for. (OxfordNet shouldn't be a problem.) Do you have any more examples?

@f0k
Copy link
Member

f0k commented Jun 24, 2016

Closing because it's not really a Lasagne issue.

@f0k f0k closed this as completed Jun 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants