Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework Torch multi-GPU training #828

Merged
merged 1 commit into from Aug 22, 2016

Conversation

gheinrich
Copy link
Contributor

@gheinrich gheinrich commented Jun 9, 2016

This change enables automatic encapsulation in an nn.DataParallelTable container for multi-GPU training. This comes from @thatguymike's realization that this would simplify the programming model. A disableAutoDataParallelism flag is added to the internal parameters to disable this mechanism. This flag is intended to be used by users who wish to retain control over what exactly gets parallelized. As an example, the standard Alexnet model enables this flag to parallelize only the feature part of the network.

Note: this changes does not enable threading in nn.DataParallelTable when doing the automatic encapsulation. Some of the numbers below are showing training time with/without threading.

For the record, some training time numbers:

LeNet:
1 GPU => 1m3s
2 GPUs, all parallel, single-thread => 1m33s;1m34s
2 GPUs, all parallel, multi-thread => 3m41s

Alexnet:
1 GPU => 11m6s
2 GPUs, feature parallel, single-thread => 10m29s
2 GPUs, feature parallel, multi-thread => 9m34s;10m6s
2 GPUs, all parallel, threaded => 13m47s

GoogLeNet:
1 GPU => 16m28s
2 GPUs, all parallel, single-thread => 12m28s
2 GPUs, all parallel, multi-thread => 11m6s;11m7s

Text classification (1D convolutions followed by FC layers):
1 GPU => 32m35s;32m14s
2 GPUs, all parallel, single-thread => 19m41s
2 GPUs, all parallel, multi-thread => 18m46s

Very Simple segmentation FCN (only 2 conv layers followed by a deconv):
1 GPU => 7m12s;7m11s
2 GPUs, all parallel, single-thread => 19m48s
2 GPUs, all parallel, multi-thread => 21m41s

GPUs are Titan X.

@gheinrich
Copy link
Contributor Author

@pansk @TimZaman any feedback on this? Thanks!

@gheinrich gheinrich changed the title [DONT MERGE] Rework Torch multi-GPU training Rework Torch multi-GPU training Jun 10, 2016
@@ -106,6 +106,7 @@ labelHook | function | No | A function(input,dblabel) tha
trainBatchSize | number | No | If specified, sets train batch size. May be overridden by user in DIGITS UI.
validationBatchSize | number | No | If specified, sets validation batch size. May be overridden by user in DIGITS UI.
fineTuneHook | function | No | A function(net) that returns the model to be used for fine-tuning. The untuned model is passed as a function parameter.
disableAutoDataParallelism | boolean | No | By default models are encapsulated in a nn.DataParallelTable container to enable multi-GPU training when more than 1 GPUs are selected. Setting this flag to `true` disables this mechanism.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about enableAutoDataParalellism to avoid double negatives ie (..) not network.disableAutoDataParallelism

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to turn on automatic encapsulation by default. I am not sure how to avoid the double negative in that case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disableAutoDataParallelism | boolean | Non
to
enableAutoDataParallelism | boolean | Ouais

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouais :) actually this column specifies whether the field is mandatory but I see what you mean, I could simply make it optional and default to true. Let me do that, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm a bit unsure... with disableAutoDataParallelism the user can disable encapsulation with disableAutoDataParallelism=true and the Lua code looks like:

if not network.disableAutoDataParallelism then

If I change the flag to enableAutoDataParallelism then the user can disable automatic encapsulation with enableAutoDataParallelism=false and the Lua code looks like:

if network.enableAutoDataParallelism ~= false then

The former method sounds more like positive logic to me given that we want to give an option to disable encapsulation (as we want encapsulation by default).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean and I get it now.

@TimZaman
Copy link
Contributor

TimZaman commented Jun 10, 2016

Looks good to me.
Off topic:
I don't want to be a 1 cent PITA but how about making all the standard networks in the same function-template as well? so instead of lenet= and googlenet= make it model= and pass the param structure the same way, etc. Maybe even in the UI when someone presses 'Custom Network' fill it with a template/skeleton.

On topic:

  • Am I correct in thinking that by default the UI selects '1 gpu' and so the MNIST will by default run on 1 GPU right? We don't want to parallelize the smaller networks i recon.
  • How/is your training time affected by batch size, since you are splitting along the batch dimension. Because the speed-up is significant but we might be able to tune it.
  • Any reason for not testing for 3 and 4 GPUs?

@gheinrich
Copy link
Contributor Author

Am I correct in thinking that by default the UI selects '1 gpu' and so the MNIST will by default run on 1 GPU right? We don't want to parallelize the smaller networks i recon.

Exactly. As you can see LeNet and the tiny segmentation FCN both show worse results when doing multi-GPU training. By default, only 1 GPU is used in DIGITS.

How/is your training time affected by batch size, since you are splitting along the batch dimension. Because the speed-up is significant but we might be able to tune it.

We are doing strong scaling (keeping batch size constant and splitting it across the GPUs). I think we'd get a greater speed-up if we increased the batch size as the number of GPUs grows but then the optimization would be different, you possibly would have to change learning rate, etc. But it's a good point I'll try that.

Any reason for not testing for 3 and 4 GPUs?

No good reason. I have only 2 GPUs on my workstation :-(

@TimZaman
Copy link
Contributor

TimZaman commented Jun 10, 2016

Any reason for not testing for 3 and 4 GPUs?
No good reason. I have only 2 GPUs on my workstation :-(

https://developer.nvidia.com/employee_gpu_seeding

I'll test with 4.

@@ -82,6 +82,7 @@ return function(params)
end
return {
model = createModel(params.ngpus, channels, nclasses),
disableAutoDataParallelism = true,
Copy link
Contributor

@TimZaman TimZaman Jun 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this and not a UI button? Same goes for the other return{ } parameters like croplen (why is it in there).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! croplen, trainBatchSize and validationBatchSize are there for feature parity with Caffe, which supports defining these in .prototxt. Besides, it's convenient to have a suitable default in the model description (e.g. GoogLeNet requires a much smaller batch size than LeNet) and it would be cumbersome to propagate these defaults to the UI. By extension I am inclined to keep the auto parallelism option in the model description as it's very specific to Torch and the appropriate value is model dependent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(..), it's convenient to have a suitable default in the model description (e.g. GoogLeNet requires a much smaller batch size than LeNet) and it would be cumbersome to propagate these defaults to the UI

In my opinion all these settings should be propagated from the UI. In an ideal scenario one would not pre-select a standard 'Network' but a standard 'Classification Job'; that also fills in the correct (meta and) hyper parameters. But that'd be another PR altogether.

@TimZaman
Copy link
Contributor

TimZaman commented Jun 11, 2016

NooOoOooOo! You forgot to remove the ngpu stuff in alexnet.lua!
https://github.com/gheinrich/DIGITS/blob/2da0b05cfb3c70a29029c429bc5d813df458ff25/digits/standard-networks/torch/ImageNet-Training/alexnet.lua#L48-L67
Now going to redo all my test so far XD

Update: Ah I now see, you left it there intentionally, and you didnt use it in GoogleNet and LeNet because it's optional. I see.

@@ -99,28 +99,10 @@ function createModel(nGPU, nChannels, nClasses)
local splitter = nn.Concat(2)
splitter:add(main_branch):add(aux_classifier)
--local googlenet = nn.Sequential():add(features):add(splitter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the idea of L99-L101

@TimZaman
Copy link
Contributor

TimZaman commented Jun 13, 2016

This looks good to me. Quite an elegant solution.
I am still running benchmarks for 40 hours on 3 and 4 GPUs. Preliminary results in table below.
Running Vanilla GoogleNet (with fixed batch size) on ILSVRC2012 for 5 epochs.

On 2 GPU's the autoDataParallelism through this PR is increased by around 10% (18h27m to 16h43m). Increasing the batch size from 32 to 128 decreases the time to 14h19m, a decrease from the original speed of 30% (!!1). (But this requires 2GPU's with 8Gb memory).

An important note is also that by default it is doing a validation pass 6 times (epoch 0,1,2,3,4,5). That takes a non-trivial amount of time.

#GPU's disableAutoDataParallelism batchSize time (HH:MM) notes
1 false 32 18:25
1 true 32 17:14 1 val epoch: 0h18
2 false 32 16:43
2 true 32 18:27 1 val epoch: 0h18
2 false 128 14:19
2 true 128 out of memory
3 false 32 13:41
3 false 128 14:17
3 true 32 16:49 1 val epoch: 0h17
4 false 32 14:06 slower than 3 gpus!
4 false 128 8:06 HOLY CREPE
4 false 256 TBD
4 true 32 24:10 disturbingly long, 1 val epoch: 1h10.

Fun to see as well that 4 GPU's with batch size 32 is actually slower than 3 GPU's with batch size 32, as the overhead gets really significant.

@gheinrich
Copy link
Contributor Author

Thank you very much for taking the time to make those measurements. Hopefully all the disableAutoDataParallelism=true will replicate the timings of the single-GPU case (otherwise something really odd is going on!).

With "auto data parallelism" I have had to disable threading in DataParallelTable because Torch threads annoyingly require that you require all the modules you need in all sub-threads. For example a model of mine is using dpnn. If I want to enable threading in DataParallelTable then I need to require('dpnn') from within each sub-thread. I didn't see an elegant way to let users do this without some cumbersome callbacks, etc. I have also measured that this accounts for roughly 10% hit on performance. Users who really want those 10% back can set disableAutoDataParallelism=true and do the parallelization in the model description.

Validation takes a long time indeed. You probably know this already but just in case you can validate less often by setting Validation interval to more than 1 in the UI.

@TimZaman
Copy link
Contributor

TimZaman commented Jun 16, 2016

Hey @gheinrich look at those measurements ^. Look at the impact of the batch size. Also, for a 4 GPU run on a batch size of 128, it only takes 4Gb per GPU. Should be maybe further define 'batch_size' when enabledataparallelism is turned on as 'batch size per gpu' and then in main.lua do batchSize=ngpu*batchSize?
Just to repeat the above table for 4 GPU's:

batchSize time (HH:MM)
32 14:06
128 8:06

@gheinrich
Copy link
Contributor Author

Thanks! I don't understand why training time differs between 1,2,3,4 GPUs when disableAutoDataParallelism=true. Did you manually encapsulate your model in a DataParallelTable in those cases?

In DIGITS we are doing "strong scaling" (ref) to keep the optimization problem exactly the same. If you increase the overall batch size then you're training something slightly different.

@TimZaman
Copy link
Contributor

I don't understand that either. I used vanilla GoogleNet from this PR. I might try running it again. But seriously i just make this queue in one go and didn't touch the server in the meantime.. Must be that planetary alignment thing.

@gheinrich
Copy link
Contributor Author

@TimZaman's data points from #828 (comment) are very interesting. I ran a few more tests to double check some that seemed somewhat unexpected to me. Using my poor man's 2-TitanX workstation, the Vanilla GoogleNet and VOC dataset and doing 2 training epochs + 2 validations:

#GPU's disableAutoDataParallelism batchSize time (MM:SS)
1 false 32 10:50
1 true 32 10:56
2 false 32 8:22
2 true 32 11:01
2 true 64 7:10
2 true 128 6:34

This looks more like I expected i.e.

  • it doesn't matter what you set disableAutoDataParallelism to when you have only one GPU,
  • when disableAutoDataParallelism=true it doesn't matter how many GPUs you have,
  • with more than 1 GPU and disableAutoDataParallelism=false you start seeing a scaling benefit (on GoogLeNet),
  • increasing batch size help speed things up though it changes the optimization dynamics and I did notice that the loss was going down more slowly when using a larger batch size (mostly because I didn't change the learning rate I suppose).

@TimZaman
Copy link
Contributor

That looks a lot more sensible indeed, i wish mine did too but i cant help it. changing the parameters dependent on ngpu can still be considered strong scaling if you define the 'total problem' as 'an epoch' instead of defining the problem as 'one batch'.
And as you have just experienced, with the current implementation you are still training something slightly different; are the gpu-specific gradients averaged?

@gheinrich
Copy link
Contributor Author

With strong scaling (keeping the total number of processed samples the same per iteration) we are solving exactly the same problem as all gradients are added together so it doesn't matter which GPU they are calculated on. This means we can keep the learning rate the same and if we initialize the random seed in the same way we will converge towards the same solution. That is the theory though in practice I did notice a tiny difference between runs (even on single GPU). I am going to try and see where it's coming from. The reproducibility of results is very important to some people.

If you change the batch size then you increase the number of samples between each update of the parameters and even if you update the learning rate you will converge towards a pretty different solution.

@TimZaman
Copy link
Contributor

Hey Greg do you know if indeed the GPUs have thermal throttling? That'd explain some inconsistent results I had, given it was running on the office devbox with warm weather, etc.

@gheinrich
Copy link
Contributor Author

I am no expert in that area but in my understanding the clock frequency can be increased up to 1075MHz with GPU Boost if the temperature is low however the base frequency (1GHz) is guaranteed. This shouldn't affect performance tremendously unless the CPU does get throttled.

@gheinrich
Copy link
Contributor Author

I will merge this since it has been reviewed and tested and we're a long way from the next release.

@gheinrich gheinrich merged commit ff29278 into NVIDIA:master Aug 22, 2016
@gheinrich gheinrich deleted the dev/rework-torch-multigpu branch August 22, 2016 13:06
SlipknotTN pushed a commit to cynnyx/DIGITS that referenced this pull request Mar 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants