Rework Torch multi-GPU training #828

gheinrich · 2016-06-09T19:31:40Z

This change enables automatic encapsulation in an nn.DataParallelTable container for multi-GPU training. This comes from @thatguymike's realization that this would simplify the programming model. A disableAutoDataParallelism flag is added to the internal parameters to disable this mechanism. This flag is intended to be used by users who wish to retain control over what exactly gets parallelized. As an example, the standard Alexnet model enables this flag to parallelize only the feature part of the network.

Note: this changes does not enable threading in nn.DataParallelTable when doing the automatic encapsulation. Some of the numbers below are showing training time with/without threading.

For the record, some training time numbers:

LeNet:
1 GPU => 1m3s
2 GPUs, all parallel, single-thread => 1m33s;1m34s
2 GPUs, all parallel, multi-thread => 3m41s

Alexnet:
1 GPU => 11m6s
2 GPUs, feature parallel, single-thread => 10m29s
2 GPUs, feature parallel, multi-thread => 9m34s;10m6s
2 GPUs, all parallel, threaded => 13m47s

GoogLeNet:
1 GPU => 16m28s
2 GPUs, all parallel, single-thread => 12m28s
2 GPUs, all parallel, multi-thread => 11m6s;11m7s

Text classification (1D convolutions followed by FC layers):
1 GPU => 32m35s;32m14s
2 GPUs, all parallel, single-thread => 19m41s
2 GPUs, all parallel, multi-thread => 18m46s

Very Simple segmentation FCN (only 2 conv layers followed by a deconv):
1 GPU => 7m12s;7m11s
2 GPUs, all parallel, single-thread => 19m48s
2 GPUs, all parallel, multi-thread => 21m41s

GPUs are Titan X.

gheinrich · 2016-06-10T08:33:26Z

@pansk @TimZaman any feedback on this? Thanks!

TimZaman · 2016-06-10T08:48:50Z

docs/GettingStartedTorch.md

@@ -106,6 +106,7 @@ labelHook             | function     | No        | A function(input,dblabel) tha
 trainBatchSize        | number       | No        | If specified, sets train batch size. May be overridden by user in DIGITS UI.
 validationBatchSize   | number       | No        | If specified, sets validation batch size. May be overridden by user in DIGITS UI.
 fineTuneHook          | function     | No        | A function(net) that returns the model to be used for fine-tuning. The untuned model is passed as a function parameter.
+disableAutoDataParallelism | boolean | No        | By default models are encapsulated in a nn.DataParallelTable container to enable multi-GPU training when more than 1 GPUs are selected. Setting this flag to `true` disables this mechanism.


How about enableAutoDataParalellism to avoid double negatives ie (..) not network.disableAutoDataParallelism

I would like to turn on automatic encapsulation by default. I am not sure how to avoid the double negative in that case?

disableAutoDataParallelism | boolean | Non
to
enableAutoDataParallelism | boolean | Ouais

Ouais :) actually this column specifies whether the field is mandatory but I see what you mean, I could simply make it optional and default to true. Let me do that, thanks!

Actually I'm a bit unsure... with disableAutoDataParallelism the user can disable encapsulation with disableAutoDataParallelism=true and the Lua code looks like:

if not network.disableAutoDataParallelism then

If I change the flag to enableAutoDataParallelism then the user can disable automatic encapsulation with enableAutoDataParallelism=false and the Lua code looks like:

if network.enableAutoDataParallelism ~= false then

The former method sounds more like positive logic to me given that we want to give an option to disable encapsulation (as we want encapsulation by default).

I see what you mean and I get it now.

TimZaman · 2016-06-10T09:01:55Z

Looks good to me.
Off topic:
I don't want to be a 1 cent PITA but how about making all the standard networks in the same function-template as well? so instead of lenet= and googlenet= make it model= and pass the param structure the same way, etc. Maybe even in the UI when someone presses 'Custom Network' fill it with a template/skeleton.

On topic:

Am I correct in thinking that by default the UI selects '1 gpu' and so the MNIST will by default run on 1 GPU right? We don't want to parallelize the smaller networks i recon.
How/is your training time affected by batch size, since you are splitting along the batch dimension. Because the speed-up is significant but we might be able to tune it.
Any reason for not testing for 3 and 4 GPUs?

gheinrich · 2016-06-10T09:22:21Z

Am I correct in thinking that by default the UI selects '1 gpu' and so the MNIST will by default run on 1 GPU right? We don't want to parallelize the smaller networks i recon.

Exactly. As you can see LeNet and the tiny segmentation FCN both show worse results when doing multi-GPU training. By default, only 1 GPU is used in DIGITS.

How/is your training time affected by batch size, since you are splitting along the batch dimension. Because the speed-up is significant but we might be able to tune it.

We are doing strong scaling (keeping batch size constant and splitting it across the GPUs). I think we'd get a greater speed-up if we increased the batch size as the number of GPUs grows but then the optimization would be different, you possibly would have to change learning rate, etc. But it's a good point I'll try that.

Any reason for not testing for 3 and 4 GPUs?

No good reason. I have only 2 GPUs on my workstation :-(

TimZaman · 2016-06-10T09:33:18Z

Any reason for not testing for 3 and 4 GPUs?
No good reason. I have only 2 GPUs on my workstation :-(

https://developer.nvidia.com/employee_gpu_seeding

I'll test with 4.

TimZaman · 2016-06-10T09:37:14Z

digits/standard-networks/torch/ImageNet-Training/alexnet.lua

@@ -82,6 +82,7 @@ return function(params)
    end
    return {
        model = createModel(params.ngpus, channels, nclasses),
+        disableAutoDataParallelism = true,


Why this and not a UI button? Same goes for the other return{ } parameters like croplen (why is it in there).

Good point! croplen, trainBatchSize and validationBatchSize are there for feature parity with Caffe, which supports defining these in .prototxt. Besides, it's convenient to have a suitable default in the model description (e.g. GoogLeNet requires a much smaller batch size than LeNet) and it would be cumbersome to propagate these defaults to the UI. By extension I am inclined to keep the auto parallelism option in the model description as it's very specific to Torch and the appropriate value is model dependent.

(..), it's convenient to have a suitable default in the model description (e.g. GoogLeNet requires a much smaller batch size than LeNet) and it would be cumbersome to propagate these defaults to the UI

In my opinion all these settings should be propagated from the UI. In an ideal scenario one would not pre-select a standard 'Network' but a standard 'Classification Job'; that also fills in the correct (meta and) hyper parameters. But that'd be another PR altogether.

TimZaman · 2016-06-11T21:41:25Z

NooOoOooOo! You forgot to remove the ngpu stuff in alexnet.lua!
https://github.com/gheinrich/DIGITS/blob/2da0b05cfb3c70a29029c429bc5d813df458ff25/digits/standard-networks/torch/ImageNet-Training/alexnet.lua#L48-L67
Now going to redo all my test so far XD

Update: Ah I now see, you left it there intentionally, and you didnt use it in GoogleNet and LeNet because it's optional. I see.

TimZaman · 2016-06-11T21:47:10Z

digits/standard-networks/torch/ImageNet-Training/googlenet.lua

@@ -99,28 +99,10 @@ function createModel(nGPU, nChannels, nClasses)
   local splitter = nn.Concat(2)
   splitter:add(main_branch):add(aux_classifier)
   --local googlenet = nn.Sequential():add(features):add(splitter)


What's the idea of L99-L101

TimZaman · 2016-06-13T10:47:03Z

This looks good to me. Quite an elegant solution.
I am still running benchmarks for 40 hours on 3 and 4 GPUs. Preliminary results in table below.
Running Vanilla GoogleNet (with fixed batch size) on ILSVRC2012 for 5 epochs.

On 2 GPU's the autoDataParallelism through this PR is increased by around 10% (18h27m to 16h43m). Increasing the batch size from 32 to 128 decreases the time to 14h19m, a decrease from the original speed of 30% (!!1). (But this requires 2GPU's with 8Gb memory).

An important note is also that by default it is doing a validation pass 6 times (epoch 0,1,2,3,4,5). That takes a non-trivial amount of time.

#GPU's	disableAutoDataParallelism	batchSize	time (HH:MM)	notes
1	false	32	18:25
1	true	32	17:14	1 val epoch: 0h18
2	false	32	16:43
2	true	32	18:27	1 val epoch: 0h18
2	false	128	14:19
2	true	128	out of memory
3	false	32	13:41
3	false	128	14:17
3	true	32	16:49	1 val epoch: 0h17
4	false	32	14:06	slower than 3 gpus!
4	false	128	8:06	HOLY CREPE
4	false	256	TBD
4	true	32	24:10	disturbingly long, 1 val epoch: 1h10.

Fun to see as well that 4 GPU's with batch size 32 is actually slower than 3 GPU's with batch size 32, as the overhead gets really significant.

gheinrich · 2016-06-13T12:57:06Z

Thank you very much for taking the time to make those measurements. Hopefully all the disableAutoDataParallelism=true will replicate the timings of the single-GPU case (otherwise something really odd is going on!).

With "auto data parallelism" I have had to disable threading in DataParallelTable because Torch threads annoyingly require that you require all the modules you need in all sub-threads. For example a model of mine is using dpnn. If I want to enable threading in DataParallelTable then I need to require('dpnn') from within each sub-thread. I didn't see an elegant way to let users do this without some cumbersome callbacks, etc. I have also measured that this accounts for roughly 10% hit on performance. Users who really want those 10% back can set disableAutoDataParallelism=true and do the parallelization in the model description.

Validation takes a long time indeed. You probably know this already but just in case you can validate less often by setting Validation interval to more than 1 in the UI.

TimZaman · 2016-06-16T15:57:57Z

Hey @gheinrich look at those measurements ^. Look at the impact of the batch size. Also, for a 4 GPU run on a batch size of 128, it only takes 4Gb per GPU. Should be maybe further define 'batch_size' when enabledataparallelism is turned on as 'batch size per gpu' and then in main.lua do batchSize=ngpu*batchSize?
Just to repeat the above table for 4 GPU's:

batchSize	time (HH:MM)
32	14:06
128	8:06

gheinrich · 2016-06-16T16:08:42Z

Thanks! I don't understand why training time differs between 1,2,3,4 GPUs when disableAutoDataParallelism=true. Did you manually encapsulate your model in a DataParallelTable in those cases?

In DIGITS we are doing "strong scaling" (ref) to keep the optimization problem exactly the same. If you increase the overall batch size then you're training something slightly different.

TimZaman · 2016-06-16T16:14:10Z

I don't understand that either. I used vanilla GoogleNet from this PR. I might try running it again. But seriously i just make this queue in one go and didn't touch the server in the meantime.. Must be that planetary alignment thing.

gheinrich · 2016-06-17T11:49:23Z

@TimZaman's data points from #828 (comment) are very interesting. I ran a few more tests to double check some that seemed somewhat unexpected to me. Using my poor man's 2-TitanX workstation, the Vanilla GoogleNet and VOC dataset and doing 2 training epochs + 2 validations:

#GPU's	disableAutoDataParallelism	batchSize	time (MM:SS)
1	false	32	10:50
1	true	32	10:56
2	false	32	8:22
2	true	32	11:01
2	true	64	7:10
2	true	128	6:34

This looks more like I expected i.e.

it doesn't matter what you set disableAutoDataParallelism to when you have only one GPU,
when disableAutoDataParallelism=true it doesn't matter how many GPUs you have,
with more than 1 GPU and disableAutoDataParallelism=false you start seeing a scaling benefit (on GoogLeNet),
increasing batch size help speed things up though it changes the optimization dynamics and I did notice that the loss was going down more slowly when using a larger batch size (mostly because I didn't change the learning rate I suppose).

TimZaman · 2016-06-17T11:57:56Z

That looks a lot more sensible indeed, i wish mine did too but i cant help it. changing the parameters dependent on ngpu can still be considered strong scaling if you define the 'total problem' as 'an epoch' instead of defining the problem as 'one batch'.
And as you have just experienced, with the current implementation you are still training something slightly different; are the gpu-specific gradients averaged?

gheinrich · 2016-06-17T15:00:54Z

With strong scaling (keeping the total number of processed samples the same per iteration) we are solving exactly the same problem as all gradients are added together so it doesn't matter which GPU they are calculated on. This means we can keep the learning rate the same and if we initialize the random seed in the same way we will converge towards the same solution. That is the theory though in practice I did notice a tiny difference between runs (even on single GPU). I am going to try and see where it's coming from. The reproducibility of results is very important to some people.

If you change the batch size then you increase the number of samples between each update of the parameters and even if you update the learning rate you will converge towards a pretty different solution.

TimZaman · 2016-06-26T10:17:04Z

Hey Greg do you know if indeed the GPUs have thermal throttling? That'd explain some inconsistent results I had, given it was running on the office devbox with warm weather, etc.

gheinrich · 2016-06-29T20:34:40Z

I am no expert in that area but in my understanding the clock frequency can be increased up to 1075MHz with GPU Boost if the temperature is low however the base frequency (1GHz) is guaranteed. This shouldn't affect performance tremendously unless the CPU does get throttled.

gheinrich · 2016-08-22T13:05:59Z

I will merge this since it has been reviewed and tested and we're a long way from the next release.

Rework Torch multi-GPU training

Rework Torch multi-GPU training

2da0b05

gheinrich force-pushed the dev/rework-torch-multigpu branch from 8d358d9 to 2da0b05 Compare June 9, 2016 20:00

gheinrich changed the title ~~[DONT MERGE] Rework Torch multi-GPU training~~ Rework Torch multi-GPU training Jun 10, 2016

TimZaman reviewed Jun 10, 2016
View reviewed changes

TimZaman reviewed Jun 11, 2016
View reviewed changes

gheinrich added enhancement torch labels Jun 12, 2016

gheinrich mentioned this pull request Jun 14, 2016

Utilization 0% on all but one GPU using torch #772

Closed

gheinrich merged commit ff29278 into NVIDIA:master Aug 22, 2016

gheinrich deleted the dev/rework-torch-multigpu branch August 22, 2016 13:06

lukeyeager mentioned this pull request Aug 22, 2016

Layer Visualization And Weights for Pretrained Jobs #937

Closed

SlipknotTN pushed a commit to cynnyx/DIGITS that referenced this pull request Mar 30, 2017

Merge pull request NVIDIA#828 from gheinrich/dev/rework-torch-multigpu

e45e410

Rework Torch multi-GPU training

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework Torch multi-GPU training #828

Rework Torch multi-GPU training #828

gheinrich commented Jun 9, 2016 •

edited

gheinrich commented Jun 10, 2016

TimZaman Jun 10, 2016

gheinrich Jun 10, 2016

TimZaman Jun 10, 2016

gheinrich Jun 10, 2016

gheinrich Jun 10, 2016

TimZaman Jun 11, 2016

TimZaman commented Jun 10, 2016 •

edited

gheinrich commented Jun 10, 2016

TimZaman commented Jun 10, 2016 •

edited

TimZaman Jun 10, 2016 •

edited

gheinrich Jun 10, 2016

TimZaman Jun 11, 2016

TimZaman commented Jun 11, 2016 •

edited

TimZaman Jun 11, 2016

TimZaman commented Jun 13, 2016 •

edited

gheinrich commented Jun 13, 2016

TimZaman commented Jun 16, 2016 •

edited

gheinrich commented Jun 16, 2016

TimZaman commented Jun 16, 2016

gheinrich commented Jun 17, 2016

TimZaman commented Jun 17, 2016

gheinrich commented Jun 17, 2016

TimZaman commented Jun 26, 2016

gheinrich commented Jun 29, 2016

gheinrich commented Aug 22, 2016

Rework Torch multi-GPU training #828

Rework Torch multi-GPU training #828

Conversation

gheinrich commented Jun 9, 2016 • edited

gheinrich commented Jun 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TimZaman commented Jun 10, 2016 • edited

gheinrich commented Jun 10, 2016

TimZaman commented Jun 10, 2016 • edited

TimZaman Jun 10, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TimZaman commented Jun 11, 2016 • edited

Choose a reason for hiding this comment

TimZaman commented Jun 13, 2016 • edited

gheinrich commented Jun 13, 2016

TimZaman commented Jun 16, 2016 • edited

gheinrich commented Jun 16, 2016

TimZaman commented Jun 16, 2016

gheinrich commented Jun 17, 2016

TimZaman commented Jun 17, 2016

gheinrich commented Jun 17, 2016

TimZaman commented Jun 26, 2016

gheinrich commented Jun 29, 2016

gheinrich commented Aug 22, 2016

gheinrich commented Jun 9, 2016 •

edited

TimZaman commented Jun 10, 2016 •

edited

TimZaman commented Jun 10, 2016 •

edited

TimZaman Jun 10, 2016 •

edited

TimZaman commented Jun 11, 2016 •

edited

TimZaman commented Jun 13, 2016 •

edited

TimZaman commented Jun 16, 2016 •

edited