New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework Torch multi-GPU training #828
Conversation
8d358d9
to
2da0b05
Compare
@@ -106,6 +106,7 @@ labelHook | function | No | A function(input,dblabel) tha | |||
trainBatchSize | number | No | If specified, sets train batch size. May be overridden by user in DIGITS UI. | |||
validationBatchSize | number | No | If specified, sets validation batch size. May be overridden by user in DIGITS UI. | |||
fineTuneHook | function | No | A function(net) that returns the model to be used for fine-tuning. The untuned model is passed as a function parameter. | |||
disableAutoDataParallelism | boolean | No | By default models are encapsulated in a nn.DataParallelTable container to enable multi-GPU training when more than 1 GPUs are selected. Setting this flag to `true` disables this mechanism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about enableAutoDataParalellism to avoid double negatives ie (..) not network.disableAutoDataParallelism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to turn on automatic encapsulation by default. I am not sure how to avoid the double negative in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disableAutoDataParallelism | boolean | Non
to
enableAutoDataParallelism | boolean | Ouais
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouais :) actually this column specifies whether the field is mandatory but I see what you mean, I could simply make it optional and default to true
. Let me do that, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm a bit unsure... with disableAutoDataParallelism
the user can disable encapsulation with disableAutoDataParallelism=true
and the Lua code looks like:
if not network.disableAutoDataParallelism then
If I change the flag to enableAutoDataParallelism
then the user can disable automatic encapsulation with enableAutoDataParallelism=false
and the Lua code looks like:
if network.enableAutoDataParallelism ~= false then
The former method sounds more like positive logic to me given that we want to give an option to disable encapsulation (as we want encapsulation by default).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean and I get it now.
Looks good to me. On topic:
|
Exactly. As you can see
We are doing strong scaling (keeping batch size constant and splitting it across the GPUs). I think we'd get a greater speed-up if we increased the batch size as the number of GPUs grows but then the optimization would be different, you possibly would have to change learning rate, etc. But it's a good point I'll try that.
No good reason. I have only 2 GPUs on my workstation :-( |
https://developer.nvidia.com/employee_gpu_seeding I'll test with 4. |
@@ -82,6 +82,7 @@ return function(params) | |||
end | |||
return { | |||
model = createModel(params.ngpus, channels, nclasses), | |||
disableAutoDataParallelism = true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this and not a UI button? Same goes for the other return{ }
parameters like croplen (why is it in there).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! croplen
, trainBatchSize
and validationBatchSize
are there for feature parity with Caffe, which supports defining these in .prototxt. Besides, it's convenient to have a suitable default in the model description (e.g. GoogLeNet requires a much smaller batch size than LeNet) and it would be cumbersome to propagate these defaults to the UI. By extension I am inclined to keep the auto parallelism option in the model description as it's very specific to Torch and the appropriate value is model dependent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(..), it's convenient to have a suitable default in the model description (e.g. GoogLeNet requires a much smaller batch size than LeNet) and it would be cumbersome to propagate these defaults to the UI
In my opinion all these settings should be propagated from the UI. In an ideal scenario one would not pre-select a standard 'Network' but a standard 'Classification Job'; that also fills in the correct (meta and) hyper parameters. But that'd be another PR altogether.
NooOoOooOo! You forgot to remove the ngpu stuff in Update: Ah I now see, you left it there intentionally, and you didnt use it in GoogleNet and LeNet because it's optional. I see. |
@@ -99,28 +99,10 @@ function createModel(nGPU, nChannels, nClasses) | |||
local splitter = nn.Concat(2) | |||
splitter:add(main_branch):add(aux_classifier) | |||
--local googlenet = nn.Sequential():add(features):add(splitter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the idea of L99-L101
This looks good to me. Quite an elegant solution. On 2 GPU's the autoDataParallelism through this PR is increased by around 10% (18h27m to 16h43m). Increasing the batch size from 32 to 128 decreases the time to 14h19m, a decrease from the original speed of 30% (!!1). (But this requires 2GPU's with 8Gb memory). An important note is also that by default it is doing a validation pass 6 times (epoch 0,1,2,3,4,5). That takes a non-trivial amount of time.
Fun to see as well that 4 GPU's with batch size 32 is actually slower than 3 GPU's with batch size 32, as the overhead gets really significant. |
Thank you very much for taking the time to make those measurements. Hopefully all the With "auto data parallelism" I have had to disable threading in Validation takes a long time indeed. You probably know this already but just in case you can validate less often by setting |
Hey @gheinrich look at those measurements ^. Look at the impact of the batch size. Also, for a 4 GPU run on a batch size of 128, it only takes 4Gb per GPU. Should be maybe further define 'batch_size' when enabledataparallelism is turned on as 'batch size per gpu' and then in
|
Thanks! I don't understand why training time differs between 1,2,3,4 GPUs when In DIGITS we are doing "strong scaling" (ref) to keep the optimization problem exactly the same. If you increase the overall batch size then you're training something slightly different. |
I don't understand that either. I used vanilla GoogleNet from this PR. I might try running it again. But seriously i just make this queue in one go and didn't touch the server in the meantime.. Must be that planetary alignment thing. |
@TimZaman's data points from #828 (comment) are very interesting. I ran a few more tests to double check some that seemed somewhat unexpected to me. Using my poor man's 2-TitanX workstation, the Vanilla GoogleNet and VOC dataset and doing 2 training epochs + 2 validations:
This looks more like I expected i.e.
|
That looks a lot more sensible indeed, i wish mine did too but i cant help it. changing the parameters dependent on ngpu can still be considered strong scaling if you define the 'total problem' as 'an epoch' instead of defining the problem as 'one batch'. |
With strong scaling (keeping the total number of processed samples the same per iteration) we are solving exactly the same problem as all gradients are added together so it doesn't matter which GPU they are calculated on. This means we can keep the learning rate the same and if we initialize the random seed in the same way we will converge towards the same solution. That is the theory though in practice I did notice a tiny difference between runs (even on single GPU). I am going to try and see where it's coming from. The reproducibility of results is very important to some people. If you change the batch size then you increase the number of samples between each update of the parameters and even if you update the learning rate you will converge towards a pretty different solution. |
Hey Greg do you know if indeed the GPUs have thermal throttling? That'd explain some inconsistent results I had, given it was running on the office devbox with warm weather, etc. |
I am no expert in that area but in my understanding the clock frequency can be increased up to 1075MHz with GPU Boost if the temperature is low however the base frequency (1GHz) is guaranteed. This shouldn't affect performance tremendously unless the CPU does get throttled. |
I will merge this since it has been reviewed and tested and we're a long way from the next release. |
Rework Torch multi-GPU training
This change enables automatic encapsulation in an
nn.DataParallelTable
container for multi-GPU training. This comes from @thatguymike's realization that this would simplify the programming model. AdisableAutoDataParallelism
flag is added to the internal parameters to disable this mechanism. This flag is intended to be used by users who wish to retain control over what exactly gets parallelized. As an example, the standard Alexnet model enables this flag to parallelize only the feature part of the network.Note: this changes does not enable threading in
nn.DataParallelTable
when doing the automatic encapsulation. Some of the numbers below are showing training time with/without threading.For the record, some training time numbers:
LeNet:
1 GPU => 1m3s
2 GPUs, all parallel, single-thread => 1m33s;1m34s
2 GPUs, all parallel, multi-thread => 3m41s
Alexnet:
1 GPU => 11m6s
2 GPUs, feature parallel, single-thread => 10m29s
2 GPUs, feature parallel, multi-thread => 9m34s;10m6s
2 GPUs, all parallel, threaded => 13m47s
GoogLeNet:
1 GPU => 16m28s
2 GPUs, all parallel, single-thread => 12m28s
2 GPUs, all parallel, multi-thread => 11m6s;11m7s
Text classification (1D convolutions followed by FC layers):
1 GPU => 32m35s;32m14s
2 GPUs, all parallel, single-thread => 19m41s
2 GPUs, all parallel, multi-thread => 18m46s
Very Simple segmentation FCN (only 2 conv layers followed by a deconv):
1 GPU => 7m12s;7m11s
2 GPUs, all parallel, single-thread => 19m48s
2 GPUs, all parallel, multi-thread => 21m41s
GPUs are Titan X.