-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework Torch multi-GPU training #828
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -55,7 +55,7 @@ local function inception(input_size, config) | |
return concat | ||
end | ||
|
||
function createModel(nGPU, nChannels, nClasses) | ||
function createModel(nChannels, nClasses) | ||
-- batch normalization added on top of convolutional layers in feature branch | ||
-- in order to help the network learn faster | ||
local features = nn.Sequential() | ||
|
@@ -99,28 +99,10 @@ function createModel(nGPU, nChannels, nClasses) | |
local splitter = nn.Concat(2) | ||
splitter:add(main_branch):add(aux_classifier) | ||
--local googlenet = nn.Sequential():add(features):add(splitter) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the idea of L99-L101 |
||
local googlenet = nn.Sequential():add(features):add(main_branch) | ||
|
||
local model | ||
if nGPU>1 then | ||
local gpus = torch.range(1, nGPU):totable() | ||
local fastest, benchmark | ||
local use_cudnn = cudnn ~= nil | ||
if use_cudnn then | ||
fastest, benchmark = cudnn.fastest, cudnn.benchmark | ||
end | ||
model = nn.DataParallelTable(1, true, true):add(googlenet,gpus):threads(function() | ||
if use_cudnn then | ||
local cudnn = require 'cudnn' | ||
cudnn.fastest, cudnn.benchmark = fastest, benchmark | ||
end | ||
end) | ||
model.gradInput = nil | ||
else | ||
model = googlenet | ||
end | ||
local googlenet = nn.Sequential():add(features):add(main_branch) | ||
|
||
return model | ||
return googlenet | ||
end | ||
|
||
-- return function that returns network definition | ||
|
@@ -135,7 +117,7 @@ return function(params) | |
assert(params.inputShape[2]==256 and params.inputShape[3]==256, 'Network expects 256x256 images') | ||
end | ||
return { | ||
model = createModel(params.ngpus, channels, nclasses), | ||
model = createModel(channels, nclasses), | ||
croplen = 224, | ||
trainBatchSize = 32, | ||
validationBatchSize = 16, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -106,6 +106,7 @@ labelHook | function | No | A function(input,dblabel) tha | |
trainBatchSize | number | No | If specified, sets train batch size. May be overridden by user in DIGITS UI. | ||
validationBatchSize | number | No | If specified, sets validation batch size. May be overridden by user in DIGITS UI. | ||
fineTuneHook | function | No | A function(net) that returns the model to be used for fine-tuning. The untuned model is passed as a function parameter. | ||
disableAutoDataParallelism | boolean | No | By default models are encapsulated in a nn.DataParallelTable container to enable multi-GPU training when more than 1 GPUs are selected. Setting this flag to `true` disables this mechanism. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about enableAutoDataParalellism to avoid double negatives ie There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would like to turn on automatic encapsulation by default. I am not sure how to avoid the double negative in that case? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ouais :) actually this column specifies whether the field is mandatory but I see what you mean, I could simply make it optional and default to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually I'm a bit unsure... with if not network.disableAutoDataParallelism then If I change the flag to if network.enableAutoDataParallelism ~= false then The former method sounds more like positive logic to me given that we want to give an option to disable encapsulation (as we want encapsulation by default). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see what you mean and I get it now. |
||
|
||
### Tensors | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this and not a UI button? Same goes for the other
return{ }
parameters like croplen (why is it in there).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
croplen
,trainBatchSize
andvalidationBatchSize
are there for feature parity with Caffe, which supports defining these in .prototxt. Besides, it's convenient to have a suitable default in the model description (e.g. GoogLeNet requires a much smaller batch size than LeNet) and it would be cumbersome to propagate these defaults to the UI. By extension I am inclined to keep the auto parallelism option in the model description as it's very specific to Torch and the appropriate value is model dependent.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion all these settings should be propagated from the UI. In an ideal scenario one would not pre-select a standard 'Network' but a standard 'Classification Job'; that also fills in the correct (meta and) hyper parameters. But that'd be another PR altogether.