Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework Torch multi-GPU training #828

Merged
merged 1 commit into from
Aug 22, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ return function(params)
end
return {
model = createModel(params.ngpus, channels, nclasses),
disableAutoDataParallelism = true,
Copy link
Contributor

@TimZaman TimZaman Jun 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this and not a UI button? Same goes for the other return{ } parameters like croplen (why is it in there).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! croplen, trainBatchSize and validationBatchSize are there for feature parity with Caffe, which supports defining these in .prototxt. Besides, it's convenient to have a suitable default in the model description (e.g. GoogLeNet requires a much smaller batch size than LeNet) and it would be cumbersome to propagate these defaults to the UI. By extension I am inclined to keep the auto parallelism option in the model description as it's very specific to Torch and the appropriate value is model dependent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(..), it's convenient to have a suitable default in the model description (e.g. GoogLeNet requires a much smaller batch size than LeNet) and it would be cumbersome to propagate these defaults to the UI

In my opinion all these settings should be propagated from the UI. In an ideal scenario one would not pre-select a standard 'Network' but a standard 'Classification Job'; that also fills in the correct (meta and) hyper parameters. But that'd be another PR altogether.

croplen = 224,
trainBatchSize = 128,
validationBatchSize = 32,
Expand Down
26 changes: 4 additions & 22 deletions digits/standard-networks/torch/ImageNet-Training/googlenet.lua
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ local function inception(input_size, config)
return concat
end

function createModel(nGPU, nChannels, nClasses)
function createModel(nChannels, nClasses)
-- batch normalization added on top of convolutional layers in feature branch
-- in order to help the network learn faster
local features = nn.Sequential()
Expand Down Expand Up @@ -99,28 +99,10 @@ function createModel(nGPU, nChannels, nClasses)
local splitter = nn.Concat(2)
splitter:add(main_branch):add(aux_classifier)
--local googlenet = nn.Sequential():add(features):add(splitter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the idea of L99-L101

local googlenet = nn.Sequential():add(features):add(main_branch)

local model
if nGPU>1 then
local gpus = torch.range(1, nGPU):totable()
local fastest, benchmark
local use_cudnn = cudnn ~= nil
if use_cudnn then
fastest, benchmark = cudnn.fastest, cudnn.benchmark
end
model = nn.DataParallelTable(1, true, true):add(googlenet,gpus):threads(function()
if use_cudnn then
local cudnn = require 'cudnn'
cudnn.fastest, cudnn.benchmark = fastest, benchmark
end
end)
model.gradInput = nil
else
model = googlenet
end
local googlenet = nn.Sequential():add(features):add(main_branch)

return model
return googlenet
end

-- return function that returns network definition
Expand All @@ -135,7 +117,7 @@ return function(params)
assert(params.inputShape[2]==256 and params.inputShape[3]==256, 'Network expects 256x256 images')
end
return {
model = createModel(params.ngpus, channels, nclasses),
model = createModel(channels, nclasses),
croplen = 224,
trainBatchSize = 32,
validationBatchSize = 16,
Expand Down
21 changes: 1 addition & 20 deletions digits/standard-networks/torch/lenet.lua
Original file line number Diff line number Diff line change
Expand Up @@ -44,27 +44,8 @@ return function(params)
lenet:add(nn.Linear(500, nclasses)) -- 500 -> nclasses
lenet:add(nn.LogSoftMax())

local model
if params.ngpus > 1 then
local gpus = torch.range(1, params.ngpus):totable()
local fastest, benchmark
local use_cudnn = cudnn ~= nil
if use_cudnn then
fastest, benchmark = cudnn.fastest, cudnn.benchmark
end
model = nn.DataParallelTable(1, true, true):add(lenet,gpus):threads(function()
if use_cudnn then
local cudnn = require 'cudnn'
cudnn.fastest, cudnn.benchmark = fastest, benchmark
end
end)
model.gradInput = nil
else
model = lenet
end

return {
model = model,
model = lenet,
loss = nn.ClassNLLCriterion(),
trainBatchSize = 64,
validationBatchSize = 32,
Expand Down
1 change: 1 addition & 0 deletions docs/GettingStartedTorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ labelHook | function | No | A function(input,dblabel) tha
trainBatchSize | number | No | If specified, sets train batch size. May be overridden by user in DIGITS UI.
validationBatchSize | number | No | If specified, sets validation batch size. May be overridden by user in DIGITS UI.
fineTuneHook | function | No | A function(net) that returns the model to be used for fine-tuning. The untuned model is passed as a function parameter.
disableAutoDataParallelism | boolean | No | By default models are encapsulated in a nn.DataParallelTable container to enable multi-GPU training when more than 1 GPUs are selected. Setting this flag to `true` disables this mechanism.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about enableAutoDataParalellism to avoid double negatives ie (..) not network.disableAutoDataParallelism

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to turn on automatic encapsulation by default. I am not sure how to avoid the double negative in that case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disableAutoDataParallelism | boolean | Non
to
enableAutoDataParallelism | boolean | Ouais

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouais :) actually this column specifies whether the field is mandatory but I see what you mean, I could simply make it optional and default to true. Let me do that, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm a bit unsure... with disableAutoDataParallelism the user can disable encapsulation with disableAutoDataParallelism=true and the Lua code looks like:

if not network.disableAutoDataParallelism then

If I change the flag to enableAutoDataParallelism then the user can disable automatic encapsulation with enableAutoDataParallelism=false and the Lua code looks like:

if network.enableAutoDataParallelism ~= false then

The former method sounds more like positive logic to me given that we want to give an option to disable encapsulation (as we want encapsulation by default).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean and I get it now.


### Tensors

Expand Down
6 changes: 6 additions & 0 deletions tools/torch/main.lua
Original file line number Diff line number Diff line change
Expand Up @@ -280,6 +280,12 @@ local parameters = {
network = network_func(parameters)
local model = network.model

-- embed model in parallel table unless explicitly disallowed in user-defined description
if nGpus > 1 and not network.disableAutoDataParallelism then
local gpus = torch.range(1, nGpus):totable()
model = nn.DataParallelTable(1, true, true):add(model, gpus)
end

-- if the loss criterion was not defined in the network
-- use nn.ClassNLLCriterion() by default
local loss = network.loss or nn.ClassNLLCriterion()
Expand Down