Loss generalization #686

jeffdonahue · 2014-07-14T10:52:41Z

This PR generalizes the loss to allow any top blob to produce a loss L = a * (blob_0 + blob_1 + blob_2 + ... + blob_{N-1}) with some scalar coefficient a.

This is accomplished by changing the interface of Forward_{cpu,gpu} implemented by layers. They become void Forward_{cpu,gpu} rather than Dtype Forward_{cpu,gpu}. The current loss layers now all produce a singleton top blob (and don't return a value), which I think they all already did because of @sguada's changes. To allow for backwards compatibility in the sense that users can still use a loss layer without explicitly specifying a top blob, I added a layer property bool AutoTopBlobs() to automatically create the MinTopBlobs() or ExactNumTopBlobs() required by that layer -- currently only the loss layers override AutoTopBlobs() to return true.

To implement the scalar coefficient, you add a proto field loss_weight specifying a float for each top blob to your LayerParameter definition. For example:

  layers {
    name: "loss"
    type: SOFTMAX_LOSS
    bottom: "ip2"
    bottom: "label"
  }

That's the "old" way of specifying a SOFTMAX_LOSS layer. It still works -- it has an implicit top blob with an implicit loss_weight of 1. It's equivalent to this:

  layers {
    name: "loss"
    type: SOFTMAX_LOSS
    bottom: "ip2"
    bottom: "label"
    top: "softmax_error"
    loss_weight: 1
  }

If you'd instead specified loss_weight: 2, that would have the exact same effect of doubling your base_lr and halving your weight_decay (I confirmed this with lenet_consolidated_solver.prototxt, which sets a seed -- the training losses were always exactly doubled; test losses were always the same since I didn't set the loss_weight: 2 in the test net). So the loss_weight coefficients don't give you any extra power if you only have one loss, but if you have multiple losses, you may want these extra parameters to scale the different losses appropriately.

*_LOSS layers are the only ones that have a default non-zero loss_weight (of 1) -- all other layers have loss_weight: 0 by default, but as long as they can perform Backward they can now produce a loss. I'm not entirely sure how useful this will be, but it seemed like a pretty elegant generalization to me and required little extra work. The only layers whose backward passes actually did have to change were the LOSS layers themselves. The scale parameter is stored in the diff() of the top blob -- in the case of the loss layers that top blob is a singleton, so the loss layers had to be modified to multiply their gradients by a scale parameter specified by the singleton top blob diff, but all the other layers already knew how to backprop their diffs and could just be used as is. The only annoying thing was that to get top blobs to be both inputs to other layers and losses, I had to use split layers, as it's functionally the same thing as sending the output to two different layers (I have to accumulate my diff from my direct loss and from any layers I output to).

Another nice thing about this is that it allows you to put an ACCURACY layer in a train net in a non-hacky way. Since the accuracy layer produces 0 loss, the net is able to figure out that it can skip running Backward through the accuracy layer. (The exception to this would be if you tried to specify loss_weight: <something != 0> in your ACCURACY layer, in which case it appropriately breaks.) I added an ACCURACY layer to the lenet_consolidated_solver.prototxt train net as a preview of this.

jeffdonahue · 2014-07-16T00:11:03Z

This is rebased and fairly well-tested. In addition to the several new unit tests, I've verified (seeded) ImageNet training behaves as before (with and without an ACCURACY layer), and verified many variations of lenet training that should be equivalent are equivalent (including the two versions of the SOFTMAX_LOSS I pasted in the original comment above).

I hope someone will get a chance to take a look at this at some point soonish to avoid constant rebases. I know it's a lot of code so I understand it might be a little while before someone has the time though -- sorry about that.

Possible disadvantages of this PR that I've thought of are the following:

it breaks the method signature (changes return type) of Forward_{cpu,gpu} -- but note that these are the protected methods called only by the public Forward, so it only requires people who have written their own layers to change their return type to void and put the loss computation result into the top blob rather than returning it.
it changes the SetUp protocol -- layers implement their layer-specific setup in FurtherSetUp, which is called by the base SetUp. I did this because there is now more generic set up for each layer that is a pain to have to remember to put in each layer -- to continue the current protocol all layers implementing SetUp would have to add code to the beginning and end of their implementations. My personal preference is instead just change the name of the overridden method rather than requiring so much of each layer's SetUp method, but I won't do that unilaterally -- what do other people think? As in the previous disadvantage, this only requires changes from people implementing their own layers (and in this case the change is just changing the method name from SetUp to FurtherSetUp).

I think code that only uses public Caffe interfaces (including the C library and prototxts) will be completely unaffected.

shelhamer · 2014-07-29T08:10:12Z

This is trivial but can you fix your commit messages? They don't have headers and are just long lines.

shelhamer · 2014-07-29T08:15:01Z

This is cosmetic but it seems to me like SetUp() should keep its name and original purpose as layer initialization and PreSetUp() should prepare the infrastructure. I only say this because SetUp() was the interface method exposed for layer development. SetUp() reads more naturally to me as a method to override.

All the same, we've said again and again now is the time to fix interfaces so I don't have strong feelings on this.

jeffdonahue · 2014-07-29T09:56:47Z

Yup, I'll clean up the commits.

My reasoning for choosing SetUp as the parent method name was that the SetUp parent method calls the child layer's SetUp method (rather than the other way around as it was before), so unless we were going to break the public Layer interface, the Layer-implemented SetUp has to be named SetUp, and the method for child classes to override has to be named something else. FurtherSetUp is one option, but I agree it's not very natural.

Open to suggestions on names and overall design -- including switching back to the old way, where each child explicitly calls the parent Layer<Dtype>::SetUp(...) at the beginning of its SetUp. But in that case they'd also now have to call the parent's Layer<Dtype>::PostSetUp(...) at the end, which starts to get to the point where I'd prefer it to be an automated thing (but there could definitely be a better way to do things like this in C++ for all I know).

in the objective function.

Check that the loss and gradients throughout the net are appropriately scaled for a few loss_weight values, assuming a default weight of 1 in the loss layer only. Also modify test_gradient_check_util to associate a loss of 2 rather than 1 with the top blob, so that loss layer tests fail if they don't scale their diffs.

its elements are summed with a scalar coefficient. Forward for layers no longer returns a loss; instead all loss layers must have top blobs. Existing loss layers are given a top blob automatically by Net::Init, with an associated top_loss_weight of 1 (set in LossLayer::FurtherSetUp). Due to the increased amount of common SetUp logic, the SetUp interface is modified such that all subclasses should normally override FurtherSetUp only, which is called by SetUp.

…s for it. Test that we can call backward with an ACCURACY layer. This currently fails, but should be possible now that we explicitly associate a loss weight with each top blob.

used to compute the loss.

backward pass when input into a loss.

example.

jeffdonahue · 2014-08-13T20:47:28Z

After discussing with @shelhamer, I've changed the name of the function that layers will now override from FurtherSetUp to LayerSetUp. Merging this momentarily.

Loss generalization

jeffdonahue mentioned this pull request Jul 14, 2014

SQUASH layer -- takes 1 or more inputs and produces no output #624

Merged

This was referenced Jul 19, 2014

All-in-one nets #734

Merged

Add gradient checks for InfogainLossLayer #793

Merged

shelhamer self-assigned this Jul 29, 2014

jeffdonahue added 10 commits August 13, 2014 11:57

Add loss_weight to proto, specifying coefficients for each top blob

d0cae53

in the objective function.

Make multiple losses work by inserting split layers and add some test…

6176f5b

…s for it. Test that we can call backward with an ACCURACY layer. This currently fails, but should be possible now that we explicitly associate a loss weight with each top blob.

Net::Init can determine that layers don't need backward if they are not

e01c867

used to compute the loss.

AccuracyLayer only dies when necessary.

7c5dc20

Disallow in-place computation in SPLIT layer -- has strange effects in

f50b233

backward pass when input into a loss.

Also display outputs in the train net. (Otherwise, why have them?)

7d590a8

Add ACCURACY layer and softmax_error output to lenet_consolidated_solver

f7b4507

example.

Store loss coefficients in layer; use for prettier training output.

415123b

jeffdonahue added a commit that referenced this pull request Aug 13, 2014

Merge pull request #686 from jeffdonahue/loss-generalization

34831e2

Loss generalization

jeffdonahue merged commit 34831e2 into BVLC:dev Aug 13, 2014

jeffdonahue deleted the loss-generalization branch August 13, 2014 22:44

This was referenced Sep 18, 2014

[cancelled] Next #1109

Merged

Next: release candidate #1112

Merged

mitmul pushed a commit to mitmul/caffe that referenced this pull request Sep 30, 2014

Merge pull request BVLC#686 from jeffdonahue/loss-generalization

450c3d0

Loss generalization

RazvanRanca pushed a commit to RazvanRanca/caffe that referenced this pull request Nov 4, 2014

Merge pull request BVLC#686 from jeffdonahue/loss-generalization

d9a2d49

Loss generalization

shelhamer mentioned this pull request Apr 14, 2017

MAE loss layer , Fusion Layer #5443

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss generalization #686

Loss generalization #686

jeffdonahue commented Jul 14, 2014

jeffdonahue commented Jul 16, 2014

shelhamer commented Jul 29, 2014

shelhamer commented Jul 29, 2014

jeffdonahue commented Jul 29, 2014

jeffdonahue commented Aug 13, 2014

Loss generalization #686

Loss generalization #686

Conversation

jeffdonahue commented Jul 14, 2014

jeffdonahue commented Jul 16, 2014

shelhamer commented Jul 29, 2014

shelhamer commented Jul 29, 2014

jeffdonahue commented Jul 29, 2014

jeffdonahue commented Aug 13, 2014