Better normalization options for SoftmaxWithLoss layer #3296

Merged
merged 1 commit into from Nov 23, 2015

Conversation

Projects
None yet
4 participants
Contributor

cdoersch commented Nov 6, 2015

The current SoftmaxWithLoss layer has two options for normalizing the output: either divide by the number of 'valid' samples (those without the ignore label), or by the batch size. Notably missing is any way to turn normalization off completely. This is needed in the case where the batches are not all the same size, and hence batches with more examples need to get more weight. One might expect that normalize = false would do this, but confusingly it still divides by the batch size.

This PR replaces the existing 'normalize' boolean parameter in caffe.proto with an enum that has four different options, with more informative names. Two of the options (VALID and BATCH_SIZE) mimic existing behavior, and the current boolean parameter is still supported for backwards compatibility (but is deprecated). The NONE option allows you to turn normalization off completely, and the FULL option allows you to normalize by the full shape of the output map, i.e., like VALID but locations with the 'ignore' label are still included in the count for normalization.

Note that there's still a bit of a mess here, since it seems that SoftmaxWithLoss is the only layer that actually reads LossParameter's normalization options. This remains unchanged for this PR.

Contributor

jeffdonahue commented Nov 10, 2015

Thanks @cdoersch -- this makes the normalization options clearer and looks correct to me. Ideally there would be a test for each option, but maybe that's not needed with the current set of tests passing, and the choice of normalization constant factored into a separate method that's called in both Forward and Backward. Maybe @longjon would like to take a look before merge since he added the normalize option in #1654 and this deprecates it?

Contributor

seanbell commented Nov 10, 2015

It makes sense to address the divide-by-zero problem in this PR as well, since that's an issue with normalization. If there are 0 valid items (which can happen in many applications for some batches), then I think the loss should be 0 and the gradient also 0.

Simple ways to address this: (a) make the minimum allowed denominator 1 (since it's always an integer) or (b) special-case the denominator code to not divide if 0.

@seanbell seanbell commented on the diff Nov 10, 2015

src/caffe/layers/softmax_loss_layer.cpp
@@ -71,11 +106,7 @@ void SoftmaxWithLossLayer<Dtype>::Forward_cpu(
++count;
}
}
- if (normalize_) {
- top[0]->mutable_cpu_data()[0] = loss / count;
- } else {
- top[0]->mutable_cpu_data()[0] = loss / outer_num_;
- }
+ top[0]->mutable_cpu_data()[0] = loss / get_normalizer(normalization_, count);
@seanbell

seanbell Nov 10, 2015

Contributor

Potential divide-by-zero here (and other similar locations).

Yes, the old code had the same problem, but it might as well get fixed here, since a separate PR fixing the divide-by-zero would conflict with this one.

Contributor

cdoersch commented Nov 10, 2015

@seanbell this is a good point. I initially didn't worry about this because it indicates that you have a batch with zero examples; it seems to me like you would want to error out if this happeens. However, on second thought, I realize that in some cases, datasets may accidentally contain examples that are totally unlabeled, which means that once in a great while a randomly-selected batch will contain no labels at all, leading to heisenbug behavior.

Do you think we should log a warning if we correct the denominator away from zero?

Contributor

seanbell commented Nov 11, 2015

I don't think a warning is always necessary. With multi-task setups, you can have auxiliary classification tasks that only have valid labels once in a while, so the log would get flooded with warnings in those cases. I'd be happy with a warning that was turned on by default, but easily disable-able. More debug info is better than less.

Contributor

cdoersch commented Nov 11, 2015

But would you really want normalization turned on at all in that use case? i.e., if the labeled examples are so rare that many batches don't even contain one, do people really want an example to get half the weight if it just happens to occur in the same batch with another example where the label is defined?

I guess I don't really have a strong opinion about this, so if you have a use case, then I'm fine with doing it your way.

Contributor

jeffdonahue commented Nov 11, 2015

My personal opinion here is that scaling down the loss by the number of non-ignored labels is the wrong thing to do to begin with, and this normalization should not be the default, but if it is used it shouldn't have a special case for 0 non-ignored labels, as the NaN/inf problem that occurs with 0 non-ignored labels is just the most extreme and illustrative case of the general problem with this normalization strategy. For example, say you're using a batch size of 128 -- then, relative to a batch with 128 valid labels, in a batch with 64 valid labels, the gradient of the each valid instance's error gradient is scaled up by a factor of 2; for a batch with 16 valid labels, each valid instance's error gradient is scaled up by a factor of 8; for a batch with 1 valid label, the instance's error gradient is scaled up by a factor of 128. Naturally, with no valid labels, the error gradient is scaled up by infinity. Which is of course bad, but only exactly as bad as it should be. With 1 valid label things already pretty bad -- scaling that one instance's gradient up by a factor of 128 is likely to lead to quite unstable and high-variance learning if you chose your learning rate expecting your update to reflect the gradient averaged over 128 independent instances.

That said, I think we should change the default to BATCH_SIZE normalization, but that would change existing behavior so it might deserve a separate PR with more discussion later.

Contributor

seanbell commented Nov 11, 2015

with no valid labels, the error gradient is scaled up by infinity. Which is of course bad, but only exactly as bad as it should be.

In my use case, I have a very large model, small batchsize (4), and use gradient accumulation to fit it on the GPU. With this setup, you only need a single iteration to have no valid labels to totally kill training.

The moment you have NaN, everything is dead. So you might as well crash with a debug message in the normalizer code. I am saying that you may want to instead ignore the iteration and avoid the crash. I can see that it might not be the best for a default, I think it's an option worth considering.

Contributor

jeffdonahue commented Nov 11, 2015

I'm not sure about a check as it's possible you might not actually be backpropagating the error and just want to compute the softmax loss for debugging/display purposes (e.g. you're using loss_weight: 0), which is actually the one case where I'd agree with using the VALID normalization :). I'd prefer such cases be caught by something more generic like #1349.

Contributor

seanbell commented Nov 11, 2015

That's a valid use case too (though the loss_weight: 0 case may need fixing, #2895).

I was just saying that I have a use case for avoiding divide-by-zero (noisy partially labeled datasets with small batchsizes, and certain multitask setups), and I thought others might as well. It could be a configurable option, with the default being to divide-by-zero and output NaN. But if nobody else runs into this problem, it could be left out for now.

Edit: Another use case. I have a multitask setup where some batches have semantic segmentation labels, and others do not. For those that do have labels, I use "valid" normalization. For those that don't, the normalizing constant is 0, and I set loss = 0 instead of divide-by-zero.

Contributor

seanbell commented Nov 11, 2015

My personal opinion here is that scaling down the loss by the number of non-ignored labels is the wrong thing to do to begin with

Since weight decay is the same each iteration, I think it does make sense to scale by the number of valid labels. Otherwise, the ratio of regularizer-to-data changes from batch to batch.

Sorry if I'm side-tracking this PR. I'm arguing for something (avoiding divide-by-zero) which could be discussed in another PR.

Contributor

jeffdonahue commented Nov 11, 2015

Since weight decay is the same each iteration, I think it does make sense to scale by the number of valid labels. Otherwise, the ratio of regularizer-to-data changes from batch to batch.

The ratio does change from minibatch to minibatch, but it's still an unbiased estimator of the objective if minibatches are sampled uniformly. But this statement made me go write down some math, and I realized I was wrong -- the VALID normalization (with your proposed special case for 0 valid labels) also gives an unbiased estimator of an objective, just one with a different weighting of the data term from the BATCH normalization (amounting to a different weighting of the regularization term). Note that this assumes minibatches are drawn randomly, however; if you're stepping through the dataset sequentially the VALID normalization is biased in that some examples -- ones that appear in batches with more ignored instances -- are actually always weighted higher than others, whereas even with sequential training I think the BATCH strategy remains an unbiased estimator of the objective at each iteration, assuming the dataset was shuffled to begin with.

Anyway though, I retract my earlier hardliner and probably mathematically unfounded position and would support the special case you suggested @seanbell (doesn't need to be in the PR, but also fine if it is, to me). Thanks for the discussion.

Contributor

cdoersch commented Nov 11, 2015

Considering that we have one 'don't care', one 'weak support' and one 'support' from someone who actually has a use case, I think it makes sense to add this to the PR. I'll implement it as one-line change at the end of the get_normalizer function where I return max(1.0,normalizer).

Contributor

jeffdonahue commented Nov 23, 2015

LGTM, thanks for the additional normalization options and clarifications of existing ones @cdoersch!

@jeffdonahue jeffdonahue added a commit that referenced this pull request Nov 23, 2015

@jeffdonahue jeffdonahue Merge pull request #3296 from cdoersch/normalize_batch
Better normalization options for SoftmaxWithLoss layer
8e8d97d

@jeffdonahue jeffdonahue merged commit 8e8d97d into BVLC:master Nov 23, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Contributor

seanbell commented Nov 23, 2015

Thanks! This is very helpful.

@shelhamer shelhamer added a commit to shelhamer/caffe that referenced this pull request Nov 17, 2016

@shelhamer shelhamer sigmoid cross-entropy loss: normalize by one/batch size/all/etc.
sig-ce loss handles all the same normalizations as the softmax loss;
refer to #3296 for more detail
3231a1d

@shelhamer shelhamer added a commit to shelhamer/caffe that referenced this pull request Nov 17, 2016

@shelhamer shelhamer sigmoid cross-entropy loss: normalize by one/batch size/all/etc.
sig-ce loss handles all the same normalizations as the softmax loss;
refer to #3296 for more detail
bcbf54d

@shelhamer shelhamer added a commit to shelhamer/caffe that referenced this pull request Nov 17, 2016

@shelhamer shelhamer sigmoid cross-entropy loss: normalize by one/batch size/all/etc.
sig-ce loss handles all the same normalizations as the softmax loss;
refer to #3296 for more detail

n.b. this changes the default normalization for this loss! previously
this loss normalized by batch size, but now it normalizes by the total
number of outputs/targets.
38d1033

@shelhamer shelhamer added a commit to shelhamer/caffe that referenced this pull request Nov 17, 2016

@shelhamer shelhamer sigmoid cross-entropy loss: normalize by one/batch size/all/etc.
sig-ce loss handles all the same normalizations as the softmax loss;
refer to #3296 for more detail

note: the default normalization remains batch size, but valid
normalization might be a better idea
f4d7208

@shelhamer shelhamer added a commit to shelhamer/caffe that referenced this pull request Nov 17, 2016

@shelhamer shelhamer sigmoid cross-entropy loss: normalize by one/batch size/all/etc.
sig-ce loss handles all the same normalizations as the softmax loss;
refer to #3296 for more detail

n.b. this changes the default normalization for this loss! previously
this loss normalized by batch size, but now it normalizes by the total
number of outputs/targets.
4001295

@shelhamer shelhamer added a commit to shelhamer/caffe that referenced this pull request Nov 17, 2016

@shelhamer shelhamer sigmoid cross-entropy loss: normalize loss by different schemes
sig-ce loss handles all the same normalizations as the softmax loss;
refer to #3296 for more detail.

this preserves the default normalization for sig-ce loss: batch size.
3d62e3c

@ZhangXinNan ZhangXinNan added a commit to ZhangXinNan/caffe that referenced this pull request Dec 23, 2016

@shelhamer @ZhangXinNan shelhamer + ZhangXinNan sigmoid cross-entropy loss: normalize loss by different schemes
sig-ce loss handles all the same normalizations as the softmax loss;
refer to #3296 for more detail.

this preserves the default normalization for sig-ce loss: batch size.
0413140
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment