Contrastive loss layer differs from loss equation #2308

imelekhov · 2015-04-13T09:22:25Z

Hi,
I am a little bit confused of an implementation of contrastive loss function. As pointed in LeCun's paper
http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf the equation of this loss function should be: L = 0.5 * (1 - Y) * D^2 + 0.5 * Y * {max(0, margin - D)}^2. It's an equation 4 in the original paper. As far as I understood, source code of contrastive loss layer implements this loss function in other way. Here is a piece of code (lines 48-52 of contrastive_loss_layer.cpp file):

if (static_cast<int>(bottom[2]->cpu_data()[i])) {  // similar pairs
      loss += dist_sq_.cpu_data()[i];
    } else {  // dissimilar pairs
      loss += std::max(margin-dist_sq_.cpu_data()[i], Dtype(0.0));
    }

Is it a bug in implementation?

The text was updated successfully, but these errors were encountered:

melgor · 2015-04-14T06:54:07Z

@SlevinKelevra I have beeen analysing it too recently. I was thinning about bug too, but then I understand it :)

LeCun function does averege between similar pairs and not. In Caffe there in only averege between all the pairs. loss = loss / static_cast(bottom[0]->num()) / Dtype(2); . This is not exacly the same, it depend on your input data.
About D^2, it it done by dist_sq_.mutable_cpu_data()[i] = caffe_cpu_dot(channels, diff_.cpu_data() + (i_channels), diff_.cpu_data() + (i_channels));
(1 - Y) * D^2 is done by loss += dist_sq_.cpu_data()[i];
Y * {max(0, margin - D)}^2 is done by std::max(margin-dist_sq_.cpu_data()[i], Dtype(0.0));

So, except the average, everything is ok.

imelekhov · 2015-04-14T07:04:31Z

Hm, I am not pretty sure... Look at your third point.
The code std::max(margin - dist_sq_.cpu_data()[i], Dtype(0.0));
doesn't implement {max(0, margin - D)}^2 equation for sure! It implements this one
{max(0, margin - D^2)} isn't it? I might be wrong, but I think that's the main problem, not the average.

melgor · 2015-04-14T07:15:33Z

You are right, there is the difference, good catch.
Briefly analyzing it, the main impacts of it are:

you need to setup other "margin" value
the ratio between loos of positive sample and negative is imbalanced

I will try to implement your idea and see if there is any different in practical experiment.

imelekhov · 2015-04-14T07:27:16Z

Ok, I will test it independently. We'll see, might be these changes significantly affect the results. I think it would be good to point @shelhamer out this thread. He may help us :)

melgor · 2015-04-15T10:37:49Z

I have quick implementation of it https://gist.github.com/melgor/962800c3200efcfb78c1
I was only working with Cuda code.
Changes:
from line 34 to 48 calculate Abs(diff) and Sum(Abs(diff)). I use abs but we could use sqrt(diff^2)
line 57,58: implement {max(0, margin - D)}^2 (return 0 or (margin - D)^2 )
line 102: change from dist^2 to abs(dist), to implement the function margin - D

As a result, I have bigger value of loss at the last step of learning, but no change in final result. How we can measure if this influence on result?

@SlevinKelevra Could you check it? Notice, that in #2312 was noticed bug in gradient. I think it is connected with bug in loss function.

imelekhov · 2015-04-16T05:57:04Z

Unfortunately, I haven't checked it yet. I try to figure out how to implement a backpropagation part in CUDA code.

nickcarlevaris · 2015-04-16T06:48:32Z

At first glance I though this wasn't really an issue, but it does result in a noticeably different cost function.

It would be interesting to see if it results in better embeddings. An easy test would be with the MNIST data using the notebook in the examples folder. It might not result in a significantly different embedding, because both cost functions are encouraging similar things. However, learning might be easier / faster with the original cost function from Hadsell et all, especially when you consider the gradients for non-matching pairs near dist=0.0 and dist=margin.

plt.figure(1, figsize=(8, 6))
plt.plot(d, dsq, '-g')
plt.plot(d, np.maximum(1.0-dsq, 0.0), '.-r')
plt.plot(d, np.power(np.maximum(1.0-np.sqrt(dsq), 0.0), 2.0), '-r')
plt.ylabel('Cost')
plt.xlabel('Euclidean Distance in Feature Space')
plt.legend(['Matching', 'Non-Matching [Implemented]', 'Non-Matching [Hadsell et al]'],
        loc=2)
plt.xlim(0,1.2)
plt.grid()
plt.show()

nickcarlevaris · 2015-04-17T05:00:30Z

@SlevinKelevra and @melgor, If you guys want to double check it, and give it a try, I created a PR (#2321) which fixes this. I ran the MNIST example using both versions, and didn't see a big difference---but that is just a simple problem. My inclination would be to just fix it so that it matches the Hadsell et al paper.

Here are the learning curve and embedding using the current version.

And here are the same plots with the fixed cost function.

imelekhov · 2015-04-17T06:12:47Z

Great, I didn't get big improvements in my project after the fix had been applied as well. I will double check a little bit later, may be I missed something.
By the way, I have a little question which is not related to the main topic. Could you know a straightforward way extracting feature vectors of any layer for the whole training dataset? For example, I have a training dataset (lmdb file) which has Z records (Z is quite large number) and the dimension of a target layer is 128x1. My main goal is to get a matrix (128xZ dimension) which has information about feature vectors. Moreover, I can't set Z to batch_size in TEST phase due to out of memory of my GPU. Have you ever faced this problem? I would highly appreciate for your help.

shelhamer · 2015-04-30T18:26:13Z

Closing for fix in #2321.

nickcarlevaris mentioned this issue Apr 17, 2015

Fixed contrastive loss layer to be the same as proposed in Hadsell et al 2006 #2321

Merged

shelhamer closed this as completed Apr 30, 2015

shelhamer changed the title ~~Contrastive loss layer implementation issue~~ Contrastive loss layer differs from loss equation Apr 30, 2015

dfdf mentioned this issue Jan 12, 2016

About DeepID2 happynear/FaceVerification#14

Open

jonbakerfish mentioned this issue Jul 17, 2016

add double-margin contrastive loss layer #4476

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contrastive loss layer differs from loss equation #2308

Contrastive loss layer differs from loss equation #2308

imelekhov commented Apr 13, 2015

melgor commented Apr 14, 2015

imelekhov commented Apr 14, 2015

melgor commented Apr 14, 2015

imelekhov commented Apr 14, 2015

melgor commented Apr 15, 2015

imelekhov commented Apr 16, 2015

nickcarlevaris commented Apr 16, 2015

nickcarlevaris commented Apr 17, 2015

imelekhov commented Apr 17, 2015

shelhamer commented Apr 30, 2015

Contrastive loss layer differs from loss equation #2308

Contrastive loss layer differs from loss equation #2308

Comments

imelekhov commented Apr 13, 2015

melgor commented Apr 14, 2015

imelekhov commented Apr 14, 2015

melgor commented Apr 14, 2015

imelekhov commented Apr 14, 2015

melgor commented Apr 15, 2015

imelekhov commented Apr 16, 2015

nickcarlevaris commented Apr 16, 2015

nickcarlevaris commented Apr 17, 2015

imelekhov commented Apr 17, 2015

shelhamer commented Apr 30, 2015