Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contrastive loss layer differs from loss equation #2308

Closed
imelekhov opened this issue Apr 13, 2015 · 10 comments
Closed

Contrastive loss layer differs from loss equation #2308

imelekhov opened this issue Apr 13, 2015 · 10 comments

Comments

@imelekhov
Copy link

Hi,
I am a little bit confused of an implementation of contrastive loss function. As pointed in LeCun's paper
http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf the equation of this loss function should be: L = 0.5 * (1 - Y) * D^2 + 0.5 * Y * {max(0, margin - D)}^2. It's an equation 4 in the original paper. As far as I understood, source code of contrastive loss layer implements this loss function in other way. Here is a piece of code (lines 48-52 of contrastive_loss_layer.cpp file):

if (static_cast<int>(bottom[2]->cpu_data()[i])) {  // similar pairs
      loss += dist_sq_.cpu_data()[i];
    } else {  // dissimilar pairs
      loss += std::max(margin-dist_sq_.cpu_data()[i], Dtype(0.0));
    }

Is it a bug in implementation?

@melgor
Copy link

melgor commented Apr 14, 2015

@SlevinKelevra I have beeen analysing it too recently. I was thinning about bug too, but then I understand it :)

  1. LeCun function does averege between similar pairs and not. In Caffe there in only averege between all the pairs. loss = loss / static_cast(bottom[0]->num()) / Dtype(2); . This is not exacly the same, it depend on your input data.
  2. About D^2, it it done by dist_sq_.mutable_cpu_data()[i] = caffe_cpu_dot(channels,
    diff_.cpu_data() + (i_channels), diff_.cpu_data() + (i_channels));
  3. (1 - Y) * D^2 is done by loss += dist_sq_.cpu_data()[i];
    Y * {max(0, margin - D)}^2 is done by std::max(margin-dist_sq_.cpu_data()[i], Dtype(0.0));

So, except the average, everything is ok.

@imelekhov
Copy link
Author

Hm, I am not pretty sure... Look at your third point.
The code std::max(margin - dist_sq_.cpu_data()[i], Dtype(0.0));
doesn't implement {max(0, margin - D)}^2 equation for sure! It implements this one
{max(0, margin - D^2)} isn't it? I might be wrong, but I think that's the main problem, not the average.

@melgor
Copy link

melgor commented Apr 14, 2015

You are right, there is the difference, good catch.
Briefly analyzing it, the main impacts of it are:

  • you need to setup other "margin" value
  • the ratio between loos of positive sample and negative is imbalanced

I will try to implement your idea and see if there is any different in practical experiment.

@imelekhov
Copy link
Author

Ok, I will test it independently. We'll see, might be these changes significantly affect the results. I think it would be good to point @shelhamer out this thread. He may help us :)

@melgor
Copy link

melgor commented Apr 15, 2015

I have quick implementation of it https://gist.github.com/melgor/962800c3200efcfb78c1
I was only working with Cuda code.
Changes:
from line 34 to 48 calculate Abs(diff) and Sum(Abs(diff)). I use abs but we could use sqrt(diff^2)
line 57,58: implement {max(0, margin - D)}^2 (return 0 or (margin - D)^2 )
line 102: change from dist^2 to abs(dist), to implement the function margin - D

As a result, I have bigger value of loss at the last step of learning, but no change in final result. How we can measure if this influence on result?

@SlevinKelevra Could you check it? Notice, that in #2312 was noticed bug in gradient. I think it is connected with bug in loss function.

@imelekhov
Copy link
Author

Unfortunately, I haven't checked it yet. I try to figure out how to implement a backpropagation part in CUDA code.

@nickcarlevaris
Copy link

At first glance I though this wasn't really an issue, but it does result in a noticeably different cost function.

It would be interesting to see if it results in better embeddings. An easy test would be with the MNIST data using the notebook in the examples folder. It might not result in a significantly different embedding, because both cost functions are encouraging similar things. However, learning might be easier / faster with the original cost function from Hadsell et all, especially when you consider the gradients for non-matching pairs near dist=0.0 and dist=margin.

plt.figure(1, figsize=(8, 6))
plt.plot(d, dsq, '-g')
plt.plot(d, np.maximum(1.0-dsq, 0.0), '.-r')
plt.plot(d, np.power(np.maximum(1.0-np.sqrt(dsq), 0.0), 2.0), '-r')
plt.ylabel('Cost')
plt.xlabel('Euclidean Distance in Feature Space')
plt.legend(['Matching', 'Non-Matching [Implemented]', 'Non-Matching [Hadsell et al]'],
        loc=2)
plt.xlim(0,1.2)
plt.grid()
plt.show()

result

@nickcarlevaris
Copy link

@SlevinKelevra and @melgor, If you guys want to double check it, and give it a try, I created a PR (#2321) which fixes this. I ran the MNIST example using both versions, and didn't see a big difference---but that is just a simple problem. My inclination would be to just fix it so that it matches the Hadsell et al paper.

Here are the learning curve and embedding using the current version.
Learning implemented
Embeded implemented

And here are the same plots with the fixed cost function.
Learning Hadsell
Embeded Hadsell

@imelekhov
Copy link
Author

Great, I didn't get big improvements in my project after the fix had been applied as well. I will double check a little bit later, may be I missed something.
By the way, I have a little question which is not related to the main topic. Could you know a straightforward way extracting feature vectors of any layer for the whole training dataset? For example, I have a training dataset (lmdb file) which has Z records (Z is quite large number) and the dimension of a target layer is 128x1. My main goal is to get a matrix (128xZ dimension) which has information about feature vectors. Moreover, I can't set Z to batch_size in TEST phase due to out of memory of my GPU. Have you ever faced this problem? I would highly appreciate for your help.

@shelhamer
Copy link
Member

Closing for fix in #2321.

@shelhamer shelhamer changed the title Contrastive loss layer implementation issue Contrastive loss layer differs from loss equation Apr 30, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants