Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: cross entropy changes #9

Closed
tbreloff opened this issue Feb 26, 2016 · 14 comments
Closed

RFC: cross entropy changes #9

tbreloff opened this issue Feb 26, 2016 · 14 comments

Comments

@tbreloff
Copy link
Member

I think this is wrong. I just did a quick back-of-the-napkin derivation, and I think the "deriv" should be dE/dy (not dE/ds):

crossentropy_deriv.pdf

code should probably be:

function value(l::CrossentropyLoss, y::Number, t::Number)
    if t == 0
        -log(1 - y)
    elseif t == 1
        -log(y)
    else
        -(t * log(y) + (1-t) * log(1-y))
end

deriv(l::CrossentropyLoss, y::Number, t::Number) = (1-t) / (1-y) - t / y

Note I switched y/t in the value function (I'm assuming y is the estimated probability and t is the target value)

Note that these calculations only make sense for y in (0,1)... if y is exactly 0 or 1 then we start to get infinities/NaNs. Do we want to check for this somehow?

Note that most computations actually care about the sensitivity of the error to the input to a sigmoid function. So if y = sigmoid(s), we actually care about delta = dE / ds = (dE / dy) * (dy / ds). When you work out the math of the derivative of the sigmoid times the derivative of the cross-entropy function, you end up with the simple: delta = y - t, which is what you see everywhere. The problem is that for generic libraries, the two derivatives need to be computed in different abstractions (loss vs activation).

@Evizero
Copy link
Member

Evizero commented Feb 26, 2016

The notation is based on this paper and the book Support Vector Machines. There y is the target and t the prediction. I also thought it is a bit off-putting at first. I am open to the idea of changing that (but that sounds like tedious refactoring)

The current way of how it is implemented is not really general. The derivative is based on the assumption that the prediction-function is a sigmoid, which it usually is. Where I wanted to go is to have this simplification in the EmpiricalRisk implementation where both predictor and Loss are specified, but I never got that far.

tldr: you are absolutely right

@tbreloff
Copy link
Member Author

There y is the target and t the prediction. I also thought it is a bit off-putting at first. I am open to the idea of changing that (but that sounds like tedious refactoring)

If we're considering refactoring, it might be worthwhile to consider longer, more descriptive names: prediction and target? (let the bikeshedding begin) I deal with so many different notations between researchers... sometimes it's good to be explicit.

@Evizero
Copy link
Member

Evizero commented Feb 26, 2016

I used to think that the variables should be close to the math, but I agree in this case. prediction and target sounds good

@Evizero
Copy link
Member

Evizero commented Feb 26, 2016

How about changing Param to Coefficient? What's your opinion on that? damn wrong issue

@tbreloff
Copy link
Member Author

Back on names... how do you feel about input, output, and target? They are short and clear IMO, and don't have unnecessary connotation attached like x, y, t, or prediction. So the method signature:

function value(l::PredictionLoss, y::AbstractVecOrMat, t::AbstractVecOrMat)

would now become:

function value(loss::PredictionLoss, target::AbstractVecOrMat, output::AbstractVecOrMat)

I'd also prefer to use longer names like loss instead of l when possible.

Thoughts?

@Evizero
Copy link
Member

Evizero commented Feb 29, 2016

👍

@Evizero
Copy link
Member

Evizero commented Feb 29, 2016

note though that currently t denotes the prediction / output.

Edit: you wrote it correctly of course, I just wanted to recapitulate that since I recognise it is a somewhat confusion notation

@tbreloff
Copy link
Member Author

You have variables r and yt... I assume r means residual? What about yt?

@Evizero
Copy link
Member

Evizero commented Feb 29, 2016

you are right with r. yt is literally the result of y * t. Margin based loss functions can be "simplified" this way.

@tbreloff
Copy link
Member Author

hmm... can you think of a good, short variable name to use in place of yt?

@Evizero
Copy link
Member

Evizero commented Feb 29, 2016

There doesn't seem to be a description in the book and I can't think of a good name.

Since y is {-1, 1} it is basically the input where the sign denotes if it is on the "correct" side of the margin. Does that spark any idea?

I have to leave right now, but I'll think of something and get back to you. Let's rename that one later (I could do that refactoring once we found a name)

@tbreloff
Copy link
Member Author

I can't think of a good name for that, but here are a few things I considered (and didn't love)... maybe it'll spark an idea for you:

  • sgn: sign of class
  • match: did the output match target?
  • 'same'

I'll leave it as-is for now.

@Evizero
Copy link
Member

Evizero commented Feb 29, 2016

sign of class

That doesn't really fit because it's still a real number. It's a curious trick, really. (Note: Please don't interpret the following as me trying to "educate you" or something. I am just trying to put my understanding of it into words, which helps me structure my thoughts)

the output is basically the orthogonal distance from the margin, where positive numbers correspond either to the halfspace further away from the origin or closer to the origin depending on the margins orientation and bias. So the output can be an arbitrary real number.

the "margin" is defined to be the hyperplane at x'w + b == 0 (so the bias is more or less the (sometimes negative, depending on the margin orientation) orthogonal offset of the margin from the origin) (in notation one usually sees x'w - b == 0 for some reason, but I have yet to see it implemented this way. All that would change is the sign of the bias, though).

Now the neat trick is the target vector convention which says that target has to be in {-1, +1}. Let's play out what happens when we multiply target * output for all the different combinations. ("positive" means the positive output, and "negative" means the negative output)

target output result
+1 positive positive
-1 positive negative
+1 negative negative
-1 negative positive

So basically the result, currently called yt, denotes how much the output agrees with the true label, given the margin. So this is why classification loss functions are designed to penalize negative numbers for yt increasingly more, like the following typical image (which is just googled) shows. The x-axis is yt

loss functions

So how about we call it agreement?

@tbreloff
Copy link
Member Author

This was a well explained and thoughtful argument. I like agreement. (and please don't apologize for fully explaining yourself! There's no downside)

@Evizero Evizero closed this as completed Jun 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants