RFC: cross entropy changes #9

tbreloff · 2016-02-26T16:44:43Z

I think this is wrong. I just did a quick back-of-the-napkin derivation, and I think the "deriv" should be dE/dy (not dE/ds):

crossentropy_deriv.pdf

code should probably be:

function value(l::CrossentropyLoss, y::Number, t::Number)
    if t == 0
        -log(1 - y)
    elseif t == 1
        -log(y)
    else
        -(t * log(y) + (1-t) * log(1-y))
end

deriv(l::CrossentropyLoss, y::Number, t::Number) = (1-t) / (1-y) - t / y

Note I switched y/t in the value function (I'm assuming y is the estimated probability and t is the target value)

Note that these calculations only make sense for y in (0,1)... if y is exactly 0 or 1 then we start to get infinities/NaNs. Do we want to check for this somehow?

Note that most computations actually care about the sensitivity of the error to the input to a sigmoid function. So if y = sigmoid(s), we actually care about delta = dE / ds = (dE / dy) * (dy / ds). When you work out the math of the derivative of the sigmoid times the derivative of the cross-entropy function, you end up with the simple: delta = y - t, which is what you see everywhere. The problem is that for generic libraries, the two derivatives need to be computed in different abstractions (loss vs activation).

The text was updated successfully, but these errors were encountered:

Evizero · 2016-02-26T16:55:37Z

The notation is based on this paper and the book Support Vector Machines. There y is the target and t the prediction. I also thought it is a bit off-putting at first. I am open to the idea of changing that (but that sounds like tedious refactoring)

The current way of how it is implemented is not really general. The derivative is based on the assumption that the prediction-function is a sigmoid, which it usually is. Where I wanted to go is to have this simplification in the EmpiricalRisk implementation where both predictor and Loss are specified, but I never got that far.

tldr: you are absolutely right

tbreloff · 2016-02-26T17:16:13Z

There y is the target and t the prediction. I also thought it is a bit off-putting at first. I am open to the idea of changing that (but that sounds like tedious refactoring)

If we're considering refactoring, it might be worthwhile to consider longer, more descriptive names: prediction and target? (let the bikeshedding begin) I deal with so many different notations between researchers... sometimes it's good to be explicit.

Evizero · 2016-02-26T17:17:58Z

I used to think that the variables should be close to the math, but I agree in this case. prediction and target sounds good

Evizero · 2016-02-26T21:20:43Z

~~How about changing Param to Coefficient? What's your opinion on that?~~ damn wrong issue

tbreloff · 2016-02-29T17:14:26Z

Back on names... how do you feel about input, output, and target? They are short and clear IMO, and don't have unnecessary connotation attached like x, y, t, or prediction. So the method signature:

function value(l::PredictionLoss, y::AbstractVecOrMat, t::AbstractVecOrMat)

would now become:

function value(loss::PredictionLoss, target::AbstractVecOrMat, output::AbstractVecOrMat)

I'd also prefer to use longer names like loss instead of l when possible.

Thoughts?

Evizero · 2016-02-29T17:18:14Z

👍

Evizero · 2016-02-29T17:59:17Z

note though that currently t denotes the prediction / output.

Edit: you wrote it correctly of course, I just wanted to recapitulate that since I recognise it is a somewhat confusion notation

tbreloff · 2016-02-29T18:26:06Z

You have variables r and yt... I assume r means residual? What about yt?

Evizero · 2016-02-29T18:34:07Z

you are right with r. yt is literally the result of y * t. Margin based loss functions can be "simplified" this way.

tbreloff · 2016-02-29T18:36:03Z

hmm... can you think of a good, short variable name to use in place of yt?

Evizero · 2016-02-29T18:44:11Z

There doesn't seem to be a description in the book and I can't think of a good name.

Since y is {-1, 1} it is basically the input where the sign denotes if it is on the "correct" side of the margin. Does that spark any idea?

I have to leave right now, but I'll think of something and get back to you. Let's rename that one later (I could do that refactoring once we found a name)

tbreloff · 2016-02-29T18:59:58Z

I can't think of a good name for that, but here are a few things I considered (and didn't love)... maybe it'll spark an idea for you:

sgn: sign of class
match: did the output match target?
'same'

I'll leave it as-is for now.

Evizero · 2016-02-29T22:10:11Z

sign of class

That doesn't really fit because it's still a real number. It's a curious trick, really. (Note: Please don't interpret the following as me trying to "educate you" or something. I am just trying to put my understanding of it into words, which helps me structure my thoughts)

the output is basically the orthogonal distance from the margin, where positive numbers correspond either to the halfspace further away from the origin or closer to the origin depending on the margins orientation and bias. So the output can be an arbitrary real number.

the "margin" is defined to be the hyperplane at x'w + b == 0 (so the bias is more or less the (sometimes negative, depending on the margin orientation) orthogonal offset of the margin from the origin) (in notation one usually sees x'w - b == 0 for some reason, but I have yet to see it implemented this way. All that would change is the sign of the bias, though).

Now the neat trick is the target vector convention which says that target has to be in {-1, +1}. Let's play out what happens when we multiply target * output for all the different combinations. ("positive" means the positive output, and "negative" means the negative output)

target	output	result
+1	positive	positive
-1	positive	negative
+1	negative	negative
-1	negative	positive

So basically the result, currently called yt, denotes how much the output agrees with the true label, given the margin. So this is why classification loss functions are designed to penalize negative numbers for yt increasingly more, like the following typical image (which is just googled) shows. The x-axis is yt

So how about we call it agreement?

tbreloff · 2016-02-29T22:56:17Z

This was a well explained and thoughtful argument. I like agreement. (and please don't apologize for fully explaining yourself! There's no downside)

Evizero closed this as completed Jun 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: cross entropy changes #9

RFC: cross entropy changes #9

tbreloff commented Feb 26, 2016

Evizero commented Feb 26, 2016

tbreloff commented Feb 26, 2016

Evizero commented Feb 26, 2016

Evizero commented Feb 26, 2016

tbreloff commented Feb 29, 2016

Evizero commented Feb 29, 2016

Evizero commented Feb 29, 2016

tbreloff commented Feb 29, 2016

Evizero commented Feb 29, 2016

tbreloff commented Feb 29, 2016

Evizero commented Feb 29, 2016

tbreloff commented Feb 29, 2016

Evizero commented Feb 29, 2016

tbreloff commented Feb 29, 2016

RFC: cross entropy changes #9

RFC: cross entropy changes #9

Comments

tbreloff commented Feb 26, 2016

Evizero commented Feb 26, 2016

tbreloff commented Feb 26, 2016

Evizero commented Feb 26, 2016

Evizero commented Feb 26, 2016

tbreloff commented Feb 29, 2016

Evizero commented Feb 29, 2016

Evizero commented Feb 29, 2016

tbreloff commented Feb 29, 2016

Evizero commented Feb 29, 2016

tbreloff commented Feb 29, 2016

Evizero commented Feb 29, 2016

tbreloff commented Feb 29, 2016

Evizero commented Feb 29, 2016

tbreloff commented Feb 29, 2016