Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta and binary cross-entropy loss #1695

Open
doobidoob opened this issue Oct 1, 2018 · 14 comments
Open

Delta and binary cross-entropy loss #1695

doobidoob opened this issue Oct 1, 2018 · 14 comments
Labels
Explanations Explanations of the source code, algorithms or method of use

Comments

@doobidoob
Copy link

@AlexeyAB Thanks for your support.
I have a few questions about yolo_layer.c

[1]. In yolo_layer.c, i know that "delta" means gradient.
However, as shown below, Delta is expressed as the difference value.
Is not this a gradient for the MSE loss?

  • delta[index + 0*stride] = scale * (tx - x[index + 0*stride]);
    
  • delta[index + 1*stride] = scale * (ty - x[index + 1*stride]);
    
  • delta[index + 2*stride] = scale * (tw - x[index + 2*stride]);
    
  • delta[index + 3*stride] = scale * (th - x[index + 3*stride]);
    
  • l.delta[obj_index] = - l.output[obj_index];
    
  • l.delta[obj_index] = 1 - l.output[obj_index];
    
  • delta[index + stride*n] = (((n == class_id) ? 1 : 0) - output[index + stride*n]);
    

In YOLOv3 paper, The author mentioned the following.
"During training we use binary cross-entropy loss for the class predictions."
Why does the class loss function above mean binary cross-entropy?

[2]. Is the following "l.cost" used for back-propagation? Or is it simply for print value?

  • *(l.cost) = pow(mag_array(l.delta, l.outputs * l.batch), 2);

[3]. I want to change YOLOv3 to output additional information. So, I am trying to modify the loss function. In this case, should I fill the "delta" of yolo_layer.c with the gradient of the desired loss function such as log-likelihood or Binary cross-entropy?
Besides this, Is there anything else to consider?
I'm sorry to have to ask you a question not related to the code. But I'm a beginner... I would like to listen to your advice.
Thank you very much.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Oct 1, 2018

@doobidoob Hi,

In general, there are two types of classification:

  • multi-label classification - each bounded box (each anchor) can have several classes. And in total there are in the neural model >= 1 classes. There is used Binary cross-entropy with Logistic activation (sigmoid). Is used in Yolo v3

  • multi-class classification - each bounded box (each anchor) can have only one classes. And in total there are in the neural model >= 1 classes. There is used Categorical cross-entropy with Softmax activation. Is used in Yolo v2


  1. For independent outputs (x,y,w,h,t0 and multi-label classifications as in yolo v3) is better to use Binary cross-entropy, when each bounded box can predict several objects at a time: https://stats.stackexchange.com/a/288456/111998 So we use logistic activation (sigmoid) as logistic regression algorithm for binary classification: yes/no car, yes/no person, yes/no dog,... So for a single bounded box can be: person(yes), car(yes), dog(no) - for example, if in a single bounded box there are person and car: https://www.reddit.com/r/learnmachinelearning/comments/88g8zf/difference_between_binary_cross_entropy_and/
  • for multi-label classifications is used Binary cross-entropy:
    delta = (n == class_id) ? (1 - logistic_activation(x)) : (-logistic_activation(x)) ;

  • for multi-class classification is used Categorical cross entropy:
    delta = (n == class_id) ? (1 - softmax(x, x_array)) : (-softmax(x, x_array)) ;


  1. This *(l.cost) = pow(mag_array(l.delta, l.outputs * l.batch), 2); is used only for print value as avg loss. This is summary loss for (x,y,w,h,t0,probabilities...) for all anchors for all final activations

  2. If you want to change loss function to get another result during training - then you should change l.delta=...

@doobidoob
Copy link
Author

@AlexeyAB Thanks for your reply!
I have a few questions about your mention.

[1]. When using binary cross-entropy, why should "(1-logistic_activation(x))" or "(-logistic_activation(x))" be applied to the delta?
[2]. Why is 1 subtracted when n is class_id?
[3]. Why is "(-logistic_activation(x))" when n is not class_id?
[4]. And why not use "logistic_gradient(x)"?
I know that "delta" means gradient...
Am i misunderstanding?

I want to change the loss function, but it is not easy to apply it to the code...
Probably because I did not fully understand it.
[5]. I want to use a negative log likelihood for additional output except YOLO original output. What should I do on "delta"?

Thanks in advance for your advice.

@AlexeyAB
Copy link
Owner

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jan 2, 2019

There is Binary cross-entropy loss = −(t*ln(y) + (1−t)*ln(1−y)) - we should minimize it
Also d(loss)/d(y) = loss_derivative = y-t : https://peterroelants.github.io/posts/cross-entropy-logistic/

  • y - probability [0 - 1]
  • t - is the class correct 1 or not 0

We do it here:

delta[index + stride*n] = ((n == class_id) ? 1 : 0) - output[index + stride*n];

The same:

  • t==1: i.e. if (detected_class == thruth_class) delta = -loss_derivative = -(y-t) = 1-y
  • t==0: i.e. if (detected_class != thruth_class) delta = -loss_derivative = -(y-t) = -y

image


Free-form reasoning - in general, in the Yolo v3:

image   image
  • we use Binary cross-entropy for multi-label classification: loss = −(y*ln(p) + (1−y)*ln(1−p)) and we should minimize it

    • p - probability [0 - 1]
    • y - is the class correct 1 or not 0
  • so we should minimize cost: loss = −ln(p) if(y==1) or loss = −ln(1−p) if(y==0), those

    • if(y==1) then should be -ln(p)=0, those p=1
    • if(y==0) then should be -ln(1-p)=0, those p=0
  • then we can do it by maximizing p if(y==1) or maximizing 1-p if(y==0),

    • if(y==1) then we should maximize logistic_activation(x + delta) so delta > 0
    • if(y==0) then we should minimize logistic_activation(x + delta) so delta < 0
  • we do it here: https://github.com/pjreddie/darknet/blob/61c9d02ec461e30d55762ec7669d6a1d3c356fb2/src/yolo_layer.c#L120
    The same:

    • if (detected_class == thruth_class) delta = 1-p > 0
    • if (detected_class != thruth_class) delta = −p < 0

where is p = logistic_activation(x) = output[index + stride*n]


In general, there are two types of classification:

  • multi-label classification - each bounded box (each anchor) can have several classes. And in total there are in the neural model >= 1 classes. There is used Binary cross-entropy with Logistic activation (sigmoid). Is used in Yolo v3

  • multi-class classification - each bounded box (each anchor) can have only one classes. And in total there are in the neural model >= 1 classes. There is used Categorical cross-entropy with Softmax activation. Is used in Yolo v2


There is used Binary cross-entropy with Logistic activation (sigmoid) for multi-label classification in the Yolo v3, so each bonded box (each anchor) can have several classes. For example, one bounded box can be Animal, Cat or Truck, Car. Or even Cat, Dog if they are close to each other.

So:

  1. There is used Logistic activation (sigmoid) = 1./(1. + exp(-x)) because:

    For the neural networks, our result states that the function of neuron activation must be nonlinear - and nothing else. Whatever this nonlinearity is, the network of connections can be constructed, and coefficients of linear connections between the neurons can be adjusted in such a way that the neural network will compute any continuous function from its input signals with any given accuracy.

    • derivative is very simple = (1-x)*x
  2. There is used Binary Classification, binary - means that we look at each class separately, and we consider each class as 2 classes (There is or There is no). So we use this formula: https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html
    loss = −(y*log(p) + (1−y)*log(1−p))

    • log - the natural log ln
    • y - binary indicator (0 or 1) if class label c is the correct classification for observation o
    • p - predicted probability observation o is of class c

So:

  • if (detected_class == thruth_class) loss = −log(p)
  • if (detected_class != thruth_class) loss = −log(1−p)

Where is p = logistic_activation(x), this is output[index + stride*n] in the yolo_layer.c source code.
And we should minimize cost: loss = −log(p) or loss = −log(1−p).

As said in the MXNET doc: https://gluon.mxnet.io/chapter02_supervised-learning/logistic-regression-gluon.html

  • if (detected_class == thruth_class) we should maximize log(p), i.e. we should maximize p
  • if (detected_class != thruth_class) we should maximize (1−p)

image

https://peterroelants.github.io/posts/cross-entropy-logistic/
https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_error_function_and_logistic_regression
https://gluon.mxnet.io/chapter02_supervised-learning/logistic-regression-gluon.html
https://ml-cheatsheet.readthedocs.io/en/latest/logistic_regression.html
https://en.wikipedia.org/wiki/Logistic_regression

@AlexeyAB AlexeyAB added Explanations Explanations of the source code, algorithms or method of use and removed question labels Jan 2, 2019
@AlexeyAB
Copy link
Owner

AlexeyAB commented Jan 2, 2019

This is very similar to the Yolo v2 with Categorical cross-entropy and Softmax activation for multi-class classification: https://peterroelants.github.io/posts/cross-entropy-softmax/

We do it here:

delta[index + n] = scale * (((n == class_id) ? 1 : 0) - output[index + n]);

The same:

  • t==1: i.e. if (detected_class == thruth_class) delta = -loss_derivative = -(y-t) = 1-y
  • t==0: i.e. if (detected_class != thruth_class) delta = -loss_derivative = -(y-t) = -y

image

@i-chaochen
Copy link

Hi @AlexeyAB

Thanks for your detailed explanation. I wonder for Binary cross-entropy, why the regularization is not included for the loss function, the smooth l1 loss is added for the bounding box although.

@AlexeyAB
Copy link
Owner

@i-chaochen

the smooth l1 loss is added for the bounding box although.

What do you mean?

@i-chaochen
Copy link

i-chaochen commented Nov 20, 2019

@i-chaochen

the smooth l1 loss is added for the bounding box although.

What do you mean?

Sorry @AlexeyAB maybe I didn't say it clearly.

I mean is any regularization, like l1-norm or l2-norm, for the loss function in the bounding box regression or object classification? (as the overall loss function for YOLO is the sum of squares of deltas for all)

For smooth l1 loss, I mean in the following links, they mentioned SSD, fast/faster rcnn are used it for box regression, and R-CNN and SPPNet used L2 loss. So, I wonder why the regularization is not added to classification.
https://lilianweng.github.io/lil-log/2018/12/27/object-detection-part-4.html
https://github.com/rbgirshick/py-faster-rcnn/files/764206/SmoothL1Loss.1.pdf

As a side note, I am not sure whether YOLO has used any regularization loss for bounding box regression?

Hope it's clear to you. Thanks

@LucWuytens
Copy link

question for @AlexeyAB
You explained that in Yolov3 it is possible that one anchor can detects two classes. This is indeed happening in my dataset: two labels for the same box. This may be desired behavior in many cases, but in my application, I would like to visualize and only keep the label with the highest probability (without using -'thresh'). After all usually the second label is incorrect and results in FP, reducing the overall metrics. Is there some 'setting' that can accomplish this?
thanks,
Luc

@i-chaochen
Copy link

i-chaochen commented Jun 9, 2020

question for @AlexeyAB
You explained that in Yolov3 it is possible that one anchor can detects two classes. This is indeed happening in my dataset: two labels for the same box. This may be desired behavior in many cases, but in my application, I would like to visualize and only keep the label with the highest probability (without using -'thresh'). After all usually the second label is incorrect and results in FP, reducing the overall metrics. Is there some 'setting' that can accomplish this?
thanks,
Luc

Interested, could you upload one this kind of picture, "two labels for the same box", please?

@LucWuytens
Copy link

LucWuytens commented Jun 9, 2020

@i-chaochen
I can't really upload my pictures, but you can see it for yourself using the youtube video that AlexeyAB also shared somewhere:
https://www.youtube.com/watch?v=69Ii3HjUiTM
You can see objects with one box and two labels, for example: car, taxi.
This is actually something I would not like to see in my output, but only the highest probability for each anchor box. The lower probability alternative labels also result in False Positives, potentially impacting the mAP calculation? Hence my question to @AlexeyAB

@i-chaochen
Copy link

@i-chaochen
I can't really upload my pictures, but you can see it for yourself using the youtube video that AlexeyAB also shared somewhere:
https://www.youtube.com/watch?v=69Ii3HjUiTM
You can see objects with one box and two labels, for example: car, taxi.
This is actually something I would not like to see in my output, but only the highest probability for each anchor box. The lower probability alternative labels also result in False Positives, potentially impacting the mAP calculation? Hence my question to @AlexeyAB

I see your point. I don't think I ever meet this kind of problem and always have one label for one box. You might have a look that how to use API for Yolo as DLL and SO libraries.
https://github.com/AlexeyAB/darknet#how-to-use-yolo-as-dll-and-so-libraries

If you're using softmax at the last cost function layer, it shall be one class label only as softmax choosing the maximum one.

threshold is for NMS, which is to replace the redundant and overlapping bounding boxes rather than the class label.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jun 9, 2020

@LucWuytens You can implement this in your application code - reject one of detection with lower confidence_score if bboxes are equal.

@arnoldfychen
Copy link

arnoldfychen commented Jan 13, 2021

@AlexeyAB @i-chaochen
I totally understood the trouble that @LucWuytens ran into, as I have already met the same issue.

Given that there are multi objects belonging to different classes in an image, and one of them was detected out two classes(e.g., person, machine) by YOLOv3, i.e. an object has two labels, if set a higher threshold of confidence score to try to filter out the one with lower score from the two labels, this would perhaps also filter out some other objects detected out in the same image, so, this way introduced a degradation in recall. Still didn't find a more proper way to resolve this contradiction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Explanations Explanations of the source code, algorithms or method of use
Projects
None yet
Development

No branches or pull requests

5 participants