Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In my dataset, the loss of ALS is very large, and it is normal to use other loss functions #22

Closed
ghost opened this issue Dec 1, 2020 · 13 comments

Comments

@ghost
Copy link

ghost commented Dec 1, 2020

Hello, thank you very much and your team's contribution in this respect, I intend to apply this loss function to my image multi label classification model (only label, no border label),
loss_function=AsymmetricLoss()
logits = net(images.to(device))
loss = loss_function(logits,labels.to(device))
I haven't changed your ALS loss function at all. At first, the loss was 156. Finally, it dropped to 4, ACC = 0. What's the matter? Why did the loss value just start to be more than 100, and still be 4 after training, and the accuracy rate is zero?When I use BCEloss, it's perfectly normal
train loss: 100%[->]4.9414
[epoch 1] train_loss: 21.409 test_accuracy: 0.000
train loss: 100%[
->]5.7753

@mrT23
Copy link
Contributor

mrT23 commented Dec 1, 2020

our default params for ASL are for highly imbalanced multi label datasets.

i suggest you try gradually variants of ASL, and make sure results are logical and consistent

(1)
start with simple CE, and make sure you reproduce your BCEloss results:
loss_function=AsymmetricLoss(gamma_neg=0, gamma_pos=0, clip=0)

(2) than try simple focal loss:
loss_function=AsymmetricLoss(gamma_neg=2, gamma_pos=2, clip=0)

(3) try now ASL:
loss_function=AsymmetricLoss(gamma_neg=2, gamma_pos=1, clip=0)
loss_function=AsymmetricLoss(gamma_neg=4, gamma_pos=1, clip=0.05)

(4) also try the 'disable_torch_grad_focal_loss' mode, it can stabilize results:
loss_function=AsymmetricLoss(gamma_neg=4, gamma_pos=1, clip=0.05,disable_torch_grad_focal_loss=True)

@ghost
Copy link
Author

ghost commented Dec 2, 2020

Hello, Thank you for your reply

I used a simple example to test and found that BCEloss can't be reproduced. What's the problem?
from losses import AsymmetricLossOptimized,AsymmetricLoss

import torch
import numpy as np
import torch.nn.functional as F
pred = np.array([[-0.4089, -1.2471, 0.5907],
[-0.4897, -0.8267, -0.7349],
[0.5241, -0.1246, -0.4751]])
label = np.array([[0, 1, 1],
[0, 0, 1],
[1, 0, 1]])

pred = torch.from_numpy(pred).float()
label = torch.from_numpy(label).float()

crition1 = torch.nn.BCEWithLogitsLoss()
loss1 = crition1(pred, label)
print(loss1)

crition2 = AsymmetricLoss(gamma_neg = 0,gamma_pos = 0,clip = 0,disable_torch_grad_focal_loss=True)

loss2 = crition2(pred, label)
print(loss2)

crition3 = AsymmetricLossOptimized(gamma_neg = 0,gamma_pos = 0,clip = 0)
loss3 = crition3(pred, label)
print(loss3)


tensor(0.7193)
tensor(6.4739)
tensor(6.4739)

@mrT23
Copy link
Contributor

mrT23 commented Dec 2, 2020

ASL preforms sigmoid

BCEWithLogitsLoss does no perform sigmoid

@ghost
Copy link
Author

ghost commented Dec 4, 2020

Thank you sincerely for your help. I have solved this problem. In addition, I would like to ask my multi label image task (there are nine kinds of tags in total, each picture may have one or two, three, four kinds of tags,There is no dependency between these tags
). Is this imbalance described in your paper? Can I use your loss function for this task?

@ghost
Copy link
Author

ghost commented Dec 7, 2020

Sincerely thank you for taking some time out of your busy work to answer this question. I am a deep learning beginner.In your article:"In typical multi label datasets, each picture contains only a few positive labels, and many negative ones.", ”In my multi label classification dataset, there are ten kings of tags in total, each picture may have one or two, three, four kings of tags. Does this not be too extreme, also belong to the situation mentioned in your article, can I use ASL?

@mrT23
Copy link
Contributor

mrT23 commented Dec 7, 2020

I am not sure. my best advice would be "try and see".

the datasets that we used in the article are probably larger than yours. however, loss function is one the of critical components in deep learning, and you would do wisely to try and find the best one for your problem.

This is an integral part of the way experienced deep learning practitioners reach top results - they test many things, and look for thee "big money". proper loss can be one of those things, although your specific problem might indeed not be the best candidate for ASL.

@ghost
Copy link
Author

ghost commented Dec 7, 2020

OK, thank you for your help

@mrT23 mrT23 closed this as completed Dec 9, 2020
@davidas1
Copy link

ASL preforms sigmoid

BCEWithLogitsLoss does no perform sigmoid

Thought it would be good to clarify something, as this issue is linked in the repo's README - both loss functions mentioned above perform Sigmoid internally. The difference between the results is due to different reduction - BCEWithLogitsLoss does mean reduction by default, while ASL always returns the sum.

@mrT23 - Do you have any intuition about why you sum the loss instead of averaging? This should make the loss (and other hyperparameters like learning rate) dependant on the batch size and number of classes.
I'm trying ASL on a multi-task multi-label problem (training multiple heads, each with its own loss), and thinking about what is the best way to reduce the losses from the different heads.

@mrT23 mrT23 reopened this Feb 24, 2021
@mrT23
Copy link
Contributor

mrT23 commented Feb 24, 2021

@davidas1
i was bothered with this question for ~1 year (on other losses as well) until I realized the following truth -
In Adam optimizer, it does not matter if we do sum or average!

you can understand this just from looking at adam update rule:
image
https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c

if you still ponder about it, i can further explain

@mrT23 mrT23 closed this as completed Feb 28, 2021
@mrT23
Copy link
Contributor

mrT23 commented Mar 8, 2021 via email

@Chen-Song
Copy link

@davidas1 Hi, I see you say 'I'm trying ASL on a multi-task multi-label problem (training multiple heads, each with its own loss), and thinking about what is the best way to reduce the losses from the different heads'. Is this strategy effective for multi-task multi-label problem?

@csEylLee
Copy link

csEylLee commented Jun 17, 2023

ASL preforms sigmoid

BCEWithLogitsLoss does no perform sigmoid

I think you were wrong. BCEWithLogitsLoss also performs sigmoid. When I set reduction='sum', the output loss of BCEWithLogitsLoss is equal to ASL.

@YUNIyx
Copy link

YUNIyx commented Feb 23, 2024

@mrT23 I have tried gamma_neg=2, gamma_pos=1 and gamma _ neg = 4, gamma _ pos = 1. The latter is better, but it is still not as good as the cross entropy loss function.
If it is modified to gamma _ neg = 5 and gamma _ pos = 1, will it have a better effect?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants