Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added NewClassesCrossEntropy criterion and automatic criterion plugin #1514

Merged
merged 6 commits into from Oct 20, 2023

Conversation

AlbinSou
Copy link
Collaborator

@AlbinSou AlbinSou commented Oct 9, 2023

I added a criterion that is very important when considering exemplar free scenarios, and only considers the learning of the new task. It's like a multitask learning criterion except this time we don't have task labels so the model outputs the full logits but when not using exemplars we want to update only the last head with the cross-entropy loss in order to avoid task-recency bias.

I personally needed this criterion to use LwF in the class incremental learning scenario but I think that in general it is useful whenever using functional regularization in CIL in the examplar free setting.

Another change that I introduced is more of a personal choice and I need your input on that. I noticed that many criterias were implemented as plugins because they need to access at some point some info about the training (i.e save a model, register current classes etc ...).

To reduce the burden of the user, I now automatically add the criterions that are also plugins to the list of plugins, by first checking that they are not already there to not have 2 times the same plugin. Let me know if you think this is a good idea otherwise I can get rid of it.

Another thing I noticed, is that the training criterion is used at evaluation. I don't know if that could be interesting to use a training criterion and evaluation criterion or if this is too much. For instance when doing exemplar free learning, you want to learn only on the new task but you still want to have the full cross entropy as a criterion when evaluating on the previous experiences. The current behavior is that it gives 0 loss to all previous experiences and it gives some loss value to the new experience only.

@AntonioCarta
Copy link
Collaborator

I agree with the first two points.

Can I suggest a different solution for masking? unit masking is quite important for most losses in CL, so it should be a generic argument like the reduce='mean' in pytorch losses. We could have a masking={'curr','old','seen','all'}, which defaults to all.
It's ok if this PR only supports crossentropy with curr and all, but I would like something that we can easily adapt to other methods. Otherwise we will need a different object for each different loss and masking method.

... could be interesting to use a training criterion and evaluation criterion

Yes, I think this is a common pain point. I'm not sure the proposal really solves the issue. For example, here are some subtle issues:

  • the training loss is the criterion plus any additional regularizations summed by plugins/strategy. Even if we had an explicit training_criterion, it may not represent everything we use for the training loss.
  • even if we have an explicit train_criterion and eval_criterion, we may want to evaluate multiple criterion during eval.

IMO this is an evaluation issue that should be fixed by simplifying the metrics/evaluation system. There is not much that we can do to improve on the strategy side.

@AlbinSou
Copy link
Collaborator Author

I agree with the first two points.

Can I suggest a different solution for masking? unit masking is quite important for most losses in CL, so it should be a generic argument like the reduce='mean' in pytorch losses. We could have a masking={'curr','old','seen','all'}, which defaults to all. It's ok if this PR only supports crossentropy with curr and all, but I would like something that we can easily adapt to other methods. Otherwise we will need a different object for each different loss and masking method.

... could be interesting to use a training criterion and evaluation criterion

Yes, I think this is a common pain point. I'm not sure the proposal really solves the issue. For example, here are some subtle issues:

  • the training loss is the criterion plus any additional regularizations summed by plugins/strategy. Even if we had an explicit training_criterion, it may not represent everything we use for the training loss.
  • even if we have an explicit train_criterion and eval_criterion, we may want to evaluate multiple criterion during eval.

IMO this is an evaluation issue that should be fixed by simplifying the metrics/evaluation system. There is not much that we can do to improve on the strategy side.

I don't really get what you mean with the masking. In classical cross entropy there is no masking of the unit and most strategies use this criterion. The masking usually is handled in the IncrementalClassifier. Could you clarify this point ? Or did you want me to create a more general criterion (rather than the new class one), that masks some units depending on the given parameter ? In the case of this criterion I guess the masking type would be "old", but I don't see any case where you would want to mask all units for instance.

@AntonioCarta
Copy link
Collaborator

AntonioCarta commented Oct 10, 2023

I don't really get what you mean with the masking. In classical cross entropy there is no masking of the unit and most strategies use this criterion. The masking usually is handled in the IncrementalClassifier. Could you clarify this point ? Or did you want me to create a more general criterion (rather than the new class one), that masks some units depending on the given parameter ? In the case of this criterion I guess the masking type would be "old", but I don't see any case where you would want to mask all units for instance.

What you are doing here is basically this:

mask = new_units
masked_output = output[:, mask]
loss = cross_entropy(masked_output, target)

If you want to use all the units it's the same code:

mask = all_units  # there is no masking here. All units are active
masked_output = output[:, mask]
loss = cross_entropy(masked_output, target)

and this masking is independent from the loss. Let's say that you want to compute the kl-div on old units (useful for LwF):

mask = old_units
masked_output = output[:, mask]
loss = kl_div(masked_output, target)

IncrementalClassifier is different. There, you only remove unseen units, but when you compute the loss you may want to mask more units.

@coveralls
Copy link

coveralls commented Oct 10, 2023

Pull Request Test Coverage Report for Build 6469068428

  • 51 of 59 (86.44%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.03%) to 72.531%

Changes Missing Coverage Covered Lines Changed/Added Lines %
avalanche/training/templates/base_sgd.py 1 2 50.0%
avalanche/training/regularization.py 3 6 50.0%
avalanche/training/losses.py 30 34 88.24%
Totals Coverage Status
Change from base Build 6299610004: 0.03%
Covered Lines: 17612
Relevant Lines: 24282

💛 - Coveralls


@property
def current_mask(self):
if self.mask == "all":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is incorrect. You are selecting only "seen" units, which excludes future units (e.g. DER regularizes them).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to have both "seen" and "all" options (not necessarily for all losses).

@@ -9,12 +9,17 @@
from avalanche.models import MultiTaskModule, avalanche_forward


def cross_entropy_with_oh_targets(outputs, targets, eps=1e-5):
def cross_entropy_with_oh_targets(outputs, targets, eps=1e-5, reduction="mean"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use pytorch cross-entropy? it is more numerically stable than this implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that I need to give one hot targets this renders the masking way easier and more natural to implement. The cross entropy from pytorch does not allow that I believe.

@AlbinSou
Copy link
Collaborator Author

@AntonioCarta I'm not sure about changing it to MaskedKLDiv, because for now I only give the option to give integer targets, not soft targets. So I think it's more like a CrossEntropy. It could be possible to make it a masked KL Div but then it would make the criterion more complex to use, since the user would have to create the one hot target himself. What I can do however is still use the F.kl_div under the hood for numerical stability but remain with the name MaskedCrossEntropy and the constraint that you should give integer targets, just like for cross entropy.

@AntonioCarta
Copy link
Collaborator

AntonioCarta commented Oct 11, 2023

It's fine, but in this case you should implement a numerically stable softmax https://stackoverflow.com/questions/42599498/numerically-stable-softmax

You can't call it CE and use KLDiv because they compute different values.

@AlbinSou AlbinSou merged commit 8803c85 into ContinualAI:master Oct 20, 2023
14 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants