Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added DER strategy #1314

Merged
merged 15 commits into from Apr 5, 2023
Merged

Added DER strategy #1314

merged 15 commits into from Apr 5, 2023

Conversation

AlbinSou
Copy link
Collaborator

@AlbinSou AlbinSou commented Mar 6, 2023

I added the DER strategy here. Below I list some implementations details and choices that we can change potentially.

  • About the use of transformations: For now, I do not use transformations by default on the current data, because in avalanche this is not suppose to be a choice from the strategy but rather one from the benchmark. However, the use of input transformations is a part of the DER method, so it should be either triggered when using it in what case the actual DER will happen, or used without input transformations in what case transformations will be applied only on the buffer data.
  • To store the data, I use a derived version of ClassBalancedBuffer which additionally stores the logits of the memory samples so that they can later be replayed. To do this, I use a classification tensor dataset that I create by iterating over the experience dataset, and I collect both the input data and the logits that I pad to a maximum dimension (set to 300 for now), I will explain below why. I also take care of using experience.dataset.eval() so that stored data is not already augmented data, I think this should be sufficient precaution but let me know if you think there might be a problem there.
  • Another point that I am not sure how to handle is the logits size. I think the classical behavior that you would expect for the model used inside an avalanche strategy would be to have potentially varying number of logits, which could make the classification tensor dataset fail to be built since it cannot store tensors of different size in the same dataset. This is why I pad the logits to a given shape.
  • Again, about the logits size, the classical DER strategy do not use varying logits size and thus store the value of the full (untrained included) logits from the beginning of training. This behavior I think was not initially desired by the authors but they later said in a follow up paper that they realized it was beneficial for the performance to do it like that. In avalanche, the strategy is not suppose to chose the architecture of the model, so the implementation I provide deals with both cases, when using a model with growing logits and also when using a models with fixed number of logits.

TLDR; While this strategy is implementing the DER and DER++ strategy, they are two details from the paper that are classically handled by avalanche (use of input transformations and model head size) that can significantly change the behavior of the method. Setting them correctly will lead to use the actual DER, otherwise it will be a modified version that tries to stick as close as possible to the original one.

@coveralls
Copy link

coveralls commented Mar 6, 2023

Pull Request Test Coverage Report for Build 4611467727

  • 177 of 186 (95.16%) changed or added relevant lines in 7 files are covered.
  • 12 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+0.2%) to 73.636%

Changes Missing Coverage Covered Lines Changed/Added Lines %
avalanche/benchmarks/utils/data_attribute.py 24 27 88.89%
avalanche/training/supervised/der.py 114 120 95.0%
Files with Coverage Reduction New Missed Lines %
avalanche/benchmarks/scenarios/classification_scenario.py 2 86.31%
avalanche/benchmarks/utils/data.py 2 87.85%
avalanche/training/storage_policy.py 8 90.27%
Totals Coverage Status
Change from base Build 4436218510: 0.2%
Covered Lines: 14102
Relevant Lines: 19151

💛 - Coveralls

@AntonioCarta
Copy link
Collaborator

For now, I do not use transformations by default on the current data

Can we reuse the original dataset transformations? This would be the best solution. Alternative: provide an argument where the user must specify the transformations.

Another point that I am not sure how to handle is the logits size.

I'm not sure if there is any correct choice here. We can either follow the paper (only curr+past logits) or the implementation (everything). Either way, I would add a comment clarifying this implementation detail since it's quite critical.

implementation I provide deals with both cases, when using a model with growing logits and also when using a models with fixed number of logits.

I think this is ok.

avalanche/training/supervised/der.py Outdated Show resolved Hide resolved
avalanche/training/supervised/der.py Outdated Show resolved Hide resolved
avalanche/training/supervised/der.py Outdated Show resolved Hide resolved
@AntonioCarta
Copy link
Collaborator

I talked with the authors and they confirmed me that we should keep both the transformations and the classifier size as part of the strategy. I suggest to:

  • use the transformations, either taken from the dataset or passed as a strategy argument (if it's not easy to take them from the dataset)
  • Return an error if the output size grows over time.

Please mention both of this in the docstring so that users are fully aware of it.

@AlbinSou
Copy link
Collaborator Author

I talked with the authors and they confirmed me that we should keep both the transformations and the classifier size as part of the strategy. I suggest to:

  • use the transformations, either taken from the dataset or passed as a strategy argument (if it's not easy to take them from the dataset)
  • Return an error if the output size grows over time.

Please mention both of this in the docstring so that users are fully aware of it.

Ok, I think returning an error is a good way to handle that problem. About the transformations. Do you know if there is a simple way that I apply them to an already existing scenario (potentially without transformations), without turning the whole dataset into a tensor dataset. So far only the replay buffer is turned into a tensor dataset.

@AlbinSou
Copy link
Collaborator Author

Basically I made all the changes except one so far. I don't known how to apply the transformations to the dataset if they are not here. Since I don't know how to robustly check for the presence of transformations (i.e in the online setting), and if they are not here I'm not sure what is the correct way to add them to the dataset in avalanche. So so far, the transformations passed as an argument are only applied to the replay buffer, letting the responsibility to the user to set the transformations from the scenario accordingly.

One way of handling things could be to create a tensor dataset not only for the replay buffer but also for the current dataset, and adding the transformations to this dataset instead. But I think it's a bit too much, if there is another way it would be nice.

@AntonioCarta
Copy link
Collaborator

Maybe we can wrap the original dataset and add the logits. Untested example:

class DatasetWithLogits(AvalancheDataset):
    def __init__(self, data, model, other_optional_arguments_for_logits_computations,...):
        super().__init__([data])
        
        # TODO: compute logits here using model
        self.logits = ...

    def __getitem(self, idx):
        mb = super().__getitem__(idx)
        return *mb, self.logits[idx]

@AlbinSou
Copy link
Collaborator Author

Ok, I ended up handling the current dataset like the replay one, by turning it to a tensor dataset created from the experience.dataset.eval() (so only with eval transformations). Then in the tensor dataset I apply the same transformations than the ones for the buffer.

I made a sanity check and can confirm that now the behavior of the method does not depend anymore on the dataset transformations but only on the ones passed to the DER strategy.

@AntonioCarta
Copy link
Collaborator

Ok, I think I completely misunderstood the problem initially. What we want to have is:
1 - The user must be aware that augmentations are a fundamental component of DER -> mention this in the doc and ask to set them in the benchmark properly.
2 - Ensure that the replay buffer and current dataset are using the same augmentations -> inherit transformations from the original dataset.

At first, I thought your problem was that you didn't know how to reuse the augmentations from the original dataset.

The transformations of the current dataset should not be changed because the users can set them explicitly and we don't want to override this choice. We need to do the opposite: reuse the transformations of the original dataset for the replay buffer (which doesn't happen if we just store the tensors but not the augmentations).

The problem with the current solution is that it makes the usage more complex without solving the issue:

  • If the user is aware of transformations, now he needs to set them on the plugin instead of the benchmark, which is potentially more confusing.
  • If the user set bad transformations, the problem remains, whether the argument is set via the benchmarks or via the plugins.
  • We may create hidden bugs where the transformations are overridden.

Sorry for the misunderstanding.

@AlbinSou
Copy link
Collaborator Author

For now I came up with something like that, but I think it's a bit hacky still ... Let me know if you think there is a better way.

@AntonioCarta
Copy link
Collaborator

Let me know if you think there is a better way.

I think reusing transformations from the original dataset is easier if you don't convert to tensors. Basically:

  • Subsample the data
  • create the dataset with logits like here
  • That's it. The new dataset will use the original trasnforms

@AlbinSou
Copy link
Collaborator Author

Let me know if you think there is a better way.

I think reusing transformations from the original dataset is easier if you don't convert to tensors. Basically:

  • Subsample the data
  • create the dataset with logits like here
  • That's it. The new dataset will use the original trasnforms

I agree and this is how I approached the problem the first time but then I had problems because of the way AvalancheDatasets are made. I can try again this way and tell you what was the problem.

@AlbinSou
Copy link
Collaborator Author

Let me know if you think there is a better way.

I think reusing transformations from the original dataset is easier if you don't convert to tensors. Basically:

  • Subsample the data
  • create the dataset with logits like here
  • That's it. The new dataset will use the original trasnforms

I agree and this is how I approached the problem the first time but then I had problems because of the way AvalancheDatasets are made. I can try again this way and tell you what was the problem.

@AntonioCarta Ok, here is the problem, when I do this, the getitem method that I want to define in the wrapper DatasetWithLogits is not called. I think it's because this dataset is never used as-is but rather some subsets are extracted from it and used to update the ReservoirBuffer .. This is why I changed to the tensor dataset solution initially

@AntonioCarta
Copy link
Collaborator

@AntonioCarta Ok, here is the problem, when I do this, the getitem method that I want to define in the wrapper DatasetWithLogits is not called. I think it's because this dataset is never used as-is but rather some subsets are extracted from it and used to update the ReservoirBuffer .. This is why I changed to the tensor dataset solution initially

Can you share the code?

@ContinualAI-bot
Copy link
Collaborator

Oh no! It seems there are some PEP8 errors! 😕
Don't worry, you can fix them! 💪
Here's a report about the errors and where you can find them:

avalanche/training/supervised/der.py:24:1: E302 expected 2 blank lines, found 1
avalanche/training/supervised/der.py:54:19: E231 missing whitespace after ';'
avalanche/training/supervised/der.py:54:19: E702 multiple statements on one line (semicolon)
avalanche/training/supervised/der.py:63:1: E302 expected 2 blank lines, found 1
1       E231 missing whitespace after ';'
2       E302 expected 2 blank lines, found 1
1       E702 multiple statements on one line (semicolon)

@AlbinSou
Copy link
Collaborator Author

It looks like it works fine now but I am still not happy with the results I get with it ... Something must be wrong because I obtain results close to the ones of normal replay without augmentation but I use both augmentations and the DER strategy. I will keep investigating.

Also a quick remark, I had to fix a bug that was not spotted by the strategy test but only by my test on my continual-learning-baselines fork. This is why I changed classification_subset() call in ReservoirSamplingBuffer resize method to data.subset() call. I hope this does not hurt any other method but I think it's fine. The basic test was not going through this method, but the integration test was going through and this was creating a bug.

When using the data attribute logits, I thought this would be happened after all the other fields, but actually it's still appened before the task_id, I don't know exactly why. The bug I encountered was that the classification_subset method was reordering this attributes so I had one part of the buffer (the one that had been resized) that was outputing x, y, t, logits, whereas the most recent, not resized one, was outputing x, y, logits, t.

…R, plus minor change in ReservoirSamplingBuffer
@AntonioCarta
Copy link
Collaborator

Something must be wrong because I obtain results close to the ones of normal replay without augmentation but I use both augmentations and the DER strategy. I will keep investigating.

It's below the original paper results?

Also a quick remark, I had to fix a bug that was not spotted by the strategy test but only by my test on my continual-learning-baselines fork. This is why I changed classification_subset() call in ReservoirSamplingBuffer resize method to data.subset() call. I hope this does not hurt any other method but I think it's fine. The basic test was not going through this method, but the integration test was going through and this was creating a bug.

Can you add a unit test for this bug?

When using the data attribute logits, I thought this would be happened after all the other fields, but actually it's still appened before the task_id, I don't know exactly why. The bug I encountered was that the classification_subset method was reordering this attributes so I had one part of the buffer (the one that had been resized) that was outputing x, y, t, logits, whereas the most recent, not resized one, was outputing x, y, logits, t.

Definitely a bug. I didn't think about this, but probably we are not enforcing that the ordering of attributes is stable. Both in AvalancheDataset.__init__ and in AvalancheDataset.__getitem__ we use for da in self._data_attributes.values(). self.self._data_attributes is a dictionary.

@AlbinSou
Copy link
Collaborator Author

Something must be wrong because I obtain results close to the ones of normal replay without augmentation but I use both augmentations and the DER strategy. I will keep investigating.

It's below the original paper results?

Also a quick remark, I had to fix a bug that was not spotted by the strategy test but only by my test on my continual-learning-baselines fork. This is why I changed classification_subset() call in ReservoirSamplingBuffer resize method to data.subset() call. I hope this does not hurt any other method but I think it's fine. The basic test was not going through this method, but the integration test was going through and this was creating a bug.

Can you add a unit test for this bug?

When using the data attribute logits, I thought this would be happened after all the other fields, but actually it's still appened before the task_id, I don't know exactly why. The bug I encountered was that the classification_subset method was reordering this attributes so I had one part of the buffer (the one that had been resized) that was outputing x, y, t, logits, whereas the most recent, not resized one, was outputing x, y, logits, t.

Definitely a bug. I didn't think about this, but probably we are not enforcing that the ordering of attributes is stable. Both in AvalancheDataset.__init__ and in AvalancheDataset.__getitem__ we use for da in self._data_attributes.values(). self.self._data_attributes is a dictionary.

Ok, I will try to add a unit test for this.

Yes, for now the results are lower than the original paper. Though I have not tried to use the exact settings used in the paper, it often gives results inferior to replay, and I found some inconsistencies when I set alpha=0 (which should behave similarly to replay). This last one is hinting me that something fails.

@AlbinSou
Copy link
Collaborator Author

I noticed that when applying transformations as it in the case in DER, there is a much bigger difference in making two separate forward passes for replay buffer and current buffer vs making a joint forward pass. In the previous commits I was using separate forward pass (as I also use for ER-ACE). It was working fine without transformations (better than classical replay), but when activating the transformations, it was not working anymore. Making a joint forward pass (thus using the combined batch statistics for loss computation) fixed that issue and is now giving more sensible results. I don't really know why this happens. I don't know if we should adopt a consistent approach to this for different method, for instance now ER-ACE is using separate forward passes. Usually I try both and see which one works best.

But, now it's working ! Maybe we can wait for merging till I include this unit test

@AlbinSou
Copy link
Collaborator Author

AlbinSou commented Apr 3, 2023

I added a test but I don't know what is the exact behavior you wanted to test here. I checked the order of the attributes if you add a dataset_attribute to an existing dataset with other dataset attributes.

I think the expected behavior should be data, label, att1, att2 for such an operation, but right now if att2 is added after att1, it can happen (I'm not sure if this is all the time), that you have data, label, att2, att1. I think this has to do with what we discussed last time, do we want to output data, y, task_id, the rest, or data, y, the rest, task_id which is the current behavior?

I think in order to change that we have to change the inheritance of data_attributes in benchmarks/utils/data.py

@AntonioCarta
Copy link
Collaborator

Making a joint forward pass (thus using the combined batch statistics for loss computation) fixed that issue and is now giving more sensible results. I don't really know why this happens. I don't know if we should adopt a consistent approach to this for different method, for instance now ER-ACE is using separate forward passes. Usually I try both and see which one works best.

The only explaination can be the batch normalization here.

I think the expected behavior should be data, label, att1, att2 for such an operation, but right now if att2 is added after att1, it can happen (I'm not sure if this is all the time), that you have data, label, att2, att1. I think this has to do with what we discussed last time, do we want to output data, y, task_id, the rest, or data, y, the rest, task_id which is the current behavior?
I think in order to change that we have to change the inheritance of data_attributes in benchmarks/utils/data.py

Yes, the expected is data, label, att1, att2. The other are bugs. Can you fix this?

@AlbinSou
Copy link
Collaborator Author

AlbinSou commented Apr 3, 2023

Yes, I know this is because of BN but what I don't understand is that it works well without transform and not well with transforms. Anyway, as it is it works so, I will investigate that later.

I fixed the data attribute issue.

@AntonioCarta
Copy link
Collaborator

Yes, I know this is because of BN but what I don't understand is that it works well without transform and not well with transforms. Anyway, as it is it works so, I will investigate that later.

Yeah, I don't have a good explanation for the phenomenon. There are some methods (e.g. contrastive methods) that depend on whether there is batch normalization or not.

@AlbinSou Can I merge the PR? The DataAttribute fix and tests look good.

@AlbinSou
Copy link
Collaborator Author

AlbinSou commented Apr 3, 2023

Yes, I know this is because of BN but what I don't understand is that it works well without transform and not well with transforms. Anyway, as it is it works so, I will investigate that later.

Yeah, I don't have a good explanation for the phenomenon. There are some methods (e.g. contrastive methods) that depend on whether there is batch normalization or not.

@AlbinSou Can I merge the PR? The DataAttribute fix and tests look good.

Apparently some tests are failing after I merged master into PR ... I will fix them when I can but I don't know where this comes from. What I don't understand is that when I checkout to an earlier commit where they were suppose to work (because the avalanche automatic tests passed) then it still does not work.

The two failing ones are in test_replay.py

Also there is one detail I don't known yet how to set. In the DER implementation they do a few weird things that I am not sure if I want to reproduce or not.

In particular, they are saving the logits of the transformed input (and not of the non transformed), and they are sampling two different batches from the buffer, one to compute the DER loss, and one to compute the additional loss used in DER++. But this seems really weird, if I can reproduce the results without doing this I would be happy. For now I got 69% on Split-Cifar10 with 500 memory (they report 72%).

Also, about the forward pass, they are doing not one but three forward pass, one on the current input, one on the first replay batch they sample, and one on the second replay batch they sample.

@AlbinSou
Copy link
Collaborator Author

AlbinSou commented Apr 4, 2023

I tried to do like the authors implementation without success, so for now it differs in a few ways, but at least it gives results that are not so far.

Notable differences:

  • The authors sample two different batches from the buffer I believe according to their code, but they say they only take one with different transformations. Each one of the batches are used to compute one part of the DER++ loss. In this code, only one transformed replay batch is transformed and used to compute the DER loss.
  • The authors store the logits of the transformed data, right as they are computed during training, here I store the logits of the untransformed data, and after training on one experience, using the evaluation batch statistics.
  • The authors make 3 different forward pass, one for the current batch, one for the first replay batch, and one for the second replay batch. Instead I use one forward pass with the concatenated replay and current batch.

I tried most of the choices to get closer to authors implementation but everytime without success, it was getting worse results that how I have it implemented actually. Notably, the fact to do one forward pass versus 3 is the most impacting factor that I found. However the storing of the logits seemed less important (whether you store transformed, untransformed, using eval batch statistics or train batch statistics).

@AntonioCarta
Copy link
Collaborator

Ok, I think we can keep the best version. Add some comments somewhere (even hidden in the implementation) to document these issues.

Can I merge it?

@AlbinSou
Copy link
Collaborator Author

AlbinSou commented Apr 5, 2023

The test that is failing is working when I try it on my own (training/test_regularization.py). I don't understand, I just added comments and now the repo is complaining that it fails when it was not failing before. I guess all tests are not run every time ?

@AntonioCarta
Copy link
Collaborator

All the tests are run every time. I'm not sure why it's not working here...

@AntonioCarta
Copy link
Collaborator

Ok I tested this locally and it's working (Windows, py3.9). Maybe it has to do with the pytorch version. I'm going to merge this because if there is a problem it's probably unrelated to your changes and we should investigate it separately.

@AntonioCarta AntonioCarta merged commit 12b5019 into ContinualAI:master Apr 5, 2023
11 of 12 checks passed
@AlbinSou AlbinSou deleted the der branch May 18, 2023 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants