New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added DER strategy #1314
Added DER strategy #1314
Conversation
Pull Request Test Coverage Report for Build 4611467727
💛 - Coveralls |
Can we reuse the original dataset transformations? This would be the best solution. Alternative: provide an argument where the user must specify the transformations.
I'm not sure if there is any correct choice here. We can either follow the paper (only curr+past logits) or the implementation (everything). Either way, I would add a comment clarifying this implementation detail since it's quite critical.
I think this is ok. |
I talked with the authors and they confirmed me that we should keep both the transformations and the classifier size as part of the strategy. I suggest to:
Please mention both of this in the docstring so that users are fully aware of it. |
Ok, I think returning an error is a good way to handle that problem. About the transformations. Do you know if there is a simple way that I apply them to an already existing scenario (potentially without transformations), without turning the whole dataset into a tensor dataset. So far only the replay buffer is turned into a tensor dataset. |
… when using non fixed classifier
Basically I made all the changes except one so far. I don't known how to apply the transformations to the dataset if they are not here. Since I don't know how to robustly check for the presence of transformations (i.e in the online setting), and if they are not here I'm not sure what is the correct way to add them to the dataset in avalanche. So so far, the transformations passed as an argument are only applied to the replay buffer, letting the responsibility to the user to set the transformations from the scenario accordingly. One way of handling things could be to create a tensor dataset not only for the replay buffer but also for the current dataset, and adding the transformations to this dataset instead. But I think it's a bit too much, if there is another way it would be nice. |
Maybe we can wrap the original dataset and add the logits. Untested example:
|
Ok, I ended up handling the current dataset like the replay one, by turning it to a tensor dataset created from the experience.dataset.eval() (so only with eval transformations). Then in the tensor dataset I apply the same transformations than the ones for the buffer. I made a sanity check and can confirm that now the behavior of the method does not depend anymore on the dataset transformations but only on the ones passed to the DER strategy. |
Ok, I think I completely misunderstood the problem initially. What we want to have is: At first, I thought your problem was that you didn't know how to reuse the augmentations from the original dataset. The transformations of the current dataset should not be changed because the users can set them explicitly and we don't want to override this choice. We need to do the opposite: reuse the transformations of the original dataset for the replay buffer (which doesn't happen if we just store the tensors but not the augmentations). The problem with the current solution is that it makes the usage more complex without solving the issue:
Sorry for the misunderstanding. |
For now I came up with something like that, but I think it's a bit hacky still ... Let me know if you think there is a better way. |
I think reusing transformations from the original dataset is easier if you don't convert to tensors. Basically:
|
I agree and this is how I approached the problem the first time but then I had problems because of the way AvalancheDatasets are made. I can try again this way and tell you what was the problem. |
@AntonioCarta Ok, here is the problem, when I do this, the getitem method that I want to define in the wrapper DatasetWithLogits is not called. I think it's because this dataset is never used as-is but rather some subsets are extracted from it and used to update the ReservoirBuffer .. This is why I changed to the tensor dataset solution initially |
Can you share the code? |
Oh no! It seems there are some PEP8 errors! 😕
|
It looks like it works fine now but I am still not happy with the results I get with it ... Something must be wrong because I obtain results close to the ones of normal replay without augmentation but I use both augmentations and the DER strategy. I will keep investigating. Also a quick remark, I had to fix a bug that was not spotted by the strategy test but only by my test on my continual-learning-baselines fork. This is why I changed classification_subset() call in ReservoirSamplingBuffer resize method to data.subset() call. I hope this does not hurt any other method but I think it's fine. The basic test was not going through this method, but the integration test was going through and this was creating a bug. When using the data attribute logits, I thought this would be happened after all the other fields, but actually it's still appened before the task_id, I don't know exactly why. The bug I encountered was that the classification_subset method was reordering this attributes so I had one part of the buffer (the one that had been resized) that was outputing x, y, t, logits, whereas the most recent, not resized one, was outputing x, y, logits, t. |
…R, plus minor change in ReservoirSamplingBuffer
It's below the original paper results?
Can you add a unit test for this bug?
Definitely a bug. I didn't think about this, but probably we are not enforcing that the ordering of attributes is stable. Both in |
Ok, I will try to add a unit test for this. Yes, for now the results are lower than the original paper. Though I have not tried to use the exact settings used in the paper, it often gives results inferior to replay, and I found some inconsistencies when I set alpha=0 (which should behave similarly to replay). This last one is hinting me that something fails. |
I noticed that when applying transformations as it in the case in DER, there is a much bigger difference in making two separate forward passes for replay buffer and current buffer vs making a joint forward pass. In the previous commits I was using separate forward pass (as I also use for ER-ACE). It was working fine without transformations (better than classical replay), but when activating the transformations, it was not working anymore. Making a joint forward pass (thus using the combined batch statistics for loss computation) fixed that issue and is now giving more sensible results. I don't really know why this happens. I don't know if we should adopt a consistent approach to this for different method, for instance now ER-ACE is using separate forward passes. Usually I try both and see which one works best. But, now it's working ! Maybe we can wait for merging till I include this unit test |
I added a test but I don't know what is the exact behavior you wanted to test here. I checked the order of the attributes if you add a dataset_attribute to an existing dataset with other dataset attributes. I think the expected behavior should be data, label, att1, att2 for such an operation, but right now if att2 is added after att1, it can happen (I'm not sure if this is all the time), that you have data, label, att2, att1. I think this has to do with what we discussed last time, do we want to output data, y, task_id, the rest, or data, y, the rest, task_id which is the current behavior? I think in order to change that we have to change the inheritance of data_attributes in benchmarks/utils/data.py |
The only explaination can be the batch normalization here.
Yes, the expected is data, label, att1, att2. The other are bugs. Can you fix this? |
Yes, I know this is because of BN but what I don't understand is that it works well without transform and not well with transforms. Anyway, as it is it works so, I will investigate that later. I fixed the data attribute issue. |
Yeah, I don't have a good explanation for the phenomenon. There are some methods (e.g. contrastive methods) that depend on whether there is batch normalization or not. @AlbinSou Can I merge the PR? The DataAttribute fix and tests look good. |
Apparently some tests are failing after I merged master into PR ... I will fix them when I can but I don't know where this comes from. What I don't understand is that when I checkout to an earlier commit where they were suppose to work (because the avalanche automatic tests passed) then it still does not work. The two failing ones are in test_replay.py Also there is one detail I don't known yet how to set. In the DER implementation they do a few weird things that I am not sure if I want to reproduce or not. In particular, they are saving the logits of the transformed input (and not of the non transformed), and they are sampling two different batches from the buffer, one to compute the DER loss, and one to compute the additional loss used in DER++. But this seems really weird, if I can reproduce the results without doing this I would be happy. For now I got 69% on Split-Cifar10 with 500 memory (they report 72%). Also, about the forward pass, they are doing not one but three forward pass, one on the current input, one on the first replay batch they sample, and one on the second replay batch they sample. |
I tried to do like the authors implementation without success, so for now it differs in a few ways, but at least it gives results that are not so far. Notable differences:
I tried most of the choices to get closer to authors implementation but everytime without success, it was getting worse results that how I have it implemented actually. Notably, the fact to do one forward pass versus 3 is the most impacting factor that I found. However the storing of the logits seemed less important (whether you store transformed, untransformed, using eval batch statistics or train batch statistics). |
Ok, I think we can keep the best version. Add some comments somewhere (even hidden in the implementation) to document these issues. Can I merge it? |
The test that is failing is working when I try it on my own (training/test_regularization.py). I don't understand, I just added comments and now the repo is complaining that it fails when it was not failing before. I guess all tests are not run every time ? |
All the tests are run every time. I'm not sure why it's not working here... |
Ok I tested this locally and it's working (Windows, py3.9). Maybe it has to do with the pytorch version. I'm going to merge this because if there is a problem it's probably unrelated to your changes and we should investigate it separately. |
I added the DER strategy here. Below I list some implementations details and choices that we can change potentially.
TLDR; While this strategy is implementing the DER and DER++ strategy, they are two details from the paper that are classically handled by avalanche (use of input transformations and model head size) that can significantly change the behavior of the method. Setting them correctly will lead to use the actual DER, otherwise it will be a modified version that tries to stick as close as possible to the original one.