Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove OnlineSupervisedTemplate #1362

Merged
merged 13 commits into from May 29, 2023

Conversation

AlbinSou
Copy link
Collaborator

@AlbinSou AlbinSou commented May 5, 2023

Avalanche already handles well the online streams as streams of small experiences, which are already compatible with the normal strategies (SupervisedTemplate). I think this is confusing to have two types of templates. Also because a CL agent is not suppose to know what is going to come (online stream or non online, maybe experiences with various sizes).

This is why I attempt to remove the online templates.

  • I removed the existing OnlineSupervisedTemplate and OnlineMeta...Template, and to merge the OnlineObservation into the BatchObservation (in the case of an online stream, we have to handle the access_task_boundaries).

  • I removed the Online ER ACE, which was the only method implementing both Online and Non Online behavior (aside from the plugins who are agnostic to that)

  • There is one thing however that I do not understand in the old online observation it's the distinction between make optimizer and update optimizer. I don't understand why it's necessary and I don't know if I should keep this behavior or drop it.

  • I also kept the OnlineNaive strategy for now, just turning it to a SupervisedTemplate and rerouting the train_passes to train_epochs.

Maybe keep it as a draft for now, till we are sure the currently implemented online strategy exhibit the same behavior when turned into their non-online version.

@coveralls
Copy link

coveralls commented May 5, 2023

Pull Request Test Coverage Report for Build 5092658528

  • 164 of 172 (95.35%) changed or added relevant lines in 9 files are covered.
  • 4 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+0.5%) to 72.586%

Changes Missing Coverage Covered Lines Changed/Added Lines %
avalanche/training/templates/observation_type/batch_observation.py 29 30 96.67%
avalanche/models/dynamic_optimizers.py 46 49 93.88%
tests/test_models.py 80 84 95.24%
Files with Coverage Reduction New Missed Lines %
avalanche/benchmarks/scenarios/new_classes/nc_scenario.py 1 89.02%
avalanche/models/dynamic_optimizers.py 1 93.75%
avalanche/benchmarks/scenarios/online_scenario.py 2 95.96%
Totals Coverage Status
Change from base Build 5001854992: 0.5%
Covered Lines: 15828
Relevant Lines: 21806

💛 - Coveralls

@HamedHemati
Copy link
Collaborator

HamedHemati commented May 5, 2023

Thanks @AlbinSou for the PR. I think at this point it makes sense to merge them to avoid additional work and confusion when implementing new strategies. Here are my comments:

  • I removed the existing OnlineSupervisedTemplate and OnlineMeta...Template, and to merge the OnlineObservation into the BatchObservation (in the case of an online stream, we have to handle the access_task_boundaries).

I think it make sense to remove observation_type and add the check_model_and_optimizer, make_optimizer and model_adaptation functions to BaseSGDTemplate. I don't think we'll have new observation types (?).

  • I removed the Online ER ACE, which was the only method implementing both Online and Non Online behavior (aside from the plugins who are agnostic to that)

Minor: another strategy that uses OnlineSupervisedMetaLearningTemplate is MER.

  • There is one thing however that I do not understand in the old online observation it's the distinction between make optimizer and update optimizer. I don't understand why it's necessary and I don't know if I should keep this behaviour or drop it.

The reason for updating instead of resetting in online CL streams with task boundaries is that we don't want to reset the optimizer state after each sub-experience of the same experience as this would be problematic for some online strategies when using Adam optimizer.

  • I also kept the OnlineNaive strategy for now, just turning it to a SupervisedTemplate and rerouting the train_passes to train_epochs.

I think it makes sense to remove OnlineNaive and just use Naive. With the definition of experience in Avalanche, using the term epoch is ok since the only difference between online and batch CL is the size of the experience.

Maybe keep it as a draft for now, till we are sure the currently implemented online strategy exhibit the same behavior when turned into their non-online version.

Since the only difference is in model adaptation and optimizer resetting, if merged correctly, it shouldn't affect anything.

@AlbinSou
Copy link
Collaborator Author

AlbinSou commented May 8, 2023

Thanks @AlbinSou for the PR. I think at this point it makes sense to merge them to avoid additional work and confusion when implementing new strategies. Here are my comments:

  • I removed the existing OnlineSupervisedTemplate and OnlineMeta...Template, and to merge the OnlineObservation into the BatchObservation (in the case of an online stream, we have to handle the access_task_boundaries).

I think it make sense to remove observation_type and add the check_model_and_optimizer, make_optimizer and model_adaptation functions to BaseSGDTemplate. I don't think we'll have new observation types (?).

  • I removed the Online ER ACE, which was the only method implementing both Online and Non Online behavior (aside from the plugins who are agnostic to that)

Minor: another strategy that uses OnlineSupervisedMetaLearningTemplate is MER.

  • There is one thing however that I do not understand in the old online observation it's the distinction between make optimizer and update optimizer. I don't understand why it's necessary and I don't know if I should keep this behaviour or drop it.

The reason for updating instead of resetting in online CL streams with task boundaries is that we don't want to reset the optimizer state after each sub-experience of the same experience as this would be problematic for some online strategies when using Adam optimizer.

  • I also kept the OnlineNaive strategy for now, just turning it to a SupervisedTemplate and rerouting the train_passes to train_epochs.

I think it makes sense to remove OnlineNaive and just use Naive. With the definition of experience in Avalanche, using the term epoch is ok since the only difference between online and batch CL is the size of the experience.

Maybe keep it as a draft for now, till we are sure the currently implemented online strategy exhibit the same behavior when turned into their non-online version.

Since the only difference is in model adaptation and optimizer resetting, if merged correctly, it shouldn't affect anything.

Yes, you are right about the fact that we should not reset the optimizer momentum after every experience. However, I think this is a more general question. Maybe we also don't want to reset it for normal task shifts after all. I think we should add this as a part of the strategy arguments (like reset-optimizer-at-task-shift trigger button). What do you think ? Otherwise in online learning without task boundaries we cannot really know when we should reset or not the optimizer. And I think waiting for new parameters to come in is a bit weird since it assumes that a distribution change will be accompanied with new parameters (a bias we have from class-incremental learning where it is indeed the case).

I will put back this update optimizer line but for me it should be clear when it's reseted or not, maybe a strategy argument would be better than an arbitrary decision.

@AntonioCarta
Copy link
Collaborator

In general I'm not against adding an argument to control the optimizer reset but I have to point out one thing. The current reset_optimizer is doing two things:
1 - adding new parameters
2 - resetting the optimizer's state (e.g. momentum)

We have to do (1) before each experience because all the strategies need to work with DynamicModules, but (2) is optional.

The current solution behaves in the same way whether you have an expanding layer or not. Instead, keeping the state is tricky when a layer is expanded, because we don't know what's in the state.

Notice that a user can always override the method to change this behavior easily.

And I think waiting for new parameters to come in is a bit weird since it assumes that a distribution change will be accompanied with new parameters (a bias we have from class-incremental learning where it is indeed the case).

I'm not sure I understand this point but in general a distribution shift does not always result in parameter expansion (e.g. domain-incremental).

@HamedHemati
Copy link
Collaborator

HamedHemati commented May 10, 2023

Regarding optimizer resetting:

I think we can treat all experience types in the same way and:

1- Do not reset the optimizer after each experience. This can be done as a part of the strategy in case it's needed.

2- Mark the strategy's model changes with an attribute assigned to dynamic modules after their adaptation . For example, for all DynamicModules we can setattr(self, 'is_expanded', True) if there are new parameters introduced, otherwise set it to False by default. Then after model_adaptation is over, we can check if the is_expanded attribute has changed for any of those modules and run update_optimizer in case of any changes. Then of course reset is_expanded to False . This should be very fast and resolve the confusion about task boundaries.

@AntonioCarta
Copy link
Collaborator

For example, for all DynamicModules we can setattr(self, 'is_expanded', True) if there are new parameters introduced, otherwise set it to False by default. Then after model_adaptation is over, we can check if the is_expanded attribute has changed for any of those modules and run update_optimizer in case of any changes

It's already possible to check expanded layers by looking at their dimension.

One problem I see is that in this way models with incremental classifiers or static ones behave differently.

@HamedHemati
Copy link
Collaborator

HamedHemati commented May 11, 2023

It's already possible to check expanded layers by looking at their dimension.

Yea, that's also possible. The idea was to avoid multiple parameter-wise comparisons.

One problem I see is that in this way models with incremental classifiers or static ones behave differently.

There is always the option to reset the optimizer. At least to me it makes more sense to always update and expand the optimizer state rather than resetting it, and let the user decide whether it should be reset or not.

@AlbinSou
Copy link
Collaborator Author

@AntonioCarta I don't understand which one is the failing test. It just says unittest 3.7 "Job has failed" on github. Apart from that, I did not understand what was the conclusion of the above discussion. Should we use update_optimizer() by default ? For me it makes sense to reset the optimizer at task boundaries when given, since most people do that.

else:
avalanche_model_adaptation(model, self.experience)

return model.to(self.device)

def make_optimizer(self):
def make_optimizer(self, reset_opt=True, reset_state=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this are global arguments of the train method I would be more explicit reset_optimizer, reset_optimizer_state

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, add a docstring for the arguments

reset_optimizer(self.optimizer, self.model)
reset_opt = detect_param_change(
self.optimizer, list(self.model.named_parameters()),
) or reset_opt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is different from what we discussed. Only the dynamic modules should be reset unless reset_opt is set. The state of the others should be kept.

Copy link
Collaborator Author

@AlbinSou AlbinSou May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then I don't know how to reset only these. Because depending on the optimizer the state could have different structure. In other words, I know how to reset the whole state, but not one part of the state only. One way I see would be to have one parameter group per parameter but this would be uncompatible with user defined parameter groups

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can check the shape of parameters before/after adaptation. If they changed, you should reset the state for those parameters.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, I know how to reset the whole state, but not one part of the state only.

I don't know how to fix this honestly. This is why we just used to reset everything.

One way I see would be to have one parameter group per parameter but this would be uncompatible with user defined parameter groups

I would not mess with parameter groups since they may be defined by the user.

Copy link
Collaborator Author

@AlbinSou AlbinSou May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can check the shape of parameters before/after adaptation. If they changed, you should reset the state for those parameters.

This is already what I am doing, the id of the parameter changes whenever the shape changes. I could check the shape also but this would be redundant I think. Though I should make sure that this thing (id change = shape change) is always true, but I think it is. I might have found a way to update the state btw, Ill push smt today.

self.model.parameters(),
reset_state=False)
else:
if reset_state:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reset_state and reset_opt are both resetting the state. What's the difference?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reset state is resetting only the state, not attributing new parameters to the optimizer, whereas reset_opt is doing both (attributing new parameters to the optimizer and resetting the state)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reset state is resetting only the state, not attributing new parameters to the optimizer

we never want to do this. I think you can remove one of the two options since they are redundants.

@AlbinSou AlbinSou marked this pull request as ready for review May 22, 2023 12:43
@AlbinSou
Copy link
Collaborator Author

I attempted to fix the checkpointing tests in blind since I don't manage to run them locally. On thing that I foresee could be a problem in the checkpointing case is that in that case I'm not sure the loaded pointers (ids) are necessarily the same for the loaded model parameters than the ones that I store in optimized_param_id. Also, it's possible that they are not even the same between model parameters and optimized parameters.

For checkpointing I cannot remove the optimized_params_id and call reset_optimizer (which is the default behavior when optimized_params_id is None), since there might be a saved optimizer state that we want to keep for the remaining part of the optimization process.

I don't know yet how to fix this problem, if you know exactly how checkpointing of the optimizer and model is done I'll welcome your help. But I think that the problem I described is happening and this must be why we have an error in that case.

I commented in one part some code that I thought would handle this case but it does not and now I understand why, it's because the ids in the loaded optimizer are not necessarly the same than the ids that I save in the optimized_params_id.

Anyways, in this last commit I also add options to manage freezing of parameters, so far I was adding all model parameters, now I check that p.requires_grad = True before adding to the optimizer, and I also strongly link the optimized_params_id to the functions that manipulate the optimizer, this attribute is now returned by the functions. I guess later both of these functions could be moved to the appropriate file (update_optimizer and reset_optimizer in dynamic_optimizers), but for now I keep them here.

@AntonioCarta
Copy link
Collaborator

You should add a test that checks that optimizers with state (e.g. Adam) are serialized properly.

Also, we could try to save the state_dict instead of the object. Pytorch converts the ids to positional indices that can be updated with load_state_dict (code).

I check that p.requires_grad = True before adding to the optimizer

You should not do this. The user passes the parameters to the optimizer and you don't know if frozen parameters that are passed to the optimizer will stay frozen.

@AlbinSou
Copy link
Collaborator Author

You should add a test that checks that optimizers with state (e.g. Adam) are serialized properly.

Also, we could try to save the state_dict instead of the object. Pytorch converts the ids to positional indices that can be updated with load_state_dict (code).

I check that p.requires_grad = True before adding to the optimizer

You should not do this. The user passes the parameters to the optimizer and you don't know if frozen parameters that are passed to the optimizer will stay frozen.

Ok, I will add this test, I have already made some tests with checkpointing but did not check the state. From visual debugging check it looked like they were correctly serialized. For the requires_grad function it's more that I'm thinking of some use case that I encountered. I agree with what you said, but then I don't know how a user could decide not to optimize some parameters. I understand that if it is in the optimizer and with requires_grad = False it will not get a gradient so it's fine, however, if a state exists for this parameter it will still get update by the momentum, so it won't be frozen. I wanted to avoid these cases.

@AntonioCarta
Copy link
Collaborator

but then I don't know how a user could decide not to optimize some parameters. I understand that if it is in the optimizer and with requires_grad = False it will not get a gradient so it's fine, however, if a state exists for this parameter it will still get update by the momentum, so it won't be frozen. I wanted to avoid these cases.

This is probably a bug on the user side but we can't do much about it. We should have the same behavior of pytorch here.

@AlbinSou
Copy link
Collaborator Author

but then I don't know how a user could decide not to optimize some parameters. I understand that if it is in the optimizer and with requires_grad = False it will not get a gradient so it's fine, however, if a state exists for this parameter it will still get update by the momentum, so it won't be frozen. I wanted to avoid these cases.

This is probably a bug on the user side but we can't do much about it. We should have the same behavior of pytorch here.

But, in pytorch the user can decide to set requires_grad = False and retire the parameter from the optimizer, here we are giving all the parameters to the optimizer all the time, leaving no option for the user to freeze any parameters. Maybe we should at least give the choice of what parameters to put into the optimizer.

@AlbinSou
Copy link
Collaborator Author

@AntonioCarta It's ready to merge on my side

@AntonioCarta
Copy link
Collaborator

ok, thank you for all the work

@AntonioCarta AntonioCarta merged commit 9a61d5b into ContinualAI:master May 29, 2023
18 checks passed
@AlbinSou AlbinSou deleted the online_strategies branch June 26, 2023 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants