Draft: Adding stable baselines PPO agent #3

coreylowman · 2022-03-03T19:15:14Z

Here is a running example of stable baselines 3 PPO agent. It requires more custom strategy class, so this does not inherit from the RLBaseStrategy in the avalanche-rl repo.

Main changes:

Add choose_actions(Observations) -> Actions to be called in both training & evaluation to get actions for specific observations. This replaces RLBaseStrategy.sample_actions(), which was only used during training
Add receive_transitions(Vectorized[Transitions]) to be called during rollout with the result of calling env.step(actions). This replaces the Rollout object that was previously returned from rollout. This is also where strategies can put training logic.
1. NOTE: A Transition is a tuple of (observation, action, reward, done, next observation)
Rework rollout() to enable usage in both training_exp() and eval_exp(). It now generates episode data rather than returning rollouts.
Use Timestep as experience limits and periodic eval frequencies.

Things I'm unsure about:

PPO manages the pytorch Module, optimizer & criterion itself, so what should I pass to the BaseStrategy? Will there be support for Strategies that manage those themselves rather than passing to BaseStrategy?
PPO requires knowing both number of environments (for vectorization) and observation/action spaces up front. I required them in the constructor, but that limits what scenarios this can be used in.

My current thinking for way ahead is:

Extract most of the functionality of this example into either a reworked RLBaseStrategy, or a new base strategy class
1. choose_actions() and receive_transitions() would be abstract methods of the new base class
Move the PPO specific things to a subclass of the new base strategy. This would only implement choose_actions() and receive_transitions().

NickLucche · 2022-03-04T09:27:39Z

Hey Corey, thanks a lot for you contribution!

The main problem with encapsulating sb3 algorithms is that we're not capable of directly exposing internals (e.g. loss) the way Avalanche does, thus limiting the customization capabilities of Plugins. This is why a2c and dqn have been implemented from scratch.

Regarding your changes and rightful concerns:

The intended way to choose an action was to implement both the RLBaseStrategy.sample_actions() to choose actions during training and PPOModel.get_action() for inference. The idea is that you can have different behaviors like eps-greedy during training (https://github.com/ContinualAI/avalanche-rl/blob/master/avalanche_rl/training/strategies/dqn.py#L138) and a simple argmax in inference (https://github.com/ContinualAI/avalanche-rl/blob/master/avalanche_rl/models/dqn.py#L16). This should enable to "throw away" the Strategy once you're done training and have everything you need for inference in a nn.Module class. I'm aware this isn't explicitly written anywhere atm so I really appreciate a feedback on whether you believe this pattern to be confusing.
I was anticipating this change for PPO and I agree with you, personally I would add the action to the Rollout instance in the RlBaseStrategy.rollout() so this becomes sort of a default behavior, while refactoring the Rollout object into the Transition you propose so that it is also optimized to be used along with some replay memory. I also think receive_transitions logic could be implemented in the RlBaseStrategy.update() method by taking in the gathered transitions, do you agree?
This is kinda dependent on 1. but I like the formulation, although we'd have to take into account VectorizedEnv quirks.
👍🏻

Overall, I think if we do want to support sb3 algorithms we should make a separate class which addresses the problems you mentioned, specifying the limits in customization (e.g. no optim, criterion, loss fiddling..) and your implementation could serve as a base for that. On the other hand, I'm unsure whether having a less flexible SBStrategy could have advantages over re-writing PPO to fit RLBaseStrategy guidelines and possibly add missing features to our base class (like the Transition object).
What are you thought on that @AntonioCarta?

P.S. A workaround to the n_envs init requirement could be delaying the actual initialization until an experience is passed, although it's not the cleanest solution.

AntonioCarta · 2022-03-04T13:42:18Z

Hey @coreylowman, thanks for the PR!

I need a bit of time to look into it but I can give some high-level comments.

Extract most of the functionality of this example into either a reworked RLBaseStrategy, or a new base strategy class

Like @NickLucche said, SB3 doesn't give access to some attributes that we use in Avalanche. Therefore, we will probably need a custom Template for them.
Is it possible to make a generic wrapper for SB algorithms? I would like to avoid having lots of wrappers for all the different algorithms.

Notice that we did some work in avalanche and we are experimenting with templates, which allow to define different training loops, with different events/callbacks, and internal state. I would prefer to limit the number of them since they add a bit of complexity, but I think RL is a place where we will need a lot of customization, which is now supported by the main avalanche branch.

PPO requires knowing both number of environments (for vectorization) and observation/action spaces up front. I required them in the constructor, but that limits what scenarios this can be used in.

Maybe a clean solution would be to provide all these attributes inside the experience and do the configuration before each experience. Alternatively, we can put the attributes in the benchmark object and pass that object to the strategy.

I'm unsure whether having a less flexible SBStrategy could have advantages over re-writing PPO to fit RLBaseStrategy guidelines and possibly add missing features to our base class (like the Transition object).

Of course, the ideal solution would be to have PPO implemented inside Avalanche so that we have full control of it and we can easily make it compatible with the rest of the library. The problem is that:

we don't have the manpower to do it (at least right now)
SB3 is well known and probably uses lots of tricks that would take a lot of time to reimplement on our side

So, if the wrapper is not too expensive to mantain, I think it can be a good middle ground.

Of course this only makes sense if we can support at least part of avalanche features with the wrapper, such as:

metrics. I expect some to be incompatible with the wrapper.
benchmarks
some plugins. Many plugins will probably be incompatible.

coreylowman · 2022-03-08T15:59:28Z

Hey all thanks for the feedback!

The intended way to choose an action was to implement both the RLBaseStrategy.sample_actions() to choose actions during training and PPOModel.get_action() for inference.

Since the strategy has access to the self.is_training, you could implement that with the single function. I think it's also nice to let the strategy control that in a single place rather than having to implement it in both the strategy for sampling and the model for inference. It also makes reusing rollout easier since there's only 1 function, but you could just pass in the action function as a parameter to rollout if you want to keep them separate!

I also think receive_transitions logic could be implemented in the RlBaseStrategy.update() method by taking in the gathered transitions, do you agree?

I think so! We found passing in the vectorized transitions to a strategy and letting them handle it any way they want easier for both parties.

This is kinda dependent on 1. but I like the formulation, although we'd have to take into account VectorizedEnv quirks.

Yes the rollout function expects only vectorized environments now - I had changed the make_eval_env() to vectorize too. We should annotate the env passed into rollout as VectorEnv instead of Env

Is it possible to make a generic wrapper for SB algorithms? I would like to avoid having lots of wrappers for all the different algorithms.

I believe its possible to make generic wrapper for on-policy and off policy algorithms separately, but not a base algorithm. This is due to the differences in their collect_rollout() functions. The code in choose_actions() and receive_transitions() is directly from OnPolicyAlgorithm.collect_rollouts(). Technically the code in this PR may be usable for any on policy algorithm, I have not tested that yet though.

Maybe a clean solution would be to provide all these attributes inside the experience and do the configuration before each experience. Alternatively, we can put the attributes in the benchmark object and pass that object to the strategy.

I think including the spaces in the experience would be useful, but most algorithms will need access to the spaces when they construct the model. In our framework we just pulled the spaces from the first experience in the scenario, which seemed to work.

I don't think num_envs belongs there because that really depends on what computer you are running on.

Of course this only makes sense if we can support at least part of avalanche features with the wrapper, such as:

Agree I think many will be incompatible at this point. We could probably extract the nn.Module from the underlying SB3 objects, but getting hooks into the SB3 training function may be difficult.

NickLucche · 2022-03-16T14:53:00Z

Ok, I've taken a more in-depth look at the PR and I would propose a few modifications in order to comply with the current structure of RLStrategy and use that as superclass, while I would postpone the more "impactful" changes (such as _rollout re-definition and choose_action) to be discussed later on as enhancements/improvements for the whole library.

Let me know if you want to help contribute to these changes:

self.ppo object is instantiated before each new experience, this way we have access to the environment (as well as seed). We must then handle ppo state (optimizer, buffer..) and decide when to re-instatiate this object (based on some class args) in order to reset such internals.
It is my understanding that we can keep on using our VectorizedEnv as long as the transitions in the ppo buffer have the intended format. Thus, I would implement the num_envs property as alias of VectorizedEnv.n_envs, which is checked internally by sb3 to deal with their own multiprocessing implementation. It would be great if we could push actor-batched Rollouts/transitions to ppo buffer.
receive_transitions() code can be moved to update(), making sure self.rollouts is of the intended shape/format; we can use self.after_rollout() to add custom code to make self.rollouts like your transitions if needed.
train_exp must be modified since optimization happens inside sb3, but we should try to keep as much as we can of the original training stages. We can rip-off the "Backward" and "Optimization Step" part and keep the rest:

for self.timestep in range(self.current_experience_steps.value):
        self.before_rollout(**kwargs)
        self.rollouts = self.rollout(
            env=self.environment, n_rollouts=self.rollouts_per_step,
            max_steps=self.max_steps_per_rollout)
        self.after_rollout(**kwargs)

        for self.update_step in range(self.updates_per_step):
            self.update(self.rollouts)
        # periodic evaluation
        self._periodic_eval(eval_streams, do_final=False)

when creating self.ppo we must set n_epochs to 1 (or to self.updates_per_step) so we can control the ppo buffer from "outside" like you propose. We can keep other arguments like policy as the strategy init args.
we can implement the observation post-processing (transpose_image()) as Gym.Wrappers (just like Array2Tensor) in make_train_env.
we should integrate sb3 logging messages into ours.

@coreylowman let me know whether this proposal makes sense to you and whether you see additional difficulties for PPO and other on-policy methods. We could have a small hierarchy made of SB3OnPolicy and SB3PPO subclass if necessary (although it would be cool to just implement the former and have all on-policy algorithms working).

coreylowman · 2022-03-23T16:34:39Z

Do you have a sense for how easy it would be to rework existing algorithms into this SB3 structure? If you're thinking this SB3 structure is good, it'd probably be less work overall to just move existing algorithms to this SB3 structure now. Perhaps do that update first before any of the other updates for this PR?

I think the rest of what you say makes sense! I do think just doing OnPolicy class would be straightforward.

NickLucche · 2022-03-30T09:03:19Z

I think I'd rather keep sb3 and avalanche-rl algorithms separated, due to the limitations in flexibility of the former.
So I'd definitely go for an implementation of OnPolicySB3 class, but trying to re-use as much as possible of what's already in place (like the Rollout object).

NickLucche · 2022-03-31T18:58:12Z

I'll post an update asap on this ppo thread implementing the points I mentioned, hopefully we will be able to generalize that to an on-policy class

NickLucche · 2022-04-01T13:22:22Z

Ok I've refactored your implementation in this branch https://github.com/ContinualAI/avalanche-rl/blob/pr/avalanche_rl/training/strategies/sb3_onpolicy_strategy.py, let me know what you think of it.

There are many rough edges atm, but the core idea is the following: we can handle model and rollouts/transitions collection from the "outside", while we let sb3 compute the loss and optimization step internally.
The model is managed externally as it shouldn't depend on the experience to be instantiated (tho it can use the task labels supplied to branch-off internally).
The data collection part is like the one you made but it re-uses avalanche-rl components, perhaps we can even improve it at the RLBaseStrategy abstraction level borrowing from your code, but this way we don't "over-specialize" the sb3 class.

Things that should be worked out in order to have the bare-functionalities version are:

rollout buffer integration: I'm somewhat sure that we can swap sb3 RolloutBuffer with our Rollout, perhaps adjusting the api of the latter. This would allow the collection of multiple transitions at once (as long as we add the ppo values to it) before the optimization step, without needing to have this kind of check if self.steps_since_last_train >= self.ppo.n_steps before doing ppo.train(). Currently we're stuck with having max_steps_per_rollout=1 due to this. Perhaps we'd find better solution by looking more into the matter.
Model creation should be wrapped in an intuitive api which hides the workaround needed to create a model that depends on sb3 algorithms args. The only requirement here is that the model shouldn't depend on some experience (we want to create it only once) .
Generalization to OnPolicy.

Adding stable baselines PPO agent

e065a50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Adding stable baselines PPO agent #3

Draft: Adding stable baselines PPO agent #3

coreylowman commented Mar 3, 2022

NickLucche commented Mar 4, 2022

AntonioCarta commented Mar 4, 2022

coreylowman commented Mar 8, 2022

NickLucche commented Mar 16, 2022

coreylowman commented Mar 23, 2022

NickLucche commented Mar 30, 2022

NickLucche commented Mar 31, 2022

NickLucche commented Apr 1, 2022

Draft: Adding stable baselines PPO agent #3

Are you sure you want to change the base?

Draft: Adding stable baselines PPO agent #3

Conversation

coreylowman commented Mar 3, 2022

NickLucche commented Mar 4, 2022

AntonioCarta commented Mar 4, 2022

coreylowman commented Mar 8, 2022

NickLucche commented Mar 16, 2022

coreylowman commented Mar 23, 2022

NickLucche commented Mar 30, 2022

NickLucche commented Mar 31, 2022

NickLucche commented Apr 1, 2022