Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about AbstractEnv API #68

Closed
jonathan-laurent opened this issue Jun 12, 2020 · 26 comments
Closed

Question about AbstractEnv API #68

jonathan-laurent opened this issue Jun 12, 2020 · 26 comments

Comments

@jonathan-laurent
Copy link
Contributor

In the documentation for AbstractEnv, you write the following remark:

So why don't we adopt the step! method like OpenAI Gym here? The reason is that the async manner will simplify a lot of things here.

Would you care to elaborate what you mean here?

@findmyway
Copy link
Member

I was asked of this question from at least three persons. So I'd better write down my thoughts here (and in the upcoming new documentation).

Originally, we have a similar API named interact! written by @jbrea . (Though I'm not sure the reason to use this name, it's just the same with step! in OpenAI Gym if I understand it correctly).

step! together with some other functions in OpenAI Gym are undoubtedly becoming the de facto standard APIs in the RL world. And even in Julia, Gym.jl, Reinforce.jl, RLInterface.jl and MinimalRLCore.jl also adopt the step! there. However, that doesn't mean it is the right way to do so.

In my opinion, one critical issue with step! is that it conflates two different operations into one: 1. Feed an action to the environment; 2. Get an observation from the environment. (There's a similar problem with pop! in data structures. But that's another topic though)

In single-agent sequential environments, step! works well with the following process:

# Given policy and env
observation = reset!(env)
while true
    action = policy(observation )
    observation , reward, done, info = step!(env, action)
    done && break
end

But when it comes to simultaneous environments, we have to change the return type of step! function slightly:

# Given policy and env
observation = reset!(env)
while true
    action = policy(observation )
    task = step!(env, action)
    observation , reward, done, info = fetch(task)
    done && break
end

Until now, it's not a big change. After all, the signature of function in Julia doesn't include the return type. So we can return a Future instead. (But do note that it already breaks the definition in OpenAI Gym)

Now consider the multi-agent environments, things become much more complicated. See more discussions at openai/gym#934:

  1. reset! should be triggered by a (global) policy
  2. reset! and step! must return observations and other info of all the players.
  3. We can only get the chance to observe the environment after taking an action in it. (There's a workaround by introducing a no-op action)

The root reason for those complexities is that step! is synchronous by design. I'm not the first person to realize this problem. In Ray/RLlib, environments are treated async by design, see more details here:

In many situations, it does not make sense for an environment to be “stepped” by RLlib. For example, if a policy is to be used in a web serving system, then it is more natural for an agent to query a service that serves policy decisions, and for that service to learn from experience over time. This case also naturally arises with external simulators (e.g. Unity3D, other game engines, or the Gazebo robotics simulator) that run independently outside the control of RLlib, but may still want to leverage RLlib for training.

So we modified the APIs in OpenAI Gym a little:

  • reset! is kept but it should only return nothing (or at least package users shouldn't rely on the result of reset!)
  • Each environment is a functional object which can take in an action (a vector of actions or action and player_id pairs ...) and return nothing. The only reason to use a functional struct here is to reduce the burden of remembering an extra API name ;)
  • An agent can observe the environment independently anytime.
    • We adopt the duck-typing here for the result of observe. Generally get_state(obs), get_reward(obs), and get_terminal(obs) are required. The fact that the result of observe can be of any type is useful for some search based algorithms like MCTS.

I must admit that treating all environments as async does bring in some inconveniences. For those environments which are sync essentially, we have to store the state, reward, info in the environment after applying an action. For example, ReinforcementLearningEnvironments.jl#mountain_car. But I think it's worth to do so.

(Also cc @sebastian-engel, @zsunberg, @mkschleg, @MaximeBouton, @jbrea in case they are interested in discussions here.)

@jbrea
Copy link
Collaborator

jbrea commented Jun 13, 2020

Though I'm not sure the reason to use this name

This dates still back to the time, when I didn't have a look at how other RL packages work and I thought the agent is basicallically interacting with the environment, since it sends and action and receives and observation, therefore interact!. I really like the new convention, that step! just steps the environment forward and observe is used to observe the current state of the environment.

@jonathan-laurent
Copy link
Contributor Author

Thanks for your thorough explanation!
Am I correct that with your API, calls to observe, get_terminal and get_reward are blocking until all agents have submitted an action? More generally, do you have a canonical example of a simultaneous environment?

@findmyway
Copy link
Member

findmyway commented Jun 13, 2020

Am I correct that with your API, calls to observe, get_terminal and get_reward are blocking until all agents have submitted an action?

It depends. It can be blocking either when calling env(action), observe(env), or at the first call of get_*(obs). For the first case, you can refer ReinforcementLearningBase.jl#MultiThreadEnv. For latter two, I don't have an example yet.

More generally, do you have a canonical example of a simultaneous environment?

Unfortunately, no😳.

@jonathan-laurent
Copy link
Contributor Author

Thanks!
Also, I was wondering: is there currently a function to reset an environment to a previous state (or equivalently create a new environment with a custom initial state)?

I am asking because although Gym environments typically do not offer this functionality, it is essential for tree-based planning algorithms such as AlphaZero.

@findmyway
Copy link
Member

findmyway commented Jun 13, 2020

is there currently a function to reset an environment to a previous state (or equivalently create a new environment with a custom initial state)?

No. To support this operation we need to separate the environment into two parts: 1) Description part, like action_space, num_of_player, etc. 2) Internal state related part. And I'm not sure how to handle it gracefully yet.

Based on my limited experience with MCTS, I found that implementing a Base.clone(env::AbstractEnv) would be enough to mimic the behavior you said above. I can add one example later. (You may watch https://github.com/JuliaReinforcementLearning/ReinforcementLearningZoo.jl/issues/32)

@zsunberg
Copy link
Member

zsunberg commented Jun 13, 2020

In POMDPs.jl, we made a different decision where the separation of the state and model is central. Instead of using the step! paradigm, we have a generative model, so, for instance, if m is an MDP object and s is any state (that you have stored in your tree for instance), you can call

sp, r = @gen(:sp, :r)(m, s, a, rng)

to generate a new state and reward (in POMDPs 0.8.4 + - we are still finalizing some issues and updating documentation to move to POMDPs 1.0). We've also tried to make it really easy to define simple models in a few lines with QuickPOMDPs.

That being said, it is true that using a step! or RLBase style interface will make it easier to wrap environments that others have written (though it could be done with POMDPs.jl), and the only thing you need to add to a step! style simulator for it to work with MCTS is the ability to copy the environment, not initialize at any arbitrary state.

@zsunberg
Copy link
Member

In any case, I don't think it will be too hard to adjust to different interfaces in the future. Probably best to just get it to work with one MDP, and then think hard about the interface in the second round. As mentioned in the RLZoo README, Make it work, make it right, make it fast is a good mantra.

@mkschleg
Copy link

I made the choice to go w/ the API in MinimalRLCore for a few reasons. The biggest is just where I'm studying and who I learned RL from initially (i.e. Adam White and at UofA/IU). In our course we heavily use the RLGlue interface which Adam made during his graduate degree w/ Brian Tanner. The API is very much inspired from this and modernized to remove some of the cruft of the original (they had constraints that I don't have to deal w/ in Juia). The focus of MinimalRLCore was also to create an API which lets people do what they need to for research, even if I didn't imagine it initially. I find that I run into walls a lot when adopting an RL API, although Julia helps a lot here w/ multiple dispatch. One example is dealing w/ a non-global RNG which is shared btw the agent and environment, or defining a reset which sets the state to a provided value (very necessary for MonteCarloRollouts when working on prediction).

While it is true the API I provided isn't really designed w/ async in mind, and this was partially on purpose and partially on how I'm actually using this in my research. But users can overload step! for any of their envs that may be async, so I don't really see it as an issue that needs addressed. If this were to be supported later I would probably have a separate abstract type. I don't feel like the assumption should be that all envs async, or that you have multiple agents running around in an env instance (like A3C for example). This usually adds complexity, that I don't really want to deal w/ as a researcher.

@zsunberg
Copy link
Member

zsunberg commented Jun 13, 2020

Oh man, this is great to get us all in the same room talking :) (@MaximeBouton @rejuvyesh, @lassepe you might be interested in this). I think we should make an actually really minimal interface that can be used for MCTS and RL and put it in a package (after a quick look, MinimalRLCore and RLInterface are almost there, but not quite). Should we move the discussion to discourse?

@zsunberg
Copy link
Member

I would submit that the minimal interface for MCTS would have:

step!(env, a) # returns an observation, reward, and done
actions(env) # returns only the valid actions at the current state of the environment
reset!(env)
clone(env) # creates a complete copy at the current state - it is assumed that the two environments are now completely independent of each other.

The other option is to explicitly separate the state from the environment.

@rejuvyesh
Copy link

A popular example of interfaces that have this concept of explicit observation interface as @findmyway mentioned is dm-lab I believe.

@zsunberg
Copy link
Member

Yeah, I must say that the explicit observation interface in RLBase is a very nice feature for some of the more complicated use cases.

This afternoon, I was thinking about a way to have a common core of basic and some optional functionality that we can all link into. My idea is a CommonRL package that all of our packages that are optimized for different use cases depend on and allows for interoperability at least on the reset! and step! level. Here is my sketch: https://gist.github.com/zsunberg/a6cae2f92b5f8fae8f624dc173bc5c6b .

@mkschleg
Copy link

I would 100% be up to helping with this.

One thing that I have issue w/ still in Julia is dealing with the implicit enforcement of interfaces (thus why MinimalRLCore separates what is called and what is implemented by users). But I think if we were to have a common package w/ good docs, this shouldn't be an issue (and I guess I should be more trusting of users :P).

I think having someway of expressing what observation types are being returned would be useful, but I never have landed on a design I like. The dict of types is reasonable, but feels really pythony. I was also playing around w/ the idea of dispatching on value types with symbols, this was a bit onerous though. Maybe we should use traits here.

@findmyway
Copy link
Member

@zsunberg 's sketch is a really nice starting point. I'd also be glad to support such common core package.

The dict of types is reasonable, but feels really python. I was also playing around w/ the idea of dispatching on value types with symbols, this was a bit onerous though. Maybe we should use traits here.

@mkschleg I'm feeling the same 😄.

@jbrea
Copy link
Collaborator

jbrea commented Jun 15, 2020

@zsunberg I also like the idea of a common core package!

@mkschleg:

I think having someway of expressing what observation types are being returned would be useful, but I never have landed on a design I like. The dict of types is reasonable, but feels really pythony. I was also playing around w/ the idea of dispatching on value types with symbols, this was a bit onerous though. Maybe we should use traits here.

What I like the best so far is to have observe return a named tuple or a custom structure. With these we could use traits like has_state(observation::NamedTuple{N,T}) where {N, T} = :state in N with fallback has_state(o) = hasproperty(o, :state). To help the developer of a new environment capture API expectations early on, I like the idea of having test functions, like e.g. basic_env_test, that could also throw warnings like has_state(observation) || @warn "Many algorithms expect observations to return a field named :state.".

@zsunberg
Copy link
Member

zsunberg commented Jun 15, 2020

Ok, great, I think the common core package should live in the JuliaReinforcementLearning org. Can you invite me to the org, @findmyway ? Thanks.

I think having someway of expressing what observation types are being returned would be useful

@mkschleg Do you mean the caller of step! chooses the type to be returned, or the environment communicates to the caller which type it will return?

I was also playing around w/ the idea of dispatching on value types with symbols, this was a bit onerous though. Maybe we should use traits here.

If I understand what you're saying correctly, we do this in POMDPs.jl, haha. For example you can use

sp, o = @gen(:sp, :o)(m, s, a, rng)

where m is a POMDP, to get the next state and observation, or

o, r = @gen(:o, :r)(m, s, a, rng)

to get the next observation and reward. The macro expands to a call that dispatches on a value type with symbols. It works pretty well, but is a bit esoteric - you have to know what the symbols mean.

@findmyway
Copy link
Member

@jbrea Could you help to send the invitation?

@zsunberg
Copy link
Member

@jbrea , @findmyway , it looks like you invited me to collaborate on the ReinforcementLearning.jl package - I was hoping to join the JuliaReinforcementLearning org so that I can create a new package owned by the org.

@jbrea
Copy link
Collaborator

jbrea commented Jun 15, 2020

@zsunberg, sure; sorry, github has too many buttons 😜

@mkschleg
Copy link

mkschleg commented Jun 15, 2020

@jbrea The named tuple is reasonable. I've been having this as an option for agents as well to make evaluation a bit easier for some of the wrapping functionality (like running episodes).

@zsunberg The way you do it in POMDP.jl is interesting! I hadn't quite dug into as much yet, but I should prioritize.

What I have been doing is something like

struct Env{V<:Val}
   dispatch_on::V
end

And dispatch on specific value types. Definitely not the best way to do it. But it has been useful when there are several observation types for an environment (like Atari w/ color and BW frames).

I'd be happy to help out with this and help refine the interface. I'd love to have a common core that I can just pull from rather than have to maintain my own. So if you are looking for collaborators on the repo let me know.

@zsunberg
Copy link
Member

Alright - start filing issues!! https://github.com/JuliaReinforcementLearning/CommonRLInterface.jl

@zsunberg
Copy link
Member

zsunberg commented Jun 15, 2020

@mkschleg hmm, yeah that seems like a reasonable way to do it. Although, options like color vs black and white frames are very domain-specific, so I'm not sure they belong in this interface. It would make sense to have a general way to deal with data type expectations (e.g. AbstractArray{Float32} vs AbstractArray{Float64}) Feel free to file an issue on that repo to discuss further.

@zsunberg
Copy link
Member

@mkschleg
Copy link

Thinking about the traits thing a bit more. I'm not sure it belongs in the base interface. The designer of the environment will be able to manage this through using traits/dispatch. The interface doesn't have to plan for it (Yay Julia!)

@findmyway
Copy link
Member

Thanks for all the discussions here.

I removed the observation layer in the latest version, making the environment more transparent to agents/policies.

Support of CommonRLInterfaces.jl is also included in JuliaReinforcementLearning/ReinforcementLearningBase.jl-Archive#58

In the next minor release, I hope ReinforcementLearningBase.jl and CommonRLInterfaces.jl can converge to a stable one after experimenting with more algorithms.

findmyway pushed a commit to findmyway/ReinforcementLearning.jl that referenced this issue May 3, 2021
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
findmyway added a commit that referenced this issue May 3, 2021
* fix render

* parametric action

* import OrdinaryDiffEq

* fixes #27

* Array -> AbstractArray

Co-authored-by: Jun Tian <find_my_way@foxmail.com>

* Array -> AbstractArray

Co-authored-by: Jun Tian <find_my_way@foxmail.com>
findmyway pushed a commit that referenced this issue May 3, 2021
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants