-
-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about AbstractEnv API #68
Comments
I was asked of this question from at least three persons. So I'd better write down my thoughts here (and in the upcoming new documentation). Originally, we have a similar API named
In my opinion, one critical issue with In single-agent sequential environments, # Given policy and env
observation = reset!(env)
while true
action = policy(observation )
observation , reward, done, info = step!(env, action)
done && break
end But when it comes to simultaneous environments, we have to change the return type of # Given policy and env
observation = reset!(env)
while true
action = policy(observation )
task = step!(env, action)
observation , reward, done, info = fetch(task)
done && break
end Until now, it's not a big change. After all, the signature of function in Julia doesn't include the return type. So we can return a Now consider the multi-agent environments, things become much more complicated. See more discussions at openai/gym#934:
The root reason for those complexities is that
So we modified the APIs in OpenAI Gym a little:
I must admit that treating all environments as async does bring in some inconveniences. For those environments which are sync essentially, we have to store the (Also cc @sebastian-engel, @zsunberg, @mkschleg, @MaximeBouton, @jbrea in case they are interested in discussions here.) |
This dates still back to the time, when I didn't have a look at how other RL packages work and I thought the agent is basicallically interacting with the environment, since it sends and action and receives and observation, therefore |
Thanks for your thorough explanation! |
It depends. It can be blocking either when calling
Unfortunately, no😳. |
Thanks! I am asking because although Gym environments typically do not offer this functionality, it is essential for tree-based planning algorithms such as AlphaZero. |
No. To support this operation we need to separate the environment into two parts: 1) Description part, like Based on my limited experience with MCTS, I found that implementing a |
In POMDPs.jl, we made a different decision where the separation of the state and model is central. Instead of using the sp, r = @gen(:sp, :r)(m, s, a, rng) to generate a new state and reward (in POMDPs 0.8.4 + - we are still finalizing some issues and updating documentation to move to POMDPs 1.0). We've also tried to make it really easy to define simple models in a few lines with QuickPOMDPs. That being said, it is true that using a |
In any case, I don't think it will be too hard to adjust to different interfaces in the future. Probably best to just get it to work with one MDP, and then think hard about the interface in the second round. As mentioned in the RLZoo README, Make it work, make it right, make it fast is a good mantra. |
I made the choice to go w/ the API in MinimalRLCore for a few reasons. The biggest is just where I'm studying and who I learned RL from initially (i.e. Adam White and at UofA/IU). In our course we heavily use the RLGlue interface which Adam made during his graduate degree w/ Brian Tanner. The API is very much inspired from this and modernized to remove some of the cruft of the original (they had constraints that I don't have to deal w/ in Juia). The focus of MinimalRLCore was also to create an API which lets people do what they need to for research, even if I didn't imagine it initially. I find that I run into walls a lot when adopting an RL API, although Julia helps a lot here w/ multiple dispatch. One example is dealing w/ a non-global RNG which is shared btw the agent and environment, or defining a reset which sets the state to a provided value (very necessary for MonteCarloRollouts when working on prediction). While it is true the API I provided isn't really designed w/ async in mind, and this was partially on purpose and partially on how I'm actually using this in my research. But users can overload step! for any of their envs that may be async, so I don't really see it as an issue that needs addressed. If this were to be supported later I would probably have a separate abstract type. I don't feel like the assumption should be that all envs async, or that you have multiple agents running around in an env instance (like A3C for example). This usually adds complexity, that I don't really want to deal w/ as a researcher. |
Oh man, this is great to get us all in the same room talking :) (@MaximeBouton @rejuvyesh, @lassepe you might be interested in this). I think we should make an actually really minimal interface that can be used for MCTS and RL and put it in a package (after a quick look, MinimalRLCore and RLInterface are almost there, but not quite). Should we move the discussion to discourse? |
I would submit that the minimal interface for MCTS would have: step!(env, a) # returns an observation, reward, and done
actions(env) # returns only the valid actions at the current state of the environment
reset!(env)
clone(env) # creates a complete copy at the current state - it is assumed that the two environments are now completely independent of each other. The other option is to explicitly separate the state from the environment. |
A popular example of interfaces that have this concept of explicit observation interface as @findmyway mentioned is |
Yeah, I must say that the explicit observation interface in RLBase is a very nice feature for some of the more complicated use cases. This afternoon, I was thinking about a way to have a common core of basic and some optional functionality that we can all link into. My idea is a CommonRL package that all of our packages that are optimized for different use cases depend on and allows for interoperability at least on the |
I would 100% be up to helping with this. One thing that I have issue w/ still in Julia is dealing with the implicit enforcement of interfaces (thus why MinimalRLCore separates what is called and what is implemented by users). But I think if we were to have a common package w/ good docs, this shouldn't be an issue (and I guess I should be more trusting of users :P). I think having someway of expressing what observation types are being returned would be useful, but I never have landed on a design I like. The dict of types is reasonable, but feels really pythony. I was also playing around w/ the idea of dispatching on value types with symbols, this was a bit onerous though. Maybe we should use traits here. |
@zsunberg 's sketch is a really nice starting point. I'd also be glad to support such common core package.
@mkschleg I'm feeling the same 😄. |
@zsunberg I also like the idea of a common core package!
What I like the best so far is to have |
Ok, great, I think the common core package should live in the JuliaReinforcementLearning org. Can you invite me to the org, @findmyway ? Thanks.
@mkschleg Do you mean the caller of
If I understand what you're saying correctly, we do this in POMDPs.jl, haha. For example you can use sp, o = @gen(:sp, :o)(m, s, a, rng) where o, r = @gen(:o, :r)(m, s, a, rng) to get the next observation and reward. The macro expands to a call that dispatches on a value type with symbols. It works pretty well, but is a bit esoteric - you have to know what the symbols mean. |
@jbrea Could you help to send the invitation? |
@jbrea , @findmyway , it looks like you invited me to collaborate on the ReinforcementLearning.jl package - I was hoping to join the JuliaReinforcementLearning org so that I can create a new package owned by the org. |
@zsunberg, sure; sorry, github has too many buttons 😜 |
@jbrea The named tuple is reasonable. I've been having this as an option for agents as well to make evaluation a bit easier for some of the wrapping functionality (like running episodes). @zsunberg The way you do it in POMDP.jl is interesting! I hadn't quite dug into as much yet, but I should prioritize. What I have been doing is something like struct Env{V<:Val}
dispatch_on::V
end And dispatch on specific value types. Definitely not the best way to do it. But it has been useful when there are several observation types for an environment (like Atari w/ color and BW frames). I'd be happy to help out with this and help refine the interface. I'd love to have a common core that I can just pull from rather than have to maintain my own. So if you are looking for collaborators on the repo let me know. |
Alright - start filing issues!! https://github.com/JuliaReinforcementLearning/CommonRLInterface.jl |
@mkschleg hmm, yeah that seems like a reasonable way to do it. Although, options like color vs black and white frames are very domain-specific, so I'm not sure they belong in this interface. It would make sense to have a general way to deal with data type expectations (e.g. AbstractArray{Float32} vs AbstractArray{Float64}) Feel free to file an issue on that repo to discuss further. |
Thinking about the traits thing a bit more. I'm not sure it belongs in the base interface. The designer of the environment will be able to manage this through using traits/dispatch. The interface doesn't have to plan for it (Yay Julia!) |
Thanks for all the discussions here. I removed the observation layer in the latest version, making the environment more transparent to agents/policies. Support of CommonRLInterfaces.jl is also included in JuliaReinforcementLearning/ReinforcementLearningBase.jl-Archive#58 In the next minor release, I hope ReinforcementLearningBase.jl and CommonRLInterfaces.jl can converge to a stable one after experimenting with more algorithms. |
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
In the documentation for
AbstractEnv
, you write the following remark:Would you care to elaborate what you mean here?
The text was updated successfully, but these errors were encountered: