[Proposal] Environment API changes #32

DavidSlayback · 2022-08-29T15:18:14Z

Hey, first off, I love this project and the general idea of defining environments in JAX so that they can be easily batched and integrated into RL training loops!

I tend to do a lot of work with POMDPs and have built a few branches in my own fork that implement various POMDP environments. It works fine for my purposes, but I've run into a couple instances where I just ignore the base Environment API

Specifically:

In typical POMDP formulations, the observation o is a function of state AND action (i.e., o ~ O(s,a)), but currently get_obs() only uses state
Similarly, there are many instances where we need the state AND/OR action to determine if the episode is done (e.g., in the Tiger problem, it ends when you open a door, but the state is just which door the tiger is behind). is_terminal() and discount() only use state
Finally, in the same way that your FourRooms example has noisy actions, there are many environments where a noisy observation is core to the environment, requiring get_obs to also have an RNG key

Obviously I'm just overriding the methods with the extra arguments as needed for my own environments, but some of this might be common enough to justify a different base API?

RobertTLange · 2022-09-03T08:15:35Z

Hi @DavidSlayback! Thank you so much for raising this, the suggestions and the kind words. And please excuse the late response. These are all good and valid points. What POMDP environments have you been working with? I am generally open to adapting the API and adding more (somewhat classic) environments to gymnax. It is a bit of fine line, since I also want to circumvent blowing up the package too much and each new env will require some testing against a numpy version. With regards to your three points:

Yes, that sounds reasonable. It would require a small modification to each environment (e.g. adding an optional action input to get_obs and terminal. Alternatively, one could absorb the action into the state, but I think treating it as a separate input is cleaner.
See 1.
Sounds good to me as well. I am also planning on adding more general wrappers for sticky actions and termination handling at some point.

My hands are currently tied up with my internship, but feel free to open a PR! I would be happy to merge it in if all the unit tests pass. Also feel free to open PRs for your environments. I would love to see what you have been brewing up.

DavidSlayback · 2022-09-03T18:56:17Z

Most of my POMDPs are the classic ones from POMDP literature as seen here. Things like Tiger, HeavenHell, RockSample, Hallway. I also have a few of my own like a multistory fourrooms variant (with various observation functions), partially-observable Taxi, some modifications of continuous control domains, etc...I'm specifically interested in ones that require extremely long-term memory and reasoning.

I'll definitely look at doing some PRs for some of the more "classic" ones then! I think an option for different observation functions for already-implemented environments provided on environment creation (like you already have in your FourRooms domain) could be a good way to expand some of these without expanding the repository too much

carlosgmartin · 2023-05-05T02:30:26Z

pgx and OpenSpiel put the observation, action mask, reward, discount, current player (for sequential games), etc. in the state. That corresponds to o = env.step(s, a, key).observation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Environment API changes #32

[Proposal] Environment API changes #32

DavidSlayback commented Aug 29, 2022

RobertTLange commented Sep 3, 2022

DavidSlayback commented Sep 3, 2022

carlosgmartin commented May 5, 2023

[Proposal] Environment API changes #32

[Proposal] Environment API changes #32

Comments

DavidSlayback commented Aug 29, 2022

RobertTLange commented Sep 3, 2022

DavidSlayback commented Sep 3, 2022

carlosgmartin commented May 5, 2023