-
-
Notifications
You must be signed in to change notification settings - Fork 730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug Report] Lunar Lander reset
determinism
#728
Comments
I'm not 100% sure, but I think there was a similar issue either here or back in Gym. The conclusion was probably that the underlying issue is some Box2D black magic and there isn't much we can do until we get rid of it and replace it with literally anything else (which iirc is somewhere on the roadmap, but might not be anytime soon) |
Let me update my new finds here:
I found that for seed=43, we've executed episodes two times, the first episode steps count is 353, but the second is 346, and then I've checked in which step they start to be different as the following codes:
It report as 235, so :
So our question is: when in step 234, the 2 observations are exactly same, and with the exactly same action, why we get the different obs? and the two obs are exactly same except the 7th value, one is 1.0, the other is 0.0. We need more debugging work on this problem. |
According to the result my checking on the source codes and the values in obs_list_43, it seems relates to the lander's legs.ground_contact, because the following codes:
It's very possible that the self.legs[0].ground_contact aare in different siutations(one is Ture, the other is False) in our tests, but why? We don't know yet. As @RedTachyon said, the contactListener is implemented in Box2D, so currently I've not setup its debugging env yet. |
Hi, new updates on this problem: |
@clockzhong Wow, thanks for finding this. This seems like a serious bug that will seriously affect performance. |
Yes, I'm setting up SB3 environment here, I'll make some tests to check the difference on those RL algorithms' performance on this bug fix. I guess this bug's fix will not influence those RL algorithms' performence largely, because those agents(PPO/DQN) has no idea on which env is right or wrong. Anyway, we need get final conclusion after completing the A/B test. |
To confirm, could you just train an agent (3 repeat) for the original v2 and then a v3 environment and plot the training results for both and confirm the results |
@clockzhong Hey, any progress on this? |
@Kallinteris-Andreas We should add this for v3 though |
Describe the bug
Hi All,
Thanks for the library! I'm currently trying to use the
LunarLander-v2
environment with some degree of determinism involved. I'm wanting to be able to run the environment for different seeds, and learn some "weights" for a controller based on the demo controller present in the library. In my setting I often have a "batch" of weights I'd like to try for each environment, so have written a simple loop to do so (as this is fast enough for my needs).For instance, I might want to run the environment for seeds 42 and 43, with two different sets of weights w1 and w2. My current approach was to just call
env.reset(seed=seed)
at the start of each run, so callenv.reset(seed=42)
then run an episode with weights w1 for the controller, then callenv.reset(seed=42)
and run an episode with weights w2 for the controller, then callenv.reset(seed=43)
and run an episode with weights w1 etc. My understanding is that if I callenv.reset(seed=x)
and run the environment with the same controller, I should get the same results? However, there seems to be some state which doesn't get reset in the environment? Most of the time things seem to be OK, but sometimes running the environment with the same seed and weights seems to yield a different result, depending on what seeds it has been run with before. Please find a code example attached.When running the example I get the following output:
Despite both runs of each seed being with the same controller weights, the two runs with the seed set to 43 yield different rewards. Am I misunderstanding the use of
env.reset()
or is this a bug? Interestingly if I don't run the environment withseed=42
beforehand I get the same result forseed=43
both times (-264.19797533115525), and if I create a new instance of the environment viagym.make()
between the two calls toevaluate_demo_heuristic_lander
the results are also consistent (-264.19797533115525 forseed=43
on both occasions).Thanks in advance!
Code example
System info
pip
.Additional context
No response
Checklist
The text was updated successfully, but these errors were encountered: