[Bug Report] Lunar Lander `reset` determinism #728

Thomas-Christie · 2023-09-27T23:46:16Z

Describe the bug

Hi All,

Thanks for the library! I'm currently trying to use the LunarLander-v2 environment with some degree of determinism involved. I'm wanting to be able to run the environment for different seeds, and learn some "weights" for a controller based on the demo controller present in the library. In my setting I often have a "batch" of weights I'd like to try for each environment, so have written a simple loop to do so (as this is fast enough for my needs).

For instance, I might want to run the environment for seeds 42 and 43, with two different sets of weights w1 and w2. My current approach was to just call env.reset(seed=seed) at the start of each run, so call env.reset(seed=42) then run an episode with weights w1 for the controller, then call env.reset(seed=42) and run an episode with weights w2 for the controller, then call env.reset(seed=43) and run an episode with weights w1 etc. My understanding is that if I call env.reset(seed=x) and run the environment with the same controller, I should get the same results? However, there seems to be some state which doesn't get reset in the environment? Most of the time things seem to be OK, but sometimes running the environment with the same seed and weights seems to yield a different result, depending on what seeds it has been run with before. Please find a code example attached.

When running the example I get the following output:

Seed: 42 Reward: 112.8980752996316
Seed: 42 Reward: 112.8980752996316
Seed: 43 Reward: -154.39439542058136
Seed: 43 Reward: -264.19797533115525

Despite both runs of each seed being with the same controller weights, the two runs with the seed set to 43 yield different rewards. Am I misunderstanding the use of env.reset() or is this a bug? Interestingly if I don't run the environment with seed=42 beforehand I get the same result for seed=43 both times (-264.19797533115525), and if I create a new instance of the environment via gym.make() between the two calls to evaluate_demo_heuristic_lander the results are also consistent (-264.19797533115525 for seed=43 on both occasions).

Thanks in advance!

Code example

import gymnasium as gym
import numpy as np
from gymnasium.utils.step_api_compatibility import step_api_compatibility

def heuristic_controller(s, weights):
    angle_targ = s[0] * weights[0] + s[2] * weights[1]
    if angle_targ > weights[2]:
        angle_targ = weights[2]
    if angle_targ < -weights[2]:
        angle_targ = -weights[2]
    hover_targ = weights[3] * np.abs(s[0])

    angle_todo = (angle_targ - s[4]) * weights[4] - (s[5]) * weights[5]
    hover_todo = (hover_targ - s[1]) * weights[6] - (s[3]) * weights[7]

    if s[6] or s[7]:
        angle_todo = weights[8]
        hover_todo = -(s[3]) * weights[9]

    a = 0
    if hover_todo > np.abs(angle_todo) and hover_todo > weights[10]:
        a = 2
    elif angle_todo < -weights[11]:
        a = 3
    elif angle_todo > +weights[11]:
        a = 1
    return a

def evaluate_demo_heuristic_lander(env, weight_configs, seed=None):
    for weight_config in weight_configs:
        total_reward = 0
        steps = 0
        s, info = env.reset(seed=seed)
        while True:
            a = heuristic_controller(s, weight_config)
            s, r, terminated, truncated, info = step_api_compatibility(env.step(a), True)
            total_reward += r

            steps += 1
            if terminated or truncated:
                print(f"Seed: {seed} Reward: {total_reward}")
                break
    return total_reward

if __name__ == "__main__":
    np.set_printoptions(precision=20)
    a = 2.0 * np.array([[0.23369889038420863 , 0.26774014620508657 , 0.2735376051121257  ,
                         0.2997768858021939  , 0.6177483298073877  , 0.7301028167611718  ,
                         0.5806673609739257  , 0.9615451067519928  , 0.07313039905288293 ,
                         0.3626741110509404  , 0.07945107270237381 , 0.009656828642440243]])
    a = np.vstack([a, a])
    env = gym.make("LunarLander-v2")
    evaluate_demo_heuristic_lander(env, a, seed=42)
    evaluate_demo_heuristic_lander(env, a, seed=43)

System info

Gymnasium installed using pip.
Gymnasium Version: 0.29.1
OS: MacOS Ventura 13.5.2
Python 3.10.10

Additional context

No response

Checklist

I have checked that there is no similar issue in the repo

The text was updated successfully, but these errors were encountered:

RedTachyon · 2023-09-28T00:10:02Z

I'm not 100% sure, but I think there was a similar issue either here or back in Gym. The conclusion was probably that the underlying issue is some Box2D black magic and there isn't much we can do until we get rid of it and replace it with literally anything else (which iirc is somewhere on the roadmap, but might not be anytime soon)

clockzhong · 2023-10-17T06:20:03Z

Let me update my new finds here:
I've added some more debugging info in the test codes as the following:

def evaluate_demo_heuristic_lander(env, weight_configs, seed=None):
    obs_list = []
    for weight_config in weight_configs:
        total_reward = 0
        steps = 0
        env.close()
        s, info = env.reset(seed=seed)
        obs_sub_list = []
        obs_sub_list.append(s)
        obs_list.append(obs_sub_list)
        while True:
            a = heuristic_controller(s, weight_config)
            #s, r, terminated, truncated, info = step_api_compatibility(env.step(a), True)
            s, r, terminated, truncated, info  = env.step(a)
            obs_sub_list.append((s,r))
            total_reward += r
            #
            steps += 1
            if terminated or truncated:
                print(f"Seed: {seed} Reward: {total_reward}, steps:{steps}")
                break
    return total_reward, obs_list

np.set_printoptions(precision=20)
a = 2.0 * np.array([[0.23369889038420863 , 0.26774014620508657 , 0.2735376051121257  ,
                     0.2997768858021939  , 0.6177483298073877  , 0.7301028167611718  ,
                     0.5806673609739257  , 0.9615451067519928  , 0.07313039905288293 ,
                     0.3626741110509404  , 0.07945107270237381 , 0.009656828642440243]])
a = np.vstack([a, a])
env = gym.make("LunarLander-v2")
total_reward_42, obs_list_42 = evaluate_demo_heuristic_lander(env, a, seed=42)
total_reward_43, obs_list_43 = evaluate_demo_heuristic_lander(env, a, seed=43)

I found that for seed=43, we've executed episodes two times, the first episode steps count is 353, but the second is 346, and then I've checked in which step they start to be different as the following codes:

for i in range(len(obs_list_43[1])):
    if not np.all(np.equal(obs_list_43[1][i][0],obs_list_43[0][i][0])):
        print(f"i:{i}")
        break

It report as 235, so :

>>> obs_list_43[0][234]
(array([-8.6891845e-02, -1.0146881e-03, -1.5059064e-01, -6.3774064e-07,
        1.5923785e-03,  2.8275241e-07,  1.0000000e+00,  1.0000000e+00],
      dtype=float32), -0.6966045942906287)
>>> obs_list_43[1][234]
(array([-8.6891845e-02, -1.0146881e-03, -1.5059064e-01, -6.3774064e-07,
        1.5923785e-03,  2.8275241e-07,  1.0000000e+00,  1.0000000e+00],
      dtype=float32), -0.6966045942906287)
>>> obs_list_43[1][235]
(array([-8.8439748e-02, -1.0163331e-03, -1.5478876e-01, -7.2242561e-05,
        1.5940322e-03,  3.2485405e-05,  1.0000000e+00,  1.0000000e+00],
      dtype=float32), -0.6047604628131988)
>>> obs_list_43[0][235]
(array([-8.8439748e-02, -1.0163331e-03, -1.5478876e-01, -7.2242561e-05,
        1.5940322e-03,  3.2485405e-05,  0.0000000e+00,  1.0000000e+00],
      dtype=float32), -10.604760462813198)

So our question is: when in step 234, the 2 observations are exactly same, and with the exactly same action, why we get the different obs? and the two obs are exactly same except the 7th value, one is 1.0, the other is 0.0.

We need more debugging work on this problem.

clockzhong · 2023-10-17T07:03:24Z

According to the result my checking on the source codes and the values in obs_list_43, it seems relates to the lander's legs.ground_contact, because the following codes:

        state = [
            (pos.x - VIEWPORT_W / SCALE / 2) / (VIEWPORT_W / SCALE / 2),
            (pos.y - (self.helipad_y + LEG_DOWN / SCALE)) / (VIEWPORT_H / SCALE / 2),
            vel.x * (VIEWPORT_W / SCALE / 2) / FPS,
            vel.y * (VIEWPORT_H / SCALE / 2) / FPS,
            self.lander.angle,
            20.0 * self.lander.angularVelocity / FPS,
            1.0 if self.legs[0].ground_contact else 0.0,
            1.0 if self.legs[1].ground_contact else 0.0,
        ]

It's very possible that the self.legs[0].ground_contact aare in different siutations(one is Ture, the other is False) in our tests, but why? We don't know yet. As @RedTachyon said, the contactListener is implemented in Box2D, so currently I've not setup its debugging env yet.

clockzhong · 2023-10-17T09:01:22Z

Hi, new updates on this problem:
I've confired the root cause relates to the LunarLander.world reset. In current LunarLander codes, we init it in init() and reset(), but in the reset() we reset self.world instead of create a totally new self.world, but it's possible that some states values we haven't clean enough through codes in _destroy() and reset(), which causes this bug.
I've added "self.world = Box2D.b2World(gravity=(0, self.gravity))" in the reset(), this bug disappears. The code patch is here:

32d5ebc

pseudo-rnd-thoughts · 2023-10-17T19:20:24Z

@clockzhong Wow, thanks for finding this.

This seems like a serious bug that will seriously affect performance.
As making this change will require a version number change, i.e., LunarLander-v3, then we need to investigate the performance difference.
@clockzhong Could you do this? It would require using SB3, CleanRL, etc with an algorithm, PPO, DQN, etc and tracking the performance for the two environment versions with graphs of the performance. See openai/gym#2762 for example

clockzhong · 2023-10-19T02:54:15Z

@clockzhong Wow, thanks for finding this.

This seems like a serious bug that will seriously affect performance. As making this change will require a version number change, i.e., LunarLander-v3, then we need to investigate the performance difference. @clockzhong Could you do this? It would require using SB3, CleanRL, etc with an algorithm, PPO, DQN, etc and tracking the performance for the two environment versions with graphs of the performance. See openai/gym#2762 for example

Yes, I'm setting up SB3 environment here, I'll make some tests to check the difference on those RL algorithms' performance on this bug fix. I guess this bug's fix will not influence those RL algorithms' performence largely, because those agents(PPO/DQN) has no idea on which env is right or wrong. Anyway, we need get final conclusion after completing the A/B test.

pseudo-rnd-thoughts · 2023-10-20T09:32:56Z

To confirm, could you just train an agent (3 repeat) for the original v2 and then a v3 environment and plot the training results for both and confirm the results

pseudo-rnd-thoughts · 2023-11-09T16:13:33Z

@clockzhong Hey, any progress on this?

pseudo-rnd-thoughts · 2024-03-22T15:45:46Z

@Kallinteris-Andreas We should add this for v3 though

Thomas-Christie added the bug Something isn't working label Sep 27, 2023

clockzhong added a commit to clockzhong/Gymnasium that referenced this issue Oct 17, 2023

workaround bug fix for Farama-Foundation#728

32d5ebc

pseudo-rnd-thoughts mentioned this issue Mar 22, 2024

Fix lunar lander determinism #979

Merged

Kallinteris-Andreas closed this as completed Mar 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] Lunar Lander `reset` determinism #728

[Bug Report] Lunar Lander `reset` determinism #728

Thomas-Christie commented Sep 27, 2023

RedTachyon commented Sep 28, 2023

clockzhong commented Oct 17, 2023

clockzhong commented Oct 17, 2023

clockzhong commented Oct 17, 2023 •

edited

Loading

pseudo-rnd-thoughts commented Oct 17, 2023

clockzhong commented Oct 19, 2023

pseudo-rnd-thoughts commented Oct 20, 2023

pseudo-rnd-thoughts commented Nov 9, 2023

pseudo-rnd-thoughts commented Mar 22, 2024

[Bug Report] Lunar Lander reset determinism #728

[Bug Report] Lunar Lander reset determinism #728

Comments

Thomas-Christie commented Sep 27, 2023

Describe the bug

Code example

System info

Additional context

Checklist

RedTachyon commented Sep 28, 2023

clockzhong commented Oct 17, 2023

clockzhong commented Oct 17, 2023

clockzhong commented Oct 17, 2023 • edited Loading

pseudo-rnd-thoughts commented Oct 17, 2023

clockzhong commented Oct 19, 2023

pseudo-rnd-thoughts commented Oct 20, 2023

pseudo-rnd-thoughts commented Nov 9, 2023

pseudo-rnd-thoughts commented Mar 22, 2024

[Bug Report] Lunar Lander `reset` determinism #728

[Bug Report] Lunar Lander `reset` determinism #728

clockzhong commented Oct 17, 2023 •

edited

Loading