Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] Lunar Lander reset determinism #728

Closed
1 task done
Thomas-Christie opened this issue Sep 27, 2023 · 9 comments
Closed
1 task done

[Bug Report] Lunar Lander reset determinism #728

Thomas-Christie opened this issue Sep 27, 2023 · 9 comments
Labels
bug Something isn't working

Comments

@Thomas-Christie
Copy link

Describe the bug

Hi All,

Thanks for the library! I'm currently trying to use the LunarLander-v2 environment with some degree of determinism involved. I'm wanting to be able to run the environment for different seeds, and learn some "weights" for a controller based on the demo controller present in the library. In my setting I often have a "batch" of weights I'd like to try for each environment, so have written a simple loop to do so (as this is fast enough for my needs).

For instance, I might want to run the environment for seeds 42 and 43, with two different sets of weights w1 and w2. My current approach was to just call env.reset(seed=seed) at the start of each run, so call env.reset(seed=42) then run an episode with weights w1 for the controller, then call env.reset(seed=42) and run an episode with weights w2 for the controller, then call env.reset(seed=43) and run an episode with weights w1 etc. My understanding is that if I call env.reset(seed=x) and run the environment with the same controller, I should get the same results? However, there seems to be some state which doesn't get reset in the environment? Most of the time things seem to be OK, but sometimes running the environment with the same seed and weights seems to yield a different result, depending on what seeds it has been run with before. Please find a code example attached.

When running the example I get the following output:

Seed: 42 Reward: 112.8980752996316
Seed: 42 Reward: 112.8980752996316
Seed: 43 Reward: -154.39439542058136
Seed: 43 Reward: -264.19797533115525

Despite both runs of each seed being with the same controller weights, the two runs with the seed set to 43 yield different rewards. Am I misunderstanding the use of env.reset() or is this a bug? Interestingly if I don't run the environment with seed=42 beforehand I get the same result for seed=43 both times (-264.19797533115525), and if I create a new instance of the environment via gym.make() between the two calls to evaluate_demo_heuristic_lander the results are also consistent (-264.19797533115525 for seed=43 on both occasions).

Thanks in advance!

Code example

import gymnasium as gym
import numpy as np
from gymnasium.utils.step_api_compatibility import step_api_compatibility

def heuristic_controller(s, weights):
    angle_targ = s[0] * weights[0] + s[2] * weights[1]
    if angle_targ > weights[2]:
        angle_targ = weights[2]
    if angle_targ < -weights[2]:
        angle_targ = -weights[2]
    hover_targ = weights[3] * np.abs(s[0])

    angle_todo = (angle_targ - s[4]) * weights[4] - (s[5]) * weights[5]
    hover_todo = (hover_targ - s[1]) * weights[6] - (s[3]) * weights[7]

    if s[6] or s[7]:
        angle_todo = weights[8]
        hover_todo = -(s[3]) * weights[9]

    a = 0
    if hover_todo > np.abs(angle_todo) and hover_todo > weights[10]:
        a = 2
    elif angle_todo < -weights[11]:
        a = 3
    elif angle_todo > +weights[11]:
        a = 1
    return a

def evaluate_demo_heuristic_lander(env, weight_configs, seed=None):
    for weight_config in weight_configs:
        total_reward = 0
        steps = 0
        s, info = env.reset(seed=seed)
        while True:
            a = heuristic_controller(s, weight_config)
            s, r, terminated, truncated, info = step_api_compatibility(env.step(a), True)
            total_reward += r

            steps += 1
            if terminated or truncated:
                print(f"Seed: {seed} Reward: {total_reward}")
                break
    return total_reward

if __name__ == "__main__":
    np.set_printoptions(precision=20)
    a = 2.0 * np.array([[0.23369889038420863 , 0.26774014620508657 , 0.2735376051121257  ,
                         0.2997768858021939  , 0.6177483298073877  , 0.7301028167611718  ,
                         0.5806673609739257  , 0.9615451067519928  , 0.07313039905288293 ,
                         0.3626741110509404  , 0.07945107270237381 , 0.009656828642440243]])
    a = np.vstack([a, a])
    env = gym.make("LunarLander-v2")
    evaluate_demo_heuristic_lander(env, a, seed=42)
    evaluate_demo_heuristic_lander(env, a, seed=43)

System info

  • Gymnasium installed using pip.
  • Gymnasium Version: 0.29.1
  • OS: MacOS Ventura 13.5.2
  • Python 3.10.10

Additional context

No response

Checklist

  • I have checked that there is no similar issue in the repo
@Thomas-Christie Thomas-Christie added the bug Something isn't working label Sep 27, 2023
@RedTachyon
Copy link
Member

I'm not 100% sure, but I think there was a similar issue either here or back in Gym. The conclusion was probably that the underlying issue is some Box2D black magic and there isn't much we can do until we get rid of it and replace it with literally anything else (which iirc is somewhere on the roadmap, but might not be anytime soon)

@clockzhong
Copy link
Contributor

Let me update my new finds here:
I've added some more debugging info in the test codes as the following:

def evaluate_demo_heuristic_lander(env, weight_configs, seed=None):
    obs_list = []
    for weight_config in weight_configs:
        total_reward = 0
        steps = 0
        env.close()
        s, info = env.reset(seed=seed)
        obs_sub_list = []
        obs_sub_list.append(s)
        obs_list.append(obs_sub_list)
        while True:
            a = heuristic_controller(s, weight_config)
            #s, r, terminated, truncated, info = step_api_compatibility(env.step(a), True)
            s, r, terminated, truncated, info  = env.step(a)
            obs_sub_list.append((s,r))
            total_reward += r
            #
            steps += 1
            if terminated or truncated:
                print(f"Seed: {seed} Reward: {total_reward}, steps:{steps}")
                break
    return total_reward, obs_list

np.set_printoptions(precision=20)
a = 2.0 * np.array([[0.23369889038420863 , 0.26774014620508657 , 0.2735376051121257  ,
                     0.2997768858021939  , 0.6177483298073877  , 0.7301028167611718  ,
                     0.5806673609739257  , 0.9615451067519928  , 0.07313039905288293 ,
                     0.3626741110509404  , 0.07945107270237381 , 0.009656828642440243]])
a = np.vstack([a, a])
env = gym.make("LunarLander-v2")
total_reward_42, obs_list_42 = evaluate_demo_heuristic_lander(env, a, seed=42)
total_reward_43, obs_list_43 = evaluate_demo_heuristic_lander(env, a, seed=43)

I found that for seed=43, we've executed episodes two times, the first episode steps count is 353, but the second is 346, and then I've checked in which step they start to be different as the following codes:

for i in range(len(obs_list_43[1])):
    if not np.all(np.equal(obs_list_43[1][i][0],obs_list_43[0][i][0])):
        print(f"i:{i}")
        break

It report as 235, so :

>>> obs_list_43[0][234]
(array([-8.6891845e-02, -1.0146881e-03, -1.5059064e-01, -6.3774064e-07,
        1.5923785e-03,  2.8275241e-07,  1.0000000e+00,  1.0000000e+00],
      dtype=float32), -0.6966045942906287)
>>> obs_list_43[1][234]
(array([-8.6891845e-02, -1.0146881e-03, -1.5059064e-01, -6.3774064e-07,
        1.5923785e-03,  2.8275241e-07,  1.0000000e+00,  1.0000000e+00],
      dtype=float32), -0.6966045942906287)
>>> obs_list_43[1][235]
(array([-8.8439748e-02, -1.0163331e-03, -1.5478876e-01, -7.2242561e-05,
        1.5940322e-03,  3.2485405e-05,  1.0000000e+00,  1.0000000e+00],
      dtype=float32), -0.6047604628131988)
>>> obs_list_43[0][235]
(array([-8.8439748e-02, -1.0163331e-03, -1.5478876e-01, -7.2242561e-05,
        1.5940322e-03,  3.2485405e-05,  0.0000000e+00,  1.0000000e+00],
      dtype=float32), -10.604760462813198)

So our question is: when in step 234, the 2 observations are exactly same, and with the exactly same action, why we get the different obs? and the two obs are exactly same except the 7th value, one is 1.0, the other is 0.0.

We need more debugging work on this problem.

@clockzhong
Copy link
Contributor

According to the result my checking on the source codes and the values in obs_list_43, it seems relates to the lander's legs.ground_contact, because the following codes:

        state = [
            (pos.x - VIEWPORT_W / SCALE / 2) / (VIEWPORT_W / SCALE / 2),
            (pos.y - (self.helipad_y + LEG_DOWN / SCALE)) / (VIEWPORT_H / SCALE / 2),
            vel.x * (VIEWPORT_W / SCALE / 2) / FPS,
            vel.y * (VIEWPORT_H / SCALE / 2) / FPS,
            self.lander.angle,
            20.0 * self.lander.angularVelocity / FPS,
            1.0 if self.legs[0].ground_contact else 0.0,
            1.0 if self.legs[1].ground_contact else 0.0,
        ]

It's very possible that the self.legs[0].ground_contact aare in different siutations(one is Ture, the other is False) in our tests, but why? We don't know yet. As @RedTachyon said, the contactListener is implemented in Box2D, so currently I've not setup its debugging env yet.

clockzhong added a commit to clockzhong/Gymnasium that referenced this issue Oct 17, 2023
@clockzhong
Copy link
Contributor

clockzhong commented Oct 17, 2023

Hi, new updates on this problem:
I've confired the root cause relates to the LunarLander.world reset. In current LunarLander codes, we init it in init() and reset(), but in the reset() we reset self.world instead of create a totally new self.world, but it's possible that some states values we haven't clean enough through codes in _destroy() and reset(), which causes this bug.
I've added "self.world = Box2D.b2World(gravity=(0, self.gravity))" in the reset(), this bug disappears. The code patch is here:

32d5ebc

@pseudo-rnd-thoughts
Copy link
Member

@clockzhong Wow, thanks for finding this.

This seems like a serious bug that will seriously affect performance.
As making this change will require a version number change, i.e., LunarLander-v3, then we need to investigate the performance difference.
@clockzhong Could you do this? It would require using SB3, CleanRL, etc with an algorithm, PPO, DQN, etc and tracking the performance for the two environment versions with graphs of the performance. See openai/gym#2762 for example

@clockzhong
Copy link
Contributor

@clockzhong Wow, thanks for finding this.

This seems like a serious bug that will seriously affect performance. As making this change will require a version number change, i.e., LunarLander-v3, then we need to investigate the performance difference. @clockzhong Could you do this? It would require using SB3, CleanRL, etc with an algorithm, PPO, DQN, etc and tracking the performance for the two environment versions with graphs of the performance. See openai/gym#2762 for example

Yes, I'm setting up SB3 environment here, I'll make some tests to check the difference on those RL algorithms' performance on this bug fix. I guess this bug's fix will not influence those RL algorithms' performence largely, because those agents(PPO/DQN) has no idea on which env is right or wrong. Anyway, we need get final conclusion after completing the A/B test.

@pseudo-rnd-thoughts
Copy link
Member

To confirm, could you just train an agent (3 repeat) for the original v2 and then a v3 environment and plot the training results for both and confirm the results

@pseudo-rnd-thoughts
Copy link
Member

@clockzhong Hey, any progress on this?

@pseudo-rnd-thoughts
Copy link
Member

@Kallinteris-Andreas We should add this for v3 though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants