Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Evaluation helper on Monitor wrapped environments #894

Closed
2 tasks done
Meyer99 opened this issue Apr 30, 2022 · 11 comments
Closed
2 tasks done

[Question] Evaluation helper on Monitor wrapped environments #894

Meyer99 opened this issue Apr 30, 2022 · 11 comments
Labels
duplicate This issue or pull request already exists question Further information is requested

Comments

@Meyer99
Copy link

Meyer99 commented Apr 30, 2022

Question

Hi all,
I would like to know the reasons "episode" in info.keys() corresponds to the true end of an episode.

# from evaluation.py
                if dones[i]:
                    if is_monitor_wrapped:
                        # Atari wrapper can send a "done" signal when
                        # the agent loses a life, but it does not correspond
                        # to the true end of episode
                        if "episode" in info.keys():
                            # Do not trust "done" with episode endings.
                            # Monitor wrapper includes "episode" key in info if environment
                            # has been wrapped with it. Use those rewards instead.
                            episode_rewards.append(info["episode"]["r"])
                            episode_lengths.append(info["episode"]["l"])
                            # Only increment at the real end of an episode
                            episode_counts[i] += 1

The above code from evaluation.py indicates that using "episode" in info.keys() to determine the true end of an episode, in case Atari wrapper sends a "done" signal when the agent loses a life. However, in monitor.py, episode info is added to info when there's a "done" signal.
If the environment is wrapped with EpisodicLifeEnv, how does using "episode" in info.keys() be able to distinguish between a true termination state and losing a life? Does that mean "episode" in info.keys() always evaluates to True for Monitor wrapped environments when done = True?

# step() in monitor.py
    def step(self, action: Union[np.ndarray, int]) -> GymStepReturn:
        """
        Step the environment with the given action

        :param action: the action
        :return: observation, reward, done, information
        """
        if self.needs_reset:
            raise RuntimeError("Tried to step environment that needs reset")
        observation, reward, done, info = self.env.step(action)
        self.rewards.append(reward)
        if done:
            self.needs_reset = True
            ep_rew = sum(self.rewards)
            ep_len = len(self.rewards)
            ep_info = {"r": round(ep_rew, 6), "l": ep_len, "t": round(time.time() - self.t_start, 6)}
            for key in self.info_keywords:
                ep_info[key] = info[key]
            self.episode_returns.append(ep_rew)
            self.episode_lengths.append(ep_len)
            self.episode_times.append(time.time() - self.t_start)
            ep_info.update(self.current_reset_info)
            if self.results_writer:
                self.results_writer.write_row(ep_info)
            info["episode"] = ep_info
        self.total_steps += 1
        return observation, reward, done, info

Additional context

Checklist

  • I have read the documentation (required)
  • I have checked that there is no similar issue in the repo (required)
@Meyer99 Meyer99 added the question Further information is requested label Apr 30, 2022
@araffin araffin added the duplicate This issue or pull request already exists label Apr 30, 2022
@araffin
Copy link
Member

araffin commented Apr 30, 2022

Duplicate of #181 and #873

@Meyer99
Copy link
Author

Meyer99 commented May 2, 2022

To make the Monitor wrapper count the unwrapped reward (i.e., full episode reward), one should wrap the environment with the Monitor wrapper before other wrappers. E.g.,

env = gym.make("SpaceInvadersNoFrameskip-v4")
env = Monitor(env)
env = AtariWrapper(env, terminal_on_life_loss=True)
obs = env.reset()
while True:
    obs, reward, done, info = env.step(env.action_space.sample())
    if done:
        print(info)
        break
print(env.get_episode_rewards())
print(env.unwrapped.ale.lives())

# Output: 
# {'lives': 2, 'episode_frame_number': 836, 'frame_number': 836}
# []
# 2

However, if we switch the wrapping order, losing a life is then treated as the end of an episode.

env = gym.make("SpaceInvadersNoFrameskip-v4")
env = AtariWrapper(env, terminal_on_life_loss=True)
env = Monitor(env)
obs = env.reset()
while True:
    obs, reward, done, info = env.step(env.action_space.sample())
    if done:
        print(info)
        break
print(env.get_episode_rewards())
print(env.unwrapped.ale.lives())

# Output: 
# {'lives': 2, 'episode_frame_number': 1505, 'frame_number': 1505, 'episode': {'r': 11.0, 'l': 368, 't': 1.910763}}
# [11.0]
# 2

@Meyer99 Meyer99 closed this as completed May 2, 2022
@AlexPasqua
Copy link
Contributor

@araffin @Meyer99 I actually had an issue related to this.
I'm using A2C and I have an evaluation environment. This is wrapped with a Monitor and I also need to wrap it with a TimeLimit wrapper from the gym library.

I noticed that the order of the wrappers actually changes the behavior (and I think this is not desirable).

First Monitor and then TimeLimit

If I wrap it first with Monitor and then with TimeLimit, the latter seems to have no effect. I've notice that this is due to the fact that, in the function evaluate_policy (from evaluation.py), the episode_counts are never incremented. Specifically, dones[i] and is_monitor_wrapped are True, but "episode" is not in info.keys().
Code for reference:

if dones[i]:
    if is_monitor_wrapped:
        # Atari wrapper can send a "done" signal when
        # the agent loses a life, but it does not correspond
        # to the true end of episode
        if "episode" in info.keys():
            # Do not trust "done" with episode endings.
            # Monitor wrapper includes "episode" key in info if environment
            # has been wrapped with it. Use those rewards instead.
            episode_rewards.append(info["episode"]["r"])
            episode_lengths.append(info["episode"]["l"])
            # Only increment at the real end of an episode
            episode_counts[i] += 1
    else:
        episode_rewards.append(current_rewards[i])
        episode_lengths.append(current_lengths[i])
        episode_counts[i] += 1
    current_rewards[i] = 0
    current_lengths[i] = 0

First TimeLimit and then Monitor

If, instead, I wrap the evaluation env first with TimeLimit and only then with Monitor, "episode" actually is in the info.keys() and the episode_counts are incremented.

Possible solution

A possible solution to this could be to move the episode_counts increment outside of the inner if-statement, as follows:

if dones[i]:
    if is_monitor_wrapped:
        # Atari wrapper can send a "done" signal when
        # the agent loses a life, but it does not correspond
        # to the true end of episode
        if "episode" in info.keys():
            # Do not trust "done" with episode endings.
            # Monitor wrapper includes "episode" key in info if environment
            # has been wrapped with it. Use those rewards instead.
            episode_rewards.append(info["episode"]["r"])
            episode_lengths.append(info["episode"]["l"])
    else:
        episode_rewards.append(current_rewards[i])
        episode_lengths.append(current_lengths[i])
    episode_counts[i] += 1
    current_rewards[i] = 0
    current_lengths[i] = 0

I don't know if this could create problems though. If it doesn't, I could open a PR to fix this problem as showed.

@Meyer99
Copy link
Author

Meyer99 commented Sep 27, 2022

@araffin @Meyer99 I actually had an issue related to this. I'm using A2C and I have an evaluation environment. This is wrapped with a Monitor and I also need to wrap it with a TimeLimit wrapper from the gym library.

I noticed that the order of the wrappers actually changes the behavior (and I think this is not desirable).

First Monitor and then TimeLimit

If I wrap it first with Monitor and then with TimeLimit, the latter seems to have no effect. I've notice that this is due to the fact that, in the function evaluate_policy (from evaluation.py), the episode_counts are never incremented. Specifically, dones[i] and is_monitor_wrapped are True, but "episode" is not in info.keys(). Code for reference:

if dones[i]:
    if is_monitor_wrapped:
        # Atari wrapper can send a "done" signal when
        # the agent loses a life, but it does not correspond
        # to the true end of episode
        if "episode" in info.keys():
            # Do not trust "done" with episode endings.
            # Monitor wrapper includes "episode" key in info if environment
            # has been wrapped with it. Use those rewards instead.
            episode_rewards.append(info["episode"]["r"])
            episode_lengths.append(info["episode"]["l"])
            # Only increment at the real end of an episode
            episode_counts[i] += 1
    else:
        episode_rewards.append(current_rewards[i])
        episode_lengths.append(current_lengths[i])
        episode_counts[i] += 1
    current_rewards[i] = 0
    current_lengths[i] = 0

First TimeLimit and then Monitor

If, instead, I wrap the evaluation env first with TimeLimit and only then with Monitor, "episode" actually is in the info.keys() and the episode_counts are incremented.

Possible solution

A possible solution to this could be to move the episode_counts increment outside of the inner if-statement, as follows:

if dones[i]:
    if is_monitor_wrapped:
        # Atari wrapper can send a "done" signal when
        # the agent loses a life, but it does not correspond
        # to the true end of episode
        if "episode" in info.keys():
            # Do not trust "done" with episode endings.
            # Monitor wrapper includes "episode" key in info if environment
            # has been wrapped with it. Use those rewards instead.
            episode_rewards.append(info["episode"]["r"])
            episode_lengths.append(info["episode"]["l"])
    else:
        episode_rewards.append(current_rewards[i])
        episode_lengths.append(current_lengths[i])
    episode_counts[i] += 1
    current_rewards[i] = 0
    current_lengths[i] = 0

I don't know if this could create problems though. If it doesn't, I could open a PR to fix this problem as showed.

For the first case, "episode" is not in info.keys() because the original environment (i.e. the environment wrapped with Monitor) is not actually terminated. "episode" is only added to info when Monitor receives a done signal.

I think the implementation first wrapping the environment with TimeLimit is correct. And there is no need to modify the code in Monitor. We can think of TimeLimit as a property of that environment, like many other predefined gym environments (e.g., HalfCheetah). In that way, Monitor is still the first wrapper of the environment, which is how it is designed and supposed to use. @AlexPasqua

@AlexPasqua
Copy link
Contributor

@Meyer99 ok, I've seen in other issues (#477 and #789) that the combination of make_vec_env with gym.wrappers.TimeLimit is a common "problem", because - in order to work correctly - the env should be wrapped with TimeLimit first and then with Monitor, but make_vec_env does the opposite (even with passing a wrapper_class argument to it).

At first it looked like a bug, but in #789 @araffin says it's by design (in this comment).

I my case, I managed to make it work this way:

base_env = MyCustomEnv(...)
env = make_vec_env(
    env_id=TimeLimit,
    n_envs=10,  # 10 is a random number, not important for what we're discussing
    env_kwargs=dict(
        env=base_env,
        max_episode_steps=30  # 30 is a random number, not important for what we're discussing
    )
)

I think it's a bit cleaner than what suggested in #477 and #789, and also it might be nice to reference this in the documentation. @araffin what do you think? In case I can make a PR

AlexPasqua added a commit to AlexPasqua/DeepNetSlice that referenced this issue Sep 27, 2022
Wrappers for tr and eval envs were in the wrong order, so the _elapsed_steps attribute of the TimeLimit wrapper was a single one for all the instances of the env in the VecEnv.
Now each instance is wrapped with a TimeLimit instance and each one of them has its own (independent) _elapsed_steps

See this discussion on Github: DLR-RM/stable-baselines3#894 (comment)
@araffin
Copy link
Member

araffin commented Sep 28, 2022

At first it looked like a bug, but in #789 @araffin says it's by design (in #798 (comment)).

yes, it's by design, because the time limit wrapper is applied when calling gym.make() normally (max_episode_steps in the env declaration).
So, if you modify the termination condition or the reward of the original env, you need to explicitly wrap it with a Monitor wrapper again, for instance via the wrapper_class argument:

def my_wrapper(env):
    env = TimeLimit(env, 100)  # new time limit
    env = Monitor(env)  # wrap it with monitor env again to explicitely take the change into account
    return env

make_vec_env(..., wrapper_class=my_wrapper)

or use VecMonitor on the vectorized env

or even (probably the cleanest solution):

def make_env():
    env = MyEnv()
    env = TimeLimit(env, 100)  # new time limit
    return env

vec_env = make_vec_env(make_env, ...)

it might be nice to reference this in the documentation

I would be happy to receive a PR that adds a note in the doc linking to the different issues

@AlexPasqua
Copy link
Contributor

AlexPasqua commented Sep 30, 2022

or even (probably the cleanest solution):

def make_env():
    env = MyEnv()
    env = TimeLimit(env, 100)  # new time limit
    return env

vec_env = make_vec_env(make_env, ...)

That looks good!
I didn't think about it because the type hint of the env_id parameter of make_vec_env is Union[str, Type[Env]], in fact my code inspector gives me a warning, but still it works.
I'd suggest to chenge the type hint to Union[str, Type[Env], Callable[[Env], Env]] (a bit like the one of wrapper_class, which in facts accepts functions without problems).

@araffin if you agree, in my PR I could change that too

@araffin
Copy link
Member

araffin commented Sep 30, 2022

I'd suggest to chenge the type hint to Union[str, Type[Env], Callable[[Env], Env]] (a bit like the one of wrapper_class, which in facts accepts functions without problems).

well, more Callable[..., Env] in that case.

if you agree, in my PR I could change that too

Please do =)

@AlexPasqua
Copy link
Contributor

I'd suggest to chenge the type hint to Union[str, Type[Env], Callable[[Env], Env]] (a bit like the one of wrapper_class, which in facts accepts functions without problems).

well, more Callable[..., Env] in that case.

if you agree, in my PR I could change that too

Please do =)

Wouldn't that not "accept" strings anymore?

@araffin
Copy link
Member

araffin commented Sep 30, 2022

Wouldn't that not "accept" strings anymore?

I meant Callable[..., Env] instead of Callable[[Env], Env]] (which includes Type[Env]), so Union[str, Callable[..., Env]])

@AlexPasqua
Copy link
Contributor

Please do =)

Opened PR #1085

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants