[Question] Evaluation helper on Monitor wrapped environments #894

Meyer99 · 2022-04-30T03:18:23Z

Question

Hi all,
I would like to know the reasons "episode" in info.keys() corresponds to the true end of an episode.

# from evaluation.py
                if dones[i]:
                    if is_monitor_wrapped:
                        # Atari wrapper can send a "done" signal when
                        # the agent loses a life, but it does not correspond
                        # to the true end of episode
                        if "episode" in info.keys():
                            # Do not trust "done" with episode endings.
                            # Monitor wrapper includes "episode" key in info if environment
                            # has been wrapped with it. Use those rewards instead.
                            episode_rewards.append(info["episode"]["r"])
                            episode_lengths.append(info["episode"]["l"])
                            # Only increment at the real end of an episode
                            episode_counts[i] += 1

The above code from evaluation.py indicates that using "episode" in info.keys() to determine the true end of an episode, in case Atari wrapper sends a "done" signal when the agent loses a life. However, in monitor.py, episode info is added to info when there's a "done" signal.
If the environment is wrapped with EpisodicLifeEnv, how does using "episode" in info.keys() be able to distinguish between a true termination state and losing a life? Does that mean "episode" in info.keys() always evaluates to True for Monitor wrapped environments when done = True?

# step() in monitor.py
    def step(self, action: Union[np.ndarray, int]) -> GymStepReturn:
        """
        Step the environment with the given action

        :param action: the action
        :return: observation, reward, done, information
        """
        if self.needs_reset:
            raise RuntimeError("Tried to step environment that needs reset")
        observation, reward, done, info = self.env.step(action)
        self.rewards.append(reward)
        if done:
            self.needs_reset = True
            ep_rew = sum(self.rewards)
            ep_len = len(self.rewards)
            ep_info = {"r": round(ep_rew, 6), "l": ep_len, "t": round(time.time() - self.t_start, 6)}
            for key in self.info_keywords:
                ep_info[key] = info[key]
            self.episode_returns.append(ep_rew)
            self.episode_lengths.append(ep_len)
            self.episode_times.append(time.time() - self.t_start)
            ep_info.update(self.current_reset_info)
            if self.results_writer:
                self.results_writer.write_row(ep_info)
            info["episode"] = ep_info
        self.total_steps += 1
        return observation, reward, done, info

Additional context

Checklist

I have read the documentation (required)
I have checked that there is no similar issue in the repo (required)

The text was updated successfully, but these errors were encountered:

araffin · 2022-04-30T09:32:52Z

Duplicate of #181 and #873

Meyer99 · 2022-05-02T14:57:29Z

To make the Monitor wrapper count the unwrapped reward (i.e., full episode reward), one should wrap the environment with the Monitor wrapper before other wrappers. E.g.,

env = gym.make("SpaceInvadersNoFrameskip-v4")
env = Monitor(env)
env = AtariWrapper(env, terminal_on_life_loss=True)
obs = env.reset()
while True:
    obs, reward, done, info = env.step(env.action_space.sample())
    if done:
        print(info)
        break
print(env.get_episode_rewards())
print(env.unwrapped.ale.lives())

# Output: 
# {'lives': 2, 'episode_frame_number': 836, 'frame_number': 836}
# []
# 2

However, if we switch the wrapping order, losing a life is then treated as the end of an episode.

env = gym.make("SpaceInvadersNoFrameskip-v4")
env = AtariWrapper(env, terminal_on_life_loss=True)
env = Monitor(env)
obs = env.reset()
while True:
    obs, reward, done, info = env.step(env.action_space.sample())
    if done:
        print(info)
        break
print(env.get_episode_rewards())
print(env.unwrapped.ale.lives())

# Output: 
# {'lives': 2, 'episode_frame_number': 1505, 'frame_number': 1505, 'episode': {'r': 11.0, 'l': 368, 't': 1.910763}}
# [11.0]
# 2

AlexPasqua · 2022-09-26T19:13:14Z

@araffin @Meyer99 I actually had an issue related to this.
I'm using A2C and I have an evaluation environment. This is wrapped with a Monitor and I also need to wrap it with a TimeLimit wrapper from the gym library.

I noticed that the order of the wrappers actually changes the behavior (and I think this is not desirable).

First `Monitor` and then `TimeLimit`

If I wrap it first with Monitor and then with TimeLimit, the latter seems to have no effect. I've notice that this is due to the fact that, in the function evaluate_policy (from evaluation.py), the episode_counts are never incremented. Specifically, dones[i] and is_monitor_wrapped are True, but "episode" is not in info.keys().
Code for reference:

if dones[i]:
    if is_monitor_wrapped:
        # Atari wrapper can send a "done" signal when
        # the agent loses a life, but it does not correspond
        # to the true end of episode
        if "episode" in info.keys():
            # Do not trust "done" with episode endings.
            # Monitor wrapper includes "episode" key in info if environment
            # has been wrapped with it. Use those rewards instead.
            episode_rewards.append(info["episode"]["r"])
            episode_lengths.append(info["episode"]["l"])
            # Only increment at the real end of an episode
            episode_counts[i] += 1
    else:
        episode_rewards.append(current_rewards[i])
        episode_lengths.append(current_lengths[i])
        episode_counts[i] += 1
    current_rewards[i] = 0
    current_lengths[i] = 0

First `TimeLimit` and then `Monitor`

If, instead, I wrap the evaluation env first with TimeLimit and only then with Monitor, "episode" actually is in the info.keys() and the episode_counts are incremented.

Possible solution

A possible solution to this could be to move the episode_counts increment outside of the inner if-statement, as follows:

if dones[i]:
    if is_monitor_wrapped:
        # Atari wrapper can send a "done" signal when
        # the agent loses a life, but it does not correspond
        # to the true end of episode
        if "episode" in info.keys():
            # Do not trust "done" with episode endings.
            # Monitor wrapper includes "episode" key in info if environment
            # has been wrapped with it. Use those rewards instead.
            episode_rewards.append(info["episode"]["r"])
            episode_lengths.append(info["episode"]["l"])
    else:
        episode_rewards.append(current_rewards[i])
        episode_lengths.append(current_lengths[i])
    episode_counts[i] += 1
    current_rewards[i] = 0
    current_lengths[i] = 0

I don't know if this could create problems though. If it doesn't, I could open a PR to fix this problem as showed.

Meyer99 · 2022-09-27T15:01:51Z

@araffin @Meyer99 I actually had an issue related to this. I'm using A2C and I have an evaluation environment. This is wrapped with a Monitor and I also need to wrap it with a TimeLimit wrapper from the gym library.

I noticed that the order of the wrappers actually changes the behavior (and I think this is not desirable).

First Monitor and then TimeLimit

If I wrap it first with Monitor and then with TimeLimit, the latter seems to have no effect. I've notice that this is due to the fact that, in the function evaluate_policy (from evaluation.py), the episode_counts are never incremented. Specifically, dones[i] and is_monitor_wrapped are True, but "episode" is not in info.keys(). Code for reference:
if dones[i]:
    if is_monitor_wrapped:
        # Atari wrapper can send a "done" signal when
        # the agent loses a life, but it does not correspond
        # to the true end of episode
        if "episode" in info.keys():
            # Do not trust "done" with episode endings.
            # Monitor wrapper includes "episode" key in info if environment
            # has been wrapped with it. Use those rewards instead.
            episode_rewards.append(info["episode"]["r"])
            episode_lengths.append(info["episode"]["l"])
            # Only increment at the real end of an episode
            episode_counts[i] += 1
    else:
        episode_rewards.append(current_rewards[i])
        episode_lengths.append(current_lengths[i])
        episode_counts[i] += 1
    current_rewards[i] = 0
    current_lengths[i] = 0
First TimeLimit and then Monitor

If, instead, I wrap the evaluation env first with TimeLimit and only then with Monitor, "episode" actually is in the info.keys() and the episode_counts are incremented.

Possible solution

A possible solution to this could be to move the episode_counts increment outside of the inner if-statement, as follows:
if dones[i]:
    if is_monitor_wrapped:
        # Atari wrapper can send a "done" signal when
        # the agent loses a life, but it does not correspond
        # to the true end of episode
        if "episode" in info.keys():
            # Do not trust "done" with episode endings.
            # Monitor wrapper includes "episode" key in info if environment
            # has been wrapped with it. Use those rewards instead.
            episode_rewards.append(info["episode"]["r"])
            episode_lengths.append(info["episode"]["l"])
    else:
        episode_rewards.append(current_rewards[i])
        episode_lengths.append(current_lengths[i])
    episode_counts[i] += 1
    current_rewards[i] = 0
    current_lengths[i] = 0
I don't know if this could create problems though. If it doesn't, I could open a PR to fix this problem as showed.

For the first case, "episode" is not in info.keys() because the original environment (i.e. the environment wrapped with Monitor) is not actually terminated. "episode" is only added to info when Monitor receives a done signal.

I think the implementation first wrapping the environment with TimeLimit is correct. And there is no need to modify the code in Monitor. We can think of TimeLimit as a property of that environment, like many other predefined gym environments (e.g., HalfCheetah). In that way, Monitor is still the first wrapper of the environment, which is how it is designed and supposed to use. @AlexPasqua

AlexPasqua · 2022-09-27T16:18:08Z

@Meyer99 ok, I've seen in other issues (#477 and #789) that the combination of make_vec_env with gym.wrappers.TimeLimit is a common "problem", because - in order to work correctly - the env should be wrapped with TimeLimit first and then with Monitor, but make_vec_env does the opposite (even with passing a wrapper_class argument to it).

At first it looked like a bug, but in #789 @araffin says it's by design (in this comment).

I my case, I managed to make it work this way:

base_env = MyCustomEnv(...)
env = make_vec_env(
    env_id=TimeLimit,
    n_envs=10,  # 10 is a random number, not important for what we're discussing
    env_kwargs=dict(
        env=base_env,
        max_episode_steps=30  # 30 is a random number, not important for what we're discussing
    )
)

I think it's a bit cleaner than what suggested in #477 and #789, and also it might be nice to reference this in the documentation. @araffin what do you think? In case I can make a PR

Wrappers for tr and eval envs were in the wrong order, so the _elapsed_steps attribute of the TimeLimit wrapper was a single one for all the instances of the env in the VecEnv. Now each instance is wrapped with a TimeLimit instance and each one of them has its own (independent) _elapsed_steps See this discussion on Github: DLR-RM/stable-baselines3#894 (comment)

araffin · 2022-09-28T18:49:20Z

At first it looked like a bug, but in #789 @araffin says it's by design (in #798 (comment)).

yes, it's by design, because the time limit wrapper is applied when calling gym.make() normally (max_episode_steps in the env declaration).
So, if you modify the termination condition or the reward of the original env, you need to explicitly wrap it with a Monitor wrapper again, for instance via the wrapper_class argument:

def my_wrapper(env):
    env = TimeLimit(env, 100)  # new time limit
    env = Monitor(env)  # wrap it with monitor env again to explicitely take the change into account
    return env

make_vec_env(..., wrapper_class=my_wrapper)

or use VecMonitor on the vectorized env

or even (probably the cleanest solution):

def make_env():
    env = MyEnv()
    env = TimeLimit(env, 100)  # new time limit
    return env

vec_env = make_vec_env(make_env, ...)

it might be nice to reference this in the documentation

I would be happy to receive a PR that adds a note in the doc linking to the different issues

AlexPasqua · 2022-09-30T09:18:16Z

or even (probably the cleanest solution):

def make_env():
    env = MyEnv()
    env = TimeLimit(env, 100)  # new time limit
    return env

vec_env = make_vec_env(make_env, ...)

That looks good!
I didn't think about it because the type hint of the env_id parameter of make_vec_env is Union[str, Type[Env]], in fact my code inspector gives me a warning, but still it works.
I'd suggest to chenge the type hint to Union[str, Type[Env], Callable[[Env], Env]] (a bit like the one of wrapper_class, which in facts accepts functions without problems).

@araffin if you agree, in my PR I could change that too

araffin · 2022-09-30T12:30:20Z

I'd suggest to chenge the type hint to Union[str, Type[Env], Callable[[Env], Env]] (a bit like the one of wrapper_class, which in facts accepts functions without problems).

well, more Callable[..., Env] in that case.

if you agree, in my PR I could change that too

Please do =)

AlexPasqua · 2022-09-30T13:36:01Z

I'd suggest to chenge the type hint to Union[str, Type[Env], Callable[[Env], Env]] (a bit like the one of wrapper_class, which in facts accepts functions without problems).

well, more Callable[..., Env] in that case.

if you agree, in my PR I could change that too

Please do =)

Wouldn't that not "accept" strings anymore?

araffin · 2022-09-30T13:42:13Z

Wouldn't that not "accept" strings anymore?

I meant Callable[..., Env] instead of Callable[[Env], Env]] (which includes Type[Env]), so Union[str, Callable[..., Env]])

AlexPasqua · 2022-10-01T15:23:18Z

Please do =)

Opened PR #1085

Meyer99 added the question Further information is requested label Apr 30, 2022

araffin added the duplicate This issue or pull request already exists label Apr 30, 2022

Meyer99 closed this as completed May 2, 2022

AlexPasqua mentioned this issue Oct 1, 2022

Updated type hint and extended docstring in make_vec_env and make_atari_env #1085

Merged

13 tasks

qgallouedec mentioned this issue Oct 14, 2022

[Feature Request] Mypy type checking #1121

Closed

1 task

AlexPasqua mentioned this issue Dec 28, 2022

[Feature Request] Env checker for VecEnv #1251

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Evaluation helper on Monitor wrapped environments #894

[Question] Evaluation helper on Monitor wrapped environments #894

Meyer99 commented Apr 30, 2022 •

edited

Loading

araffin commented Apr 30, 2022

Meyer99 commented May 2, 2022

AlexPasqua commented Sep 26, 2022

Meyer99 commented Sep 27, 2022 •

edited

Loading

First `Monitor` and then `TimeLimit`

First `TimeLimit` and then `Monitor`

Possible solution

AlexPasqua commented Sep 27, 2022

araffin commented Sep 28, 2022

AlexPasqua commented Sep 30, 2022 •

edited

Loading

araffin commented Sep 30, 2022

AlexPasqua commented Sep 30, 2022

araffin commented Sep 30, 2022

AlexPasqua commented Oct 1, 2022

[Question] Evaluation helper on Monitor wrapped environments #894

[Question] Evaluation helper on Monitor wrapped environments #894

Comments

Meyer99 commented Apr 30, 2022 • edited Loading

Question

Additional context

Checklist

araffin commented Apr 30, 2022

Meyer99 commented May 2, 2022

AlexPasqua commented Sep 26, 2022

First Monitor and then TimeLimit

First TimeLimit and then Monitor

Possible solution

Meyer99 commented Sep 27, 2022 • edited Loading

First Monitor and then TimeLimit

First TimeLimit and then Monitor

Possible solution

AlexPasqua commented Sep 27, 2022

araffin commented Sep 28, 2022

AlexPasqua commented Sep 30, 2022 • edited Loading

araffin commented Sep 30, 2022

AlexPasqua commented Sep 30, 2022

araffin commented Sep 30, 2022

AlexPasqua commented Oct 1, 2022

Meyer99 commented Apr 30, 2022 •

edited

Loading

First `Monitor` and then `TimeLimit`

First `TimeLimit` and then `Monitor`

Meyer99 commented Sep 27, 2022 •

edited

Loading

First `Monitor` and then `TimeLimit`

First `TimeLimit` and then `Monitor`

AlexPasqua commented Sep 30, 2022 •

edited

Loading