Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Infinite horizon tasks are handled like episodic tasks #284

Closed
tomasruizt opened this issue Jan 6, 2021 · 6 comments · Fixed by #243
Closed

[Bug] Infinite horizon tasks are handled like episodic tasks #284

tomasruizt opened this issue Jan 6, 2021 · 6 comments · Fixed by #243
Labels
enhancement New feature or request
Milestone

Comments

@tomasruizt
Copy link

Hi,
I wonder how to correctly use SAC with infinite horizon environments. I saw @araffin s answer to hill-a/stable-baselines#776 where he points out that algorithms are step-based. Our environments could always return done = False, but we would have to reset the environment manually then. As a consequence, we would add transitions to the replay buffer going from the last state to the initial state, which is bad.

Is the only solution to include a time-feature? That means messing with the observation_space size and handling dict spaces correctly + explaining what this "time-feature" is in papers. Let me know if I've missed a thread treating this issue already 😄
Greetings!

🐛 Bug / Background

My understanding is that SAC skips the target if s' is a terminal state:

q_backup = replay_data.rewards + (1 - replay_data.dones) * self.gamma * target_q

In infinite horizon tasks, we wrap our env with gym.wrappers.TimeLimit, which sets done = True when the maximum episode length is reached. This stops the episode in SAC and the transition is saved in the replay buffer for learning.

However, according to "Time Limits in Reinforcement Learning" (https://arxiv.org/abs/1712.00378), we should not see that last state as a "terminal" state, since the termination has nothing to do with the MDP. If we ignore this, we are doing "state aliasing" and violating the Markov Property.

@tomasruizt tomasruizt added the bug Something isn't working label Jan 6, 2021
@Miffyli
Copy link
Collaborator

Miffyli commented Jan 7, 2021

To summarize so that I understood things right: You have non-episodic task (never truly "done"), but you use TimeLimit to reset game every now and then, and to train correctly you can not apply terminal boundaries during training (does not reflect true agent setup).

There should not be a problem with this while using SAC, as long as you always feed in done=False. The biggest problem then is that final timestep does not reflect environment behaviour (it was reset under the hood). The easiest fix for this is not to include it in the training data. You can add a check here to check if not infos.get(TimeLimit.truncated, False): buffer.add(...). Such flag is added to info dictionary when episode is truncated by timelimit.

A more sophisticated solution would indeed be a nice enhancement though, as errors like these are easy to miss. I will mark it as an enhancement for some later versions of stable-baselines.

@Miffyli Miffyli added enhancement New feature or request and removed bug Something isn't working labels Jan 7, 2021
@Miffyli Miffyli added this to the v1.1 milestone Jan 7, 2021
@tomasruizt
Copy link
Author

Yes, your description is accurate. Would you like me to open a PR for this? I'm happy to follow any guidelines you point me to.
Best, Tomas

@Miffyli
Copy link
Collaborator

Miffyli commented Jan 7, 2021

I would not start doing a PR yet as proper implementation (if any) should be discussed first, and I do not see an easy way to incorporate this to contrib repository either. Worse yet, this will get more complicated once support for n-step updates is included.

For now I suggest you modify your own fork of SB3 to support this in the way that suits your experiments.

@tomasruizt
Copy link
Author

Okay. Feel free to ping me if you require support with this feature later on.
For my work, I'll use a time-feature and set gamma = 1.

@araffin araffin changed the title [Bug] Infinize horizon tasks are handled like episodic tasks [Bug] Infinite horizon tasks are handled like episodic tasks Jan 7, 2021
@araffin
Copy link
Member

araffin commented Jan 7, 2021

Is the only solution to include a time-feature? That means messing with the observation_space size and handling dict spaces correctly + explaining what this "time-feature" is in papers

TimeFeature is one solution and equivalent in performance to specific handling of timeout.
We have an implementation in SB3-Contrib that already handles dict and it is used for all PyBullet env in the zoo:
https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/hyperparams/sac.yml#L142

Personally, this is the recommended solution (and you can use the test mode at test time too).

Note: the timeout handling is indeed important, see appendix of https://arxiv.org/abs/2005.05719

Related issues

linking to all relevant issues:

Experimental branch

You can add a check here to check if not infos.get(TimeLimit.truncated, False): buffer.add(...). Such flag is added to info dictionary when episode is truncated by timelimit.

"I created a branch on SB3 but it in fact a bit more tricky than expected (notably because VecEnv resets automatically): "
As mentioned, I already created an experimental branch here:
https://github.com/DLR-RM/stable-baselines3/compare/feat/remove-timelimit

For my work, I'll use a time-feature and set gamma = 1.

you don't need gamma=1, this is independent from the infinite horizon problem.

@araffin araffin mentioned this issue Mar 24, 2021
18 tasks
@araffin
Copy link
Member

araffin commented Apr 5, 2021

I've implemented correct handling of timeouts in #351

Note: for on-policy algorithms, I would still recommend to use timefeature wrapper (as the change will be even more tricky I think, due to lambda-return).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants