Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] CartPole's reward function constantly returns 1 (even when it falls) #790

Closed
1 task done
Kallinteris-Andreas opened this issue Nov 24, 2023 · 16 comments · Fixed by #958
Closed
1 task done
Labels
bug Something isn't working

Comments

@Kallinteris-Andreas
Copy link
Collaborator

Kallinteris-Andreas commented Nov 24, 2023

Describe the bug

The CartPole environment provides reward==1 when the pole "stands" and reward==1 when the pole has "fallen".

The old gym documentation mentioned that this was the behavior, and so does the current documentation, indicating that this is the desired behavior, but I can find no evidence that this was the design goal.

The argument could be made that "state-of-the-art algorithms (e.g. DQN, A2C, PPO) can easily solve this environment anyway, so why bother?", which is true that the environment is very easy to solve, but it is still important to examine why, and what the consequences are in these learning algorithms and potential future ones.

The reason the algorithms are able to learn a policy, despite the fact that the environment is effectively reward-free, is because of the sampling bias introduced by the number of actions taken per given policy.
If the agent has a "good policy", then on average it will be alive longer than if it has a "bad policy", and this will cause the "good policy" to be sampled more often on average by the train() subroutine of the learning algorithm, and therefore a "good policy" will be reinforced more than a "bad policy" (instead of the "good policy" being reinforced and the "bad policy" being reduced, which is what normally happens in RL).

This would mean that RL algorithms that are not affected by this sampling bias (or are much less affected by it) would not be able to learn a policy for the CartPole-v1 environment,
And the CartPole environment is where many RL algorithms are tested during the prototyping phase (since it is probably assumed that if a learning algorithm cannot solve CartPole, it is probably a dud).
Therefore, if an algorithm that was developed was less affected by this sampling bias, it might have failed the prototyping phase because of the wrong implementation of Gymnasium/CartPole.

Quick Performance Analysis

CartPole

  • DQN seems to be benefit significantly from fixing this bug.
  • A2C seem to benefit slightly from fixing this bug.
  • PPO does not seem to be affected by this bug, as it can already easily solve the problem.

Note: The shaded area is the (min, max) episode length of each algorithm.

Suggestion

Suggestion 1 (the one I recommend)

Fix this bug and update CartPole to v2, and add an argument to recreate the old behavior.

Suggestion 2

Do not fix this bug, but update the description to clearly indicate that this is a "reward-free" variant of the cart pole environment.

Additional references

Old openai/gym related issues: openai/gym#1682, openai/gym#704, openai/gym#21.

This issue has existed since gym=0.0.1 (hello world commit by gdb)

It is the same issue that MuJoCo/Pendulums had #500, and #526

The original CartPole implementation by Sutton did not have a constant reward function, nor did not the paper

Code example

Python 3.11.5 (main, Sep  2 2023, 14:16:33) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gymnasium
>>> env = gymnasium.make("CartPole-v1")
>>> env.reset()
(array([-0.04924537,  0.01854442, -0.0259891 ,  0.00165513], dtype=float32), {})
>>> env.step(env.action_space.sample())
(array([-0.04887448,  0.21402927, -0.025956  , -0.29911307], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.0445939 ,  0.4095114 , -0.03193826, -0.5998677 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.03640367,  0.2148505 , -0.04393562, -0.31741348], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.03210666,  0.02038097, -0.05028389, -0.03890362], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.03169904, -0.1739852 , -0.05106196,  0.2374999 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.03517874, -0.3683419 , -0.04631196,  0.5136492 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.04254558, -0.17259932, -0.03603898,  0.20673935], dtype=float32), 1.0, False, False, {})
>>> 
>>> env.step(env.action_space.sample())
(array([-0.04599757, -0.3671879 , -0.03190419,  0.4878395 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.05334133, -0.5618455 , -0.0221474 ,  0.77029914], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06457824, -0.36642587, -0.00674142,  0.47073072], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07190675, -0.17120934,  0.0026732 ,  0.17593063], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07533094,  0.02387426,  0.00619181, -0.11590779], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07485346,  0.21890694,  0.00387366, -0.4066308 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07047532,  0.41397375, -0.00425896, -0.69809   ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06219584,  0.21891111, -0.01822076, -0.40675083], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.05781762,  0.02405221, -0.02635578, -0.11986759], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.05733658, -0.17068242, -0.02875313,  0.1643852 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06075022,  0.02483909, -0.02546543, -0.13722809], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06025344, -0.16990903, -0.02820999,  0.14731336], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06365162, -0.3646159 , -0.02526372,  0.4309648 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07094394, -0.5593712 , -0.01664442,  0.7155777 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.08213136, -0.7542588 , -0.00233287,  1.0029755 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.09721654, -0.9493495 ,  0.01772664,  1.2949249 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.11620353, -1.1446922 ,  0.04362514,  1.5931041 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.13909738, -0.95011413,  0.07548722,  1.3143365 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.15809965, -1.1461059 ,  0.10177395,  1.6296592 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.18102178, -1.342266  ,  0.13436714,  1.9522467 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.2078671 , -1.148804  ,  0.17341207,  1.7040544 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.23084317, -1.3454461 ,  0.20749316,  2.0453217 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.2577521, -1.1529721,  0.2483996,  1.8233697], dtype=float32), 1.0, True, False, {})
# The last steps terminates because the pole has fallen, but the reward we get is `1.0` not `0.0`

Additional context

if not terminated:
reward = 1.0
elif self.steps_beyond_terminated is None:
# Pole just fell!
self.steps_beyond_terminated = 0
reward = 1.0
else:
if self.steps_beyond_terminated == 0:
logger.warn(
"You are calling 'step()' even though this "
"environment has already returned terminated = True. You "
"should always call 'reset()' once you receive 'terminated = "
"True' -- any further steps are undefined behavior."
)
self.steps_beyond_terminated += 1
reward = 0.0

Checklist

  • I have checked that there is no similar issue in the repo
@Kallinteris-Andreas Kallinteris-Andreas added the bug Something isn't working label Nov 24, 2023
@pseudo-rnd-thoughts
Copy link
Member

Good question and analysis.
As you show there seems to be minimal impact on the training performance of these algorithms.
I think we should certainly do suggestion 2 to at least explain Cartpole v0 and v1
The question is if we do suggestion 1 as well to me, which I'm not sure. Given the minimal impact on performance and that it is not a "bug" as such just non-standard implementation then I would say no. But if we want v2 to be a proper replication of Sutton and Barto

@Kallinteris-Andreas
Copy link
Collaborator Author

Kallinteris-Andreas commented Nov 24, 2023

The reason I suggest fixing this "bug" is because even though it is easily solvable by existing learning algorithms, it may not be solvable by a "future learning algorithm", and this would cause the developer of that "future learning algorithm" to reject it during the prototyping phase and not evaluate it further after it fails in a "simple" CartPole environment.
(where this "future learning algorithm" is not affected by the execution sampling bias, and cannot even solve a simple reward-free environment).

But if we want v2 to be a proper replication of Sutton and Barto
As for the other ways in which CartPole-v1 is not a proper replication of Sutton and Barto, I cannot (at the best of my efforts) find any reason why they would cause a problem in possible future RL research,

  • The physics problem is not identical.
  • The reward function is 0 for normal steps and -1 for

@pseudo-rnd-thoughts
Copy link
Member

Sorry, I'm not sure what you are suggesting that we should do here, do you want to create a cartpole v2 that is very different from v0 or v1?

@Kallinteris-Andreas
Copy link
Collaborator Author

To summarize a bit

  1. the reward function is technically wrong (see figure from sutton barton book page 6 for the definition of the reward function), the current reward function would indicate that it is just as good to fall as it to keep standing in a step.
  2. I believe we should make new CartPole revision (v2) fixing this bug, I am invariant with respect to other changes being added to a new CartPole revision (v2).

image

@pseudo-rnd-thoughts
Copy link
Member

the reward function is technically wrong

That is fair but I think the second aspect is as important that the agent is to maximise the total reward over time so as the agent is terminated after falling, it is not an optimal strategy to fall immediately.
Therefore, I don't think that this should cause a significant issue to training of agents (as your results show)

@balazsgyenes
Copy link

I don't think it's correct to say that the reward function is wrong. The current reward function still forces an agent to learn to balance the pole in order to maximize return, which is the discounted sum of rewards. That's what an RL algorithm does.

Now, I agree that the reward defined in CartPole is silly, but only because empirically (in my experience), it makes Q-learning much more difficult than it needs to be. The Q values for good actions are effectively unbounded (subject only to the time limit and the discount), so the variance is quite high. However, I agree with @pseudo-rnd-thoughts that the goal should be to accurately reflect the original CartPole problem defined by Sutton and Barto. That's why I was surprised when I looked at the original C code and found that it defines a reward of -1 for failure and 0 otherwise (which is also what I found to be a good reward through experimentation). Not sure what should be done with this information, but there you go.

@Kallinteris-Andreas
Copy link
Collaborator Author

@pseudo-rnd-thoughts other than aversion to making a new revision, is there another reason you oppose the fixing of this bug?

@pseudo-rnd-thoughts
Copy link
Member

@pseudo-rnd-thoughts other than aversion to making a new revision, is there another reason you oppose the fixing of this bug?

Yes, I think that version change is my largest worry but if we change to 0 and -1 for rewards, then this to me is a completely different environment compared to the current implementation. The change to the final reward makes more sense as a change but we could do this with a simple parameter for the environment but experimentally, I think you have shown this makes minimal difference.

@Kallinteris-Andreas
Copy link
Collaborator Author

Kallinteris-Andreas commented Dec 10, 2023

I tested with Sutton-Barton's reward function for DQN and as expected there is not any notable difference in it and my suggested reward function (since all that matters is the mse_loss of critic prediction and actual reward for training, it would be the same behavior for something like -100 for fail -99 for success rewards +99 for fail 100 for success rewards, for all state-of-the-art RL algorithms)
CartPole

@balazsgyenes
Copy link

I just want to point out that with -100 for failure and -99 for survival, the optimal behavior would be to end the trajectory as soon as possible, and you would get a very different looking graph.

@wjessup
Copy link

wjessup commented Jan 18, 2024

@balazsgyenes

"The current reward function still forces an agent to learn to balance the pole in order to maximize return, which is the discounted sum of rewards."

This is not the case. You never get a "reward of 50" for staying up 50 steps. Instead you only get a reward of 1 for any step, even the terminated steps. Then you stuff all those steps in your memory and repeatedly train them — always setting the 'y_true' in your loss function for +1 more than the current Q value, other than terminal cases where the target_network predictions are zeroed out, so you always set your 'y_true' == 1.

It's a terrible setup but it's still possible to learn a good policy.

Ultimately states that are 1 step away from termination (example: if you go right you will terminate) will have their Q values approach 2. This is because the terminated state will end up over time having an average Q value of 1 and so the state before termination will be asked to predict it's next state value, which we know is 1. then we add 1 reward, and so the 'y_true' passed to the loss function over and over is 2.

You can repeat the logic and see states 2 steps away will approach 3, etc.

@wjessup
Copy link

wjessup commented Jan 18, 2024

I think @Kallinteris-Andreas suggestion option 1 is good, but with reward = 1 for non terminated and reward = -1 for terminated. This will move the gradients up or down (away from 0) in both cases, which is good.

I would add to it to ensure that env.unwrapped.kinematics_integrator = False is a default or just remove this branch entirely and do the physics correctly. See: openai/gym#3254 (comment)

Changing this setting also reduces training time because we store the correct 'x' and 'theta' values in the state for training and not the previous states x and theta which is what it does now by default.

Additionally, changing the terminated reward to -1 helps to reduces training cycles that do not contribute, at all, to any learning of the policy when the terminated state is also given a reward of 1. Here's why:

Initial Q values of an randomly initialized network are near zero.

With the default 1 reward for everything, it means that even terminal states will have Q value that approach 1 over time. All of the initial training pushes both "good" middleish states and terminal states up to Q >= 1, and only above 1 do the Q values kick in and do any learning.

@wjessup
Copy link

wjessup commented Jan 18, 2024

One other note, https://github.com/marek-robak/Double-cartpole-custom-gym-env-for-reinforcement-learning uses a different reward approach: "The reward function consists of linear dependence of the cart distance from the center of the available space and a -50 penalty for premature termination of the episode."

I don't think we should go this direction because it changes the problem too much. I think we should stay as close to the current reward scheme because it is actually an interesting problem to dissect. "how is it possible to learn a policy when the only rewards you get are from the termination and nothing else"?

And interestingly the with enough training the policies will eventually keep the pole up and in the middle nearly perfectly. So the termination cases become boundaries that the rest of the non-terminal cases smooth over.

@Kallinteris-Andreas
Copy link
Collaborator Author

@pseudo-rnd-thoughts I have made my arguments on why it should be fixed, it is primarily that references environments like CartPole should be correct, a few more people have weighted in with some additional analysis on the result of the bug.

If you believe it is not worth fixing please close this issue and label it as wontfix.

@pseudo-rnd-thoughts
Copy link
Member

I think my preferred solution would be to add a sutton_barto: bool = False parameter in __init__ where the reward function matches sutton and barto or uses the previous implementation (defaulting to previous implementation).
This would allow us to keep v1 and just add a parameter to change reward function for users to test with.
Then we just update the documentation to reflect these changes

Does that work for you @Kallinteris-Andreas

@Kallinteris-Andreas
Copy link
Collaborator Author

I would add sutton_barto_reward: bool = False, (I would set it to True by default, but I do not really care)

I should have time next week to work on it, or if you prefer to start it go for it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants