[Bug Report] `CartPole`'s reward function constantly returns 1 (even when it falls) #790

Kallinteris-Andreas · 2023-11-24T13:53:58Z

Describe the bug

The CartPole environment provides reward==1 when the pole "stands" and reward==1 when the pole has "fallen".

The old gym documentation mentioned that this was the behavior, and so does the current documentation, indicating that this is the desired behavior, but I can find no evidence that this was the design goal.

The argument could be made that "state-of-the-art algorithms (e.g. DQN, A2C, PPO) can easily solve this environment anyway, so why bother?", which is true that the environment is very easy to solve, but it is still important to examine why, and what the consequences are in these learning algorithms and potential future ones.

The reason the algorithms are able to learn a policy, despite the fact that the environment is effectively reward-free, is because of the sampling bias introduced by the number of actions taken per given policy.
If the agent has a "good policy", then on average it will be alive longer than if it has a "bad policy", and this will cause the "good policy" to be sampled more often on average by the train() subroutine of the learning algorithm, and therefore a "good policy" will be reinforced more than a "bad policy" (instead of the "good policy" being reinforced and the "bad policy" being reduced, which is what normally happens in RL).

This would mean that RL algorithms that are not affected by this sampling bias (or are much less affected by it) would not be able to learn a policy for the CartPole-v1 environment,
And the CartPole environment is where many RL algorithms are tested during the prototyping phase (since it is probably assumed that if a learning algorithm cannot solve CartPole, it is probably a dud).
Therefore, if an algorithm that was developed was less affected by this sampling bias, it might have failed the prototyping phase because of the wrong implementation of Gymnasium/CartPole.

Quick Performance Analysis

DQN seems to be benefit significantly from fixing this bug.
A2C seem to benefit slightly from fixing this bug.
PPO does not seem to be affected by this bug, as it can already easily solve the problem.

Note: The shaded area is the (min, max) episode length of each algorithm.

Suggestion

Suggestion 1 (the one I recommend)

Fix this bug and update CartPole to v2, and add an argument to recreate the old behavior.

Suggestion 2

Do not fix this bug, but update the description to clearly indicate that this is a "reward-free" variant of the cart pole environment.

Additional references

Old openai/gym related issues: openai/gym#1682, openai/gym#704, openai/gym#21.

This issue has existed since gym=0.0.1 (hello world commit by gdb)

It is the same issue that MuJoCo/Pendulums had #500, and #526

The original CartPole implementation by Sutton did not have a constant reward function, nor did not the paper

Code example

Python 3.11.5 (main, Sep  2 2023, 14:16:33) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gymnasium
>>> env = gymnasium.make("CartPole-v1")
>>> env.reset()
(array([-0.04924537,  0.01854442, -0.0259891 ,  0.00165513], dtype=float32), {})
>>> env.step(env.action_space.sample())
(array([-0.04887448,  0.21402927, -0.025956  , -0.29911307], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.0445939 ,  0.4095114 , -0.03193826, -0.5998677 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.03640367,  0.2148505 , -0.04393562, -0.31741348], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.03210666,  0.02038097, -0.05028389, -0.03890362], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.03169904, -0.1739852 , -0.05106196,  0.2374999 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.03517874, -0.3683419 , -0.04631196,  0.5136492 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.04254558, -0.17259932, -0.03603898,  0.20673935], dtype=float32), 1.0, False, False, {})
>>> 
>>> env.step(env.action_space.sample())
(array([-0.04599757, -0.3671879 , -0.03190419,  0.4878395 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.05334133, -0.5618455 , -0.0221474 ,  0.77029914], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06457824, -0.36642587, -0.00674142,  0.47073072], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07190675, -0.17120934,  0.0026732 ,  0.17593063], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07533094,  0.02387426,  0.00619181, -0.11590779], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07485346,  0.21890694,  0.00387366, -0.4066308 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07047532,  0.41397375, -0.00425896, -0.69809   ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06219584,  0.21891111, -0.01822076, -0.40675083], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.05781762,  0.02405221, -0.02635578, -0.11986759], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.05733658, -0.17068242, -0.02875313,  0.1643852 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06075022,  0.02483909, -0.02546543, -0.13722809], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06025344, -0.16990903, -0.02820999,  0.14731336], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.06365162, -0.3646159 , -0.02526372,  0.4309648 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.07094394, -0.5593712 , -0.01664442,  0.7155777 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.08213136, -0.7542588 , -0.00233287,  1.0029755 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.09721654, -0.9493495 ,  0.01772664,  1.2949249 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.11620353, -1.1446922 ,  0.04362514,  1.5931041 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.13909738, -0.95011413,  0.07548722,  1.3143365 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.15809965, -1.1461059 ,  0.10177395,  1.6296592 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.18102178, -1.342266  ,  0.13436714,  1.9522467 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.2078671 , -1.148804  ,  0.17341207,  1.7040544 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.23084317, -1.3454461 ,  0.20749316,  2.0453217 ], dtype=float32), 1.0, False, False, {})
>>> env.step(env.action_space.sample())
(array([-0.2577521, -1.1529721,  0.2483996,  1.8233697], dtype=float32), 1.0, True, False, {})
# The last steps terminates because the pole has fallen, but the reward we get is `1.0` not `0.0`

Additional context

Gymnasium/gymnasium/envs/classic_control/cartpole.py

Lines 188 to 203 in 34872e9

    
           if not terminated: 
        
               reward = 1.0 
        
           elif self.steps_beyond_terminated is None: 
        
               # Pole just fell! 
        
               self.steps_beyond_terminated = 0 
        
               reward = 1.0 
        
           else: 
        
               if self.steps_beyond_terminated == 0: 
        
                   logger.warn( 
        
                       "You are calling 'step()' even though this " 
        
                       "environment has already returned terminated = True. You " 
        
                       "should always call 'reset()' once you receive 'terminated = " 
        
                       "True' -- any further steps are undefined behavior." 
        
                   ) 
        
               self.steps_beyond_terminated += 1 
        
               reward = 0.0

Checklist

I have checked that there is no similar issue in the repo

The text was updated successfully, but these errors were encountered:

pseudo-rnd-thoughts · 2023-11-24T16:22:46Z

Good question and analysis.
As you show there seems to be minimal impact on the training performance of these algorithms.
I think we should certainly do suggestion 2 to at least explain Cartpole v0 and v1
The question is if we do suggestion 1 as well to me, which I'm not sure. Given the minimal impact on performance and that it is not a "bug" as such just non-standard implementation then I would say no. But if we want v2 to be a proper replication of Sutton and Barto

Kallinteris-Andreas · 2023-11-24T16:59:06Z

The reason I suggest fixing this "bug" is because even though it is easily solvable by existing learning algorithms, it may not be solvable by a "future learning algorithm", and this would cause the developer of that "future learning algorithm" to reject it during the prototyping phase and not evaluate it further after it fails in a "simple" CartPole environment.
(where this "future learning algorithm" is not affected by the execution sampling bias, and cannot even solve a simple reward-free environment).

But if we want v2 to be a proper replication of Sutton and Barto
As for the other ways in which CartPole-v1 is not a proper replication of Sutton and Barto, I cannot (at the best of my efforts) find any reason why they would cause a problem in possible future RL research,

The physics problem is not identical.
The reward function is 0 for normal steps and -1 for

pseudo-rnd-thoughts · 2023-11-28T13:20:27Z

Sorry, I'm not sure what you are suggesting that we should do here, do you want to create a cartpole v2 that is very different from v0 or v1?

Kallinteris-Andreas · 2023-11-28T13:22:08Z

To summarize a bit

the reward function is technically wrong (see figure from sutton barton book page 6 for the definition of the reward function), the current reward function would indicate that it is just as good to fall as it to keep standing in a step.
I believe we should make new CartPole revision (v2) fixing this bug, I am invariant with respect to other changes being added to a new CartPole revision (v2).

pseudo-rnd-thoughts · 2023-11-28T14:05:51Z

the reward function is technically wrong

That is fair but I think the second aspect is as important that the agent is to maximise the total reward over time so as the agent is terminated after falling, it is not an optimal strategy to fall immediately.
Therefore, I don't think that this should cause a significant issue to training of agents (as your results show)

balazsgyenes · 2023-11-30T15:21:34Z

I don't think it's correct to say that the reward function is wrong. The current reward function still forces an agent to learn to balance the pole in order to maximize return, which is the discounted sum of rewards. That's what an RL algorithm does.

Now, I agree that the reward defined in CartPole is silly, but only because empirically (in my experience), it makes Q-learning much more difficult than it needs to be. The Q values for good actions are effectively unbounded (subject only to the time limit and the discount), so the variance is quite high. However, I agree with @pseudo-rnd-thoughts that the goal should be to accurately reflect the original CartPole problem defined by Sutton and Barto. That's why I was surprised when I looked at the original C code and found that it defines a reward of -1 for failure and 0 otherwise (which is also what I found to be a good reward through experimentation). Not sure what should be done with this information, but there you go.

Kallinteris-Andreas · 2023-12-06T08:27:28Z

@pseudo-rnd-thoughts other than aversion to making a new revision, is there another reason you oppose the fixing of this bug?

pseudo-rnd-thoughts · 2023-12-07T12:04:57Z

@pseudo-rnd-thoughts other than aversion to making a new revision, is there another reason you oppose the fixing of this bug?

Yes, I think that version change is my largest worry but if we change to 0 and -1 for rewards, then this to me is a completely different environment compared to the current implementation. The change to the final reward makes more sense as a change but we could do this with a simple parameter for the environment but experimentally, I think you have shown this makes minimal difference.

Kallinteris-Andreas · 2023-12-10T13:43:00Z

I tested with Sutton-Barton's reward function for DQN and as expected there is not any notable difference in it and my suggested reward function (since all that matters is the mse_loss of critic prediction and actual reward for training, it would be the same behavior for something like ~~-100 for fail -99 for success rewards~~ +99 for fail 100 for success rewards, for all state-of-the-art RL algorithms)

balazsgyenes · 2023-12-11T17:57:20Z

I just want to point out that with -100 for failure and -99 for survival, the optimal behavior would be to end the trajectory as soon as possible, and you would get a very different looking graph.

wjessup · 2024-01-18T03:14:44Z

@balazsgyenes

"The current reward function still forces an agent to learn to balance the pole in order to maximize return, which is the discounted sum of rewards."

This is not the case. You never get a "reward of 50" for staying up 50 steps. Instead you only get a reward of 1 for any step, even the terminated steps. Then you stuff all those steps in your memory and repeatedly train them — always setting the 'y_true' in your loss function for +1 more than the current Q value, other than terminal cases where the target_network predictions are zeroed out, so you always set your 'y_true' == 1.

It's a terrible setup but it's still possible to learn a good policy.

Ultimately states that are 1 step away from termination (example: if you go right you will terminate) will have their Q values approach 2. This is because the terminated state will end up over time having an average Q value of 1 and so the state before termination will be asked to predict it's next state value, which we know is 1. then we add 1 reward, and so the 'y_true' passed to the loss function over and over is 2.

You can repeat the logic and see states 2 steps away will approach 3, etc.

wjessup · 2024-01-18T03:23:37Z

I think @Kallinteris-Andreas suggestion option 1 is good, but with reward = 1 for non terminated and reward = -1 for terminated. This will move the gradients up or down (away from 0) in both cases, which is good.

I would add to it to ensure that env.unwrapped.kinematics_integrator = False is a default or just remove this branch entirely and do the physics correctly. See: openai/gym#3254 (comment)

Changing this setting also reduces training time because we store the correct 'x' and 'theta' values in the state for training and not the previous states x and theta which is what it does now by default.

Additionally, changing the terminated reward to -1 helps to reduces training cycles that do not contribute, at all, to any learning of the policy when the terminated state is also given a reward of 1. Here's why:

Initial Q values of an randomly initialized network are near zero.

With the default 1 reward for everything, it means that even terminal states will have Q value that approach 1 over time. All of the initial training pushes both "good" middleish states and terminal states up to Q >= 1, and only above 1 do the Q values kick in and do any learning.

wjessup · 2024-01-18T03:36:38Z

One other note, https://github.com/marek-robak/Double-cartpole-custom-gym-env-for-reinforcement-learning uses a different reward approach: "The reward function consists of linear dependence of the cart distance from the center of the available space and a -50 penalty for premature termination of the episode."

I don't think we should go this direction because it changes the problem too much. I think we should stay as close to the current reward scheme because it is actually an interesting problem to dissect. "how is it possible to learn a policy when the only rewards you get are from the termination and nothing else"?

And interestingly the with enough training the policies will eventually keep the pole up and in the middle nearly perfectly. So the termination cases become boundaries that the rest of the non-terminal cases smooth over.

Kallinteris-Andreas · 2024-02-20T14:24:31Z

@pseudo-rnd-thoughts I have made my arguments on why it should be fixed, it is primarily that references environments like CartPole should be correct, a few more people have weighted in with some additional analysis on the result of the bug.

If you believe it is not worth fixing please close this issue and label it as wontfix.

pseudo-rnd-thoughts · 2024-02-23T16:23:02Z

I think my preferred solution would be to add a sutton_barto: bool = False parameter in __init__ where the reward function matches sutton and barto or uses the previous implementation (defaulting to previous implementation).
This would allow us to keep v1 and just add a parameter to change reward function for users to test with.
Then we just update the documentation to reflect these changes

Does that work for you @Kallinteris-Andreas

Kallinteris-Andreas · 2024-02-23T17:08:18Z

I would add sutton_barto_reward: bool = False, (I would set it to True by default, but I do not really care)

I should have time next week to work on it, or if you prefer to start it go for it

Kallinteris-Andreas added the bug Something isn't working label Nov 24, 2023

Farama-Foundation deleted a comment from Kallinteris-Andreas Feb 20, 2024

Kallinteris-Andreas mentioned this issue Mar 8, 2024

[CartPole] Add sutton_barto_reward argument #958

Merged

11 tasks

Kallinteris-Andreas closed this as completed in #958 Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] `CartPole`'s reward function constantly returns 1 (even when it falls) #790

[Bug Report] `CartPole`'s reward function constantly returns 1 (even when it falls) #790

Kallinteris-Andreas commented Nov 24, 2023 •

edited

pseudo-rnd-thoughts commented Nov 24, 2023

Kallinteris-Andreas commented Nov 24, 2023 •

edited

pseudo-rnd-thoughts commented Nov 28, 2023

Kallinteris-Andreas commented Nov 28, 2023

pseudo-rnd-thoughts commented Nov 28, 2023

balazsgyenes commented Nov 30, 2023

Kallinteris-Andreas commented Dec 6, 2023

pseudo-rnd-thoughts commented Dec 7, 2023

Kallinteris-Andreas commented Dec 10, 2023 •

edited

balazsgyenes commented Dec 11, 2023

wjessup commented Jan 18, 2024

wjessup commented Jan 18, 2024 •

edited

wjessup commented Jan 18, 2024

Kallinteris-Andreas commented Feb 20, 2024

pseudo-rnd-thoughts commented Feb 23, 2024

Kallinteris-Andreas commented Feb 23, 2024

[Bug Report] CartPole's reward function constantly returns 1 (even when it falls) #790

[Bug Report] CartPole's reward function constantly returns 1 (even when it falls) #790

Comments

Kallinteris-Andreas commented Nov 24, 2023 • edited

Describe the bug

Quick Performance Analysis

Suggestion

Suggestion 1 (the one I recommend)

Suggestion 2

Additional references

Code example

Additional context

Checklist

pseudo-rnd-thoughts commented Nov 24, 2023

Kallinteris-Andreas commented Nov 24, 2023 • edited

pseudo-rnd-thoughts commented Nov 28, 2023

Kallinteris-Andreas commented Nov 28, 2023

pseudo-rnd-thoughts commented Nov 28, 2023

balazsgyenes commented Nov 30, 2023

Kallinteris-Andreas commented Dec 6, 2023

pseudo-rnd-thoughts commented Dec 7, 2023

Kallinteris-Andreas commented Dec 10, 2023 • edited

balazsgyenes commented Dec 11, 2023

wjessup commented Jan 18, 2024

wjessup commented Jan 18, 2024 • edited

wjessup commented Jan 18, 2024

Kallinteris-Andreas commented Feb 20, 2024

pseudo-rnd-thoughts commented Feb 23, 2024

Kallinteris-Andreas commented Feb 23, 2024

[Bug Report] `CartPole`'s reward function constantly returns 1 (even when it falls) #790

[Bug Report] `CartPole`'s reward function constantly returns 1 (even when it falls) #790

Kallinteris-Andreas commented Nov 24, 2023 •

edited

Kallinteris-Andreas commented Nov 24, 2023 •

edited

Kallinteris-Andreas commented Dec 10, 2023 •

edited

wjessup commented Jan 18, 2024 •

edited