Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug fix: compute_reward for batch input #153

Merged
merged 1 commit into from Jun 20, 2023

Conversation

nicehiro
Copy link
Contributor

@nicehiro nicehiro commented May 19, 2023

Description

Fix compute_reward function can not compute reward for batch input in AntMaze environment.

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

  • I have run the pre-commit checks with pre-commit run --all-files (see CONTRIBUTING.md instructions to set it up)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@rodrigodelazcano
Copy link
Member

Hi @nicehiro! Can you give me a bit more detail on what type of bug this is? Some code example.

Also, can you run pre-commit(https://pre-commit.com/) locally to fix the CI?

elif self.reward_type == "sparse":
return 1.0 if np.linalg.norm(achieved_goal - desired_goal) <= 0.45 else 0.0
return - (d > 0.45).astype(np.float32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you changing the reward function? With your implementation it returns -1 if the goal is not reached and 0 otherwise.

@nicehiro
Copy link
Contributor Author

Hi, @rodrigodelazcano, Thanks for your great work!

Can you give me a bit more detail on what type of bug this is?

We expect to get list of reward for each observation passed to compute_reward function. This is helpful when we use HER to re-compute reward. FYI, here's the implementation of HER by stable-baseline.

Why are you changing the reward function? With your implementation it returns -1 if the goal is not reached and 0 otherwise.

-1 is more convenient for training in sparse reward environment. Because 0 has no grad generate.

By the way, all the changes refer to design here.

Here's a simple example:

env = gym.make('AntMaze_UMaze-v3')
obs, _ = env.reset()

# compute reward for one obs input
desired_goal = obs['desired_goal']
achieved_goal = obs['achieved_goal']
reward = env.compute_reward(desired_goal, achieved_goal, None)

# compute rewards for batch obs input
batch_desired_goal = np.array([desired_goal for _ in range(10)])
batch_achieved_goal = np.array([achieved_goal for _ in range(10)])
rewards = env.compute_reward(batch_desired_goal, batch_achieved_goal, None)

Before:

reward: 0.0
rewards: 0.0

After

reward: -1.0
rewards: [-1. -1. -1. -1. -1. -1. -1. -1. -1. -1.]

Copy link
Member

@pseudo-rnd-thoughts pseudo-rnd-thoughts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicehiro The suggested reward change makes sense but as we are trying to preserve at unmaintained work, we are going to keep the reward function as is.
If you want to use the alternative reward function, I would recommend using a reward wrapper to modify the value.

Could you allow add a test that confirms the function works as expected with a single and multiple rewards
Otherwise, looks good to merge

@pseudo-rnd-thoughts
Copy link
Member

@nicehiro Could you add the multi-fetch environment in a different PR

@nicehiro
Copy link
Contributor Author

@nicehiro Could you add the multi-fetch environment in a different PR

Yes. Sorry about that.

Could you allow add a test that confirms the function works as expected with a single and multiple rewards

Yes. Is it ok I add a test in tests/env/fetch/?

@pseudo-rnd-thoughts
Copy link
Member

Yes. Is it ok I add a test in tests/env/fetch/?

Yes, I think makes sense

elif self.reward_type == "sparse":
return 1.0 if np.linalg.norm(achieved_goal - desired_goal) <= 0.45 else 0.0
return -(d > 0.45).astype(np.float32)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fp64 not fp32

@rodrigodelazcano
Copy link
Member

I'm merging this. Thank you @nicehiro and for the reviews @pseudo-rnd-thoughts and @Kallinteris-Andreas

@rodrigodelazcano rodrigodelazcano merged commit 65bfa85 into Farama-Foundation:main Jun 20, 2023
12 checks passed
@Kallinteris-Andreas
Copy link
Collaborator

Kallinteris-Andreas commented Jun 21, 2023

@rodrigodelazcano changing the reward function, would require a new revision

Also, my suggested changes have not been applied, revert?

Kallinteris-Andreas added a commit to Kallinteris-Andreas/Gymnasium-Robotics-Kalli that referenced this pull request Jun 21, 2023
@pseudo-rnd-thoughts
Copy link
Member

@Kallinteris-Andreas I think we should revert the reward function change but the batch input shouldn't require a change of version

@rodrigodelazcano
Copy link
Member

Sorry @pseudo-rnd-thoughts and @Kallinteris-Andreas . I thought the changes were made.

Can you make a PR with the old reward function?

@Kallinteris-Andreas
Copy link
Collaborator

also, i have no idea what axis=-1 does

@pseudo-rnd-thoughts
Copy link
Member

Yeah, we need testing for this as well

@rodrigodelazcano
Copy link
Member

also, i have no idea what axis=-1 does

Basically this is used to compute vector norms along a specific axis of a matrix, instead of an int norm of the full matrix. In this case the last axis.

Since the fix is for computing rewards for batch inputs of achieved and desired goal this addition is required.

Am I missing anything @nicehiro ?

@nicehiro
Copy link
Contributor Author

nicehiro commented Jun 21, 2023 via email

@Kallinteris-Andreas
Copy link
Collaborator

To clarify I know what axis=1 does, but I do not know what axis=-1 does.

@pseudo-rnd-thoughts
Copy link
Member

-1 is the last axis I believe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants