Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adgudime/async sac #39

Merged
merged 5 commits into from
Sep 21, 2020
Merged

Adgudime/async sac #39

merged 5 commits into from
Sep 21, 2020

Conversation

AdityaGudimella
Copy link

@AdityaGudimella AdityaGudimella commented Sep 20, 2020

Why are these changes needed?

Fixes Bug in execution plan where setting prioritized_replay in config to False does not actually turn it off.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/latest/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested (please justify below)

RuofanKong
RuofanKong previously approved these changes Sep 21, 2020
Copy link
Collaborator

@RuofanKong RuofanKong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just have one question. otherwise. looks good to me! Thank you @AdityaGudimella!

rllib/tests/agents/parameters.py Outdated Show resolved Hide resolved
RuofanKong
RuofanKong previously approved these changes Sep 21, 2020
Copy link
Collaborator

@RuofanKong RuofanKong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link

@Edilmo Edilmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just took a quick look, so maybe I missed important details, but my thought is the following:

  • Is there a simplest way to have this flag available without duplicating all these code? (I think it's possible).
  • Why do you want to turn off the prioritized replay in first place?
  • Why is this reducing the sensitivity to episode horizon?

rllib/agents/sac/apex.py Outdated Show resolved Hide resolved
@@ -41,6 +69,243 @@
# __sphinx_doc_end__
# yapf: enable


class LocalVanillaReplayBuffer(LocalReplayBuffer):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference here with the LocalReplayBuffer beside not including the weights and the indices in the Batch?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three changes are

  1. PrioritizedReplayBuffer vs VanillaReplayBuffer
  2. SampleBatch containing different elements
  3. Stats not containing update_priorities time.

"rewards": rewards,
"new_obs": obses_tp1,
"dones": dones,
# "weights": weights,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we excluding the weights and the indices?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you want is to avoid the update_priorities method to perform its job, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. update_priorities should not do anything, but also, weights and batch_idxs are not returned by the vanilla replay buffer. These are only returned by the PrioritizedReplayBuffer.

@RuofanKong
Copy link
Collaborator

I just took a quick look, so maybe I missed important details, but my thought is the following:

  • Is there a simplest way to have this flag available without duplicating all these code? (I think it's possible).
  • Why do you want to turn off the prioritized replay in first place?
  • Why is this reducing the sensitivity to episode horizon?

@Edilmo First, Tuning "no-done-at-end" shouldn't be the right way to go. Two major reasons:
a) Given fixed number of workers & simulators, Apex-SAC on our Ray-forked Rllib without tuning this hyper-param converges very well without big variance.
b) It's not a generic hyper-param that works for generic case. Even if it works by tuning it, it only works for infinite horizon rl problem but all finite horizon rl problem will never converge because it breaks the MDP sequence.

Second, back to question Why turn-off prioritized replay buffer and why reducing the sensitivity to episode horizon?
It's not episode horizon sensitivity at all, although the symptom could be related. Briefly, the objective of the current ApeX algorithms (regardless of ApeX-SAC) are all optimizing the objective of maximizing the expected reward. When it comes to optimize both Critic and Actor models (neural networks) with serving samples efficiently. It does the weighted importance sampling with priority TD errors with the k-maximum TD error rank as the priority, on where it efficiently pushes GD direction to. Energy-based algorithm is different, and the core objective is on Policy entropy maximization besides expected rewards, and Policy entropy maximization is important on the exploration. In terms of the ApeX framework on Energy based algorithm, if the important sampling still keeps using the priority on TD-Error, it would lose significant amount of high policy entropy samples which varies much, which's directly related to what samples it gathers so it's sensitive to the scheduling way on sampling, and is definitely exaggerated when number of sampling workers and simulators varies especially the non-fixed number of simulators on Bonsai MT platform wrt auto-scaling, because it changes the sample distribution sampling same amount of data. Although eventually it converges, that general where high variance on convergence introduces. However, this also pushes us to think about the appropriate prioritized replay buffer from the theoretic perspective on the energy based algorithm (Soft Q Learning, SAC, etc.).

This high-level details may not give you full understanding, if you don't have deep understanding on the energy based RL algorithm, prioritized replay buffer, etc. So I listed all recommended papers to help deeply understand:

Copy link

@Edilmo Edilmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Great Finding guys!!!

@Edilmo
Copy link

Edilmo commented Sep 21, 2020

I just took a quick look, so maybe I missed important details, but my thought is the following:

  • Is there a simplest way to have this flag available without duplicating all these code? (I think it's possible).
  • Why do you want to turn off the prioritized replay in first place?
  • Why is this reducing the sensitivity to episode horizon?

@Edilmo First, Tuning "no-done-at-end" shouldn't be the right way to go. Two major reasons:
a) Given fixed number of workers & simulators, Apex-SAC on our Ray-forked Rllib without tuning this hyper-param converges very well without big variance.
b) It's not a generic hyper-param that works for generic case. Even if it works by tuning it, it only works for infinite horizon rl problem but all finite horizon rl problem will never converge because it breaks the MDP sequence.

Second, back to question Why turn-off prioritized replay buffer and why reducing the sensitivity to episode horizon?
It's not episode horizon sensitivity at all, although the symptom could be related. Briefly, the objective of the current ApeX algorithms (regardless of ApeX-SAC) are all optimizing the objective of maximizing the expected reward. When it comes to optimize both Critic and Actor models (neural networks) with serving samples efficiently. It does the weighted importance sampling with priority TD errors with the k-maximum TD error rank as the priority, on where it efficiently pushes GD direction to. Energy-based algorithm is different, and the core objective is on Policy entropy maximization besides expected rewards, and Policy entropy maximization is important on the exploration. In terms of the ApeX framework on Energy based algorithm, if the important sampling still keeps using the priority on TD-Error, it would lose significant amount of high policy entropy samples which varies much, which's directly related to what samples it gathers so it's sensitive to the scheduling way on sampling, and is definitely exaggerated when number of sampling workers and simulators varies especially the non-fixed number of simulators on Bonsai MT platform wrt auto-scaling, because it changes the sample distribution sampling same amount of data. Although eventually it converges, that general where high variance on convergence introduces. However, this also pushes us to think about the appropriate prioritized replay buffer from the theoretic perspective on the energy based algorithm (Soft Q Learning, SAC, etc.).

This high-level details may not give you full understanding, if you don't have deep understanding on the energy based RL algorithm, prioritized replay buffer, etc. So I listed all recommended papers to help deeply understand:

I think you totally misunderstood my comments. They are just about avoid code duplication.

@AdityaGudimella AdityaGudimella merged commit ffb5fe0 into releases/0.8.6 Sep 21, 2020
@Edilmo Edilmo deleted the adgudime/async_sac branch December 7, 2020 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants