Adgudime/async sac #39

AdityaGudimella · 2020-09-20T23:50:42Z

Why are these changes needed?

Fixes Bug in execution plan where setting prioritized_replay in config to False does not actually turn it off.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/latest/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested (please justify below)

…to create AsyncSAC.

…ctly.

… usage

RuofanKong

Just have one question. otherwise. looks good to me! Thank you @AdityaGudimella!

rllib/tests/agents/parameters.py

RuofanKong

LGTM!

Edilmo

I just took a quick look, so maybe I missed important details, but my thought is the following:

Is there a simplest way to have this flag available without duplicating all these code? (I think it's possible).
Why do you want to turn off the prioritized replay in first place?
Why is this reducing the sensitivity to episode horizon?

rllib/agents/sac/apex.py

Edilmo · 2020-09-21T17:02:54Z

rllib/agents/sac/apex.py

@@ -41,6 +69,243 @@
 # __sphinx_doc_end__
 # yapf: enable

+
+class LocalVanillaReplayBuffer(LocalReplayBuffer):


What is the difference here with the LocalReplayBuffer beside not including the weights and the indices in the Batch?

The three changes are

PrioritizedReplayBuffer vs VanillaReplayBuffer

SampleBatch containing different elements

Stats not containing update_priorities time.

Edilmo · 2020-09-21T17:03:37Z

rllib/agents/sac/apex.py

+                        "rewards": rewards,
+                        "new_obs": obses_tp1,
+                        "dones": dones,
+                        # "weights": weights,


Why are we excluding the weights and the indices?

What you want is to avoid the update_priorities method to perform its job, right?

Yes. update_priorities should not do anything, but also, weights and batch_idxs are not returned by the vanilla replay buffer. These are only returned by the PrioritizedReplayBuffer.

RuofanKong · 2020-09-21T18:48:11Z

I just took a quick look, so maybe I missed important details, but my thought is the following:

Is there a simplest way to have this flag available without duplicating all these code? (I think it's possible).

Why do you want to turn off the prioritized replay in first place?

Why is this reducing the sensitivity to episode horizon?

@Edilmo First, Tuning "no-done-at-end" shouldn't be the right way to go. Two major reasons:
a) Given fixed number of workers & simulators, Apex-SAC on our Ray-forked Rllib without tuning this hyper-param converges very well without big variance.
b) It's not a generic hyper-param that works for generic case. Even if it works by tuning it, it only works for infinite horizon rl problem but all finite horizon rl problem will never converge because it breaks the MDP sequence.

Second, back to question Why turn-off prioritized replay buffer and why reducing the sensitivity to episode horizon?
It's not episode horizon sensitivity at all, although the symptom could be related. Briefly, the objective of the current ApeX algorithms (regardless of ApeX-SAC) are all optimizing the objective of maximizing the expected reward. When it comes to optimize both Critic and Actor models (neural networks) with serving samples efficiently. It does the weighted importance sampling with priority TD errors with the k-maximum TD error rank as the priority, on where it efficiently pushes GD direction to. Energy-based algorithm is different, and the core objective is on Policy entropy maximization besides expected rewards, and Policy entropy maximization is important on the exploration. In terms of the ApeX framework on Energy based algorithm, if the important sampling still keeps using the priority on TD-Error, it would lose significant amount of high policy entropy samples which varies much, which's directly related to what samples it gathers so it's sensitive to the scheduling way on sampling, and is definitely exaggerated when number of sampling workers and simulators varies especially the non-fixed number of simulators on Bonsai MT platform wrt auto-scaling, because it changes the sample distribution sampling same amount of data. Although eventually it converges, that general where high variance on convergence introduces. However, this also pushes us to think about the appropriate prioritized replay buffer from the theoretic perspective on the energy based algorithm (Soft Q Learning, SAC, etc.).

This high-level details may not give you full understanding, if you don't have deep understanding on the energy based RL algorithm, prioritized replay buffer, etc. So I listed all recommended papers to help deeply understand:

Prioritized Experience Replay: https://arxiv.org/pdf/1511.05952.pdf
Distributed Prioritized Experience Replay: https://arxiv.org/pdf/1803.00933.pdf
Energy based Deep Reinforcement Learning: https://arxiv.org/pdf/1702.08165.pdf
Soft Actor Critic v1 (Original Format: Critic Q and V and Policy models with fixed manually tunable temperature parameter): https://arxiv.org/pdf/1801.01290.pdf
Soft Actor Critic v2 (Modified Format: Only Critic Q and Policy models with trainable temperature parameter): https://arxiv.org/pdf/1812.05905.pdf

Edilmo

LGTM

Great Finding guys!!!

Edilmo · 2020-09-21T20:17:19Z

I just took a quick look, so maybe I missed important details, but my thought is the following:

Is there a simplest way to have this flag available without duplicating all these code? (I think it's possible).

Why do you want to turn off the prioritized replay in first place?

Why is this reducing the sensitivity to episode horizon?

@Edilmo First, Tuning "no-done-at-end" shouldn't be the right way to go. Two major reasons:
a) Given fixed number of workers & simulators, Apex-SAC on our Ray-forked Rllib without tuning this hyper-param converges very well without big variance.
b) It's not a generic hyper-param that works for generic case. Even if it works by tuning it, it only works for infinite horizon rl problem but all finite horizon rl problem will never converge because it breaks the MDP sequence.

Second, back to question Why turn-off prioritized replay buffer and why reducing the sensitivity to episode horizon?
It's not episode horizon sensitivity at all, although the symptom could be related. Briefly, the objective of the current ApeX algorithms (regardless of ApeX-SAC) are all optimizing the objective of maximizing the expected reward. When it comes to optimize both Critic and Actor models (neural networks) with serving samples efficiently. It does the weighted importance sampling with priority TD errors with the k-maximum TD error rank as the priority, on where it efficiently pushes GD direction to. Energy-based algorithm is different, and the core objective is on Policy entropy maximization besides expected rewards, and Policy entropy maximization is important on the exploration. In terms of the ApeX framework on Energy based algorithm, if the important sampling still keeps using the priority on TD-Error, it would lose significant amount of high policy entropy samples which varies much, which's directly related to what samples it gathers so it's sensitive to the scheduling way on sampling, and is definitely exaggerated when number of sampling workers and simulators varies especially the non-fixed number of simulators on Bonsai MT platform wrt auto-scaling, because it changes the sample distribution sampling same amount of data. Although eventually it converges, that general where high variance on convergence introduces. However, this also pushes us to think about the appropriate prioritized replay buffer from the theoretic perspective on the energy based algorithm (Soft Q Learning, SAC, etc.).

This high-level details may not give you full understanding, if you don't have deep understanding on the energy based RL algorithm, prioritized replay buffer, etc. So I listed all recommended papers to help deeply understand:

Prioritized Experience Replay: https://arxiv.org/pdf/1511.05952.pdf

Distributed Prioritized Experience Replay: https://arxiv.org/pdf/1803.00933.pdf

Energy based Deep Reinforcement Learning: https://arxiv.org/pdf/1702.08165.pdf

Soft Actor Critic v1 (Original Format: Critic Q and V and Policy models with fixed manually tunable temperature parameter): https://arxiv.org/pdf/1801.01290.pdf

Soft Actor Critic v2 (Modified Format: Only Critic Q and Policy models with trainable temperature parameter): https://arxiv.org/pdf/1812.05905.pdf

I think you totally misunderstood my comments. They are just about avoid code duplication.

AdityaGudimella added 2 commits September 20, 2020 16:14

Replaced Prioritized Experience Replay with normal Experience replay …

98ddc2d

…to create AsyncSAC.

Setting prioritized_replay in config now uses PrioritizedReplay corre…

bd81b68

…ctly.

AdityaGudimella requested review from Edilmo and RuofanKong September 20, 2020 23:51

AdityaGudimella self-assigned this Sep 20, 2020

Renamed LocalAsyncReplayBuffer and AsyncReplayActor to better reflect…

fde0328

… usage

RuofanKong previously approved these changes Sep 21, 2020

View reviewed changes

rllib/tests/agents/parameters.py Outdated Show resolved Hide resolved

Added test with prioritized_replay set to True

46e3cf5

AdityaGudimella dismissed RuofanKong’s stale review via 46e3cf5 September 21, 2020 01:13

AdityaGudimella requested a review from RuofanKong September 21, 2020 01:14

RuofanKong previously approved these changes Sep 21, 2020

View reviewed changes

AdityaGudimella requested a review from bimalkmehta September 21, 2020 16:44

Edilmo reviewed Sep 21, 2020

View reviewed changes

Cleaned up code.

2d260d6

AdityaGudimella dismissed RuofanKong’s stale review via 2d260d6 September 21, 2020 19:51

AdityaGudimella requested a review from Edilmo September 21, 2020 19:57

Edilmo approved these changes Sep 21, 2020

View reviewed changes

AdityaGudimella merged commit ffb5fe0 into releases/0.8.6 Sep 21, 2020

Edilmo deleted the adgudime/async_sac branch December 7, 2020 07:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adgudime/async sac #39

Adgudime/async sac #39

AdityaGudimella commented Sep 20, 2020 •

edited

Loading

RuofanKong left a comment

RuofanKong left a comment

Edilmo left a comment

Edilmo Sep 21, 2020

AdityaGudimella Sep 21, 2020

Edilmo Sep 21, 2020

Edilmo Sep 21, 2020

AdityaGudimella Sep 21, 2020

RuofanKong commented Sep 21, 2020

Edilmo left a comment

Edilmo commented Sep 21, 2020

Adgudime/async sac #39

Adgudime/async sac #39

Conversation

AdityaGudimella commented Sep 20, 2020 • edited Loading

Why are these changes needed?

Related issue number

Checks

RuofanKong left a comment

Choose a reason for hiding this comment

RuofanKong left a comment

Choose a reason for hiding this comment

Edilmo left a comment

Choose a reason for hiding this comment

Edilmo Sep 21, 2020

Choose a reason for hiding this comment

AdityaGudimella Sep 21, 2020

Choose a reason for hiding this comment

Edilmo Sep 21, 2020

Choose a reason for hiding this comment

Edilmo Sep 21, 2020

Choose a reason for hiding this comment

AdityaGudimella Sep 21, 2020

Choose a reason for hiding this comment

RuofanKong commented Sep 21, 2020

Edilmo left a comment

Choose a reason for hiding this comment

Edilmo commented Sep 21, 2020

AdityaGudimella commented Sep 20, 2020 •

edited

Loading