Skip to content

Latest commit

 

History

History
43 lines (29 loc) · 3.14 KB

13_2017_Wang.md

File metadata and controls

43 lines (29 loc) · 3.14 KB

Title: Sample Efficient Actor-Critic with Experience Replay

Author: Wang et al (2017)

General Content:

Authors introduce a stable + sample efficient AC architecture with the help of 3 adaptations: i) truncated importance sampling with bias correction, ii) stochastic dueling nets, iii) TRPO updates in distribution space. In addition the architecture makes use of standard variance reduction techniques as well as off-policy Retrace algo and parallel training. Test their approach in both discrete as well as continuous domains. Finally, prove that retrace can be interpreted as application of importance weight truncation and bias correction!

Key take-away for own work:

Importance weight usage to correct for off-policy experience replay. Use multiple previous results to build on top off and create new own results. Steal like an artist and compose into new! Retrace operator as generalization of the bellman operator.

Keypoints:

  • Problems DQN: Deterministic policy is vulnerable to adversarial attacks, greedy eval/action choice is infeasible in large action spaces. Problem PG Methods: Sample inefficiency.

  • Problem off-policy ER learning: controlling variance and stability of off-policy estimators is hard. Use importance sampling to correct for off-policy trajectories in on-policy policy gradient estimation.

  • Unbiased estimator with large variance due to product over unbounded importance weights. Ideas to solve:

    • Wawrzynski (2009): Truncate product - introduces bias
    • Degris et al (2012): Marginal value fcts over limiting distribution of process. Requires estimating on-policy value function
  • Multi-step estimation of state-action value function:

    • Retrace = off-policy, return-based algo with low variance - proven to converge in tabular case to value fct of target policy for any behavior policy! Allows for faster learning of the critic.
    • Allows to reduce bias in policy gradient and enables fast learning of critic - reduction of bias!
  • i) Importance weight truncation with bias correction:

    • Truncate large marginal importance weights and introduce correction term to reduce bias
    • Substitute on-policy state-value function with retrace estimator and critic estimator
    • Reduce variance with value function baseline
  • ii) Efficient TRPO:

    • Instead of constraining update policy to be close to current policy, use average policy network representing running average of policies.
    • Policy net = distribution (i.e. categorical in discrete case) + NN generating stats of distribution
    • Maintain average params of net and perform soft updates
    • Trust region step performed in space of statistics of distribution f and not in space of policy params = General strategy for modifying backward messages in backprop to stabilize activations
  • Continuous Extensions: Cannot easily integrate over Q to get V - solution in stochastic dueling networks iii):

    • SDN outputs stochastic state-value estimate and deterministic value estimate - MC sampling of actions from policy to eval advantage

Questions/Critical Observations:

  • Need to reed theoretical retrace paper.
  • Better understand on vs off policy back and forth