Skip to content

Latest commit

 

History

History
44 lines (31 loc) · 2.51 KB

11_2017_Heess.md

File metadata and controls

44 lines (31 loc) · 2.51 KB

Title: Emergence of Locomotion Behaviors in Rich Environments

Author: Heess et al (2017) (DeepMind)

General Content:

Authors show how rich locomotion behavior can arise from simple reward functions with increasingly complex curriculum of environment design. They propose rich environments as a possible way to make a policy robust. Furthermore, they combine A3C with PPO and call it distributed PPO.

Key take-away for own work:

Survival/Existence is the ultimate reward. Every other reward is a corollary resulting from the environment and an internal self-reward. The associated difficulty shapes the learning. Robustness is the counterside of overfitting in RL.

Keypoints:

  • Motivation: Hard to design a reward that leads to robust behavior. The environment on the other hand can be designed easily and interpretable.

  • Present every episode a different environment/instance of task - implicit curriculum with increasing hardness.

  • Show that learning speed can be improved by explicitly structuring terrains in increasing difficulty.

  • Distributed PPO: Use TRPO/PPO with adaptive KL penalty coefficient in combination with many workers (A3C setup with K step rewards) to average over the gradients - allows for scalability to larger domains with distibuted computing.

    • RNNs + k-Step returns + BPTT
  • Architecture and Experiments:

    • Planar Walker: 9 DoF, 6 joints
    • Quadruped: 12 DoF, 8 joints (3D body)
    • Humanoid: 28 DoF, 21 joints
    • Rewards: Propotional to velocity along x-axis (simply forward progress) + penalty for torques, small adjustments for different envs
    • Obstacles in evn: hurdles, gaps, variable terrain, slalom walls, platforms
    • Single-type course, mixtures of single-type courses, mixed terrains
    • Observations: Seperate input channels from proprioceptive and exteroceptive features. - Learns faster!
  • Results:

    • Curriculum of hurdle lengths has positive influence on Learning
    • Robustness checks - Hurdles make policy more robust if there are new unseen disturbances in the env
      • unobserved variations in surface friction
      • unobserved rumble-strips
      • changes in model of body
      • unobserved inclines/declines of ground
    • Measure learning on easy and hard test terrain

Questions/Critical Observations:

  • No real technical innovation: But pushed the boundaries and showed what was possible in terms of locomotion and simple rewards.
  • Still confused about what PPO resperesents - here adaptive KL penalty coefficient with target KL.