Title: Emergence of Locomotion Behaviors in Rich Environments

Author: Heess et al (2017) (DeepMind)

General Content:

Authors show how rich locomotion behavior can arise from simple reward functions with increasingly complex curriculum of environment design. They propose rich environments as a possible way to make a policy robust. Furthermore, they combine A3C with PPO and call it distributed PPO.

Key take-away for own work:

Survival/Existence is the ultimate reward. Every other reward is a corollary resulting from the environment and an internal self-reward. The associated difficulty shapes the learning. Robustness is the counterside of overfitting in RL.

Keypoints:

Motivation: Hard to design a reward that leads to robust behavior. The environment on the other hand can be designed easily and interpretable.
Present every episode a different environment/instance of task - implicit curriculum with increasing hardness.
Show that learning speed can be improved by explicitly structuring terrains in increasing difficulty.
Distributed PPO: Use TRPO/PPO with adaptive KL penalty coefficient in combination with many workers (A3C setup with K step rewards) to average over the gradients - allows for scalability to larger domains with distibuted computing.
- RNNs + k-Step returns + BPTT
Architecture and Experiments:
- Planar Walker: 9 DoF, 6 joints
- Quadruped: 12 DoF, 8 joints (3D body)
- Humanoid: 28 DoF, 21 joints
- Rewards: Propotional to velocity along x-axis (simply forward progress) + penalty for torques, small adjustments for different envs
- Obstacles in evn: hurdles, gaps, variable terrain, slalom walls, platforms
- Single-type course, mixtures of single-type courses, mixed terrains
- Observations: Seperate input channels from proprioceptive and exteroceptive features. - Learns faster!
Results:
- Curriculum of hurdle lengths has positive influence on Learning
- Robustness checks - Hurdles make policy more robust if there are new unseen disturbances in the env
  - unobserved variations in surface friction
  - unobserved rumble-strips
  - changes in model of body
  - unobserved inclines/declines of ground
- Measure learning on easy and hard test terrain

Questions/Critical Observations:

No real technical innovation: But pushed the boundaries and showed what was possible in terms of locomotion and simple rewards.
Still confused about what PPO resperesents - here adaptive KL penalty coefficient with target KL.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

11_2017_Heess.md

11_2017_Heess.md

Title: Emergence of Locomotion Behaviors in Rich Environments

Author: Heess et al (2017) (DeepMind)

General Content:

Key take-away for own work:

Keypoints:

Questions/Critical Observations:

Files

11_2017_Heess.md

Latest commit

History

11_2017_Heess.md

File metadata and controls

Title: Emergence of Locomotion Behaviors in Rich Environments

Author: Heess et al (2017) (DeepMind)

General Content:

Key take-away for own work:

Keypoints:

Questions/Critical Observations: