Authors show how rich locomotion behavior can arise from simple reward functions with increasingly complex curriculum of environment design. They propose rich environments as a possible way to make a policy robust. Furthermore, they combine A3C with PPO and call it distributed PPO.
Survival/Existence is the ultimate reward. Every other reward is a corollary resulting from the environment and an internal self-reward. The associated difficulty shapes the learning. Robustness is the counterside of overfitting in RL.
-
Motivation: Hard to design a reward that leads to robust behavior. The environment on the other hand can be designed easily and interpretable.
-
Present every episode a different environment/instance of task - implicit curriculum with increasing hardness.
-
Show that learning speed can be improved by explicitly structuring terrains in increasing difficulty.
-
Distributed PPO: Use TRPO/PPO with adaptive KL penalty coefficient in combination with many workers (A3C setup with K step rewards) to average over the gradients - allows for scalability to larger domains with distibuted computing.
- RNNs + k-Step returns + BPTT
-
Architecture and Experiments:
- Planar Walker: 9 DoF, 6 joints
- Quadruped: 12 DoF, 8 joints (3D body)
- Humanoid: 28 DoF, 21 joints
- Rewards: Propotional to velocity along x-axis (simply forward progress) + penalty for torques, small adjustments for different envs
- Obstacles in evn: hurdles, gaps, variable terrain, slalom walls, platforms
- Single-type course, mixtures of single-type courses, mixed terrains
- Observations: Seperate input channels from proprioceptive and exteroceptive features. - Learns faster!
-
Results:
- Curriculum of hurdle lengths has positive influence on Learning
- Robustness checks - Hurdles make policy more robust if there are new unseen disturbances in the env
- unobserved variations in surface friction
- unobserved rumble-strips
- changes in model of body
- unobserved inclines/declines of ground
- Measure learning on easy and hard test terrain
- No real technical innovation: But pushed the boundaries and showed what was possible in terms of locomotion and simple rewards.
- Still confused about what PPO resperesents - here adaptive KL penalty coefficient with target KL.