Name		Name	Last commit message	Last commit date
parent directory ..
dir_save		dir_save
images		images
BipedalWalker_PPO_VecEnv_450epis.ipynb		BipedalWalker_PPO_VecEnv_450epis.ipynb
README.md		README.md
distributions.py		distributions.py
dummy_vec_env.py		dummy_vec_env.py
envs.py		envs.py
init_vec_env.py		init_vec_env.py
model.py		model.py
parallelEnv.py		parallelEnv.py
ppo.py		ppo.py
storage.py		storage.py
subproc_vec_env.py		subproc_vec_env.py
utils.py		utils.py
vec_normalize.py		vec_normalize.py

README.md

Project - BipedalWalker with PPO, Vectorized Environment

Introduction

Solving the environment require an average total reward of over 300 over 100 consecutive episodes. Training of BipedalWalker is considered as difficult task, in particular, it is very difficult to train BipedalWalker by DDPG and PPO (with one agent). In this directory we solve the environment in 450 episodes by usage of the PPO (with multi-agent) algorithm, see Multi-Agent RL or Baseline doc. For other solutions (based on the single agent) see BipedalWalker-TD3 and BipedalWalker-SAC.

Requirement

Environment

The environment is simulated as list of 16 gym environments. They run in 16
subprocesses adopted from openai baseline:

 num_processes=16
 envs = parallelEnv('BipedalWalker-v2', n=num_processes, seed=seed)

Hyperparameters

Agent uses the following hyperparameters:

gamma=0.99 # discount
epoch = 16 # the parameter in the update mexanism of the PPO
mini_batch=16 # optimizer and backward mechisms work after sampling BATCH elements
lr = 0.001 # learning rate
eps=0.2 # the clipping parameter using for calculation of the action loss

Update mechanism

Standard policy gradient methods perform one gradient update per data sample.
In the original paper it was proposed a novel objective function that enables multiple epochs.
This is the loss function L_t(\theta), which is (approximately) maximized each iteration:

Parameters c1, c2 and epoch are essential hyperparameters in the PPO algorithm. In this agent, c1 = -0.5, c2 = 0.01.

            value_loss = (return_batch - values).pow(2)
            loss = -torch.min(surr1, surr2) + 0.5 * value_loss - 0.01 * dist_entropy

The update is performed in the function ppo_agent.update().

Training the Agent

We train the agent to understand that it can use information from its surroundings to inform the next best action.
The score 300.5 was achieved in the episode 450 after training 2 hours 33 minutes.

Other PPO projects

Pong, 8 parallel agents
Crawler, 12 parallel agents
CarRacing, Single agent, Learning from pixels

Other BipedalWalker projects:

Credit

Most of the code is based on the Udacity code, and Ilya Kostrikov's code (https://github.com/ikostrikov).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BipedalWalker-PPO-VectorizedEnv

BipedalWalker-PPO-VectorizedEnv

README.md

Project - BipedalWalker with PPO, Vectorized Environment

Introduction

Requirement

Environment

Hyperparameters

Update mechanism

Training the Agent

Other PPO projects

Other BipedalWalker projects:

Credit

Files

BipedalWalker-PPO-VectorizedEnv

Directory actions

More options

Directory actions

More options

Latest commit

History

BipedalWalker-PPO-VectorizedEnv

Folders and files

parent directory

README.md

Project - BipedalWalker with PPO, Vectorized Environment

Introduction

Requirement

Environment

Hyperparameters

Update mechanism

Training the Agent

Other PPO projects

Other BipedalWalker projects:

Credit