PPO-clip-and-PPO-penalty-on-Atari-Domain

Overview

This repo is based on spinningup, sincerely grateful to it.

I do these things:

Implement PPO-penalty. You will find that neither spinningup nor baseline implements PPO-penalty. Although PPO-penalty results are not as good as PPO-clip, this algorithm is meaningful as a good baseline.
Implement PPO algorithm on Atari domain(if you read spinningup carefully, or run the program, you will find the program don't match the Atari domain. Because the input vector isn't flattened.)

Advantage

This may be the only open source of PPO-penalty
This program is very easy to configure.
This code is readable, more readable than baseline, and more suitable for beginners.

References

Proximal Policy Optimization Algorithms, Schulman et al. 2017
Emergence of Locomotion Behaviours in Rich Environments, Heess et al. 2017
High Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al. 2016

Blog

my blog on PPO
mpi4py blog

Backgroud

PPO is motivated by the same question as TRPO: how can we take the biggest possible improvement step on a policy using the data we currently have, without stepping so far that we accidentally cause performance collapse? Where TRPO tries to solve this problem with a complex second-order method, PPO is a family of first-order methods that use a few other tricks to keep new policies close to old. PPO methods are significantly simpler to implement, and empirically seem to perform at least as well as TRPO.

There are two primary variants of PPO: PPO-Penalty and PPO-Clip.

PPO-Penalty approximately solves a KL-constrained update like TRPO, but penalizes the KL-divergence in the objective function instead of making it a hard constraint, and automatically adjusts the penalty coefficient over the course of training so that it's scaled appropriately.

PPO-Clip doesn't have a KL-divergence term in the objective and doesn't have a constraint at all. Instead relies on specialized clipping in the objective function to remove incentives for the new policy to get far from the old policy.

Here, we'll focus only on PPO-Clip (the primary variant used at OpenAI).

Quick Facts

PPO is an on-policy algorithm.
PPO can be used for environments with either discrete or continuous action spaces.
The Spinning Up implementation of PPO supports parallelization with MPI.

Installation Dependencies

cloudpickle==0.5.2
gym>=0.10.8
matplotlib
numpy
pandas
scipy
tensorflow>=1.8.0
tqdm

How to Install

	conda install tensorflow
	conda install gym
	conda install numpy

mpi4py install on unix click here mpi4py on windows click here

How to Run

Quick start

	python ppo.py

Play with parallel (-np:set number of processings, take care of OOM!)

	mpiexec -np 4 python ppo.py

Algorithm

The detail of PPO is the same as the original paper. you can have a look at my bolg to know the details. Pseudo-code is shown below.

PPO-clip and PPO-penalty's objective functions are below:

Details

Most settings are the same with PPO, details as follow :

Network Structure we used a fully-connected MLP with two hidden layers of 64 units, and tanh nonlinearities, outputting the mean of a Gaussian distribution, with variable standard deviations, following [Sch+15b; Dua+16]. We don’t share parameters between the policy and value function (so coefficient c1 is irrelevant), and we don’t use an entropy bonus.
Parameters in Detail

Parameters on Mujoco and Atari

References

Proximal Policy Optimization Algorithms, Schulman et al. 2017
Emergence of Locomotion Behaviours in Rich Environments, Heess et al. 2017
High Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al. 2016

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
PPO_clip		PPO_clip
PPO_penalty		PPO_penalty
image		image
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO_clip

PPO_clip

PPO_penalty

PPO_penalty

image

image

README.md

README.md

Repository files navigation

PPO-clip-and-PPO-penalty-on-Atari-Domain

Overview

Backgroud

Quick Facts

Installation Dependencies

How to Install

How to Run

Algorithm

Details

References

About

Releases

Packages

Languages

ChengTsang/PPO-clip-and-PPO-penalty-on-Atari-Domain

Folders and files

Latest commit

History

Repository files navigation

PPO-clip-and-PPO-penalty-on-Atari-Domain

Overview

Backgroud

Quick Facts

Installation Dependencies

How to Install

How to Run

Algorithm

Details

References

About

Resources

Stars

Watchers

Forks

Languages