GitHub

This repository is a reimplementation of the PPO algorithm that I made to get a better understanding of reinforcement learning and actor critic algorithms in particular.

It is based on the ppo paper and the accompanying github repository, openAI baselines

I later implemented curiosity-driven exploration as well.

Requirements

My development environment was on Ubuntu 16 with python3.5, tensorflow-gpu 1.9, CUDA 9.0. But it should also work with more recent versions and even without a GPU (though some environments like Breakout will be much slower)

From my experience it can be difficult to run it on Windows because of openAI gym but it is possible.

Installation

Clone the repository and install the dependencies:

git clone git@github.com:LeSphax/ppo.git
cd ppo
pip3 install -e .

If you want to try curiosity as well you will need my gym-ui environments as well. You can see how to install it here

Training

Run the algorithm on one of the configured environments: python3 my_ppo.py (<label>) (<env_name>) [options]

So for example, to run the environment BreakoutNoFrameskip-v4, you would type:

python3 my_ppo.py TestingTheRepository BreakoutNoFrameskip-v4

This will start training the model with the configuration specified in configs/breakout_no_frameskip_config.py

You can see all the options with: python3 my_ppo.py -h

Inference

When you start the training, a window will also appear to show the agent playing.

There are actually several agents training at the same time, what you see in this window is just a way to see how well your agent is currently playing.

You can toggle the presence of the window by pressing 'r'+'Enter' on your keyboard.

Tensorboard

You can see the results of your algorithm by running:

tensorboard --logdir=train

The most interesting graph should be Stats/TotalReward which shows the average reward that the agent is getting per episode.

Videos

Another way to see how well the agents are doing is to look at videos taken during the training process. You can find these videos in the train folder.

Saves

The latest version of the model is saved regularly.

You can restart a training session from the saved model using the --load flag:

python3 my_ppo.py TestingTheRepository BreakoutNoFrameskip-v4 --load

Configurations

The structure of the neural network and the hyperparameters to use for a specific environment is located in the config files. You can find them in the configs folder.

Configurations exist for the following gym environments:

Breakout-v0
BreakoutNoFrameskip-v4
CartPole-v1

And the following environments that I created myself to test the curiosity-driven exploration algorithm:

FixedButton-v0
FixedButtonHard-v0
RandomButton-v0

Curiosity

To train using curiosity you can use the --curiosity flag, you can also use the --norewards flag to ignore the extrinsic rewards that the environment is giving. To support curiosity-driven exploration a config must implement the state_action_predictor and curiosity_encoder methods.

Results for Breakout

This implementation of ppo produces results similar to those of openAI baselines repository:

OpenAI Baselines results for BreakoutNoFrameSkip-v4 with --num_timesteps=10000000 --nsteps=128:

Results for python3 my_ppo.py TestingTheRepository BreakoutNoFrameskip-v4:

You can also see videos from the beginning of the training and the end

Results for curiosity

For curiosity I didn't compare with a baseline, I just wanted to make sure that I implemented it correctly.

So I made sure that the agent could train using:

Only intrinsic rewards and no extrinsic rewards (Red curve)

python3 my_ppo.py TestingTheRepository FixedButton-v0 --curiosity --norewards

Both intrinsic and extrinsic rewards (Blue curve)

python3 my_ppo.py TestingTheRepository FixedButton-v0 --curiosity

You can see the trained agent here

As expected the agent does better when we add extrinsic rewards but it still learns to click the buttons when only intrinsic rewards are present.

It works because trying to predict where the next button will appear is harder than predicting where a click will appear, so the agent is receiving intrinsic rewards when clicking the buttons.

One interesting finding was the need to clip actions before feeding them to the inverse predictor:

In the FixedButton-v0 environment actions are clipped to 0,1 to make it impossible to click outside the screen. If the agent sends 30, 30 as an action, it will still click on position 1,1.
But with a normal implementation of the algorithm, the actions used to train curiosity are not clipped.
The problem is that the inverse predictor is trying to predict the action that was done given the current state and the previous state.
If we don't clip actions, it won't be possible to do it. Because the actions 1,1 and 30,30 result in the same state.
Since the agent gets rewards the inverse predictor is making mistakes it will exploit this mechanism and output only actions outside of the range of the environment.

So to fix this problem I added the clipping as part of the fixed_button_config. But this does not seem like a good solution, because this makes the algorithm dependent on the environment it is training in.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
agent		agent
brag		brag
configs		configs
rendering		rendering
utils		utils
wrappers		wrappers
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
exp.py		exp.py
my_ppo.py		my_ppo.py
runner.py		runner.py
setup.py		setup.py

LeSphax/ppo

Folders and files

Latest commit

History

Repository files navigation

Requirements

Installation

Training

Inference

Tensorboard

Videos

Saves

Configurations

Curiosity

Results for Breakout

Results for curiosity

About

Resources

Stars

Watchers

Forks

Languages