Skip to content

zafarali/policy-gradient-methods

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Policy Gradient Methods

PyTorch implementation of policy gradient methods.

NOTE This repository is still work in progress! As I continue to try to break things down into modular and reusable parts things might break. However, I will try to ensure the cases in tests keep passing.

Installation

This library only works with Python 3.5+. If you are using Python 2.7 you should upgrade immediately. The requirements for this library can be found in requirements.txt. To install this library you can use pip:

pip install -e .

The -e indicates that the library will be installed in development mode. You can then check if it works by opening up python and typing:

import pg_methods
print(pg_methods.__version__) # should print 0

Tests

There are tests for components in this library under ./tests/. You can run them by executing python -m pytest ./tests --verbose.

Philosophy

There are a few good reinforcement learning reinforcement algorithm algorithm implementations in pytorch. There are many ones in Tensorflow, Theano and Keras. The main thing lacking in the PyTorch implementations is extensibility/modularity. Sure I would love to run this one algorithm on all environments ever. But sometimes it's just the little parts that are useful. For example, a good utility to calculate discounted future returns with masks. Or the REINFORCE objective itself. Maybe you want to try a new kind of baseline? The goal of this library is to allow you to do all of these things. Sort of like LEGO. Arguably, more important than having a long script with the algorithm, is having the components to make new ones. This is one thing I find frustrating with baselines, all the algorithms are in their own folders, with only marginal code sharing. I've already used some stuff from here in some (old version of pg_methods) of my projects (soon to be released).

To see how the code is organized see ./pg_methods/README.md

Algorithms Implemented

  1. Vanilla Policy Gradient (pg_methods.algorithms.VanillaPolicyGradient)

To be implemented (contributions welcome!)

  • Synchronous Advantage Actor Critic
  • Asynchronous Advantage Actor Critic
  • Natural Policy Gradient
  • Trust Region Policy Optimization
  • Proximal Policy Optimization

etc.

other opportunities to contribute

See projects. Things like new objectives, baselines optimizers, replay_memorys are all good contributions!

Also what would be cool is a large scale benchmarking script so that we can run all the algorithms to see how they perform on different gym environments.

Some performance graphs (soon to improve)

rewards-plots

I'm working to get roboschool installed on the ComputeCanada clusters so i can run for longer. To install roboschool on your local machine you can try this script

rewards-roboschool

Example

Here is an example script of how to get started with the VanillaPolicyGradient algorithm. We expect other algorithms to have similar interfaces.

from pg_methods import interfaces
from pg_methods.algorithms.REINFORCE import VanillaPolicyGradient
from pg_methods.baselines import FunctionApproximatorBaseline
from pg_methods.utils import experiment

env = interfaces.make_parallelized_gym_env('CartPole-v0', seed=4, n_workers=2)
experiment_logger = experiment.Experiment({'algorithm_name': 'VPG'}, './')
experiment_logger.start()
fn_approximator, policy = experiment.setup_policy(env, hidden_non_linearity=nn.ReLU, hidden_sizes=[16, 16])
optimizer = torch.optim.SGD(fn_approximator.parameters(), lr=0.01)

# setting up a baseline function
baseline_approximator = MLP_factory(env.observation_space_info['shape'][0],
                               [16, 16],
                               output_size=1,
                               hidden_non_linearity=nn.ReLU)
baseline_optimizer = torch.optim.SGD(baseline_approximator.parameters(), lr=0.01)
baseline = FunctionApproximatorBaseline(baseline_approximator, baseline_optimizer)

algorithm = VanillaPolicyGradient(env, policy, optimizer, gamma=0.99, baseline=baseline)

rewards, losses = algorithm.run(1000, verbose=True)

experiment_logger.log_data('rewards', rewards.tolist())
experiment_logger.save()

More example scripts can be seen in ./experiments/

About

Modular PyTorch implementation of policy gradient methods

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published