Skip to content

This a Trust Region Policy Optimization implementation using pyTorch for continues action space system

License

Notifications You must be signed in to change notification settings

MEfeTiryaki/trpo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Trust Region Policy Optimization

This a trust region policy optimization implementation for continues action space system. This repo uses some methods from *.

Dependencies

TODO

  • Documentation : dependendies, how to save network
  • Experience Replay
  • Make code stupid proof : check log folders, save and load folders
  • Reward plotter for only one run
  • automatic log generation bash for the given plots
  • various minimal examples

Useful Flags

  • --max-iteration-number {int} : the max number of episode

  • --batch-size {int} : batch size of each episode

  • --episode-length {int} : length of each episode (Environmens may limit this internally for example Pendulum has length 200)

  • --log : if added at the end of the training the mean reward

  • --log-dir {string} : the logging directory

  • --log-prefix {string} : name your log file

Logging

If the --log flag is added to the command line instruction, at the end of the training the average cummulative reward will be logged automatically.

Recommended way of logging:

First create a log directory

  mkdir log
  mkdir log/example

Then run the train.py

  python train.py --log -log-dir "log/example"

the log file will appear in the given folder. If you run the same command multiple time, the log code will automatically enumurate the log file names.

Saving the trained networks

Useful Reference

Books

Papers

Other TRPO repos

If there are some other good implementations please inform me to add to the list

Results

  • Bootstrapping works way better
  • Increasing batch size increases the learning rate but simulations takes to long time
  • Training policy and value networks with data from same time step results a poor learning performance, even if the value training perform after policy optimization. Training value function with previous data solves the problem. Using more than one previous batch does not improve the results.
  • High the value training iteration number results overfitting, and low cause poor learning. Though, this experiments are performed with minibatches with size batch_size/iter, namely minibatch size is not constant.(TODO: add constant batch)

Experiments

The experiments are performed in Pendulum-v0 environment

Monte Carlo vs Bootstrap

In this experiment, two different way of estimating return is compared.

1- Monte Carlo : return is calculated using the discounted return of next state

2- Bootstrap : return is calculated using the discounted value approaximation of next state

bacth_size

Value function training batch

In this experiment, we train the system with 4 different batch sizes.

bacth_size

Past data for value learning

In *, they used previous batches data to train the value function to avoid overfitting. * used the previous+current batch to train the value function. Here, we are testing different combinations of both to see difference.

bacth_size

Value training iteration number

We test the value training iteration number. The experiment is performed with the a batch size of 5k and the minibatch size are 5k/iter_num.

bacth_size

About

This a Trust Region Policy Optimization implementation using pyTorch for continues action space system

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages