This repository consists of my policy gradient reinforcement learning code, currently implemented for the classic control environment, CartPole however I have done my best to make it as modular as possible to allow implementation into any of the other OpenAI environments, as well as other problems entirely. I used the extremely helpful resources provided by OpenAI's gym to get the fitness, environmental features, etc. for now.
Using this code, I was able to complete the cartpole challenge, as presented on the OpenAI environment description(however I will be applying it to others soon). Here's how it works:
The general idea behind using policy gradients (as far as I have gathered), is to use (stochastic/non-stochastic) gradient descent to update parameters of any number of things depending on the environment. I used a neural network, originally with the 4 inputs, a 4-hidden-neuron fully connected sigmoid layer, and a 2 output softmax layer for the actions. Note: you can also use the mean and variance of distributions for the weights, however I haven't done that yet, and was much more excited about the combination of neural network machine learning and fitness reinforcement machine learning.
I won't go into the detailed intricacies here, however you can find a working implementation complete with the following features in this repository at the time of this writing:
- 3D input of neural network in order to use mini batches / stochastic gradient descent
- Exponential decay/growth of epsilon, learning rate, and mini batch size(last one in progress)
- Epsilon-greedy exploration/exploit approach option
- Softmax discrete distribution drawing exploration/exploit approach option
- Repeatedly updated mean of rewards up to the current timestep for baseline algorithm implementation (See resources if you don't know what this is)
- Matplotlib graphed output of results
- Reward Discount factor
- Added CHO and cleaned up a bit, automatically optimizes training hyper parameters on environment.
- Bayesian Hyper Parameter Optimization as an upgrade to CHO will be reflected in here as well as several other repos(Currently working on this)
- Support for other types of action spaces, seeing as cartpole is better done with the Cross Entropy Method I will likely adjust Policy Gradients to modify the mean and variance of multiple Gaussian Distributions to draw parameters from them, instead for instance.
- Various other improvements and upgrades as I see them, including any cleverness I figure out for CHO or Policy Gradients
I also included my Cross-Entropy Method for Black Box optimization in here, in case you'd like that too. (In old/)
###If you want good resources for understanding policy gradient reinforcement learning better, look here:
OpenAI's tutorial (The entirety of this site is awesome for starting out) Some slides on it More slides Good extra, neat examples of the algorithm along with a baseline algorithm
###(Wait, isn't it gradient ascent instead of descent? hmm...)