To see differences between algorithms, try running
diff -y <file1> <file2>, e.g.,
diff -y ddpg.py td3.py.
For MPI versions of on-policy algorithms, see the
- Vanilla Policy Gradient/Advantage Actor-Critic (
- Trust Region Policy Gradient (
- Proximal Policy Optimization (
- Deep Deterministic Policy Gradient (
- Twin Delayed DDPG (
- Soft Actor-Critic (
- Deep Q-Network (
Note that implementation details can have a significant effect on performance, as discussed in What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study. This codebase attempts to be as simple as possible, but note that for instance on-policy algorithms use separate actor and critic networks, a state-independent policy standard deviation, per-minibatch advantage normalisation, and several critic updates per minibatch, while the deterministic off-policy algorithms use layer normalisation. Equally, soft actor-critic uses a transformed Normal distribution by default, but this can also help the on-policy algorithms.