An extremely simple PPO implementation based on Pytorch with ~270 lines of code. It supports:
- Grid search of hyperparameters
- Showing curves with tensorboard
- Parallel sampling from multiple environments via Gym interface. You can define your own parallel environment by modifying
env.py
- Discrete action space (select an action in a step) and MultiDiscrete action space (select multiple actions in a step).
- Single GPU or multi-GPU training via nn.DataParallel of Pytorch.
This implementation is a simplification of the PPO algorithm in cleanRL (link).
Just run the ppo.py
in each folder.
During training, the output will be recorded in a runs
folder. You can visualize the output by:
tensorboard --logdir=runs
env.py
contains a "filling grid" environment. There are several grids. If the agent fills a blank grid, it will receive a reward of +1, otherwise the reward will be -1.- When you use multiple GPUs, the number of GPUs should be smaller or equal to the number of parallel environments.
- I am using
gym.vector.AsyncVectorEnv
to create parallel environments with multiprocessing. However, debugging a multiprocessing program is complicated. Thus, I advise you to switch togym.vector.SyncVectorEnv
during debugging, which only uses multithreading. - Current multi-GPU capability can work but is quite slow. I will improve it.
- New: PPO+Transformer is based on "MultiDiscrete Action Space - Sequential Sampling - Single GPU" configuration. It's a dirty implementation but does work.
Please cite the following work:
Li, Z. (2022). Use Reinforcement Learning to Generate Testing Commands for Onboard Software of Small Satellites.
The RL algorithms in this work are in StarCycle/TestCommandGeneration