Roboro: to strengthen, reinforce
The aim of this library is to implement modular deep reinforcement learning algorithms (RL). It is based on essential libraries such as pytorch-lightning for training organization, hydra for command line input and configuartions, and mlflow for experiment logging.
The modularity of this library is supposed to provide a taxonomy over algorithms and their extensions. Furthermore, orthogonal improvements to e.g. DQN can be combined on the fly, while keeping their implementations enclosed in classes.
Another focus of this library is offline RL (also called batch RL). Not only environments can be used to train agents, but also expert-data datasets (or both in combination). Evaluation of agents can be conducted on datasets and within environments.
Please install the requirements.txt
first
To run cartpole with predefined settings, enter: python3 train.py env=cart
For Pong: python3 train.py env=pong
For any other gym-registered env e.g.: python3 train.py learner.train_env=PongNoFrameskip-v4
Check out configs/main.py
for adjustable hyperparameters. E.g. you can force the use of frameskipping and change the learning rate by calling: python3 train.py learner.train_env=PongNoFrameskip-v4 opt.lr=0.001 learner.frameskip=2
.
You can combine algorithms as you want. E.g. you can combine IQN with PER, CER and M-Rl to train on Pong like this:
python3 train.py env=pong agent.iqn=1 agent.munch_q=1 learner.per=1 learner.cer=1
.
- Uniform Experience Replay and Prioritied Experience Replay. Defaults to uniform exp replay.
- Corrected Experience Replay, CER. Can be combined either with uniform or prioritized experience replay.
- Double Q-learning DDQN. To avoid the Q-learning maximization bias, the online network is used in the action-selection of the Bellman update, whereas the target network is used for the evaluation of this selected action.
- Use of a target net that is updated every N steps or of a Polyak-averaged target network, as seen in DDPG. Defaults to Polyak-averaging.
- QV and QVMax learning. Next to a Q-network, a V-network (state-value network) is trained. In QV-learning, the Q-network is trained using the target of the V-network. In QVMAX, additionally, the V-network ist trained using the target of the Q-network (making this an off-policy algorithm).
- Observation Z-standardization. The mean and std are collected during an initial rollout of random actions. Turned on by default.
- Random Ensemble Mixture, REM. During the value net optimization, a mixture of a randomly sampled categorical distribution of N value networks is used.
- Implicit Quantile Networks IQN. The value network is trained to predict N quantiles of the return.
- Munchausen RL M-RL. A form of maximum-entropy RL that focuses on optimizing for the optimal policy, next to the optimal value function.
- Training on offline data
- Evaluating agents using offline data
- Efficient Eligibility traces - as described in v1 of the arXiv paper.
- MuZero
See the Trello board for details.