This repository contains implementations of the majority of the core algorithms in Deep Reinforcement Learning (DRL) as listed below. All the implementations are in Python and based on PyTorch for models, optimizers and training in general. The algorithms can be used with any environment in Gymnasium or other environments that follow the same API of Gymnasium. Each algorithm is implemented standalone and is therefore independent of the implementation of the others even when sharing a lot of overlapping ideas. This was done for easier code readibility for each algorithm. The library is organized into different directories, each encompassing a specific class of DRL algorithms. These currently include the following:
-
Core: Majority of the main on and off-policy DRL algorithms listed in details below.
-
Exploration: Algorithms aimed at enhancing the DRL agent's ability at exploring its environment, typically aimed at environments with a sparse reward signal. However, in this context, this also includes algorithms for safe exploration.
-
MLP, CNN and CNN-LSTM (Recurrent) Policies
-
TensorBoard integration for logging
-
Parallel vector environments
-
Nvidia GPU support
-
Model saving, checkpointing and ability to start training from an existing model's parameters
-
Environment saving and loading for both base or arbitrarily wrapped environments
-
Policy testing using saved environments and models in addition to easy video recording
-
Support for learning rate scheduling
-
Parameter sharing for CNN-based architectures (except for TRPO)
-
Return normalization and action rescaling to [-1, 1] for Box action spaces
-
Flexible sequence lengths for recurrent policies with adjustable 'burn-in' periods and hidden state management for uninterrupted rollouts
-
Extremely customizable algorithm and architecture configurations through scripting or the terminal
-
Deep Q-Learning Network (DQN)
-
Advantage Actor-Critic (A2C)
-
Trust Region Policy Optimization (TRPO)
-
Proximal Policy Optimization (PPO)
-
Deep Deterministic Policy Gradient (DDPG)
-
Twin Delayed Deep Deterministic Policy Gradient (TD3)
-
Soft Actor-Critic (SAC)
-
Curiosity-driven Exploration by Self-supervised Prediction (CDESP)
-
Hindsight Experience Replay (HER)
-
Adaptive Policy ReguLarization (APRL)
- To do:
-
Adverserial RL
-
Meta RL
-
Policy Distillation
-
-
OpenAI's Spinning Up which was my main source of information (in addition to the original papers) for learning about the core algorithms and Gym.
-
Stable-Baselines3 (SB3) mainly for clearing up confusions regarding parameter sharing in on-policy algorithms and as a guide for default hyperparamter values.
-
The amazing blogpost: The 37 Implementation Details of Proximal Policy Optimization by Huang, et al. which dives into all the important details regarding PPO's implementation.
-
The Generalized Advantage Estimation (GAE) paper and the Recurrent Replay Distributed DQN (R2D2) paper which cleared many confusions about recurrent policies in general.
-
The amazing book: Dive into Deep Learning by Zhang, Aston and Lipton, Zachary C. and Li, Mu and Smola, Alexander J. for Deep Learning using PyTorch.
-
Note: All correspoding papers are linked with their algorithms above. Also, for SAC the link correponds to the 2nd paper that the implementation is based on and that describes automatic temperature coefficient adjustment using Dual Gradient Descent. Meanwhile, the original SAC paper can be found here.


