Collection of exercises and minimal DRL implementations used for teaching. All algorithms are coded using Torch. Note that these algorithms are for pedagogical purposes, and hence, they include minimal implementation tricks, as the purpose of this code is to have a clear view on how DRL codes are implemented (thus, the performance of these codes may be low compared to state-of-the-art implementations as Stable Baselines 3).
The following examples and exercises correspond to some basic ideas needed to understand the basics of RL:
- Multi-armed bandit, which is devoted to implementing the epsilon-greedy algorithm, and how to use it to solve the multi-armed bandit problem.
- Stationary distribution of an MP, which is devoted to computing the stationary distribution of a Markov Process (MP).
- Gym interface example, which is devoted to showing how to use the Gym interface to create your own RL environments.
- Bellman fixed point for PE, which is devoted to implementing the Bellman fixed point equations for policy evaluation (PE) in a simple MDP.
- Bellman fixed point for PE in a Random Walk, is similar to the previous code, but using a Random Walk with terminal states as the MDP.
- Bellman random Policy Search, which is devoted to finding the optimal Bellman Policy using random search.
- Bellman random Policy Search in the Cliff, which is similar to the previous code, but using the Cliff problem as the MDP.
- Optimal policy using Bellman Equations, which is devoted to finding the optimal policy by solving the Bellman Equations in a simple case of discrete actions.
- The recycling robot, which is devoted to finding the optimal policy by solving the Bellman Equations.
- MDP simple example, which is devoted to showing how to compute the basic elements of an MDP.
The following examples and exercises correspond to some classic RL algorithms, including iterative methods, tabular methods, and linear function approximation methods:
- Policy Evaluation, which is devoted to implementing the policy evaluation algorithm for a simple MDP.
- Policy Iteration, which is devoted to implementing the policy iteration algorithm for a simple MDP.
- Value Iteration, which is devoted to implementing the value iteration algorithm for a simple MDP.
- Grid World, which is devoted to review the iterative methods (PE, PI, VI) in a Grid World problem.
- Every-visit Monte-Carlo, which is devoted to implementing the every-visit Monte-Carlo algorithm for a simple MDP.
- Off-policy Monte-Carlo via Importance Sampling, which is devoted to implementing the off-policy Monte-Carlo algorithm using Importance Sampling for a simple MDP.
- Monte Carlo and Temporary Difference in a simple MDP, which is an example devoted to implementing the Monte Carlo and Temporal Difference algorithms for a simple MDP.
- Monte Carlo and Temporary Difference in the Cliff, which is an example devoted to implementing the Monte Carlo and Temporal Difference algorithms for the Cliff.
- Monte Carlo and Temporary Difference in a random walk, which is devoted to implementing the Monte Carlo and Temporal Difference algorithms for a random walk.
- SARSA and Q-learning in a simple MDP, which is an example devoted to implementing the SARSA and Q-learning algorithms for a simple MDP.
- SARSA and Q-learning in the Cliff, which is devoted to implementing the SARSA and Q-learning algorithms for the Cliff problem .
- SARSA and Q-learning in a random walk, which is devoted to implementing the SARSA and Q-learning algorithms for a random walk.
- Feature basis for linear approximations, which is devoted to implementing a feature basis for a linear approximation.
- Model-based prediction using linear approximations, which is devoted to implementing BPE (Bellman Projected Equation), a model-based prediction algorithm using linear approximations.
- Model-free prediction using linear approximations, which is devoted to implementing LSTD, a model-free prediction algorithm using linear approximations.
- Model-free control using linear approximations, which is devoted to implementing LSPI, a model-free control algorithm using linear approximations.
- Linear approximation limits, which is an example devoted to showing the limits of linear approximations.
The following examples implement model-free DRL algorithms (all tested on the Cartpole problem):
- DDQN (Double Deep Q-Networks)
- VPG (Vanilla Policy Gradient)
- A2C (Advantage Actor Critic)
- TRPO (Trust Region Policy Optimization, note that in this case, we use the implementation of Stable Baselines 3 instead of providing an implementation to show a state-of-the-art library)
- DDPG (Deep Deterministic Policy Gradient)
For model-based DRL, the only implemented example is AlphaZero (tested on tic-tac-toe).
Link | Observations |
---|---|
Example 7.6 | Code for the example in the slides |
Example 8.1 | Code for the example in the slides |
Example 8.3 | Code for the example in the slides |
Example 8.5 | Homework |
Example 8.6 | Homework |
Example 8.7 | Homework |
Example 9.1 | Code for the example in the slides |
Example 9.2 | Code for the example in the slides |
Example 9.3 | Code for the example in the slides |
Example 9.4 | Homework |
Example 9.5 | Homework |
Example 9.6 | Homework |
Example 9.7 | Homework |
Example 9.8 | Homework |
Example 9.9 | Code for the example in the slides |
Example 9.10 | Code for the example in the slides |
Example 9.11 | Code for the example in the slides |
Example 9.12 | Code for the example in the slides |
Link | Observations |
---|---|
Exercise 2.1 | Exercise to be completed by the student |
Exercise 3.2 | Exercise to be completed by the student |
Exercise 3.3 | Exercise to be completed by the student |
Exercise 3.4 | Exercise to be completed by the student |
Exercise 3.5 | Exercise to be completed by the student |
Exercise 3.6 | Exercise to be completed by the student |
Exercise 3.7 | Exercise to be completed by the student |
Exercise 4.1 | Exercise to be completed by the student |
Exercise 4.2 | Exercise to be completed by the student |
Exercise 4.3 | Exercise to be completed by the student |
Exercise 4.4 | Exercise to be completed by the student |
Exercise 5.1 | Exercise to be completed by the student |
Exercise 5.2 | Exercise to be completed by the student |
Exercise 5.3 | Exercise to be completed by the student |
Exercise 5.4 | Exercise to be completed by the student |
Exercise 5.5 | Exercise to be completed by the student |
Exercise 6.1 | Exercise to be completed by the student |
Exercise 6.2 | Exercise to be completed by the student |
Exercise 6.3 | Exercise to be completed by the student |
Exercise 6.4 | Exercise to be completed by the student |
Example 7.1 | Code for the example in the slides |
Example 7.2 | Code for the example in the slides |
Example 7.3 | Code for the example in the slides |
Example 7.4 | Code for the example in the slides |
Example 7.5 | Code for the example in the slides |
Example 7.6 | Code for the example in the slides |
Example 7.7 | Code for the example in the slides |
The recommended way of executing these codes is to use Google Colab. The simplest way of doing that is to navigate to the code you want to execute, and then replace github.com
in the URL by githubtocolab.com
.
A second option is to go to Colab, and in the Open options, select GitHub and add this repository.
And finally, you can also download the code and execute it in your own machine, by installing all required dependencies.
If you are interested in DRL and want to keep on learning, it might be worthy checking the following resources:
- Spinning up in DRL is an OpenAI webpage devoted to give an in-depth introduction to DRL, as well as a set of papers to learn more advanced topics. Their documentation is well-written, and their code is also available and worth checking.
- CleanRL is a project that has developed several implementations of DRL algorithms in a single file, in order to facilitate understanding. Their documentation is nice, and it is a code repository worth checking.
- Stable Baselines 3 is a high-quality implementation of most DRL algorithms, that is highly recommended if you want to use state-of-the-art implementations of the most popular DRL algorithms. They are a solid alternative to OpenAI Baselines.