-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
91 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
.. _developer: | ||
|
||
================ | ||
Developer Guide | ||
================ | ||
|
||
This guide is meant for those who wants to understand the internals and the design choices of Stable-Baselines3. | ||
|
||
|
||
At first, you should read the two issues where the design choices were discussed: | ||
|
||
- https://github.com/hill-a/stable-baselines/issues/576 | ||
- https://github.com/hill-a/stable-baselines/issues/733 | ||
|
||
|
||
The library is not meant to be modular, although inheritance is used to reduce code duplication. | ||
|
||
|
||
Algorithms Structure | ||
==================== | ||
|
||
Each algorithm (on-policy and off-policy ones) follow a common structure. | ||
There is one folder per algorithm, and in that folder there is the algorithm and the policy definition (``policies.py``). | ||
|
||
Each algorithm had two main methods: | ||
|
||
- ``.collect_rollouts()`` which defines how new samples are collected, usually inherited from the base class. Those samples are then stored in a ``RolloutBuffer`` (discarded after the gradient update) or ``ReplayBuffer`` | ||
|
||
- ``.train()`` which update the parameters using samples from the buffer | ||
|
||
|
||
Where to start? | ||
=============== | ||
|
||
The first thing you need to read and understand are the base classes in the ``common/`` folder: | ||
|
||
- ``BaseRLModel`` in ``base_class.py`` which defines how an RL class should look like | ||
it contains also all the "glue code" for saving/loading and the common operations (wrapping environments) | ||
|
||
- ``BasePolicy`` in ``policies.py`` which defines how a policy class should look like | ||
it contains also all the magic for the ``.predict()`` method, to handle as many cases as possible | ||
|
||
- ``OffPolicyRLModel`` in ``base_class.py`` that contains the implementation of ``collect_rollouts()`` for the off-policy algorithms | ||
|
||
|
||
All the environments handled internally are assume to be ``VecEnv`` (``gym.Env`` are automatically wrapped). | ||
|
||
|
||
Pre-Processing | ||
============== | ||
|
||
To handle different observation spaces, some pre-processing needs to be done (e.g. one-hot encoding for discrete observation). | ||
Most of the code for pre-processing is in ``common/preprocessing.py``. | ||
|
||
For images, we make use of an additional wrapper ``VecTransposeImage`` because PyTorch uses the "channel-first" convention. | ||
|
||
|
||
Policy Structure | ||
================ | ||
|
||
When we refer to "policy" in Stable-Baselines3, this is usually an abuse of language compared to RL terminology. | ||
In SB3, "Policy" refers to the class that handle all the networks useful for training, | ||
so not only the network used to predict actions (the "learned controller"). | ||
|
||
For instance, the ``TD3`` policy contains the actor, the critic and the target networks. | ||
|
||
Probability distributions | ||
========================= | ||
|
||
When needed, the policies handle the different probability distributions. | ||
All distributions are located in ``common/distributions.py`` and follow the same interface. | ||
Each distribution correspond to a type of action space (e.g. ``Categorical`` is the one used for discrete actions. | ||
For continuous actions, we can use multiple distributions ("DiagGaussian", "SquashedGaussian" or "StateDependentDistribution") | ||
|
||
State-Dependent Exploration | ||
=========================== | ||
|
||
State-Dependent Exploration (SDE) is a type of exploration that allows to use RL directly on real robots, | ||
that was the starting point for the Stable-Baselines3 library. | ||
I (@araffin) will publish a paper about a generalized version of SDE (the one implemented in SB3) soon. | ||
|
||
Misc | ||
==== | ||
|
||
The rest of the ``common/`` is composed of helpers (e.g. evaluation helpers) or basic components (like the callbacks). | ||
The ``type_aliases.py`` file contains common type hint aliases like ``GymStepReturn``. | ||
|
||
Et voilà? | ||
|
||
After reading this guide and the mentioned files, you should be now able to understand the design logic behind the library ;) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -45,6 +45,7 @@ Main Features | |
guide/rl_zoo | ||
guide/migration | ||
guide/checking_nan | ||
guide/developer | ||
|
||
|
||
.. toctree:: | ||
|