Documentation update (#1732)

* Update RL Tips * Fix grammar * Update SBX doc * Fix various typos and grammar mistakes
DLR-RM · Nov 3, 2023 · 294f2b4 · 294f2b4
1 parent 69afefc
commit 294f2b4
Show file tree

Hide file tree

Showing 17 changed files with 76 additions and 55 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -8,7 +8,7 @@ into two categories:
 2. You want to implement a feature or bug-fix for an outstanding issue
     - Look at the outstanding issues here: https://github.com/DLR-RM/stable-baselines3/issues
     - Pick an issue or feature and comment on the task that you want to work on this feature.
-    - If you need more context on a particular issue, please ask and we shall provide.
+    - If you need more context on a particular issue, please ask, and we shall provide.
 
 Once you finish implementing a feature or bug-fix, please send a Pull Request to
 https://github.com/DLR-RM/stable-baselines3
@@ -61,7 +61,7 @@ def my_function(arg1: type1, arg2: type2) -> returntype:
 
 ## Pull Request (PR)
 
-Before proposing a PR, please open an issue, where the feature will be discussed. This prevent from duplicated PR to be proposed and also ease the code review process.
+Before proposing a PR, please open an issue, where the feature will be discussed. This prevents from duplicated PR to be proposed and also ease the code review process.
 
 Each PR need to be reviewed and accepted by at least one of the maintainers (@hill-a, @araffin, @ernestum, @AdamGleave, @Miffyli or @qgallouedec).
 A PR must pass the Continuous Integration tests to be merged with the master branch.

diff --git a/README.md b/README.md
@@ -109,7 +109,7 @@ pip install stable-baselines3[extra]
 ```
 **Note:** Some shells such as Zsh require quotation marks around brackets, i.e. `pip install 'stable-baselines3[extra]'` ([More Info](https://stackoverflow.com/a/30539963)).
 
-This includes an optional dependencies like Tensorboard, OpenCV or `atari-py` to train on atari games. If you do not need those, you can use:
+This includes an optional dependencies like Tensorboard, OpenCV or `ale-py` to train on atari games. If you do not need those, you can use:
 ```sh
 pip install stable-baselines3
 ```

diff --git a/docs/common/distributions.rst b/docs/common/distributions.rst
@@ -16,7 +16,7 @@ The policy networks output parameters for the distributions (named ``flat`` in t
 Actions are then sampled from those distributions.
 
 For instance, in the case of discrete actions. The policy network outputs probability
-of taking each action. The ``CategoricalDistribution`` allows to sample from it,
+of taking each action. The ``CategoricalDistribution`` allows sampling from it,
 computes the entropy, the log probability (``log_prob``) and backpropagate the gradient.
 
 In the case of continuous actions, a Gaussian distribution is used. The policy network outputs

diff --git a/docs/guide/callbacks.rst b/docs/guide/callbacks.rst
@@ -30,7 +30,7 @@ You can find two examples of custom callbacks in the documentation: one for savi
         :param verbose: Verbosity level: 0 for no output, 1 for info messages, 2 for debug messages
         """
         def __init__(self, verbose=0):
-            super(CustomCallback, self).__init__(verbose)
+            super().__init__(verbose)
             # Those variables will be accessible in the callback
             # (they are defined in the base class)
             # The RL model
@@ -70,7 +70,7 @@ You can find two examples of custom callbacks in the documentation: one for savi
             For child callback (of an `EventCallback`), this will be called
             when the event is triggered.
 
-            :return: (bool) If the callback returns False, training is aborted early.
+            :return: If the callback returns False, training is aborted early.
             """
             return True
 
@@ -110,7 +110,7 @@ A child callback is for instance :ref:`StopTrainingOnRewardThreshold <StopTraini
 
 .. note::
 
-	We recommend to take a look at the source code of :ref:`EvalCallback` and :ref:`StopTrainingOnRewardThreshold <StopTrainingCallback>` to have a better overview of what can be achieved with this kind of callbacks.
+	We recommend taking a look at the source code of :ref:`EvalCallback` and :ref:`StopTrainingOnRewardThreshold <StopTrainingCallback>` to have a better overview of what can be achieved with this kind of callbacks.
 
 
 .. code-block:: python
@@ -159,8 +159,8 @@ corresponding statistics using ``save_vecnormalize`` (``False`` by default).
 
 .. warning::
 
-  When using multiple environments, each call to  ``env.step()`` will effectively correspond to ``n_envs`` steps.
-  If you want the ``save_freq`` to be similar when using different number of environments,
+  When using multiple environments, each call to ``env.step()`` will effectively correspond to ``n_envs`` steps.
+  If you want the ``save_freq`` to be similar when using a different number of environments,
   you need to account for it using ``save_freq = max(save_freq // n_envs, 1)``.
   The same goes for the other callbacks.
 
@@ -189,7 +189,7 @@ EvalCallback
 ^^^^^^^^^^^^
 
 Evaluate periodically the performance of an agent, using a separate test environment.
-It will save the best model if ``best_model_save_path`` folder is specified and save the evaluations results in a numpy archive (``evaluations.npz``) if ``log_path`` folder is specified.
+It will save the best model if ``best_model_save_path`` folder is specified and save the evaluations results in a NumPy archive (``evaluations.npz``) if ``log_path`` folder is specified.
 
 
 .. note::
@@ -230,7 +230,7 @@ This callback is integrated inside SB3 via the ``progress_bar`` argument of the
 
 .. note::
 
-	This callback requires ``tqdm`` and ``rich`` packages to be installed. This is done automatically when using ``pip install stable-baselines3[extra]``
+	``ProgressBarCallback`` callback requires ``tqdm`` and ``rich`` packages to be installed. This is done automatically when using ``pip install stable-baselines3[extra]``
 
 
 .. code-block:: python
@@ -367,7 +367,7 @@ StopTrainingOnNoModelImprovement
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Stop the training if there is no new best model (no new best mean reward) after more than a specific number of consecutive evaluations.
-The idea is to save time in experiments when you know that the learning curves are somehow well behaved and, therefore,
+The idea is to save time in experiments when you know that the learning curves are somehow well-behaved and, therefore,
 after many evaluations without improvement the learning has probably stabilized.
 It must be used with the :ref:`EvalCallback` and use the event triggered after every evaluation.
 

diff --git a/docs/guide/custom_env.rst b/docs/guide/custom_env.rst
@@ -3,7 +3,7 @@
 Using Custom Environments
 ==========================
 
-To use the RL baselines with custom environments, they just need to follow the *gymnasium*  `interface <https://gymnasium.farama.org/tutorials/gymnasium_basics/environment_creation/#sphx-glr-tutorials-gymnasium-basics-environment-creation-py>`_.
+To use the RL baselines with custom environments, they just need to follow the *gymnasium* `interface <https://gymnasium.farama.org/tutorials/gymnasium_basics/environment_creation/#sphx-glr-tutorials-gymnasium-basics-environment-creation-py>`_.
 That is to say, your environment must implement the following methods (and inherits from Gym Class):
 
 

diff --git a/docs/guide/custom_policy.rst b/docs/guide/custom_policy.rst
@@ -262,7 +262,7 @@ Custom Networks
 If you need a network architecture that is different for the actor and the critic when using ``PPO``, ``A2C`` or ``TRPO``,
 you can pass a dictionary of the following structure: ``dict(pi=[<actor network architecture>], vf=[<critic network architecture>])``.
 
-For example, if you want a different architecture for the actor (aka ``pi``) and the critic ( value-function aka ``vf``) networks,
+For example, if you want a different architecture for the actor (aka ``pi``) and the critic (value-function aka ``vf``) networks,
 then you can specify ``net_arch=dict(pi=[32, 32], vf=[64, 64])``.
 
 Otherwise, to have actor and critic that share the same network architecture,

diff --git a/docs/guide/examples.rst b/docs/guide/examples.rst
@@ -5,7 +5,7 @@ Examples
 
 .. note::
 
-        These examples are only to demonstrate the use of the library and its functions, and the trained agents may not solve the environments. Optimized               hyperparameters can be found in the RL Zoo `repository <https://github.com/DLR-RM/rl-baselines3-zoo>`_.
+  These examples are only to demonstrate the use of the library and its functions, and the trained agents may not solve the environments. Optimized hyperparameters can be found in the RL Zoo `repository <https://github.com/DLR-RM/rl-baselines3-zoo>`_.
 
 
 Try it online with Colab Notebooks!
@@ -191,8 +191,8 @@ Dict Observations
 
 You can use environments with dictionary observation spaces. This is useful in the case where one can't directly
 concatenate observations such as an image from a camera combined with a vector of servo sensor data (e.g., rotation angles).
-Stable Baselines3 provides ``SimpleMultiObsEnv`` as an example of this kind of of setting.
-The environment is a simple grid world but the observations for each cell come in the form of dictionaries.
+Stable Baselines3 provides ``SimpleMultiObsEnv`` as an example of this kind of setting.
+The environment is a simple grid world, but the observations for each cell come in the form of dictionaries.
 These dictionaries are randomly initialized on the creation of the environment and contain a vector observation and an image observation.
 
 .. code-block:: python
@@ -217,7 +217,7 @@ Callbacks: Monitoring Training
 
 You can define a custom callback function that will be called inside the agent.
 This could be useful when you want to monitor training, for instance display live
-learning curves in Tensorboard (or in Visdom) or save the best agent.
+learning curves in Tensorboard or save the best agent.
 If your callback returns False, training is aborted early.
 
 .. image:: ../_static/img/colab-badge.svg
@@ -251,7 +251,7 @@ If your callback returns False, training is aborted early.
       :param verbose: Verbosity level: 0 for no output, 1 for info messages, 2 for debug messages
       """
       def __init__(self, check_freq: int, log_dir: str, verbose: int = 1):
-          super(SaveOnBestTrainingRewardCallback, self).__init__(verbose)
+          super().__init__(verbose)
           self.check_freq = check_freq
           self.log_dir = log_dir
           self.save_path = os.path.join(log_dir, "best_model")

diff --git a/docs/guide/export.rst b/docs/guide/export.rst
@@ -194,14 +194,14 @@ Full example code: https://github.com/chunky/sb3_to_coral
 
 Google created a chip called the "Coral" for deploying AI to the
 edge. It's available in a variety of form factors, including USB (using
-the Coral on a Rasbperry pi, with a SB3-developed model, was the original
+the Coral on a Raspberry Pi, with a SB3-developed model, was the original
 motivation for the code example above).
 
 The Coral chip is fast, with very low power consumption, but only has limited
 on-device training abilities. More information is on the webpage here:
 https://coral.ai.
 
-To deploy to a Coral, one must work via TFLite, and quantise the
+To deploy to a Coral, one must work via TFLite, and quantize the
 network to reflect the Coral's capabilities. The full chain to go from
 SB3 to Coral is: SB3 (Torch) => ONNX => TensorFlow => TFLite => Coral.
 

diff --git a/docs/guide/install.rst b/docs/guide/install.rst
@@ -9,10 +9,10 @@ Prerequisites
 
 Stable-Baselines3 requires python 3.8+ and PyTorch >= 1.13
 
-Windows 10
-~~~~~~~~~~
+Windows
+~~~~~~~
 
-We recommend using `Anaconda <https://conda.io/docs/user-guide/install/windows.html>`_ for Windows users for easier installation of Python packages and required libraries. You need an environment with Python version 3.6 or above.
+We recommend using `Anaconda <https://conda.io/docs/user-guide/install/windows.html>`_ for Windows users for easier installation of Python packages and required libraries. You need an environment with Python version 3.8 or above.
 
 For a quick start you can move straight to installing Stable-Baselines3 in the next step.
 
@@ -34,7 +34,7 @@ To install Stable Baselines3 with pip, execute:
         Some shells such as Zsh require quotation marks around brackets, i.e. ``pip install 'stable-baselines3[extra]'`` `More information <https://stackoverflow.com/a/30539963>`_.
 
 
-This includes an optional dependencies like Tensorboard, OpenCV or ``ale-py`` to train on atari games. If you do not need those, you can use:
+This includes an optional dependencies like Tensorboard, OpenCV or ``ale-py`` to train on Atari games. If you do not need those, you can use:
 
 .. code-block:: bash
 

diff --git a/docs/guide/migration.rst b/docs/guide/migration.rst
@@ -15,7 +15,7 @@ Overview
 
 Overall Stable-Baselines3 (SB3) keeps the high-level API of Stable-Baselines (SB2).
 Most of the changes are to ensure more consistency and are internal ones.
-Because of the backend change, from Tensorflow to PyTorch, the internal code is much much readable and easy to debug
+Because of the backend change, from Tensorflow to PyTorch, the internal code is much more readable and easy to debug
 at the cost of some speed (dynamic graph vs static graph., see `Issue #90 <https://github.com/DLR-RM/stable-baselines3/issues/90>`_)
 However, the algorithms were extensively benchmarked on Atari games and continuous control PyBullet envs
 (see `Issue #48 <https://github.com/DLR-RM/stable-baselines3/issues/48>`_  and `Issue #49 <https://github.com/DLR-RM/stable-baselines3/issues/49>`_)
@@ -203,8 +203,8 @@ New Features (SB3 vs SB2)
 - Much cleaner and consistent base code (and no more warnings =D!) and static type checks
 - Independent saving/loading/predict for policies
 - A2C now supports Generalized Advantage Estimation (GAE) and advantage normalization (both are deactivated by default)
-- Generalized State-Dependent Exploration (gSDE) exploration is available for A2C/PPO/SAC. It allows to use RL directly on real robots (cf https://arxiv.org/abs/2005.05719)
-- Better saving/loading: optimizers are now included in the saved parameters and there is two new methods ``save_replay_buffer`` and ``load_replay_buffer`` for the replay buffer when using off-policy algorithms (DQN/DDPG/SAC/TD3)
+- Generalized State-Dependent Exploration (gSDE) exploration is available for A2C/PPO/SAC. It allows using RL directly on real robots (cf https://arxiv.org/abs/2005.05719)
+- Better saving/loading: optimizers are now included in the saved parameters and there are two new methods ``save_replay_buffer`` and ``load_replay_buffer`` for the replay buffer when using off-policy algorithms (DQN/DDPG/SAC/TD3)
 - You can pass ``optimizer_class`` and ``optimizer_kwargs`` to ``policy_kwargs`` in order to easily
   customize optimizers
 - Seeding now works properly to have deterministic results

diff --git a/docs/guide/rl.rst b/docs/guide/rl.rst
@@ -15,4 +15,5 @@ However, if you want to learn about RL, there are several good resources to get
 - `Lilian Weng's blog <https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html>`_
 - `Berkeley's Deep RL Bootcamp <https://sites.google.com/view/deep-rl-bootcamp/lectures>`_
 - `Berkeley's Deep Reinforcement Learning course <http://rail.eecs.berkeley.edu/deeprlcourse/>`_
+- `DQN tutorial <https://github.com/araffin/rlss23-dqn-tutorial>`_
 - `More resources <https://github.com/dennybritz/reinforcement-learning>`_