 Copyright © Sorbonne University.

 This source code is licensed under the MIT license found in the LICENSE file
 in the root directory of this source tree.

# Outlook

In this notebook we code the Soft Actor-Critic (SAC) algorithm using BBRL.
This algorithm is described in [this
paper](http://proceedings.mlr.press/v80/haarnoja18b/haarnoja18b.pdf) and [this
paper](https://arxiv.org/pdf/1812.05905.pdf).

To understand this code, you need to know more about [the BBRL interaction
model](https://github.com/osigaud/bbrl/blob/master/docs/overview.md) Then you
should run [a didactical
example](https://github.com/osigaud/bbrl/blob/master/docs/notebooks/03-multi_env_autoreset.student.ipynb)
to see how agents interact in BBRL when autoreset=True.

The algorithm is explained in [this
video](https://www.youtube.com/watch?v=U20F-MvThjM) and you can also read [the
corresponding slides](http://pages.isir.upmc.fr/~sigaud/teach/ps/12_sac.pdf).


# Setting up the environment
We first need to setup the environment
Installs the necessary Python and system libraries

### 展望

在此笔记本中，我们使用 BBRL 实现了 Soft Actor-Critic (SAC) 算法。该算法的详细介绍可以参考[这篇论文](http://proceedings.mlr.press/v80/haarnoja18b/haarnoja18b.pdf)和[这篇论文](https://arxiv.org/pdf/1812.05905.pdf)。

为了理解这段代码，你需要熟悉 [BBRL 交互模型](https://github.com/osigaud/bbrl/blob/master/docs/overview.md)。然后，建议运行[一个教学示例](https://github.com/osigaud/bbrl/blob/master/docs/notebooks/03-multi_env_autoreset.student.ipynb)，查看在 BBRL 中当 `autoreset=True` 时代理是如何交互的。

算法的详细解释可以观看[这个视频](https://www.youtube.com/watch?v=U20F-MvThjM)，也可以参考[对应的幻灯片](http://pages.isir.upmc.fr/~sigaud/teach/ps/12_sac.pdf)。

### 环境设置

我们首先需要设置运行环境，安装必要的 Python 和系统库。

In [24]:
try:
    from easypip import easyimport
except ModuleNotFoundError:
    from subprocess import run

    assert (
        run(["pip", "install", "easypip"]).returncode == 0
    ), "Could not install easypip"
    from easypip import easyimport

easyimport("swig")
easyimport("bbrl_utils>=0.5").setup()

import copy
import os

import torch
import torch.nn as nn
from bbrl.workspace import Workspace
from bbrl.agents import Agent, Agents, TemporalAgent, KWAgentWrapper
from bbrl_utils.algorithms import EpochBasedAlgo
from bbrl_utils.nn import build_mlp, setup_optimizer, soft_update_params
from bbrl_utils.notebook import setup_tensorboard
from omegaconf import OmegaConf
from torch.distributions import (
    Normal,
    Independent,
    TransformedDistribution,
    TanhTransform,
)
import bbrl_gymnasium  # noqa: F401

# Learning environment

## Configuration

The learning environment is controlled by a configuration that define a few
important things as described in the example below. This configuration can
hold as many extra information as you need, the example below is the minimal
one.

```python
params = {
    # This defines the a path for logs and saved models
    "base_dir": "${gym_env.env_name}/myalgo_${current_time:}",

    # The Gymnasium environment
    "gym_env": {
        "env_name": "CartPoleContinuous-v1",
    },

    # Algorithm
    "algorithm": {
        # Seed used for the random number generator
        "seed": 1023,

        # Number of parallel training environments
        "n_envs": 8,
                
        # Minimum number of steps between two evaluations
        "eval_interval": 500,
        
        # Number of parallel evaluation environments
        "nb_evals": 10,

        # Number of epochs (loops)
        "max_epochs": 40000,

        # Number of steps (partial iteration)
        "n_steps": 100,
        
    },
}

# Creates the configuration object, i.e. cfg.algorithm.nb_evals is 10
cfg = OmegaConf.create(params)
```

## The RL algorithm

In this notebook, the RL algorithm is based on `EpisodicAlgo`, that defines
the algorithm environment when using episodes. To use such environment, we
just need to subclass `EpisodicAlgo` and to define two things, namely the
`train_policy` and the `eval_policy`. Both are BBRL agents that, given the
environment state, select the action to perform.

```py
  class MyAlgo(EpisodicAlgo):
      def __init__(self, cfg):
          super().__init__(cfg)

          # Define the train and evaluation policies
          # (the agents compute the workspace `action` variable)
          self.train_policy = MyPolicyAgent(...)
          self.eval_policy = MyEvalAgent(...)

algo = MyAlgo(cfg)
```

The `EpisodicAlgo` defines useful objects:

- `algo.cfg` is the configuration
- `algo.nb_steps` (integer) is the number of steps since the training began
- `algo.logger` is a logger that can be used to collect statistics during training:
    - `algo.logger.add_log("critic_loss", critic_loss, algo.nb_steps)` registers the `critic_loss` value on tensorboard
- `algo.evaluate()` evaluates the current `eval_policy` if needed, and keeps the
agent if it was the best so far (average cumulated reward);
- `algo.visualize_best()` runs the best agent on one episode, and displays the video



Besides, it also defines an `iter_episodes` that allows to iterate over partial
episodes (with `n_steps` from `n_envs` environments):

```python3
  # with partial episodes
  for workspace in algo.iter_partial_episodes():
      # workspace is a workspace containing 50 transitions
      # (with autoreset)
      ...
```

### 学习环境

#### 配置

学习环境通过配置文件来控制，该文件定义了一些重要的参数，具体说明如下示例。此配置文件可以包含任意数量的附加信息，以下示例展示了最基本的设置。

```python
params = {
    # 定义日志和模型保存的路径
    "base_dir": "${gym_env.env_name}/myalgo_${current_time:}",

    # Gymnasium 环境
    "gym_env": {
        "env_name": "CartPoleContinuous-v1",
    },

    # 算法相关配置
    "algorithm": {
        # 随机数生成器的种子
        "seed": 1023,

        # 并行训练环境的数量
        "n_envs": 8,
                
        # 两次评估之间的最小步数
        "eval_interval": 500,
        
        # 并行评估环境的数量
        "nb_evals": 10,

        # 最大训练周期（循环）次数
        "max_epochs": 40000,

        # 每次迭代的步数
        "n_steps": 100,
    },
}

# 创建配置对象，如 cfg.algorithm.nb_evals 为 10
cfg = OmegaConf.create(params)
```

#### 强化学习算法

在此笔记本中，RL（强化学习）算法基于 `EpisodicAlgo` 类。该类用于定义带有“episode”概念的算法环境。为了使用这种环境，只需继承 `EpisodicAlgo` 并定义两个核心内容，即 `train_policy` 和 `eval_policy`。这两个策略均为 BBRL 代理（agent），根据环境状态选择需要执行的动作。

```python
class MyAlgo(EpisodicAlgo):
    def __init__(self, cfg):
        super().__init__(cfg)

        # 定义训练和评估策略
        # （这些代理会计算工作空间的 `action` 变量）
        self.train_policy = MyPolicyAgent(...)
        self.eval_policy = MyEvalAgent(...)

algo = MyAlgo(cfg)
```

`EpisodicAlgo` 定义了一些有用的对象：

- `algo.cfg`：配置文件对象
- `algo.nb_steps`（整数）：从训练开始以来的步数
- `algo.logger`：日志记录器，可用于收集训练期间的统计数据：
    - `algo.logger.add_log("critic_loss", critic_loss, algo.nb_steps)` 会在 tensorboard 上记录 `critic_loss` 值
- `algo.evaluate()`：评估当前 `eval_policy`（如有必要），并保留迄今为止表现最佳的代理（根据平均累积奖励）；
- `algo.visualize_best()`：在一个 episode 中运行最佳代理，并展示视频。

此外，`EpisodicAlgo` 还定义了 `iter_episodes` 方法，允许对部分 episodes 进行迭代（`n_envs` 个环境中的 `n_steps` 步数）：

```python
# 使用部分 episodes
for workspace in algo.iter_partial_episodes():
    # workspace 是包含 50 个转换（transitions）的工作空间
    # （带自动重置）
    ...
```

## The SquashedGaussianActor

SAC works better with a Squashed Gaussian actor, which transforms a gaussian
distribution with a $tanh$. The computation of the gradient  uses the
reparametrization trick. Note that our attempts to use a
`TunableVarianceContinuousActor` as we did for instance in the notebook about
PPO completely failed. Such failure is also documented in the [OpenAI spinning
up documentation page about
SAC](https://spinningup.openai.com/en/latest/algorithms/sac.html).

The code of the `SquashedGaussianActor` actor is below.

The fact that we use the reparametrization trick is hidden inside the code of
this distribution. You can read more about the reparametrization trick in at
the following URLs:
- [Goker Erdogan's
  blog](http://gokererdogan.github.io/2016/07/01/reparameterization-trick/)
  which shows the variance of different tricks to compute gradient of
  expectations for $\mathbb{E}(x^2)$ where $x \sim \mathcal{N}(\theta, 1)$

### SquashedGaussianActor（压缩高斯Actor）

SAC（软演员-评论家）算法在使用“压缩高斯” actor 时效果更佳。这种 actor 使用 $ \tanh $ 函数对高斯分布进行转换，同时计算梯度时使用了“重参数化技巧”（reparameterization trick）。需要注意的是，我们曾尝试在类似 PPO 笔记本中使用 `TunableVarianceContinuousActor`，但完全失败了。类似的失败在 [OpenAI Spinning Up 关于 SAC 的文档](https://spinningup.openai.com/en/latest/algorithms/sac.html)中也有提及。

以下是 `SquashedGaussianActor` 的代码。

我们在代码中使用了重参数化技巧，但这一过程在分布的代码实现中被隐藏了。您可以通过以下链接了解更多关于重参数化技巧的信息：
- [Goker Erdogan 的博客](http://gokererdogan.github.io/2016/07/01/reparameterization-trick/)，展示了不同技巧在计算期望梯度时的方差，例如对于 $\mathbb{E}(x^2)$，其中 $x \sim \mathcal{N}(\theta, 1)$

In [25]:
class SquashedGaussianActor(Agent):  # 定义一个继承自Agent的SquashedGaussianActor类，用于构建SAC的策略网络
    def __init__(self, state_dim, hidden_layers, action_dim, min_std=1e-4):
        """Creates a new Squashed Gaussian actor

        :param state_dim: The dimension of the state space
        :param hidden_layers: Hidden layer sizes
        :param action_dim: The dimension of the action space
        :param min_std: The minimum standard deviation, defaults to 1e-4
        """
        super().__init__()  # 调用父类Agent的初始化函数
        self.min_std = min_std  # 设置最小标准差，防止模型训练中标准差过小导致数值问题
        backbone_dim = [state_dim] + list(hidden_layers)  # 将状态维度和隐藏层维度组合成列表，以构建MLP
        self.layers = build_mlp(backbone_dim, activation=nn.ReLU())  # 构建多层感知机(MLP)，各层激活函数为ReLU
        self.backbone = nn.Sequential(*self.layers)  # 将所有层连接成一个有序神经网络模型
        self.last_mean_layer = nn.Linear(hidden_layers[-1], action_dim)  # 定义最后一层用于输出动作均值的线性层
        self.last_std_layer = nn.Linear(hidden_layers[-1], action_dim)  # 定义最后一层用于输出动作标准差的线性层
        self.softplus = nn.Softplus()  # 使用Softplus激活函数保证标准差为正数

        # cache_size avoids numerical infinites or NaNs when
        # computing log probabilities
        self.tanh_transform = TanhTransform(cache_size=1)  # 定义tanh变换，用于将动作限制在[-1, 1]区间

    def normal_dist(self, obs: torch.Tensor):  # 定义正态分布函数，接收观测值obs
        """Compute normal distribution given observation(s)"""

        backbone_output = self.backbone(obs)  # 将观测输入传入backbone网络，获取输出特征
        mean = self.last_mean_layer(backbone_output)  # 计算动作的均值
        std_out = self.last_std_layer(backbone_output)  # 计算动作的标准差
        std = self.softplus(std_out) + self.min_std  # 应用Softplus并加上最小标准差，确保标准差为正
        # Independent ensures that we have a multivariate
        # Gaussian with a diagonal covariance matrix (given as
        # a vector `std`)
        return Independent(Normal(mean, std), 1)  # 返回以mean和std为参数的独立正态分布

    def forward(self, t, stochastic=True):  # 前向传播函数，计算给定时间t下的动作及其对数概率
        """Computes the action a_t and its log-probability p(a_t| s_t)

        :param stochastic: True when sampling
        """
        normal_dist = self.normal_dist(self.get(("env/env_obs", t)))  # 根据观测生成正态分布
        action_dist = TransformedDistribution(normal_dist, [self.tanh_transform])  # 应用tanh变换，生成动作分布
        if stochastic:
            # Uses the re-parametrization trick
            action = action_dist.rsample()  # 使用重参数化技巧从分布中采样动作
        else:
            # Directly uses the mode of the distribution
            action = self.tanh_transform(normal_dist.mode)  # 使用分布的模式作为动作

        log_prob = action_dist.log_prob(action)  # 计算动作的对数概率
        # This line allows to deepcopy the actor...
        self.tanh_transform._cached_x_y = [None, None]  # 清除缓存，确保actor可以被deepcopy
        self.set(("action", t), action)  # 存储动作
        self.set(("action_logprobs", t), log_prob)  # 存储动作的对数概率


### Critic agent Q(s,a)

As critics and target critics, SAC uses several instances of ContinuousQAgent
class, as DDPG and TD3. See the [DDPG
notebook](http://master-dac.isir.upmc.fr/rld/rl/04-ddpg-td3.student.ipynb) for
details.

### Critic 代理 Q(s, a)

在 SAC 算法中，作为评论家（Critic）和目标评论家（Target Critic），使用了多个 `ContinuousQAgent` 类的实例，类似于 DDPG 和 TD3 算法。详情可以参考 [DDPG 笔记本](http://master-dac.isir.upmc.fr/rld/rl/04-ddpg-td3.student.ipynb)。

In [26]:
class ContinuousQAgent(Agent):
    def __init__(self, state_dim: int, hidden_layers: list[int], action_dim: int):
        """创建一个新的 Q 函数评论家代理: Q(s, a)

        :param state_dim: 状态空间的维数（观测的维数）
        :param hidden_layers: 神经网络的隐藏层大小列表
        :param action_dim: 动作空间的维数
        """
        super().__init__()  # 调用父类的初始化方法
        self.is_q_function = True  # 标记该代理为Q函数
        # 使用给定的状态维度和动作维度构建一个多层感知机（MLP）模型
        # 输入层大小为 状态维度 + 动作维度，输出层大小为 1（Q值）
        self.model = build_mlp(
            [state_dim + action_dim] + list(hidden_layers) + [1], activation=nn.ReLU()
        )

    def forward(self, t):
        # 获取在时间步t的环境观测（状态）
        obs = self.get(("env/env_obs", t))
        # 获取在时间步t的动作
        action = self.get(("action", t))
        # 将状态和动作连接起来作为模型输入
        obs_act = torch.cat((obs, action), dim=1)
        # 使用模型计算Q值，squeeze(-1)用于移除多余的维度
        q_value = self.model(obs_act).squeeze(-1)
        # 将计算得到的Q值存储在字典中，以备后续使用
        self.set((f"{self.prefix}q_value", t), q_value)


### Building the complete training and evaluation agents

In the code below we create the Squashed Gaussian actor, two critics and the
corresponding target critics. Beforehand, we checked that the environment
takes continuous actions (otherwise we would need a different code).

### 构建完整的训练和评估代理

在下面的代码中，我们创建了压缩高斯 actor、两个评论家（Critic）以及对应的目标评论家。在此之前，我们已经确认环境接收连续动作（否则我们需要使用不同的代码）。

In [27]:
# 创建SAC算法环境类
class SACAlgo(EpochBasedAlgo):
    def __init__(self, cfg):
        super().__init__(cfg)  # 调用父类的初始化方法，传入配置参数cfg

        # 获取状态空间和动作空间的大小
        obs_size, act_size = self.train_env.get_obs_and_actions_sizes()
        # 断言动作空间是否为连续型，若不是则报错提示
        assert (
            self.train_env.is_continuous_action()
        ), "SAC代码专用于连续动作空间"

        # 创建一个actor（策略网络）
        self.actor = SquashedGaussianActor(
            obs_size, cfg.algorithm.architecture.actor_hidden_size, act_size
        )

        # 创建第一个评论家网络（critic_1）来估计Q值
        self.critic_1 = ContinuousQAgent(
            obs_size,  # 状态空间的大小
            cfg.algorithm.architecture.critic_hidden_size,  # 评论家网络的隐藏层大小
            act_size,  # 动作空间的大小
        ).with_prefix("critic-1/")  # 添加前缀以区分网络

        # 创建目标评论家网络target_critic_1，作为critic_1的深拷贝
        self.target_critic_1 = copy.deepcopy(self.critic_1).with_prefix(
            "target-critic-1/"
        )

        # 创建第二个评论家网络critic_2，作为SAC的双重Q网络
        self.critic_2 = ContinuousQAgent(
            obs_size,
            cfg.algorithm.architecture.critic_hidden_size,
            act_size,
        ).with_prefix("critic-2/")

        # 创建目标评论家网络target_critic_2，作为critic_2的深拷贝
        self.target_critic_2 = copy.deepcopy(self.critic_2).with_prefix(
            "target-critic-2/"
        )

        # 训练策略网络的引用，指向actor
        self.train_policy = self.actor
        # 评估策略网络的引用，使用KWAgentWrapper封装actor，并设定stochastic=False，即使用确定性策略
        self.eval_policy = KWAgentWrapper(self.actor, stochastic=False)


For the entropy coefficient optimizer, the code is as follows. Note the trick
which consists in using the log of this entropy coefficient. This trick was
taken from the Stable baselines3 implementation of SAC, which is explained in
[this
notebook](https://colab.research.google.com/drive/12LER1_ShWOa_UhOL1nlX-LX_t5KQK9LV?usp=sharing).

Tuning $\alpha$ in SAC is an option. To chose to tune it, the `target_entropy`
argument in the parameters should be `auto`. The initial value is given
through the `entropy_coef` parameter. For any other value than `auto`, the
value of $\alpha$ will stay constant and correspond to the `entropy_coef`
parameter.

对于熵系数优化器，其代码如下。需要注意其中的小技巧，即使用该熵系数的对数。这一技巧来源于 Stable Baselines3 的 SAC 实现，详细解释请参考[这个笔记本](https://colab.research.google.com/drive/12LER1_ShWOa_UhOL1nlX-LX_t5KQK9LV?usp=sharing)。

在 SAC 中调整 $\alpha$ 是可选项。若选择调整它，则在参数中应将 `target_entropy` 设置为 `auto`，初始值通过 `entropy_coef` 参数给定。对于 `auto` 以外的任何值，$\alpha$ 的值将保持不变，并与 `entropy_coef` 参数一致。

In [28]:
def setup_entropy_optimizers(cfg):
    # 定义设置熵优化器的函数，参数为配置文件 `cfg`

    if cfg.algorithm.entropy_mode == "auto":
        # 如果配置中的熵模式为自动模式 ("auto")，则进行以下操作：

        # 注释：优化熵系数的对数值，这略微不同于原论文中的做法，
        # 详细讨论见 https://github.com/rail-berkeley/softlearning/issues/37
        # 此注释和代码参考自稳定基线3（Stable Baselines3）的SAC实现版本

        log_entropy_coef = nn.Parameter(
            torch.log(torch.ones(1) * cfg.algorithm.init_entropy_coef)
        )  # 定义一个可学习的参数log_entropy_coef，用于存储初始熵系数的对数值
           # torch.log(torch.ones(1) * cfg.algorithm.init_entropy_coef) 将初始熵系数取对数以便直接优化其对数值

        # 调用 `setup_optimizer` 函数为 `log_entropy_coef` 参数设置优化器
        entropy_coef_optimizer = setup_optimizer(
            cfg.entropy_coef_optimizer, log_entropy_coef
        )

        # 返回熵系数优化器 `entropy_coef_optimizer` 和 `log_entropy_coef` 参数
        return entropy_coef_optimizer, log_entropy_coef
    else:
        # 如果熵模式不是自动模式，则返回两个 `None` 值，表示不进行熵系数优化
        return None, None


### Compute the critic loss

With the notations of my slides, the equation corresponding to Eq. (5) and (6)
in [this paper](https://arxiv.org/pdf/1812.05905.pdf) becomes:

$$ loss_{Q_{\boldsymbol{\phi}_i}}({\boldsymbol{\theta}}) = {\mathbb{E}}_{(\mathbf{s}_t, \mathbf{a}_t, \mathbf{s}_{t+1}) \sim
\mathcal{D}}\left[\left( r(\mathbf{s}_t, \mathbf{a}_t) + \gamma {\mathbb{E}}_{\mathbf{a} \sim
\pi_{\boldsymbol{\theta}}(.|\mathbf{s}_{t+1})} \left[
\min_{j\in 1,2} \hat{Q}^{\mathrm{target}}_{\boldsymbol{\phi}_j}(\mathbf{s}_{t+1}, \mathbf{a}) - \alpha
\log{\pi_{\boldsymbol{\theta}}(\mathbf{a}|\mathbf{s}_{t+1})} \right] - \hat{Q}_{\boldsymbol{\phi}_i}(\mathbf{s}_t, \mathbf{a}_t) \right)^2
\right] $$

An important information in the above equation and the one about the actor
loss below is the index of the expectations. These indexes tell us where the
data should be taken from. In the above equation, one can see that the index
of the outer expectation is over samples taken from the replay buffer, whereas
in the inner expectation we consider actions from the current actor at the
next state $s_{t+1}$.

Thus, to compute the inner expectation, one needs to determine what actions
the current actor would take in the next state of each sample. This is what
the line

`t_actor(rb_workspace, t=1, n_steps=1, stochastic=True)`

does. The parameter `t=1` (instead of 0) ensures that we consider the next
state $s_{t+1}$.

Once we have determined these actions, we can determine their Q-values and
their log probabilities, to compute the inner expectation.

Note that at this stage, we only determine the log probabilities corresponding
to actions taken at the next time step, by contrast with what we do for the
actor in the `compute_actor_loss(...)` function later on.

Finally, once we have computed the $$
\hat{Q}_{\boldsymbol{\phi}}(\mathbf{s}_{t+1},
\mathbf{a}) $$ for both critics, we take the min and store it into
`post_q_values`. By contrast, the Q-values corresponding to the last term of
the equation are taken from the replay buffer, they are computed in the
beginning of the function by applying the Q agents to the replay buffer
*before* changing the action to that of the current actor.

An important remark is that, if the entropy coefficient $\alpha$ corresponding
to the `ent_coef` variable is set to 0, then we retrieve exactly the critic
loss computation function of the TD3 algorithm. As we will see later, this is
also true of the actor loss computation.

This remark proved very useful in debugging the SAC code. We have set
`ent_coef` to 0 and ensured the behavior was strictly the same as the behavior
of TD3.

Note also that we compute the loss for two critics (initialized
independently), and use two target critics (using the minimum of their
prediction as the basis of the target)

### 计算评论家（Critic）损失

在我的幻灯片中使用的符号中，公式 (5) 和 (6) 对应于[此论文](https://arxiv.org/pdf/1812.05905.pdf)中的方程为：

$$
\text{loss}_{Q_{\boldsymbol{\phi}_i}}(\boldsymbol{\theta}) = \mathbb{E}_{(\mathbf{s}_t, \mathbf{a}_t, \mathbf{s}_{t+1}) \sim \mathcal{D}}\left[\left( r(\mathbf{s}_t, \mathbf{a}_t) + \gamma \mathbb{E}_{\mathbf{a} \sim \pi_{\boldsymbol{\theta}}(.|\mathbf{s}_{t+1})} \left[ \min_{j \in 1,2} \hat{Q}^{\text{target}}_{\boldsymbol{\phi}_j}(\mathbf{s}_{t+1}, \mathbf{a}) - \alpha \log{\pi_{\boldsymbol{\theta}}(\mathbf{a}|\mathbf{s}_{t+1})} \right] - \hat{Q}_{\boldsymbol{\phi}_i}(\mathbf{s}_t, \mathbf{a}_t) \right)^2 \right]
$$

在上面的公式和接下来关于 actor 损失的公式中，期望值的索引非常重要。这些索引表明数据应该从哪里获取。在上面的公式中，可以看到外部期望是针对从重放缓冲区中提取的样本，而在内部期望中，我们在下一状态 $ s_{t+1} $ 中使用当前 actor 的动作。

因此，为了计算内部期望，需要确定当前 actor 在每个样本的下一状态中会采取哪些动作。这可以通过以下代码行完成：

```python
t_actor(rb_workspace, t=1, n_steps=1, stochastic=True)
```

参数 `t=1`（而非 0）确保我们考虑的是下一状态 $ s_{t+1} $。

一旦确定了这些动作，我们可以计算它们的 Q 值和对数概率，从而计算内部期望。

请注意，在此阶段，我们只确定在下一时间步采取的动作对应的对数概率，这与稍后在 `compute_actor_loss(...)` 函数中对 actor 所做的处理不同。

最后，在计算出两个评论家 $ \hat{Q}_{\boldsymbol{\phi}}(\mathbf{s}_{t+1}, \mathbf{a}) $ 的值后，我们取它们的最小值并存储在 `post_q_values` 中。相反，公式中最后一项的 Q 值来自重放缓冲区，在函数开始时就已通过将 Q 代理应用于重放缓冲区计算出来 *（在将动作更改为当前 actor 的动作之前）*。

需要注意的是，如果熵系数 $\alpha$（对应 `ent_coef` 变量）设置为 0，那么我们得出的正是 TD3 算法中的评论家损失计算公式。稍后可以看到，这对 actor 损失的计算同样适用。

这个观察在调试 SAC 代码时非常有用。我们将 `ent_coef` 设置为 0，确保其行为与 TD3 的行为完全一致。

还需注意的是，我们为两个独立初始化的评论家计算损失，并使用两个目标评论家（将其预测的最小值作为目标的基础）。

In [29]:
def compute_critic_loss(
    cfg,
    reward: torch.Tensor,
    must_bootstrap: torch.Tensor,
    t_actor: TemporalAgent,
    t_q_agents: TemporalAgent,
    t_target_q_agents: TemporalAgent,
    rb_workspace: Workspace,
    ent_coef: torch.Tensor,
):
    r"""Computes the critic loss for a set of $S$ transition samples

    Args:
        cfg: The experimental configuration
        reward: Tensor (2xS) of rewards
        must_bootstrap: Tensor (2xS) of indicators
        t_actor: The actor agent
        t_q_agents: The critics
        t_target_q_agents: The target of the critics
        rb_workspace: The transition workspace
        ent_coef: The entropy coefficient $\alpha$

    Returns:
        Tuple[torch.Tensor, torch.Tensor]: The two critic losses (scalars)
    """

    # Replay the actor so we get the necessary statistics
    
    # Compute q_values from both critics with the actions present in the buffer:
    # at t, we have Q(s,a) from (s,a) in the RB
    t_q_agents(rb_workspace, t=0, n_steps=1)

    with torch.no_grad():
        # Replay the current actor on the replay buffer to get actions of the current actor
        t_actor(rb_workspace, t=1, n_steps=1)
        action_logprobs_next = rb_workspace["action_logprobs"]

        # Compute target q_values from both target critics: at t+1, we have
        # Q(s_{t+1}, a_{t+1}) from the (s_{t+1}, a_{t+1}) where a_{t+1} has been
        # replaced in the RB with the t_actor line above
        t_target_q_agents(rb_workspace, t=1, n_steps=1)

    q_values_rb_1, q_values_rb_2, post_q_values_1, post_q_values_2 = rb_workspace[
        "critic-1/q_value",
        "critic-2/q_value",
        "target-critic-1/q_value",
        "target-critic-2/q_value"
    ]
    

    # Compute temporal difference

    q_next = torch.minimum(post_q_values_1[1], post_q_values_2[1])
    v_phi = q_next - ent_coef * action_logprobs_next[1]

    target = reward[1] + cfg.algorithm.discount_factor * v_phi * must_bootstrap[1].int()
    critic_loss_1 = nn.functional.mse_loss(q_values_rb_1[0], target)
    critic_loss_2 = nn.functional.mse_loss(q_values_rb_2[0], target)
    

    return critic_loss_1, critic_loss_2

### Compute the actor Loss

With the notations of my slides, the equation of the actor loss corresponding
to Eq. (7) in [this paper](https://arxiv.org/pdf/1812.05905.pdf) becomes:

$$ loss_\pi({\boldsymbol{\theta}}) = {\mathbb{E}}_{\mathbf{s}_t \sim
\mathcal{D}}\left[ {\mathbb{E}}_{\mathbf{a}_t\sim
\pi_{\boldsymbol{\theta}}(.|\mathbf{s}_t)} \left[ \alpha
\log{\pi_{\boldsymbol{\theta}}(\mathbf{a}_t|\mathbf{s}_t) -
\hat{Q}_{\boldsymbol{\phi}_{i}}(\mathbf{s}_t,
\mathbf{a}_t)} \right] \right] $$

Note that [the paper](https://arxiv.org/pdf/1812.05905.pdf) mistakenly writes
$Q_\theta(s_t,s_t)$

As for the critic loss, we have two expectations, one over the states from the
replay buffer, and one over the actions of the current actor. Thus we need to
apply again the current actor to the content of the replay buffer.

But this time, we consider the current state, thus we parametrize it with
`t=0` and `n_steps=1`. This way, we get the log probabilities and Q-values at
the current step.

A nice thing is that this way, there is no overlap between the log probability
data used to update the critic and the actor, which avoids having to 'retain'
the computation graph so that it can be reused for the actor and the critic.

This small trick is one of the features that makes coding SAC the most
difficult.

Again, once we have computed the Q values over both critics, we take the min
and put it into `current_q_values`.

As for the critic loss, if we set `ent_coef` to 0, we retrieve the actor loss
function of DDPG and TD3, which simply tries to get actions that maximize the
Q values (by minimizing -Q).

### 计算 Actor 损失

在我的幻灯片符号中，actor 损失对应于[此论文](https://arxiv.org/pdf/1812.05905.pdf)中的方程 (7)：

$$
\text{loss}_\pi(\boldsymbol{\theta}) = \mathbb{E}_{\mathbf{s}_t \sim \mathcal{D}} \left[ \mathbb{E}_{\mathbf{a}_t \sim \pi_{\boldsymbol{\theta}}(.|\mathbf{s}_t)} \left[ \alpha \log{\pi_{\boldsymbol{\theta}}(\mathbf{a}_t|\mathbf{s}_t)} - \hat{Q}_{\boldsymbol{\phi}_{i}}(\mathbf{s}_t, \mathbf{a}_t) \right] \right]
$$

请注意，[该论文](https://arxiv.org/pdf/1812.05905.pdf)中错误地写成了 $ Q_\theta(s_t, s_t) $。

和评论家损失一样，这里也有两个期望值，一个是重放缓冲区中的状态，另一个是当前 actor 的动作。因此，我们需要再次将当前 actor 应用于重放缓冲区的内容。

不过这次我们关注的是当前状态，因此使用参数 `t=0` 和 `n_steps=1`。通过这种方式，我们可以在当前时间步得到对数概率和 Q 值。

这样做的好处是，更新评论家和 actor 时使用的对数概率数据不重叠，避免了需要“保留”计算图以便在评论家和 actor 间重复使用。

这个小技巧是实现 SAC 时最复杂的部分之一。

同样地，计算出两个评论家的 Q 值后，我们取其最小值并存入 `current_q_values`。

和评论家损失一样，如果将 `ent_coef` 设置为 0，那么我们得到的就是 DDPG 和 TD3 的 actor 损失函数，它简单地通过最小化 $-Q$ 来尝试获得能够最大化 Q 值的动作。

In [30]:
def compute_actor_loss(
    ent_coef, t_actor: TemporalAgent, t_q_agents: TemporalAgent, rb_workspace: Workspace
):
    r"""
    Actor loss computation
    :param ent_coef: The entropy coefficient $\alpha$
    :param t_actor: The actor agent (temporal agent)
    :param t_q_agents: The critics (as temporal agent)
    :param rb_workspace: The replay buffer (2 time steps, $t$ and $t+1$)
    """

    # Recompute the action with the current actor (at $a_t$)

    # Step 1: 使用当前 actor 重新计算当前状态下的动作 a_t 和对数概率
    t_actor(rb_workspace, t=0, n_steps=1, stochastic=True)
    action_logprobs_new = rb_workspace["action_logprobs"]

    # Compute Q-values

    # Step 2: 使用 Critic 计算当前状态下的 Q 值
    t_q_agents(rb_workspace, t=0, n_steps=1)
    q_values_1 = rb_workspace["critic-1/q_value"]
    q_values_2 = rb_workspace["critic-2/q_value"]

    # Step 3: 取两个 Q 值的最小值，减少估值偏差
    current_q_values = torch.min(q_values_1, q_values_2)

    # Compute the actor loss
    # Step 4: 计算 actor 损失

    actor_loss = ent_coef * action_logprobs_new[0] - current_q_values[0]
    
    return actor_loss.mean()

## Main training loop

In [31]:
import numpy as np


def run_sac(sac: SACAlgo):
    cfg = sac.cfg
    logger = sac.logger

    # init_entropy_coef is the initial value of the entropy coef alpha
    ent_coef = cfg.algorithm.init_entropy_coef
    tau = cfg.algorithm.tau_target

    # Creates the temporal actors
    t_actor = TemporalAgent(sac.train_policy)
    q_agents = TemporalAgent(Agents(sac.critic_1, sac.critic_2))
    target_q_agents = TemporalAgent(Agents(sac.target_critic_1, sac.target_critic_2))

    # Configure the optimizer
    actor_optimizer = setup_optimizer(cfg.actor_optimizer, sac.actor)
    critic_optimizer = setup_optimizer(cfg.critic_optimizer, sac.critic_1, sac.critic_2)
    entropy_coef_optimizer, log_entropy_coef = setup_entropy_optimizers(cfg) 

    # If entropy_mode is not auto, the entropy coefficient ent_coef remains
    # fixed. Otherwise, computes the target entropy
    if cfg.algorithm.entropy_mode == "auto":
        # target_entropy is \mathcal{H}_0 in the SAC and aplications paper.
        target_entropy = -np.prod(sac.train_env.action_space.shape).astype(np.float32)

    # Loops over successive replay buffers
    for rb in sac.iter_replay_buffers():

        # Implement the SAC algorithm
        rb_workspace = rb.get_shuffled(cfg.algorithm.batch_size)
        terminated, reward = rb_workspace["env/terminated", "env/reward"]

        must_boostrap = ~terminated

        if entropy_coef_optimizer is not None:
            ent_coef = torch.exp(log_entropy_coef.detach())

        # Critic update part #############################
        (critic_loss_1, critic_loss_2) = compute_critic_loss(
            cfg = cfg,
            reward = reward,
            must_bootstrap = must_boostrap,
            t_actor = t_actor,
            t_q_agents = q_agents,
            t_target_q_agents = target_q_agents,
            rb_workspace = rb_workspace,
            ent_coef = ent_coef
        )

        # 记录 Critic 损失
        logger.add_log("critic_loss_1", critic_loss_1, sac.nb_steps)
        logger.add_log("critic_loss_2", critic_loss_2, sac.nb_steps)

        # 反向传播并更新 Critic 参数
        critic_loss = critic_loss_1 + critic_loss_2
        critic_optimizer.zero_grad()
        critic_loss.backward()

        nn.utils.clip_grad_norm_(
            sac.critic_1.parameters(), cfg.algorithm.max_grad_norm
        )
        nn.utils.clip_grad_norm_(
            sac.critic_2.parameters(), cfg.algorithm.max_grad_norm
        )

        critic_optimizer.step()

        # Actor update part #############################
        actor_optimizer.zero_grad()
        actor_loss = compute_actor_loss(
            ent_coef = ent_coef,
            t_actor = t_actor,
            t_q_agents = q_agents,
            rb_workspace = rb_workspace
        )

        # 记录 Actor 损失
        logger.add_log("actor_loss", actor_loss, sac.nb_steps)

        # 反向传播并更新 Actor 参数
        actor_loss.backward()

        nn.utils.clip_grad_norm_(
            sac.actor.parameters(), cfg.algorithm.max_grad_norm
        )
        actor_optimizer.step()

        # Entropy optimizer part
        if entropy_coef_optimizer is not None:
            # See Eq. (17) of the SAC and Applications paper. The log
            # probabilities *must* have been computed when computing the actor
            # loss.
            action_logprobs_rb = rb_workspace["action_logprobs"].detach()
            entropy_coef_loss = -(
                log_entropy_coef.exp() * (action_logprobs_rb + target_entropy)
            ).mean()
            entropy_coef_optimizer.zero_grad()
            entropy_coef_loss.backward()
            entropy_coef_optimizer.step()
            logger.add_log("entropy_coef_loss", entropy_coef_loss, sac.nb_steps)
            logger.add_log("entropy_coef", ent_coef, sac.nb_steps)

        ####################################################

        # Soft update of target q function
        soft_update_params(sac.critic_1, sac.target_critic_1, tau)
        soft_update_params(sac.critic_2, sac.target_critic_2, tau)

        agents.evaluate()

## Definition of the parameters

In [32]:
params = {
    "save_best": True,
    "base_dir": "${gym_env.env_name}/sac-S${algorithm.seed}_${current_time:}",
    "algorithm": {
        "seed": 1,
        "n_envs": 8,
        "n_steps": 32,
        "buffer_size": 1e6,
        "batch_size": 256,
        "max_grad_norm": 0.5,
        "nb_evals": 16,
        "eval_interval": 2_000,
        "learning_starts": 10_000,
        "max_epochs": 2_000,
        "discount_factor": 0.98,
        "entropy_mode": "auto",  # "auto" or "fixed"
        "init_entropy_coef": 2e-7,
        "tau_target": 0.05,
        "architecture": {
            "actor_hidden_size": [64, 64],
            "critic_hidden_size": [256, 256],
        },
    },
    "gym_env": {"env_name": "CartPoleContinuous-v1"},
    "actor_optimizer": {
        "classname": "torch.optim.Adam",
        "lr": 3e-4,
    },
    "critic_optimizer": {
        "classname": "torch.optim.Adam",
        "lr": 3e-4,
    },
    "entropy_coef_optimizer": {
        "classname": "torch.optim.Adam",
        "lr": 3e-4,
    },
}

## Launching tensorboard to visualize the results

In [None]:
from torch.utils.tensorboard import SummaryWriter

In [36]:
setup_tensorboard("./outputs/tblogs")

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 752), started 35 days, 0:55:52 ago. (Use '!kill 752' to kill it.)

In [34]:
agents = SACAlgo(OmegaConf.create(params))
run_sac(agents)

  0%|          | 0/2000 [00:00<?, ?it/s]

In [35]:
# Visualize the best policy
agents.visualize_best()

Video of best agent recorded in outputs/CartPoleContinuous-v1/sac-S1_20241104-171044/best_agent.mp4
Moviepy - Building video /home/chen_guanyu/M2A/M2A_RLD/outputs/CartPoleContinuous-v1/sac-S1_20241104-171044/best_agent.mp4.
Moviepy - Writing video /home/chen_guanyu/M2A/M2A_RLD/outputs/CartPoleContinuous-v1/sac-S1_20241104-171044/best_agent.mp4



                                                               

Moviepy - Done !
Moviepy - video ready /home/chen_guanyu/M2A/M2A_RLD/outputs/CartPoleContinuous-v1/sac-S1_20241104-171044/best_agent.mp4


## Exercises

- use the same code on the Pendulum-v1 environment. This one is harder to
  tune. Get the parameters from the
  [rl-baseline3-zoo](https://github.com/DLR-RM/rl-baselines3-zoo) and see if
  you manage to get SAC working on Pendulum

## 练习

- 在 `Pendulum-v1` 环境中使用相同的代码。这个环境更难调试。可以从 [rl-baseline3-zoo](https://github.com/DLR-RM/rl-baselines3-zoo) 获取参数，看看能否让 SAC 在 `Pendulum` 环境中正常运行。