# 算法 
## On-Policy
## 近端策略优化 (PPO)
PPO 架构：在一个训练迭代中，PPO 执行三个主要步骤：1. 采样一组 episode 或 episode 片段 1. 将它们转换为训练批次，并使用 clipped objective 和多次 SGD 遍历该批次来更新模型 1. 将 Learner 的权重同步回 EnvRunner。PPO 在两个方向上都能扩展，支持多个 EnvRunner 进行样本收集，以及多个基于 GPU 或 CPU 的 Learner 更新模型。

class ray.rllib.algorithms.ppo.ppo.PPOConfig(algo_class=None)
定义一个配置类，可以从中构建 PPO 算法。

In [7]:
from ray.rllib.algorithms.ppo import PPOConfig

config = PPOConfig()
config.environment("CartPole-v1")
config.env_runners(num_env_runners=1)
config.training(
    gamma=0.9, lr=0.01, kl_coeff=0.3, train_batch_size_per_learner=256
)

# Build a Algorithm object from the config and run 1 training iteration.
algo = config.build_algo()
algo.train()

`UnifiedLogger` will be removed in Ray 2.7.
  return UnifiedLogger(config, logdir, loggers=None)
The `JsonLogger interface is deprecated in favor of the `ray.tune.json.JsonLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `CSVLogger interface is deprecated in favor of the `ray.tune.csv.CSVLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `TBXLogger interface is deprecated in favor of the `ray.tune.tensorboardx.TBXLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
[2025-08-06 19:16:06,568 E 4032506 4032506] core_worker.cc:2740: Actor with class name: 'SingleAgentEnvRunner' and ID: 'e26e84d5351721ccbd1186d501000000' has constructor arguments in the object store and max_restarts > 0. If the arguments in the object store go out of scope or are lost, the actor resta

{'timers': {'training_iteration': 1.5853639098349959,
  'restore_env_runners': 2.5487970560789108e-05,
  'training_step': 1.58493511704728,
  'env_runner_sampling_timer': 0.5528545319102705,
  'learner_update_timer': 1.0288492140825838,
  'synch_weights': 0.0027918238192796707},
 'env_runners': {'num_env_steps_sampled_lifetime': 256,
  'env_to_module_connector': {'timers': {'connectors': {'add_states_from_episodes_to_batch': 9.538870859409285e-06,
     'batch_individual_items': 3.748741463657136e-05,
     'add_time_dim_to_batch_and_zero_pad': 1.6902450732924846e-05,
     'add_observations_from_episodes_to_batch': 1.8243293477540685e-05,
     'numpy_to_tensor': 7.132719848563005e-05}},
   'connector_pipeline_timer': 0.0003292565112888338},
  'num_env_steps_sampled': 256,
  'module_to_env_connector': {'timers': {'connectors': {'normalize_and_clip_actions': 0.0001058564655639427,
     'listify_data_for_vector_env': 6.923905912260099e-05,
     'un_batch_to_individual_items': 3.633027966971

In [8]:
from ray.rllib.algorithms.ppo import PPOConfig
from ray import tune

config = (
    PPOConfig()
    # Set the config object's env.
    .environment(env="CartPole-v1")
    # Update the config object's training parameters.
    .training(
        lr=0.001, clip_param=0.2
    )
)

tune.Tuner(
    "PPO",
    run_config=tune.RunConfig(stop={"training_iteration": 1}),
    param_space=config,
).fit()

0,1
Current time:,2025-08-06 19:18:42
Running for:,00:01:08.12
Memory:,14.7/15.3 GiB

Trial name,status,loc
PPO_CartPole-v1_f3807_00000,PENDING,


2025-08-06 19:18:42,468	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/robotarm/ray_results/PPO_2025-08-06_19-17-34' in 0.0191s.
2025-08-06 19:18:52,498	INFO tune.py:1041 -- Total run time: 78.17 seconds (68.10 seconds for the tuning loop).
Resume experiment with: Tuner.restore(path="/home/robotarm/ray_results/PPO_2025-08-06_19-17-34", trainable=...)
- PPO_CartPole-v1_f3807_00000: FileNotFoundError('Could not fetch metrics for PPO_CartPole-v1_f3807_00000: both result.json and progress.csv were not found at /home/robotarm/ray_results/PPO_2025-08-06_19-17-34/PPO_CartPole-v1_f3807_00000_0_2025-08-06_19-17-34')


ResultGrid<[
  Result(
    metrics={},
    path='/home/robotarm/ray_results/PPO_2025-08-06_19-17-34/PPO_CartPole-v1_f3807_00000_0_2025-08-06_19-17-34',
    filesystem='local',
    checkpoint=None
  )
]>

## Off-Policy
## 深度 Q 网络 (DQN, Rainbow, Parametric DQN)
DQN 架构：DQN 使用回放缓冲区临时存储 RLlib 从环境中收集的 episode 样本。在不同的训练迭代中，这些 episode 和 episode 片段会从缓冲区中重新采样并重新用于更新模型，最终在缓冲区达到容量且新样本不断进入时被丢弃 (FIFO)。这种训练数据的重用使得 DQN 的样本效率很高且是离策略的。DQN 在两个方向上都能扩展，支持多个 EnvRunner 进行样本收集，以及多个基于 GPU 或 CPU 的 Learner 更新模型。

 class ray.rllib.algorithms.dqn.dqn.DQNConfig(algo_class=None) 定义一个配置类，可以从中构建 DQN 算法。

In [None]:
from ray.rllib.algorithms.dqn.dqn import DQNConfig

config = (
    DQNConfig()
    .environment("CartPole-v1")
    .training(replay_buffer_config={
        "type": "PrioritizedEpisodeReplayBuffer",
        "capacity": 60000,
        "alpha": 0.5,
        "beta": 0.5,
    })
    .env_runners(num_env_runners=1)
)
algo = config.build_algo()
algo.train()
algo.stop()

In [None]:
from ray.rllib.algorithms.dqn.dqn import DQNConfig
from ray import tune

config = (
    DQNConfig()
    .environment("CartPole-v1")
    .training(
        num_atoms=tune.grid_search([1,])
    )
)
tune.Tuner(
    "DQN",
    run_config=tune.RunConfig(stop={"training_iteration":1}),
    param_space=config,
).fit()

## 软 Actor-Critic (SAC)
SAC 架构：SAC 使用回放缓冲区临时存储 RLlib 从环境中收集的 episode 样本。在不同的训练迭代中，这些 episode 和 episode 片段会从缓冲区中重新采样并重新用于更新模型，最终在缓冲区达到容量且新样本不断进入时被丢弃 (FIFO)。这种训练数据的重用使得 SAC 的样本效率很高且是离策略的。SAC 在两个方向上都能扩展，支持多个 EnvRunner 进行样本收集，以及多个基于 GPU 或 CPU 的 Learner 更新模型。

class ray.rllib.algorithms.sac.sac.SACConfig(algo_class=None) 定义一个配置类，可以从中构建 SAC 算法。

In [9]:
from ray.rllib.algorithms.sac.sac import SACConfig

config = (
    SACConfig()
    .environment("Pendulum-v1")
    .env_runners(num_env_runners=1)
    .training(
        gamma=0.9,
        actor_lr=0.001,
        critic_lr=0.002,
        train_batch_size_per_learner=32,
    )
)
# Build the SAC algo object from the config and run 1 training iteration.
algo = config.build_algo()
algo.train()

`UnifiedLogger` will be removed in Ray 2.7.
  return UnifiedLogger(config, logdir, loggers=None)
The `JsonLogger interface is deprecated in favor of the `ray.tune.json.JsonLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `CSVLogger interface is deprecated in favor of the `ray.tune.csv.CSVLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `TBXLogger interface is deprecated in favor of the `ray.tune.tensorboardx.TBXLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
[2025-08-06 19:19:07,320 E 4032506 4032506] core_worker.cc:2740: Actor with class name: 'SingleAgentEnvRunner' and ID: '7caae00e4b3e50fa0743633901000000' has constructor arguments in the object store and max_restarts > 0. If the arguments in the object store go out of scope or are lost, the actor resta

{'timers': {'training_iteration': 2.49877782096155,
  'restore_env_runners': 2.619828366461121e-05,
  'training_step': 0.14277760602769302,
  'env_runner_sampling_timer': 0.1358278902293092,
  'replay_buffer_add_data_timer': 0.003704311900377226},
 'env_runners': {'num_env_steps_sampled_lifetime': 100,
  'env_to_module_connector': {'timers': {'connectors': {'add_states_from_episodes_to_batch': 8.805160120346227e-06,
     'batch_individual_items': 7.885414424821597e-05,
     'add_time_dim_to_batch_and_zero_pad': 1.4853592734398081e-05,
     'add_observations_from_episodes_to_batch': 4.0668916042309075e-05,
     'numpy_to_tensor': 0.00013696652123822622}},
   'connector_pipeline_timer': 0.0008803033944130772},
  'num_env_steps_sampled': 100,
  'module_to_env_connector': {'timers': {'connectors': {'normalize_and_clip_actions': 0.00033713240959476807,
     'listify_data_for_vector_env': 0.00010940701427942472,
     'un_batch_to_individual_items': 6.779415586698412e-05,
     'get_actions': 

## 高吞吐量 On-Policy 和 Off-Policy
APPO 架构：APPO 是基于 IMPALA 架构的 近端策略优化 (PPO) 的异步变体，但使用带有 clipping 的代理策略损失，允许每收集一个训练批次进行多次 SGD 遍历。在一个训练迭代中，APPO 异步地从所有 EnvRunner 请求样本，收集到的 episode 样本作为 Ray 引用返回给主算法进程，而不是本地进程上可用的实际对象。然后 APPO 将这些 episode 引用传递给 Learner 进行模型的异步更新。RLlib 在新的模型版本可用后不会总是立即将权重同步回 EnvRunner。为了考虑 EnvRunner 是离策略的，APPO 使用 IMPALA 论文中描述的 v-trace 过程。APPO 在两个方向上都能扩展，支持多个 EnvRunner 进行样本收集，以及多个基于 GPU 或 CPU 的 Learner 更新模型。

 class ray.rllib.algorithms.appo.appo.APPOConfig(algo_class=None) 定义一个配置类，可以从中构建 APPO 算法。

In [12]:
from ray.rllib.algorithms.appo import APPOConfig
config = (
    APPOConfig()
    .training(lr=0.01, grad_clip=30.0, train_batch_size_per_learner=50)
)
config = config.learners(num_learners=1)
config = config.env_runners(num_env_runners=1)
config = config.environment("CartPole-v1")

# Build an Algorithm object from the config and run 1 training iteration.
algo = config.build_algo() 
print(algo.train())
del algo

`UnifiedLogger` will be removed in Ray 2.7.
  return UnifiedLogger(config, logdir, loggers=None)
The `JsonLogger interface is deprecated in favor of the `ray.tune.json.JsonLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `CSVLogger interface is deprecated in favor of the `ray.tune.csv.CSVLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `TBXLogger interface is deprecated in favor of the `ray.tune.tensorboardx.TBXLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
[2025-08-06 19:47:17,924 E 4032506 4032506] core_worker.cc:2740: Actor with class name: 'SingleAgentEnvRunner' and ID: 'ae7194612e0c6f719ffa607601000000' has constructor arguments in the object store and max_restarts > 0. If the arguments in the object store go out of scope or are lost, the actor resta

ValueError: <ray.rllib.env.single_agent_env_runner.SingleAgentEnvRunner object at 0x7df0b263b2b0> doesn't have an env! Can't call `sample()` on it.

In [6]:
from ray.rllib.algorithms.appo import APPOConfig
from ray import tune

config = APPOConfig()
# Update the config object.
config = config.training(lr=tune.grid_search([0.001,]))
# Set the config object's env.
config = config.environment(env="CartPole-v1")
# Use to_dict() to get the old-style python config dict when running with tune.
tune.Tuner(
    "APPO",
    run_config=tune.RunConfig(
        stop={"training_iteration": 1},
        verbose=0,
    ),
    param_space=config.to_dict(),

).fit()

2025-08-06 19:11:29,725	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/robotarm/ray_results/APPO_2025-08-06_19-11-04' in 0.0150s.
Resume experiment with: Tuner.restore(path="/home/robotarm/ray_results/APPO_2025-08-06_19-11-04", trainable=...)
- APPO_CartPole-v1_0b1c9_00000: FileNotFoundError('Could not fetch metrics for APPO_CartPole-v1_0b1c9_00000: both result.json and progress.csv were not found at /home/robotarm/ray_results/APPO_2025-08-06_19-11-04/APPO_CartPole-v1_0b1c9_00000_0_lr=0.0010_2025-08-06_19-11-04')


ResultGrid<[
  Result(
    metrics={},
    path='/home/robotarm/ray_results/APPO_2025-08-06_19-11-04/APPO_CartPole-v1_0b1c9_00000_0_lr=0.0010_2025-08-06_19-11-04',
    filesystem='local',
    checkpoint=None
  )
]>

## 重要性加权 Actor-Learner 架构 (IMPALA)
IMPALA 架构：在一个训练迭代中，IMPALA 异步地向所有 EnvRunners 请求样本，并将收集到的 episode 作为 Ray 引用返回给主算法进程，而不是本地进程上可用的实际对象。然后 IMPALA 将这些 episode 引用传递给 Learners 进行模型的异步更新。当新的模型版本可用时，RLlib 并不会立即将权重同步回 EnvRunners。为了应对 EnvRunners 处于 off-policy 状态，IMPALA 使用了一种称为 v-trace 的过程，如论文中所述。IMPALA 在两个方面进行扩展，支持多个 EnvRunners 进行样本收集，以及多个基于 GPU 或 CPU 的 Learners 进行模型更新。

ray.rllib.algorithms.impala.impala.IMPALAConfig(algo_class=None) 定义一个配置类，可以从中构建一个 Impala 算法。

In [None]:
from ray.rllib.algorithms.impala import IMPALAConfig

config = (
    IMPALAConfig()
    .environment("CartPole-v1")
    .env_runners(num_env_runners=1)
    .training(lr=0.0003, train_batch_size_per_learner=512)
    .learners(num_learners=1)
)
# Build a Algorithm object from the config and run 1 training iteration.
algo = config.build_algo()
algo.train()
del algo

In [None]:
from ray.rllib.algorithms.impala import IMPALAConfig
from ray import tune

config = (
    IMPALAConfig()
    .environment("CartPole-v1")
    .env_runners(num_env_runners=1)
    .training(lr=tune.grid_search([0.0001, 0.0002]), grad_clip=20.0)
    .learners(num_learners=1)
)
# Run with tune.
tune.Tuner(
    "IMPALA",
    param_space=config,
    run_config=tune.RunConfig(stop={"training_iteration": 1}),
).fit()

## 基于模型的强化学习
## DreamerV3

DreamerV3 架构：DreamerV3 使用从回放缓冲区采样的真实环境交互以监督方式训练一个循环 WORLD_MODEL。世界模型的目标是正确预测 RL 环境的过渡 dynamics：下一个 observation、reward 和一个布尔值 continuation flag。DreamerV3 仅在合成轨迹上训练 actor- 和 critic-网络，这些轨迹由世界模型“梦想”出来。DreamerV3 在两个方面进行扩展，支持多个 EnvRunners 进行样本收集以及多个基于 GPU 或 CPU 的 Learners 进行模型更新。它也可以用于不同的环境类型，包括基于图像或向量的 observation、连续或离散的 actions，以及稀疏或密集的 reward functions。

## 离线强化学习和模仿学习
## 行为克隆 (BC)

BC 架构：RLlib 的行为克隆 (BC) 使用 Ray Data 利用其并行数据处理能力。在一个训练迭代中，BC 由 n 个 DataWorkers 并行读取离线文件（例如 parquet）中的 episode。然后 Connector pipelines 将这些 episode 预处理成训练批次，并将这些批次作为数据迭代器直接发送给 n 个 Learners 以更新模型。RLlib 的 BC 实现直接源自其 MARWIL 实现，唯一的区别是 beta 参数（设置为 0.0）。这使得 BC 试图匹配生成离线数据的行为策略，而忽略任何由此产生的奖励。

ray.rllib.algorithms.bc.bc.BCConfig(algo_class=None) 定义一个配置类，可以从中构建一个新的 BC 算法。


In [None]:
from ray.rllib.algorithms.bc import BCConfig
# Run this from the ray directory root.
config = BCConfig().training(lr=0.00001, gamma=0.99)
config = config.offline_data(
    input_="./rllib/tests/data/cartpole/large.json")

# Build an Algorithm object from the config and run 1 training iteration.
algo = config.build()
algo.train()

In [None]:
from ray.rllib.algorithms.bc import BCConfig
from ray import tune
config = BCConfig()
# Print out some default values.
print(config.beta)
# Update the config object.
config.training(
    lr=tune.grid_search([0.001, 0.0001]), beta=0.75
)
# Set the config object's data path.
# Run this from the ray directory root.
config.offline_data(
    input_="./rllib/tests/data/cartpole/large.json"
)
# Set the config object's env, used for evaluation.
config.environment(env="CartPole-v1")
# Use to_dict() to get the old-style python config dict
# when running with tune.
tune.Tuner(
    "BC",
    param_space=config.to_dict(),
).fit()

## 保守 Q 学习 (CQL)
CQL 架构：CQL (保守 Q 学习) 是一种离线强化学习算法，它通过保守的 critic 估计来减轻数据集分布外部 Q 值的过高估计。它在标准的 Bellman 更新损失中添加了一个简单的 Q 正则化损失，确保 critic 不会输出过于乐观的 Q 值。SACLearner 将此保守修正项添加到基于 TD 的 Q 学习损失中。

ray.rllib.algorithms.cql.cql.CQLConfig(algo_class=None) 定义一个配置类，可以从中构建一个 CQL 算法。

In [None]:
from ray.rllib.algorithms.cql import CQLConfig
config = CQLConfig().training(gamma=0.9, lr=0.01)
config = config.resources(num_gpus=0)
config = config.env_runners(num_env_runners=4)
print(config.to_dict())
# Build a Algorithm object from the config and run 1 training iteration.
algo = config.build_algo(env="CartPole-v1")
algo.train()

## 单调优势重加权模仿学习 (MARWIL)
MARWIL 架构：MARWIL 是一种混合模仿学习和策略梯度算法，适用于在批处理的历史数据上进行训练。当 beta 超参数设置为零时，MARWIL objective 退化为简单的模仿学习（参见 BC）。MARWIL 使用 Ray.Data 利用其并行数据处理能力。在一个训练迭代中，MARWIL 由 n 个 DataWorkers 并行读取离线文件（例如 parquet）中的 episode。Connector pipelines 将这些 episode 预处理成训练批次，并将这些批次作为数据迭代器直接发送给 n 个 Learners 以更新模型。

ray.rllib.algorithms.marwil.marwil.MARWILConfig(algo_class=None) 定义一个配置类，可以从中构建一个 MARWIL 算法。

In [13]:
import gymnasium as gym
import numpy as np

from pathlib import Path
from ray.rllib.algorithms.marwil import MARWILConfig

# Get the base path (to ray/rllib)
base_path = Path(__file__).parents[2]
# Get the path to the data in rllib folder.
data_path = base_path / "tests/data/cartpole/cartpole-v1_large"

config = MARWILConfig()
# Enable the new API stack.
config.api_stack(
    enable_rl_module_and_learner=True,
    enable_env_runner_and_connector_v2=True,
)
# Define the environment for which to learn a policy
# from offline data.
config.environment(
    observation_space=gym.spaces.Box(
        np.array([-4.8, -np.inf, -0.41887903, -np.inf]),
        np.array([4.8, np.inf, 0.41887903, np.inf]),
        shape=(4,),
        dtype=np.float32,
    ),
    action_space=gym.spaces.Discrete(2),
)
# Set the training parameters.
config.training(
    beta=1.0,
    lr=1e-5,
    gamma=0.99,
    # We must define a train batch size for each
    # learner (here 1 local learner).
    train_batch_size_per_learner=2000,
)
# Define the data source for offline data.
config.offline_data(
    input_=[data_path.as_posix()],
    # Run exactly one update per training iteration.
    dataset_num_iters_per_learner=1,
)

# Build an `Algorithm` object from the config and run 1 training
# iteration.
algo = config.build_algo()
algo.train()

NameError: name '__file__' is not defined

In [None]:
import gymnasium as gym
import numpy as np

from pathlib import Path
from ray.rllib.algorithms.marwil import MARWILConfig
from ray import tune

# Get the base path (to ray/rllib)
base_path = Path(__file__).parents[2]
# Get the path to the data in rllib folder.
data_path = base_path / "tests/data/cartpole/cartpole-v1_large"

config = MARWILConfig()
# Enable the new API stack.
config.api_stack(
    enable_rl_module_and_learner=True,
    enable_env_runner_and_connector_v2=True,
)
# Print out some default values
print(f"beta: {config.beta}")
# Update the config object.
config.training(
    lr=tune.grid_search([1e-3, 1e-4]),
    beta=0.75,
    # We must define a train batch size for each
    # learner (here 1 local learner).
    train_batch_size_per_learner=2000,
)
# Set the config's data path.
config.offline_data(
    input_=[data_path.as_posix()],
    # Set the number of updates to be run per learner
    # per training step.
    dataset_num_iters_per_learner=1,
)
# Set the config's environment for evalaution.
config.environment(
    observation_space=gym.spaces.Box(
        np.array([-4.8, -np.inf, -0.41887903, -np.inf]),
        np.array([4.8, np.inf, 0.41887903, np.inf]),
        shape=(4,),
        dtype=np.float32,
    ),
    action_space=gym.spaces.Discrete(2),
)
# Set up a tuner to run the experiment.
tuner = tune.Tuner(
    "MARWIL",
    param_space=config,
    run_config=tune.RunConfig(
        stop={"training_iteration": 1},
    ),
)
# Run the experiment.
tuner.fit()