# 核心概念
RLlib 概述：RLlib 的核心组件是Algorithm类，它充当执行您的 RL 实验的运行时。使用算法的入口是AlgorithmConfig（青色）类，它允许您管理可用的配置设置，例如学习率或模型架构。大多数Algorithm对象都拥有EnvRunneractor（蓝色）用于从RL 环境收集训练样本，以及Learneractor（黄色）用于计算梯度和更新您的模型。算法在更新后会同步模型权重。

AlgorithmConfig 和 Algorithm
使用各种 RLlib Algorithm 类型的入口是各自的 AlgorithmConfig 类，允许您以检查和类型安全的方式配置可用设置。例如，要配置 PPO（“近端策略优化”）算法实例，您可以使用 PPOConfig 类。
在构建过程中，Algorithm 首先设置其 EnvRunnerGroup，其中包含 n 个 EnvRunner actor，以及其 LearnerGroup，其中包含 m 个 Learner actor。通过这种方式，您可以分别从单个核心扩展到集群中的数千个核心，以扩展样本收集和训练。


In [None]:
from ray.rllib.algorithms.ppo import PPOConfig

# Configure.
config = (
    PPOConfig()
    .environment("CartPole-v1")
    .training(
        train_batch_size_per_learner=2000,
        lr=0.0004,
    )
)

# Build the Algorithm.
algo = config.build()

# Train for one iteration, which is 2000 timesteps (1 train batch).
print(algo.train())

`UnifiedLogger` will be removed in Ray 2.7.
  return UnifiedLogger(config, logdir, loggers=None)
The `JsonLogger interface is deprecated in favor of the `ray.tune.json.JsonLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `CSVLogger interface is deprecated in favor of the `ray.tune.csv.CSVLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `TBXLogger interface is deprecated in favor of the `ray.tune.tensorboardx.TBXLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
2025-08-04 22:50:47,942	INFO worker.py:1927 -- Started a local Ray instance.
[2025-08-04 22:50:49,223 E 3199946 3199946] core_worker.cc:2740: Actor with class name: 'SingleAgentEnvRunner' and ID: '351ff80ebbc61d15f0b4d3e801000000' has constructor arguments in the object store and max_restarts > 0. If t

{'timers': {'training_iteration': 7.654325661016628, 'restore_env_runners': 2.7968897484242916e-05, 'training_step': 7.653806342044845, 'env_runner_sampling_timer': 2.1841077659046277, 'learner_update_timer': 5.465040223090909, 'synch_weights': 0.0031930829863995314}, 'env_runners': {'num_episodes_lifetime': 93.0, 'module_to_env_connector': {'timers': {'connectors': {'listify_data_for_vector_env': np.float64(5.4867078324845275e-05), 'un_batch_to_individual_items': np.float64(3.01107424896616e-05), 'get_actions': np.float64(0.0004431089123402281), 'tensor_to_numpy': np.float64(9.187517128096352e-05), 'remove_single_ts_time_rank_from_batch': np.float64(3.261060719322453e-06), 'normalize_and_clip_actions': np.float64(5.134171365577584e-05)}}, 'connector_pipeline_timer': np.float64(0.0008485672500699198)}, 'sample': np.float64(1.8559252845006995), 'env_to_module_connector': {'timers': {'connectors': {'add_states_from_episodes_to_batch': np.float64(7.647165153443389e-06), 'add_observations_

[36m(PPO pid=3203836)[0m [2025-08-04 22:54:07,605 E 3203836 3203836] core_worker.cc:2740: Actor with class name: 'SingleAgentEnvRunner' and ID: '7edb1854274a945e9b6bbb9c01000000' has constructor arguments in the object store and max_restarts > 0. If the arguments in the object store go out of scope or are lost, the actor restart will fail. See https://github.com/ray-project/ray/issues/53727 for more details.
[36m(PPO pid=3203836)[0m [2025-08-04 22:54:07,643 E 3203836 3203836] core_worker.cc:2740: Actor with class name: 'SingleAgentEnvRunner' and ID: '42c57f2a3942b157b2e4468901000000' has constructor arguments in the object store and max_restarts > 0. If the arguments in the object store go out of scope or are lost, the actor restart will fail. See https://github.com/ray-project/ray/issues/53727 for more details.
[36m(PPO(env=CartPole-v1; env-runners=2; learners=0; multi-agent=False) pid=3203836)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/robota

由于 Algorithm 类是 Tune Trainable API 的子类，可以使用 Ray Tune 更轻松地管理您的实验并调优超参数。

In [1]:
from ray import tune
from ray.rllib.algorithms.ppo import PPOConfig

# Configure.
config = (
    PPOConfig()
    .environment("CartPole-v1")
    .training(
        train_batch_size_per_learner=2000,
        lr=0.0004,
    )
)

# Train through Ray Tune.
results = tune.Tuner(
    "PPO",
    param_space=config,
    # Train for 4000 timesteps (2 iterations).
    run_config=tune.RunConfig(stop={"num_env_steps_sampled_lifetime": 4000}),
).fit()

0,1
Current time:,2025-08-05 14:41:54
Running for:,00:00:26.97
Memory:,11.7/15.3 GiB

Trial name,status,loc,iter,total time (s),num_training_step_ca lls_per_iteration,num_env_steps_sample d_lifetime
PPO_CartPole-v1_36536_00000,TERMINATED,10.110.34.88:3397885,2,16.1175,1,4000


[36m(PPO pid=3397885)[0m [2025-08-05 14:41:32,033 E 3397885 3397885] core_worker.cc:2740: Actor with class name: 'SingleAgentEnvRunner' and ID: '88eba756a645d593ce20b5f301000000' has constructor arguments in the object store and max_restarts > 0. If the arguments in the object store go out of scope or are lost, the actor restart will fail. See https://github.com/ray-project/ray/issues/53727 for more details.
[36m(PPO pid=3397885)[0m [2025-08-05 14:41:32,069 E 3397885 3397885] core_worker.cc:2740: Actor with class name: 'SingleAgentEnvRunner' and ID: 'd4d75d62bb16a9cee8ff88ad01000000' has constructor arguments in the object store and max_restarts > 0. If the arguments in the object store go out of scope or are lost, the actor restart will fail. See https://github.com/ray-project/ray/issues/53727 for more details.
2025-08-05 14:41:54,239	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/robotarm/ray_results/PPO_2025-08-05_14-41-24' in 0.

[33m(raylet)[0m [2025-08-06 12:44:30,692 E 3396908 3396908] (raylet) node_manager.cc:3041: 2 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: b62833ca9d8c3ddae79da8684c91c9752b0cbd158d0278b48759b821, IP: 10.110.34.88) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 10.110.34.88`
[33m(raylet)[0m 
[33m(raylet)[0m Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
[33m(raylet)[0m 
[33m(raylet)[0m [2025-08-06 15:49:31,339 E 3396908 3396908] (rayl

# RL 环境
这是一个简单的 RL 环境，其中智能体从 reset() 方法返回的初始observation开始。智能体（可能由神经网络策略控制）将动作（例如 right 或 jump）发送到环境的 step() 方法，该方法返回奖励。在此，达到目标的奖励值为 +5，否则为 0。环境还会返回一个布尔标志，指示回合是否完成。
## RLModule
RLModule 是深度学习框架特定的神经网络包装器。RLlib 的EnvRunner 在遍历 RL 环境时使用它们来计算动作，而 RLlib 的Learner 在更新模型之前使用 RLModule 实例来计算损失和梯度。
每个 EnvRunner actor 由 Algorithm 的 EnvRunnerGroup 管理，都拥有用户 RLModule 的副本。同样，每个 Learner actor 由 Algorithm 的 LearnerGroup 管理，都拥有 RLModule 的副本。
阻止 EnvRunner 副本通常是其 inference_only 版本，这意味着计算纯动作不需要的组件（例如值函数估计）会被省略以节省内存。

RLlib 以回合 (Episode) 的形式传输所有训练数据。
SingleAgentEpisode 类描述单智能体轨迹。MultiAgentEpisode 类包含多个此类单智能体回合，并描述了单个智能体相对于彼此的步进时间模式。
通常，RLlib 通过 Algorithm 的 EnvRunnerGroup 中的 EnvRunner actor 生成大小为 config.rollout_fragment_length 的回合块，并向每个 Learner actor 发送所需的回合块数量，以构建一个大小恰好为 config.train_batch_size_per_learner 的训练批次。

In [None]:
import numpy as np

# A SingleAgentEpisode of length 20 has roughly the following schematic structure.
# Note that after these 20 steps, you have 20 actions and rewards, but 21 observations and info dicts
# due to the initial "reset" observation/infos.
episode = {
    'obs': np.ndarray((21, 4), dtype=np.float32),  # 21 due to additional reset obs
    'infos': [{}, {}, {}, {}, {}, {}],  # infos are always lists of dicts
    'actions': np.ndarray((20,), dtype=np.int64),  # Discrete(4) action space
    'rewards': np.ndarray((20,), dtype=np.float32),
    'extra_model_outputs': {
        'action_dist_inputs': np.ndarray((20, 4), dtype=np.float32),  # Discrete(4) action space
    },
    'is_terminated': False,  # <- single bool
    'is_truncated': True,  # <- single bool
}
episode_w_complex_observations = {
    'obs': {
        "camera": np.ndarray((21, 64, 64, 3), dtype=np.float32),  # RGB images
        "sensors": {
            "front": np.ndarray((21, 15), dtype=np.float32),  # 1D tensors
            "rear": np.ndarray((21, 5), dtype=np.float32),  # another batch of 1D tensors
        },
    },
}

## EnvRunner：结合 RL 环境和 RLModule
RLlib 提供了两个内置的 EnvRunner 类，SingleAgentEnvRunner 和 MultiAgentEnvRunner，它们会自动处理这些复杂性。RLlib 根据您的配置选择正确的类型，特别是 config.environment() 和 config.multi_agent() 设置。
您也可以单独使用一个 EnvRunner，通过调用其 sample() 方法来生成回合列表。

In [None]:
import tree  # pip install dm_tree
import ray
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.env.single_agent_env_runner import SingleAgentEnvRunner

# Configure the EnvRunners.
config = (
    PPOConfig()
    .environment("Acrobot-v1")
    .env_runners(num_env_runners=2, num_envs_per_env_runner=1)
)
# Create the EnvRunner actors.
env_runners = [
    ray.remote(SingleAgentEnvRunner).remote(config=config)
    for _ in range(config.num_env_runners)
]

# Gather lists of `SingleAgentEpisode`s (each EnvRunner actor returns one
# such list with exactly two episodes in it).
episodes = ray.get([
    er.sample.remote(num_episodes=3)
    for er in env_runners
])
# Two remote EnvRunners used.
assert len(episodes) == 2
# Each EnvRunner returns three episodes
assert all(len(eps_list) == 3 for eps_list in episodes)

# Report the returns of all episodes collected
for episode in tree.flatten(episodes):
    print("R=", episode.get_return())

R= -500.0
R= -500.0
R= -500.0
R= -500.0
R= -500.0
R= -500.0


## Learner：结合 RLModule、损失函数和优化器
Learner 实例与算法特定，这主要是由于不同 RL 算法使用了各种损失函数。
RLlib 总是通过 LearnerGroup API 捆绑多个 Learner actor，自动在训练数据上应用分布式数据并行 (DDP)。也可以单独使用一个 Learner，用回合列表更新您的 RLModule。

In [3]:
import gymnasium as gym
import ray
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.core.rl_module.default_model_config import DefaultModelConfig

# Configure the Learner.
config = (
    PPOConfig()
    .environment("Acrobot-v1")
    .training(lr=0.0001)
    .rl_module(model_config=DefaultModelConfig(fcnet_hiddens=[64, 32]))
)
# Get the Learner class.
ppo_learner_class = config.get_default_learner_class()

# Create the Learner actor.
learner_actor = ray.remote(ppo_learner_class).remote(
    config=config,
    module_spec=config.get_multi_rl_module_spec(env=gym.make("Acrobot-v1")),
)
# Build the Learner.
ray.get(learner_actor.build.remote())

# Perform an update from the list of episodes we got from the `EnvRunners` above.
learner_results = ray.get(learner_actor.update.remote(
    episodes=tree.flatten(episodes)
))
print(learner_results["default_policy"]["policy_loss"])

Stats(0.010434884577989578; len=1; reduce=mean; win=1)
