Algorithms bring all RLlib components together, making learning of different tasks accessible via RLlib’s Python API and its command line interface (CLI). Each Algorithm class is managed by its respective AlgorithmConfig, for example to configure a PPO instance, you should use the PPOConfig class. An Algorithm sets up its rollout workers and optimizers, and collects training metrics. Algorithms also implement the Tune Trainable API for easy experiment management.

You have three ways to interact with an algorithm. You can use the basic Python API or the command line to train it, or you can use Ray Tune to tune hyperparameters of your reinforcement learning algorithm. The following example shows three equivalent ways of interacting with PPO, which implements the proximal policy optimization algorithm in RLlib.

In [1]:
# Configure.
from ray.rllib.algorithms.ppo import PPOConfig
config = PPOConfig().environment(env="CartPole-v1").training(train_batch_size=4000)

# Build.
algo = config.build()

# Train.
print(algo.train())

`UnifiedLogger` will be removed in Ray 2.7.
  return UnifiedLogger(config, logdir, loggers=None)
The `JsonLogger interface is deprecated in favor of the `ray.tune.json.JsonLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `CSVLogger interface is deprecated in favor of the `ray.tune.csv.CSVLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `TBXLogger interface is deprecated in favor of the `ray.tune.tensorboardx.TBXLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
2024-05-14 01:08:10,191	INFO worker.py:1740 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8266 [39m[22m
[36m(RolloutWorker pid=29538)[0m   prep = cls(observation_space, options)
[36m(RolloutWorker pid=29538)[0m   self._preprocessor = get_preprocessor(obs_space)(

{'custom_metrics': {}, 'episode_media': {}, 'info': {'learner': {'default_policy': {'learner_stats': {'allreduce_latency': 0.0, 'grad_gnorm': 1.6258137209441073, 'cur_kl_coeff': 0.20000000000000004, 'cur_lr': 5.0000000000000016e-05, 'total_loss': 8.92392743736185, 'policy_loss': -0.04551196624224465, 'vf_loss': 8.963540629417665, 'vf_explained_var': 0.005344951665529641, 'kl': 0.029493994446276976, 'entropy': 0.6646866707391637, 'entropy_coeff': 0.0}, 'model': {}, 'custom_metrics': {}, 'num_agent_steps_trained': 128.0, 'num_grad_updates_lifetime': 465.5, 'diff_num_grad_updates_vs_sampler_policy': 464.5}}, 'num_env_steps_sampled': 4000, 'num_env_steps_trained': 4000, 'num_agent_steps_sampled': 4000, 'num_agent_steps_trained': 4000}, 'sampler_results': {'episode_reward_max': 68.0, 'episode_reward_min': 8.0, 'episode_reward_mean': 22.13888888888889, 'episode_len_mean': 22.13888888888889, 'episode_media': {}, 'episodes_this_iter': 180, 'episodes_timesteps_total': 3985, 'policy_reward_min':

In [2]:
from ray import tune

# Configure.
from ray.rllib.algorithms.ppo import PPOConfig
config = PPOConfig().environment(env="CartPole-v1").training(train_batch_size=4000)

# Train via Ray Tune.
tune.run("PPO", config=config)

2024-05-14 01:10:58,150	INFO tune.py:614 -- [output] This uses the legacy output and progress reporter, as Jupyter notebooks are not supported by the new engine, yet. For more information, please see https://github.com/ray-project/ray/issues/36949
  gym.logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
  logger.warn(
  logger.warn(f"{pre} is not within the observation space.")


0,1
Current time:,2024-05-14 02:15:17
Running for:,01:04:19.23
Memory:,45.6/125.7 GiB

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_CartPole-v1_c417a_00000,RUNNING,10.16.20.158:14688,221,3825.22,884000,500,500,500,500


[36m(RolloutWorker pid=22601)[0m   prep = cls(observation_space, options)
[36m(RolloutWorker pid=22612)[0m   self._preprocessor = get_preprocessor(obs_space)(
[36m(PPO pid=14688)[0m Install gputil for GPU system monitoring.
[36m(RolloutWorker pid=22612)[0m   prep = cls(observation_space, options)[32m [repeated 2x across cluster][0m
[36m(PPO pid=14688)[0m   self._preprocessor = get_preprocessor(obs_space)([32m [repeated 2x across cluster][0m


Trial name,agent_timesteps_total,connector_metrics,counters,custom_metrics,env_runner_results,episode_len_mean,episode_media,episode_return_max,episode_return_mean,episode_return_min,episode_reward_max,episode_reward_mean,episode_reward_min,episodes_this_iter,episodes_timesteps_total,info,num_agent_steps_sampled,num_agent_steps_sampled_lifetime,num_agent_steps_trained,num_env_steps_sampled,num_env_steps_sampled_lifetime,num_env_steps_sampled_this_iter,num_env_steps_sampled_throughput_per_sec,num_env_steps_trained,num_env_steps_trained_this_iter,num_env_steps_trained_throughput_per_sec,num_episodes,num_faulty_episodes,num_healthy_workers,num_in_flight_async_reqs,num_remote_worker_restarts,num_steps_trained_this_iter,perf,policy_reward_max,policy_reward_mean,policy_reward_min,sampler_perf,sampler_results,timers
PPO_CartPole-v1_c417a_00000,884000,"{'ObsPreprocessorConnector_ms': 0.006231546401977539, 'StateBufferConnector_ms': 0.006144523620605469, 'ViewRequirementAgentConnector_ms': 0.19642233848571777}","{'num_env_steps_sampled': 884000, 'num_env_steps_trained': 884000, 'num_agent_steps_sampled': 884000, 'num_agent_steps_trained': 884000}",{},"{'episode_reward_max': 500.0, 'episode_reward_min': 500.0, 'episode_reward_mean': 500.0, 'episode_len_mean': 500.0, 'episode_media': {}, 'episodes_this_iter': 8, 'episodes_timesteps_total': 50000, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0], 'episode_lengths': [500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500]}, 'sampler_perf': {'mean_raw_obs_processing_ms': 0.48981083400909287, 'mean_inference_ms': 1.6449787644077316, 'mean_action_processing_ms': 0.17760268693926767, 'mean_env_wait_ms': 0.08690799572258338, 'mean_env_render_ms': 0.0}, 'num_faulty_episodes': 0, 'connector_metrics': {'ObsPreprocessorConnector_ms': 0.006231546401977539, 'StateBufferConnector_ms': 0.006144523620605469, 'ViewRequirementAgentConnector_ms': 0.19642233848571777}, 'num_episodes': 8, 'episode_return_max': 500.0, 'episode_return_min': 500.0, 'episode_return_mean': 500.0}",500,{},500,500,500,500,500,500,8,50000,"{'learner': {'default_policy': {'learner_stats': {'allreduce_latency': 0.0, 'grad_gnorm': 0.028289363523744927, 'cur_kl_coeff': 1.0509738482436128e-46, 'cur_lr': 5.0000000000000016e-05, 'total_loss': 0.00020715497041081068, 'policy_loss': 0.000207037600358167, 'vf_loss': 1.1734393938827598e-07, 'vf_explained_var': 0.9999399056998632, 'kl': 0.0027693383632967283, 'entropy': 0.2039412122580313, 'entropy_coeff': 0.0}, 'model': {}, 'custom_metrics': {}, 'num_agent_steps_trained': 128.0, 'num_grad_updates_lifetime': 205065.5, 'diff_num_grad_updates_vs_sampler_policy': 464.5}}, 'num_env_steps_sampled': 884000, 'num_env_steps_trained': 884000, 'num_agent_steps_sampled': 884000, 'num_agent_steps_trained': 884000}",884000,884000,884000,884000,884000,4000,224.241,884000,4000,224.241,8,0,2,0,0,4000,"{'cpu_util_percent': 98.916, 'ram_util_percent': 36.356}",{},{},{},"{'mean_raw_obs_processing_ms': 0.48981083400909287, 'mean_inference_ms': 1.6449787644077316, 'mean_action_processing_ms': 0.17760268693926767, 'mean_env_wait_ms': 0.08690799572258338, 'mean_env_render_ms': 0.0}","{'episode_reward_max': 500.0, 'episode_reward_min': 500.0, 'episode_reward_mean': 500.0, 'episode_len_mean': 500.0, 'episode_media': {}, 'episodes_this_iter': 8, 'episodes_timesteps_total': 50000, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0], 'episode_lengths': [500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500]}, 'sampler_perf': {'mean_raw_obs_processing_ms': 0.48981083400909287, 'mean_inference_ms': 1.6449787644077316, 'mean_action_processing_ms': 0.17760268693926767, 'mean_env_wait_ms': 0.08690799572258338, 'mean_env_render_ms': 0.0}, 'num_faulty_episodes': 0, 'connector_metrics': {'ObsPreprocessorConnector_ms': 0.006231546401977539, 'StateBufferConnector_ms': 0.006144523620605469, 'ViewRequirementAgentConnector_ms': 0.19642233848571777}, 'num_episodes': 8, 'episode_return_max': 500.0, 'episode_return_min': 500.0, 'episode_return_mean': 500.0}","{'training_iteration_time_ms': 18218.817, 'restore_workers_time_ms': 0.023, 'training_step_time_ms': 18218.748, 'sample_time_ms': 5286.226, 'load_time_ms': 1.09, 'load_throughput': 3669717.835, 'learn_time_ms': 12911.173, 'learn_throughput': 309.809, 'synch_weights_time_ms': 19.341}"


RL Modules 是框架特定的神经网络容器：RL Modules 是用于承载神经网络并定义在强化学习中的三个阶段（探索、推断和训练）如何使用它们的容器。它们为神经网络提供了一个统一的封装，以便在不同的强化学习环节中使用。

RL Modules 在强化学习的三个阶段中发挥作用：

探索（Exploration）：在探索阶段，RL Modules 负责定义如何从环境中采样动作，以便代理可以探索环境并收集数据。

推断（Inference）：在推断阶段，RL Modules 负责将观测映射到动作，即根据当前的观测选择合适的动作。

训练（Training）：在训练阶段，RL Modules 负责定义神经网络的训练逻辑，以便通过优化算法来更新神经网络的参数以最大化奖励。

RL Modules 在 RLlib 中的应用：

在 RolloutWorker 中，RL Modules 负责探索和推断逻辑，用于从环境中采样动作，并与环境进行交互。

在 Learner 中，RL Modules 负责训练逻辑，用于训练神经网络参数以最大化累积奖励。

RL Modules 扩展到多智能体情况：在多智能体情况下，一个 MultiAgentRLModule 包含多个 RL Modules，每个 RL Module 可以代表一个智能体的策略。

策略评估过程：在强化学习中，策略评估是指在给定环境和策略的情况下，生成一批经验的过程。这个过程通常被称为“环境交互循环”，因为它涉及到代理与环境进行交互，执行动作并观察结果。

RLlib 中的 RolloutWorker 类：RLlib 提供了 RolloutWorker 类来管理策略评估的过程。RolloutWorker 负责处理与环境的交互，生成经验批次，并在多种情况下处理效率问题，比如使用向量化技术、循环神经网络（RNNs）或在多智能体环境中操作时。

使用 RolloutWorker 生成经验批次：你可以单独使用 RolloutWorker 来生成经验批次。这可以通过在 RolloutWorker 实例上调用 worker.sample() 或在创建为 Ray actors 的 worker 实例上并行调用 worker.sample.remote() 来完成。这样做可以利用并行处理来加速经验采集过程。

In [None]:
# Setup policy and rollout workers.
env = gym.make("CartPole-v1")
policy = CustomPolicy(env.observation_space, env.action_space, {})
workers = EnvRunnerGroup(
    policy_class=CustomPolicy,
    env_creator=lambda c: gym.make("CartPole-v1"),
    num_env_runners=10)

while True:
    # Gather a batch of samples.
    T1 = SampleBatch.concat_samples(
        ray.get([w.sample.remote() for w in workers.remote_workers()]))

    # Improve the policy using the T1 batch.
    policy.learn_on_batch(T1)

    # The local worker acts as a "parameter server" here.
    # We put the weights of its `policy` into the Ray object store once (`ray.put`)...
    weights = ray.put({"default_policy": policy.get_weights()})
    for w in workers.remote_workers():
        # ... so that we can broacast these weights to all rollout-workers once.
        w.set_weights.remote(weights)