### RLlib
- Offering scalability as RL applications can be compute-intensive and need to scale-out onto a cluster
- Unified API
- Contained in Ray
### Ray
- Parallelism and scalability

### Experiment
1. A RL environment (e.g. CartPole-v1)
2. A RL algorithm to learn in that environment (e.g. Proximal Policy Optimization (PPO))
3. Conguration (algorithm, experiment, environment config, etc.)
3. Experiment runner (tune)
    - Iterating training and evaluation

In [23]:
import shutil
import os

CHECKPOINT_ROOT = "tmp/ppo/taxi"
shutil.rmtree(CHECKPOINT_ROOT, ignore_errors=True, onerror=None) # clean up old runs
ray_results = "/results"
shutil.rmtree(ray_results, ignore_errors=True, onerror=None)


In [24]:
import ray
from ray.rllib.algorithms.ppo import PPO, PPOConfig

# Shutdown any existing Ray instances
ray.shutdown()

# Initialize Ray
ray.init(ignore_reinit_error=True)

# Configure the PPO agent
config = PPOConfig()
config.environment(env="Taxi-v3")
config.evaluation(evaluation_interval=1, evaluation_duration=10)

# Create a PPO trainer
trainer = PPO(config=config)

# Train the agent
for i in range(10):
    result = trainer.train()
    file_name = trainer.save(CHECKPOINT_ROOT)
    print(result)

# Shutdown Ray
ray.shutdown()

2024-11-04 22:17:53,801	INFO worker.py:1807 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265 [39m[22m
`UnifiedLogger` will be removed in Ray 2.7.
  return UnifiedLogger(config, logdir, loggers=None)
The `JsonLogger interface is deprecated in favor of the `ray.tune.json.JsonLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `CSVLogger interface is deprecated in favor of the `ray.tune.csv.CSVLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `TBXLogger interface is deprecated in favor of the `ray.tune.tensorboardx.TBXLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))


{'evaluation': {'env_runners': {'episode_reward_max': -477.0, 'episode_reward_min': -803.0, 'episode_reward_mean': -643.0, 'episode_len_mean': 191.8, 'episode_media': {}, 'episodes_timesteps_total': 1918, 'policy_reward_min': {'default_policy': -803.0}, 'policy_reward_max': {'default_policy': -477.0}, 'policy_reward_mean': {'default_policy': -643.0}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [-713.0, -569.0, -564.0, -477.0, -803.0, -623.0, -605.0, -722.0, -596.0, -758.0], 'episode_lengths': [200, 200, 171, 147, 200, 200, 200, 200, 200, 200], 'policy_default_policy_reward': [-713.0, -569.0, -564.0, -477.0, -803.0, -623.0, -605.0, -722.0, -596.0, -758.0]}, 'sampler_perf': {'mean_raw_obs_processing_ms': 0.07706597921064337, 'mean_inference_ms': 0.26600531080602796, 'mean_action_processing_ms': 0.030074783015089653, 'mean_env_wait_ms': 0.015533858751989765, 'mean_env_render_ms': 0.0}, 'num_faulty_episodes': 0, 'connector_metrics': {'ObsPreprocessorConnector_ms': 0.003492832183

In [15]:
policy = trainer.get_policy()
model = policy.model
print("Model configuration:")
print(model)

Model configuration:
FullyConnectedNetwork(
  (_logits): SlimFC(
    (_model): Sequential(
      (0): Linear(in_features=256, out_features=6, bias=True)
    )
  )
  (_hidden_layers): Sequential(
    (0): SlimFC(
      (_model): Sequential(
        (0): Linear(in_features=500, out_features=256, bias=True)
        (1): Tanh()
      )
    )
    (1): SlimFC(
      (_model): Sequential(
        (0): Linear(in_features=256, out_features=256, bias=True)
        (1): Tanh()
      )
    )
  )
  (_value_branch_separate): Sequential(
    (0): SlimFC(
      (_model): Sequential(
        (0): Linear(in_features=500, out_features=256, bias=True)
        (1): Tanh()
      )
    )
    (1): SlimFC(
      (_model): Sequential(
        (0): Linear(in_features=256, out_features=256, bias=True)
        (1): Tanh()
      )
    )
  )
  (_value_branch): SlimFC(
    (_model): Sequential(
      (0): Linear(in_features=256, out_features=1, bias=True)
    )
  )
)
