<a href="https://colab.research.google.com/github/LondonNode/Pearl-tutorials/blob/main/6_Agents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pearll

# Introduction

This notebook is a tutorial for the `agents` module within Pearl. This is the main interface between the user and the algorithm. Pearl is designed to allow for three types of agent:

1. Reinforcement learning (RL)
2. Evolutionary Computation (EC)
3. Hybrid algorithms (combination of RL and EC)

All three types of agent can be derived from the `BaseAgent` object. The goal is to allow the user to focus on writing the code for the training step itself without having to think about any of the other infrastructure around the algorithm (e.g. collecting trajectories). As such, the only abstract method defined is the `_fit()` method where the update algorithms should be implemented.

Key features include:

| Features                 | Pearl   | 
|-------------------       |---------|
| Modular Components       | ✅      |
| Dataclass settings       | ✅      |
| Tensorboard integration  | ✅      |
| Single or multi-agent    | ✅      |
| Callbacks                | ✅      |
| Control log frequency    | ✅      |
| Control train frequency  | ✅      |
| Trajectory log file agent.log  | ✅      |

Some agents are already implemented as examples and a [template](https://github.com/LondonNode/Pearl/blob/main/pearll/agents/templates.py) is included for creating your own agents.

In [2]:
# A demo script is also included to run differnt implemented agents in Pearl
# from the command line.

!python -m pearll.demo -h

usage: demo.py [-h] [--agent AGENT]

pearll demo with preloaded hyperparameters

optional arguments:
  -h, --help     show this help message and exit
  --agent AGENT  Agent to demo


# Reinforcement Learning

Pearl supports single and multi agent RL. Let's use the CartPole gym environment for its simplicity, for which we can implement the DQN off-policy algorithm. An implementation of DQN is done in Pearl already but we'll do a simplified version here to demonstrate the flow.

In [12]:
from pearll.agents import BaseAgent
from pearll.models import ActorCritic
from pearll.updaters.critics import BaseCriticUpdater, DiscreteQRegression
from pearll.buffers import BaseBuffer, ReplayBuffer
from pearll.explorers import BaseExplorer
from pearll.callbacks import BaseCallback
from pearll.common.type_aliases import Log
# The utils file in common contains some useful data manipulation functions.
from pearll.common.utils import to_numpy
# Pearl includes implementations of common estimators, TD(0), GAE, etc.
from pearll.signal_processing import return_estimators
from pearll.settings import (
    BufferSettings,
    ExplorerSettings,
    LoggerSettings,
    MiscellaneousSettings,
    OptimizerSettings,
    Settings,
)

from typing import Type, Optional, List
import gym
import numpy as np
import torch as T



class DQN(BaseAgent):
  # Note that many of the agent parameters are grouped into settings objects for
  # a cleaner interface. The __init__() method's goal is to simply intialize all
  # the other modules inputted to the class, should generally be quite simple to
  # implement.
  def __init__(
    self,
    env: gym.Env,
    model: ActorCritic,
    trajectory_discount: float = 0.99, # can add extra parameters not defined in the BaseAgent
    updater_class: Type[BaseCriticUpdater] = DiscreteQRegression, # easy to swap for another critic updater
    optimizer_settings: OptimizerSettings = OptimizerSettings(),
    buffer_class: Type[BaseBuffer] = ReplayBuffer, # easy to swap for another buffer
    buffer_settings: BufferSettings = BufferSettings(),
    action_explorer_class: Type[BaseExplorer] = BaseExplorer, # easy to swap for another explorer
    explorer_settings: ExplorerSettings = ExplorerSettings(start_steps=1000),
    callbacks: Optional[List[Type[BaseCallback]]] = None, # easy to swap for another callback (in the next tutorial...)
    callback_settings: Optional[List[Settings]] = None,
    logger_settings: LoggerSettings = LoggerSettings(),
    misc_settings: MiscellaneousSettings = MiscellaneousSettings(), # note seed is stored here!
  ) -> None:
    # The BaseAgent handles intialization of many of the modules, this can be
    # done since the submodules within share the same initialization interface.
    super().__init__(
      env,
      model,
      action_explorer_class=action_explorer_class,
      explorer_settings=explorer_settings,
      buffer_class=buffer_class,
      buffer_settings=buffer_settings,
      logger_settings=logger_settings,
      callbacks=callbacks,
      callback_settings=callback_settings,
      misc_settings=misc_settings,
    )

    self.q_regression = updater_class(
        loss_class = optimizer_settings.loss_class,
        optimizer_class = optimizer_settings.optimizer_class,
        max_grad = optimizer_settings.max_grad
    )

    self.learning_rate = optimizer_settings.learning_rate
    self.trajectory_discount = trajectory_discount

  # Abstract method needs to be implemented. The _fit() method defines the
  # actual training update step. This can be quite simple as well though
  # thanks to the pre-implemented flexible components that handle much of the 
  # deep logic.
  def _fit(
        self, batch_size: int, actor_epochs: int = 1, critic_epochs: int = 1
  ) -> Log:
    critic_losses = np.zeros(shape=(critic_epochs))
    for i in range(critic_epochs):
      # Sample trajectories
      trajectories = self.buffer.sample(batch_size=batch_size, flatten_env=False)

      # Get target Q values for regression loss
      with T.no_grad():
        next_q_values = self.model.forward_target_critics(
          trajectories.next_observations
        )
        next_q_values = to_numpy(next_q_values.max(dim=-1)[0])
        next_q_values = next_q_values[..., np.newaxis]
        target_q_values = return_estimators.TD_zero( # TD(0) is already implemented!
          trajectories.rewards,
          next_q_values,
          trajectories.dones,
          self.trajectory_discount,
        )

      # Run Q regression
      updater_log = self.q_regression(
        self.model,
        trajectories.observations,
        target_q_values,
        trajectories.actions,
        learning_rate=self.learning_rate,
      )
      critic_losses[i] = updater_log.loss

    # Update target networks
    self.model.assign_targets()

    # Returns a Log object which contains useful training statistics to be
    # logged in Tensorboad.
    return Log(critic_loss=np.mean(critic_losses))

In [13]:
# Single agent training

from pearll.models import Critic, EpsilonGreedyActor, ActorCritic
from pearll.models.encoders import IdentityEncoder
from pearll.models.torsos import MLP
from pearll.models.heads import DiscreteQHead


encoder = IdentityEncoder()
torso = MLP(layer_sizes=[4, 64, 32], activation_fn=T.nn.ReLU)
head = DiscreteQHead(input_shape=32, output_shape=2)

# Epsilon greedy policy included!
actor = EpsilonGreedyActor(
    critic_encoder=encoder, critic_torso=torso, critic_head=head
)
critic = Critic(encoder=encoder, torso=torso, head=head, create_target=True)

agent = DQN(
  env=gym.make("CartPole-v0"),
  model = ActorCritic(actor, critic),
  logger_settings = LoggerSettings(log_frequency=("episode", 1), verbose=True),
  explorer_settings=ExplorerSettings(start_steps=1000),
)

# max episode reward = 200
# Note that an agent.log file has also been saved which stores all the trajectories run.
agent.fit(num_steps=20000, batch_size=32, critic_epochs=16, train_frequency=("episode", 1))

Using device cpu
44: Log(reward=45.0, actor_loss=None, critic_loss=1.2274129167199135, divergence=None, entropy=None)
69: Log(reward=25.0, actor_loss=None, critic_loss=1.0130657367408276, divergence=None, entropy=None)
80: Log(reward=11.0, actor_loss=None, critic_loss=0.8367413617670536, divergence=None, entropy=None)
105: Log(reward=25.0, actor_loss=None, critic_loss=0.6842276882380247, divergence=None, entropy=None)
125: Log(reward=20.0, actor_loss=None, critic_loss=0.6481776218861341, divergence=None, entropy=None)
141: Log(reward=16.0, actor_loss=None, critic_loss=0.6356014953926206, divergence=None, entropy=None)
159: Log(reward=18.0, actor_loss=None, critic_loss=0.8716539908200502, divergence=None, entropy=None)
173: Log(reward=14.0, actor_loss=None, critic_loss=1.1501947110518813, divergence=None, entropy=None)
211: Log(reward=38.0, actor_loss=None, critic_loss=1.4279367625713348, divergence=None, entropy=None)
252: Log(reward=41.0, actor_loss=None, critic_loss=1.620548648759722

In [15]:
# Multi agent training
# Note that this should train in far fewer steps!

from pearll.models import Critic, EpsilonGreedyActor, ActorCritic
from pearll.models.encoders import IdentityEncoder
from pearll.models.torsos import MLP
from pearll.models.heads import DiscreteQHead
from pearll.settings import PopulationSettings


encoder = IdentityEncoder()
torso = MLP(layer_sizes=[4, 64, 32], activation_fn=T.nn.ReLU)
head = DiscreteQHead(input_shape=32, output_shape=2)

# Epsilon greedy policy included!
actor = EpsilonGreedyActor(
    critic_encoder=encoder, critic_torso=torso, critic_head=head
)
critic = Critic(encoder=encoder, torso=torso, head=head, create_target=True)
settings = PopulationSettings(
    actor_population_size=5,
    critic_population_size=5,
    actor_distribution="normal",
    critic_distribution="normal"
)

# Note the need for the vector environment!
agent = DQN(
  env=gym.vector.make("CartPole-v0", num_envs=5, asynchronous=True),
  model = ActorCritic(actor, critic, settings),
  logger_settings = LoggerSettings(log_frequency=("episode", 1), verbose=True),
  explorer_settings=ExplorerSettings(start_steps=1000),
)

# max episode reward = 200
agent.fit(num_steps=20000, batch_size=32, critic_epochs=16, train_frequency=("episode", 1))

Using device cpu
31: Log(reward=32.0, actor_loss=None, critic_loss=None, divergence=None, entropy=None)
66: Log(reward=35.0, actor_loss=None, critic_loss=622.3754959106445, divergence=None, entropy=None)
93: Log(reward=27.0, actor_loss=None, critic_loss=636.6338005065918, divergence=None, entropy=None)
170: Log(reward=77.0, actor_loss=None, critic_loss=508.28544425964355, divergence=None, entropy=None)
196: Log(reward=26.0, actor_loss=None, critic_loss=299.8132076263428, divergence=None, entropy=None)
254: Log(reward=58.0, actor_loss=None, critic_loss=184.33353519439697, divergence=None, entropy=None)
290: Log(reward=36.0, actor_loss=None, critic_loss=157.3399338722229, divergence=None, entropy=None)
355: Log(reward=65.0, actor_loss=None, critic_loss=168.7603464126587, divergence=None, entropy=None)
437: Log(reward=82.0, actor_loss=None, critic_loss=203.29157638549805, divergence=None, entropy=None)
456: Log(reward=19.0, actor_loss=None, critic_loss=162.80573916435242, divergence=None,

# Evolutionary Computation

Pearl also support evolutionary computation algorithms (EC). There are two instances Pearl supports where this is useful:

1. When trying to optimize some black box function where only the function value for different inputs is known. We'll use the sphere function to demonstrate this for simplicity.
2. When trying to optimize an agent to interact with some environment, like RL. For this we will once again use gym CartPole as an example.

For both cases, we will implement the OpenAI evolutionary strategy. Once again, an implementation of this algorithm is done in Pearl already but we'll implement it again here.

In [24]:
from pearll.agents import BaseAgent
from pearll.models import ActorCritic
from pearll.updaters.evolution import BaseEvolutionUpdater, NoisyGradientAscent
from pearll.buffers import BaseBuffer, RolloutBuffer
from pearll.explorers import BaseExplorer
from pearll.callbacks import BaseCallback
from pearll.common.type_aliases import Log
# The utils file in common contains some useful data manipulation functions.
from pearll.common.utils import filter_rewards
from pearll.settings import (
    BufferSettings,
    ExplorerSettings,
    LoggerSettings,
    MiscellaneousSettings,
    OptimizerSettings,
    Settings,
)

from typing import Type, Optional, List
import gym
import numpy as np
import torch as T
from sklearn.preprocessing import scale


class EvolutionaryStrategy(BaseAgent):
  # Note that many of the agent parameters are grouped into settings objects for
  # a cleaner interface. The __init__() method's goal is to simply intialize all
  # the other modules inputted to the class, should generally be quite simple to
  # implement.
  def __init__(
    self,
    env: gym.vector.VectorEnv,
    model: Optional[ActorCritic] = None,
    updater_class: Type[BaseEvolutionUpdater] = NoisyGradientAscent,
    learning_rate: float = 1e-3,
    buffer_class: Type[BaseBuffer] = RolloutBuffer,
    buffer_settings: BufferSettings = BufferSettings(),
    action_explorer_class: Type[BaseExplorer] = BaseExplorer,
    explorer_settings: ExplorerSettings = ExplorerSettings(start_steps=0),
    callbacks: Optional[List[Type[BaseCallback]]] = None,
    callback_settings: Optional[List[Settings]] = None,
    logger_settings: LoggerSettings = LoggerSettings(),
    misc_settings: MiscellaneousSettings = MiscellaneousSettings(),
  ) -> None:
    # The BaseAgent handles intialization of many of the modules, this can be
    # done since the submodules within share the same initialization interface.
    super().__init__(
      env=env,
      model=model,
      action_explorer_class=action_explorer_class,
      explorer_settings=explorer_settings,
      buffer_class=buffer_class,
      buffer_settings=buffer_settings,
      logger_settings=logger_settings,
      callbacks=callbacks,
      callback_settings=callback_settings,
      misc_settings=misc_settings,
    )

    self.learning_rate = learning_rate
    self.grad_ascent = updater_class(model=self.model)

  # Abstract method needs to be implemented. The _fit() method defines the
  # actual training update step. This can be quite simple as well though
  # thanks to the pre-implemented flexible components that handle much of the 
  # deep logic.
  def _fit(
      self, batch_size: int, actor_epochs: int = 1, critic_epochs: int = 1
  ) -> Log:
    divergences = np.zeros(actor_epochs)
    entropies = np.zeros(actor_epochs)

    # online learning, get all trajectories collected to train with.
    trajectories = self.buffer.all(flatten_env=False)

    # process rewards
    rewards = trajectories.rewards.squeeze()
    rewards = filter_rewards(rewards, trajectories.dones.squeeze())
    if rewards.ndim > 1:
      rewards = rewards.sum(axis=-1)
    scaled_rewards = scale(rewards)

    # update steps
    optimization_direction = np.dot(self.grad_ascent.normal_dist.T, scaled_rewards) / (
        np.mean(self.grad_ascent.std) * self.env.num_envs
    )
    for i in range(actor_epochs):
      log = self.grad_ascent(
          learning_rate=self.learning_rate,
          optimization_direction=optimization_direction,
      )
      divergences[i] = log.divergence
      entropies[i] = log.entropy
    self.buffer.reset()

    # Returns a Log object which contains useful training statistics to be
    # logged in Tensorboad.
    return Log(divergence=divergences.sum(), entropy=entropies.mean())

In [25]:
import gym

class Sphere(gym.Env):
  """
  Sphere(2) function.
  """

  def __init__(self):
    self.action_space = gym.spaces.Box(low=-100, high=100, shape=(2,))
    self.observation_space = gym.spaces.Discrete(1)

  def step(self, action):
    return 0, -(action[0] ** 2 + action[1] ** 2), False, {}

  def reset(self):
    return 0

In [26]:
from pearll.models import Dummy, ActorCritic
from pearll.settings import PopulationSettings


POPULATION_SIZE = 10
settings = PopulationSettings(
    actor_population_size=POPULATION_SIZE,
    actor_distribution="normal"
)

env = gym.vector.SyncVectorEnv([lambda: Sphere() for _ in range(POPULATION_SIZE)])
actor = Dummy(space=env.single_action_space, state=np.array([10, 10]))
critic = Dummy(space=env.single_action_space, state=np.array([10, 10]))

agent = EvolutionaryStrategy(
    env=env,
    model=ActorCritic(actor, critic, settings),
    learning_rate=1,
    logger_settings=LoggerSettings(log_frequency=("step", 1), verbose=True)
)

# in this case batch_size doesn't actually matter since not used in _fit
# max reward = 0
# Note that an agent.log file has also been saved which stores all the trajectories run.
agent.fit(num_steps=20, batch_size=1)

Using device cpu
0: Log(reward=-209.6029469773056, actor_loss=None, critic_loss=None, divergence=None, entropy=None)
1: Log(reward=-185.33994487608464, actor_loss=None, critic_loss=None, divergence=0.28313300013542175, entropy=1.4189385175704956)
2: Log(reward=-167.7082046027137, actor_loss=None, critic_loss=None, divergence=0.16955488920211792, entropy=1.4189385175704956)
3: Log(reward=-142.34828060831654, actor_loss=None, critic_loss=None, divergence=0.15549159049987793, entropy=1.4189385175704956)
4: Log(reward=-121.34994700509928, actor_loss=None, critic_loss=None, divergence=0.22404944896697998, entropy=1.4189385175704956)
5: Log(reward=-108.65142401244948, actor_loss=None, critic_loss=None, divergence=0.2476799488067627, entropy=1.4189385175704956)
6: Log(reward=-84.33634550448184, actor_loss=None, critic_loss=None, divergence=0.13305053114891052, entropy=1.4189385175704956)
7: Log(reward=-68.19667188972528, actor_loss=None, critic_loss=None, divergence=0.20013868808746338, entro

In [29]:
from pearll.models.encoders import IdentityEncoder
from pearll.models.torsos import MLP
from pearll.models.heads import CategoricalHead
from pearll.models import Dummy, ActorCritic, Actor
from pearll.settings import PopulationSettings


POPULATION_SIZE=20
env = gym.vector.make("CartPole-v0", num_envs=POPULATION_SIZE, asynchronous=True)

actor = Actor(
  encoder=IdentityEncoder(),
  torso=MLP(layer_sizes=[4, 20, 10], activation_fn=T.nn.ReLU), 
  head=CategoricalHead(input_shape=10, action_size=2, activation_fn=T.nn.Tanh)
)
critic = Dummy(space=env.single_action_space)

settings = PopulationSettings(
  actor_population_size=POPULATION_SIZE,
  actor_distribution="normal",
  actor_std=0.01,
)

agent = EvolutionaryStrategy(
  env=env,
  model=ActorCritic(actor, critic, settings),
  learning_rate=0.001,
  logger_settings=LoggerSettings(log_frequency=("episode", 1), verbose=True)
)

# in this case batch_size doesn't actually matter since not used in _fit
# max reward = 200
# Note that an agent.log file has also been saved which stores all the trajectories run.
agent.fit(num_steps=20000, batch_size=0, train_frequency=("episode", 1))

Using device cpu
50: Log(reward=51.0, actor_loss=None, critic_loss=None, divergence=0.0, entropy=-3.1862316131591797)
102: Log(reward=52.0, actor_loss=None, critic_loss=None, divergence=2.4583632946014404, entropy=-3.1862316131591797)
159: Log(reward=57.0, actor_loss=None, critic_loss=None, divergence=2.850159168243408, entropy=-3.1862316131591797)
226: Log(reward=67.0, actor_loss=None, critic_loss=None, divergence=2.5895583629608154, entropy=-3.1862316131591797)
284: Log(reward=58.0, actor_loss=None, critic_loss=None, divergence=2.703207492828369, entropy=-3.1862316131591797)
337: Log(reward=53.0, actor_loss=None, critic_loss=None, divergence=2.8306572437286377, entropy=-3.1862316131591797)
389: Log(reward=52.0, actor_loss=None, critic_loss=None, divergence=2.676666021347046, entropy=-3.1862316131591797)
432: Log(reward=43.0, actor_loss=None, critic_loss=None, divergence=2.6999988555908203, entropy=-3.1862316131591797)
480: Log(reward=48.0, actor_loss=None, critic_loss=None, divergenc

# Hybrid Algorithms

Hybrid algorithms can be implemented by simply having either the actor or critic be updated by an evolution updater while the other is updated by an actor or critic updater using backpropagation. The flow is the same as with the RL and EC agents. An example is CEM-RL, which is implemented [here](https://github.com/LondonNode/Pearl/blob/main/pearll/agents/cem_rl.py).

# Post Training

The `BaseAgent` class also includes many methods to make predictions after training:

- `predict`: predict an action from the trained policy.
- `action_distribution`: get the policy distribution given an observation.
- `critic`: get the (Q) value of an observation (action pair).

A plotting script is also included to plot the results of various runs. See the [technical report](https://arxiv.org/abs/2201.09568) for examples of this and more details.

In [41]:
# vector environment so just pick one of them by indexing
obs = env.reset()[0]
print(f"Given observation: {obs}")
print(f"Policy distribution: Categorical({agent.action_distribution(obs).probs})")
print(f"Action sampled from policy distribution: {agent.predict(obs)}")

Given observation: [-0.01213112 -0.00454794 -0.02761681  0.01953347]
Policy distribution: Categorical(tensor([0.5007, 0.4993], grad_fn=<SoftmaxBackward0>))
Action sampled from policy distribution: 1


In [43]:
!python -m pearll.plot -h

usage: plot.py [-h] -p PATHS [PATHS ...] --metric METRIC --titles TITLES
               [TITLES ...] [--num-cols NUM_COLS] [--interval INTERVAL]
               [--legend LEGEND [LEGEND ...]] [--window WINDOW]
               [--xlabel XLABEL] [--ylabel YLABEL] [--x-axis X_AXIS]
               [--y-axis Y_AXIS] [--log-y]
               [--save-types SAVE_TYPES [SAVE_TYPES ...]]
               [--save-path SAVE_PATH]

optional arguments:
  -h, --help            show this help message and exit
  -p PATHS [PATHS ...], --paths PATHS [PATHS ...]
  --metric METRIC
  --titles TITLES [TITLES ...]
  --num-cols NUM_COLS
  --interval INTERVAL
  --legend LEGEND [LEGEND ...]
  --window WINDOW
  --xlabel XLABEL
  --ylabel YLABEL
  --x-axis X_AXIS
  --y-axis Y_AXIS
  --log-y
  --save-types SAVE_TYPES [SAVE_TYPES ...]
  --save-path SAVE_PATH
