<a href="https://colab.research.google.com/github/LondonNode/Pearl-tutorials/blob/main/2_Buffers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pearll

# Introduction

This notebook is a tutorial for the `buffers` module within Pearl. This module has implementations of various experience buffers that allow for the storing and sampling of trajectories. There are three different types of buffer implemented: `ReplayBuffer`, `RolloutBuffer` and `HERBuffer`. A `BaseBuffer` class is also implemented for creating other buffers compatible with the rest of the library.

# Base Buffer

The `BaseBuffer` is the base class from which all other buffers are derived. It includes 5 abstract methods to be defined by the user if a new buffer is made:

- `reset`: reset the buffer to its inital empty state
- `add_trajectory`: add a single trajectory to the buffer
- `sample`: sample a batch of trajectories
- `last`: get the most recent batch of trajectories stored
- `all`: get all stored trajectories

In [2]:
from pearll.buffers import BaseBuffer
from pearll.common.type_aliases import Observation, Trajectories
from pearll.common.enumerations import TrajectoryType

import gym
from typing import Union
import torch as T
import numpy as np


class YourBuffer(BaseBuffer):
  # The BaseBuffer __init__ method automatically creates arrays for the
  # observations, actions, rewards and dones collected in MDP trajectories
  # and checks if these arrays will fit in memory.
  def __init__(
    self,
    env: gym.Env,
    buffer_size: int,
    device: Union[str, T.device] = "auto",
  ) -> None:
    super().__init__(
        env,
        buffer_size,
        device,
    )

  # An implementation of reset is done in BaseBuffer that assumes only the
  # BaseBuffer trajectory arrays are used. It's kept abstract to encourage the 
  # user to think about its implementation for themselves though.
  def reset():
    super().reset()

  def add_trajectory(
    self,
    observation: Observation,
    action: Union[np.ndarray, int],
    reward: Union[float, np.ndarray],
    next_observation: Observation,
    done: Union[bool, np.ndarray],
  ) -> None:
    pass

  def sample(
    self,
    batch_size: int,
    flatten_env: bool = False, # whether to include an env axis in sampled trajectories
    dtype: Union[str, TrajectoryType] = "numpy", # return type of trajectories, numpy or torch
  ) -> Trajectories:
    pass

  def last(
    self,
    batch_size: int,
    flatten_env: bool = False, # whether to include an env axis in sampled trajectories
    dtype: Union[str, TrajectoryType] = "numpy", # return type of trajectories, numpy or torch
  ) -> Trajectories:
    pass

  def all(
    self,
    flatten_env: bool = False, # whether to include an env axis in sampled trajectories
    dtype: Union[str, TrajectoryType] = "numpy" # return type of trajectories, numpy or torch
  ) -> Trajectories:
    pass


env = gym.make("CartPole-v1")
buffer = YourBuffer(env=env, buffer_size=10)

print(f"The environment has action space {env.action_space} and Box observation space shape {env.observation_space.shape}\n")
print(f"Buffer observation array shape = {buffer.observations.shape}")
print(f"Buffer action array shape = {buffer.actions.shape}")
print(f"Buffer reward array shape = {buffer.rewards.shape}")
print(f"Buffer done array shape = {buffer.dones.shape}")

The environment has action space Discrete(2) and Box observation space shape (4,)

Buffer observation array shape = (10, 4)
Buffer action array shape = (10, 1)
Buffer reward array shape = (10, 1)
Buffer done array shape = (10, 1)


# Replay Buffer

The `ReplayBuffer` class handles sample collection and processing for generally off-policy algorithms. A key feature is the use of a single array to handle observations and next observations in trajectories rather than using two different arrays. This has the advantage of saving memory space but assumes observations are stored sequentially and that observations can only be 'blended' at the end of the trajectory and start of the next.

In [3]:
from pearll.buffers import ReplayBuffer
from pearll.common.type_aliases import Trajectories
import gym


env = gym.make("CartPole-v1")
buffer = ReplayBuffer(env=env, buffer_size=5)

print("Environment trajectories in order:")
obs = env.reset()
for i in range(5):
  action = env.action_space.sample()
  next_obs, reward, done, _ = env.step(action)
  buffer.add_trajectory(obs, action, reward, next_obs, done)
  print(f"{i+1}. {Trajectories(obs, action, reward, next_obs, done)}")
  obs = next_obs

trajectories = buffer.sample(batch_size=5)
print(f"\nSampled trajectories are randomly indexed:")
for i in range(5):
  print(f"{i+1}. {Trajectories(trajectories.observations[i], trajectories.actions[i], trajectories.rewards[i], trajectories.next_observations[i], trajectories.dones[i])}")

Environment trajectories in order:
1. Trajectories(observations=array([-0.00788589,  0.03270863,  0.03365615, -0.02002516]), actions=0, rewards=1.0, next_observations=array([-0.00723172, -0.16287942,  0.03325565,  0.2830838 ]), dones=False)
2. Trajectories(observations=array([-0.00723172, -0.16287942,  0.03325565,  0.2830838 ]), actions=1, rewards=1.0, next_observations=array([-0.01048931,  0.03175281,  0.03891732,  0.00107225]), dones=False)
3. Trajectories(observations=array([-0.01048931,  0.03175281,  0.03891732,  0.00107225]), actions=1, rewards=1.0, next_observations=array([-0.00985425,  0.22629564,  0.03893877, -0.27908224]), dones=False)
4. Trajectories(observations=array([-0.00985425,  0.22629564,  0.03893877, -0.27908224]), actions=1, rewards=1.0, next_observations=array([-0.00532834,  0.4208411 ,  0.03335712, -0.55923413]), dones=False)
5. Trajectories(observations=array([-0.00532834,  0.4208411 ,  0.03335712, -0.55923413]), actions=1, rewards=1.0, next_observations=array([ 0

# Rollout Buffer

The `RolloutBuffer` class handles sample collection and processing for generally on-policy algorithms. This uses separate memory for observations and next observations unlike the `ReplayBuffer`.

In [4]:
from pearll.buffers import RolloutBuffer
from pearll.common.type_aliases import Trajectories
import gym


env = gym.make("CartPole-v1")
buffer = RolloutBuffer(env=env, buffer_size=5)

print("Environment trajectories in order:")
obs = env.reset()
for i in range(5):
  action = env.action_space.sample()
  next_obs, reward, done, _ = env.step(action)
  buffer.add_trajectory(obs, action, reward, next_obs, done)
  print(f"{i+1}. {Trajectories(obs, action, reward, next_obs, done)}")
  obs = next_obs

# In this case, the sample method is similar to the last method.
trajectories = buffer.sample(batch_size=5)
print(f"\nSampled trajectories are also in order:")
for i in range(5):
  print(f"{i+1}. {Trajectories(trajectories.observations[i], trajectories.actions[i], trajectories.rewards[i], trajectories.next_observations[i], trajectories.dones[i])}")

Environment trajectories in order:
1. Trajectories(observations=array([ 0.01372224,  0.01503374,  0.00939766, -0.01590847]), actions=1, rewards=1.0, next_observations=array([ 0.01402292,  0.21001967,  0.00907949, -0.30561157]), dones=False)
2. Trajectories(observations=array([ 0.01402292,  0.21001967,  0.00907949, -0.30561157]), actions=0, rewards=1.0, next_observations=array([ 0.01822331,  0.01476951,  0.00296726, -0.01007908]), dones=False)
3. Trajectories(observations=array([ 0.01822331,  0.01476951,  0.00296726, -0.01007908]), actions=0, rewards=1.0, next_observations=array([ 0.0185187 , -0.18039487,  0.00276568,  0.28353858]), dones=False)
4. Trajectories(observations=array([ 0.0185187 , -0.18039487,  0.00276568,  0.28353858]), actions=0, rewards=1.0, next_observations=array([ 0.01491081, -0.37555615,  0.00843645,  0.5770925 ]), dones=False)
5. Trajectories(observations=array([ 0.01491081, -0.37555615,  0.00843645,  0.5770925 ]), actions=1, rewards=1.0, next_observations=array([ 0

# HER Buffer

The `HERBuffer` is an implementation of the Hindsight Experience Replay buffer. An explanation of the code is pinned by Towards Data Science under "Tips and Tricks": https://towardsdatascience.com/hindsight-experience-replay-her-implementation-92eebab6f653

Unfortunately, this only supports single agent environments right now (not compatible with `gym.vector.VectorEnv`)