## Open notebook in:
| Colab                                                                                                 
|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nicolepcx/transformers-the-definitive-guide/blob/master/CH07/ch07_replay_buffer_DT.ipynb)                                             

# About This Section: Replay Buffer Implementation

This section of the notebook focuses on the implementation of a **Replay Buffer** specifically designed for the [Online Decision Transformer](https://arxiv.org/abs/2202.05607) (ODT) framework. The replay buffer is a critical component in reinforcement learning, particularly for algorithms like ODT that require the storage and efficient retrieval of past experiences.

#### Key Features:
- **Capacity Management**: The replay buffer is initialized with a specified capacity, ensuring that only the top trajectories (based on rewards) are retained when the buffer is full.
- **Trajectory Storage**: Unlike traditional replay buffers that store individual transitions, this implementation stores entire trajectories, which are sequences of states, actions, and rewards. This approach aligns with ODT's focus on sequence-level modeling.
- **FIFO Replacement Strategy**: As new trajectories are collected during the training process, the replay buffer replaces the oldest trajectories in a first-in-first-out (FIFO) manner, maintaining a dynamic and up-to-date set of experiences for the agent to learn from.
- **Efficiency**: The replay buffer enhances the learning process by allowing the model to continuously learn from a diverse set of past experiences, improving stability and generalization.

This replay buffer implementation is integral to the ODT's ability to adapt its policy based on new data while retaining valuable information from previous experiences.


# Imports

In [None]:
import numpy as np

# Replay Buffer

In [None]:
class ReplayBuffer(object):
    def __init__(self, capacity, trajectories=[]):
        self.capacity = capacity
        if len(trajectories) <= self.capacity:
            self.trajectories = trajectories
        else:
            returns = [traj["rewards"].sum() for traj in trajectories]
            sorted_inds = np.argsort(returns)  # lowest to highest
            self.trajectories = [
                trajectories[ii] for ii in sorted_inds[-self.capacity :]
            ]

        self.start_idx = 0

    def __len__(self):
        return len(self.trajectories)

    def add_new_trajs(self, new_trajs):
        if len(self.trajectories) < self.capacity:
            self.trajectories.extend(new_trajs)
            self.trajectories = self.trajectories[-self.capacity :]
        else:
            self.trajectories[
                self.start_idx : self.start_idx + len(new_trajs)
            ] = new_trajs
            self.start_idx = (self.start_idx + len(new_trajs)) % self.capacity

        assert len(self.trajectories) <= self.capacity

# Run an Example of Trajectories

In [None]:
# Example trajectory data
trajectories = [
    {"rewards": np.array([1, 2, 3]), "states": np.array([0, 1, 2]), "actions": np.array([0, 1, 2])},
    {"rewards": np.array([2, 3, 4]), "states": np.array([1, 2, 3]), "actions": np.array([1, 2, 3])},
    {"rewards": np.array([3, 4, 5]), "states": np.array([2, 3, 4]), "actions": np.array([2, 3, 4])},
    {"rewards": np.array([4, 5, 6]), "states": np.array([3, 4, 5]), "actions": np.array([3, 4, 5])}
]

# Initialize the replay buffer with a capacity of 3
replay_buffer = ReplayBuffer(capacity=3, trajectories=trajectories[:2])

# Print the initial state of the replay buffer
print("Initial replay buffer:")
for traj in replay_buffer.trajectories:
    print(traj)

# Add new trajectories to the buffer
new_trajectories = [
    {"rewards": np.array([5, 6, 7]), "states": np.array([4, 5, 6]), "actions": np.array([4, 5, 6])},
    {"rewards": np.array([6, 7, 8]), "states": np.array([5, 6, 7]), "actions": np.array([5, 6, 7])}
]
replay_buffer.add_new_trajs(new_trajectories)

# Print the state of the replay buffer after adding new trajectories
print("\nReplay buffer after adding new trajectories:")
for traj in replay_buffer.trajectories:
    print(traj)


Initial replay buffer:
{'rewards': array([1, 2, 3]), 'states': array([0, 1, 2]), 'actions': array([0, 1, 2])}
{'rewards': array([2, 3, 4]), 'states': array([1, 2, 3]), 'actions': array([1, 2, 3])}

Replay buffer after adding new trajectories:
{'rewards': array([2, 3, 4]), 'states': array([1, 2, 3]), 'actions': array([1, 2, 3])}
{'rewards': array([5, 6, 7]), 'states': array([4, 5, 6]), 'actions': array([4, 5, 6])}
{'rewards': array([6, 7, 8]), 'states': array([5, 6, 7]), 'actions': array([5, 6, 7])}
