# Outlook

This notebook is designed to understand how to use a gymnasium environment as a BBRL agent in practice, using autoreset=True.
It is part of the [BBRL documentation](https://github.com/osigaud/bbrl/docs/index.html).

If this is your first contact with BBRL, you may start be having a look at [this more basic notebook](01-basic_concepts.student.ipynb) and [the one using autoreset=False](02-multi_env_noautoreset.student.ipynb).

## Installation and Imports

The BBRL library is [here](https://github.com/osigaud/bbrl).

Below, we import standard python packages, pytorch packages and gymnasium environments.

In [1]:
# Installs the necessary Python and system libraries
try:
    from easypip import easyimport, easyinstall, is_notebook
except ModuleNotFoundError as e:
    get_ipython().run_line_magic("pip", "install easypip")
    from easypip import easyimport, easyinstall, is_notebook

easyinstall("bbrl>=0.2.2")
easyinstall("swig")
easyinstall("bbrl_gymnasium>=0.2.0")
easyinstall("bbrl_gymnasium[classic_control]")

Collecting easypip
  Downloading easypip-1.3.5-py3-none-any.whl.metadata (475 bytes)
Downloading easypip-1.3.5-py3-none-any.whl (3.8 kB)
Installing collected packages: easypip
Successfully installed easypip-1.3.5


[easypip] Installing bbrl>=0.2.2
[easypip] Installing swig
[easypip] Installing bbrl_gymnasium>=0.2.0


In [2]:
import os
import sys
from pathlib import Path
import math

from moviepy.editor import ipython_display as video_display
import time
from tqdm.auto import tqdm
from typing import Tuple, Optional
from functools import partial

from omegaconf import OmegaConf
import torch
import bbrl_gymnasium

import copy
from abc import abstractmethod, ABC
import torch.nn as nn
import torch.nn.functional as F
from time import strftime
OmegaConf.register_new_resolver(
    "current_time", lambda: strftime("%Y%m%d-%H%M%S"), replace=True
)

  if event.key is 'enter':



In [3]:
# Imports all the necessary classes and functions from BBRL
from bbrl.agents.agent import Agent
from bbrl import get_arguments, get_class, instantiate_class
# The workspace is the main class in BBRL, this is where all data is collected and stored
from bbrl.workspace import Workspace

# Agents(agent1, agent2, agent3, ...) executes the different agents the one after the other
# TemporalAgent(agent) executes an agent over multiple timesteps in the workspace,
# or until a given condition is reached

from bbrl.agents import Agents, TemporalAgent
from bbrl.agents.gymnasium import ParallelGymAgent, make_env

# Replay buffers are useful to store past transitions when training
from bbrl.utils.replay_buffer import ReplayBuffer

## Definition of agents

We reuse the RandomAgent already used in the autoreset=False case.

In [4]:
class RandomAgent(Agent):
    def __init__(self, action_dim):
        super().__init__()
        self.action_dim = action_dim

    def forward(self, t: int, choose_action=True, **kwargs):
        """An Agent can use self.workspace"""
        obs = self.get(("env/env_obs", t))
        action = torch.randint(0, self.action_dim, (len(obs), ))
        self.set(("action", t), action)

As before, we create an Agent representing [the CartPole-v1 gym environment](https://gymnasium.farama.org/environments/classic_control/cart_pole/).
This is done using the [ParallelGymAgent](https://github.com/osigaud/bbrl/blob/40fe0468feb8998e62c3cd6bb3a575fef88e256f/src/bbrl/agents/gymnasium.py#L261) class.

### Single environment case

We start with a single instance of the CartPole environment

In [5]:
# We deal with 1 environment (random seed 2139)

env_agent = ParallelGymAgent(partial(make_env, env_name='CartPole-v1', autoreset=True), num_envs=1).seed(2139)
obs_size, action_dim = env_agent.get_obs_and_actions_sizes()
print(f"Environment: observation space in R^{obs_size} and action space R^{action_dim}")

# Each agent is run in the order given when constructing Agents

agents = Agents(env_agent, RandomAgent(action_dim))
t_agents = TemporalAgent(agents)

Environment: observation space in R^4 and action space R^2


Let us have a closer look at the content of the workspace

In [23]:
# Creates a new workspace
workspace = Workspace()
epoch_size = 15
t_agents(workspace, n_steps=epoch_size)

In [22]:
for key in workspace.variables.keys():
    print(key, workspace[key].shape, workspace[key])

env/env_obs torch.Size([15, 3, 4]) tensor([[[ 0.0351, -0.0244,  0.0483,  0.0200],
         [-0.0070,  0.0379, -0.0248,  0.0201],
         [-0.0378,  0.0325,  0.0493, -0.0067]],

        [[ 0.0346, -0.2202,  0.0487,  0.3275],
         [-0.0062,  0.2334, -0.0244, -0.2803],
         [-0.0372,  0.2269,  0.0492, -0.2834]],

        [[ 0.0302, -0.0258,  0.0552,  0.0505],
         [-0.0015,  0.4289, -0.0300, -0.5806],
         [-0.0326,  0.0311,  0.0435,  0.0244]],

        [[ 0.0297, -0.2216,  0.0562,  0.3601],
         [ 0.0070,  0.6244, -0.0417, -0.8825],
         [-0.0320, -0.1646,  0.0440,  0.3305]],

        [[ 0.0253, -0.0274,  0.0634,  0.0857],
         [ 0.0195,  0.4299, -0.0593, -0.6032],
         [-0.0353, -0.3604,  0.0506,  0.6367]],

        [[ 0.0247,  0.1668,  0.0651, -0.1863],
         [ 0.0281,  0.6258, -0.0714, -0.9140],
         [-0.0425, -0.1660,  0.0634,  0.3604]],

        [[ 0.0280,  0.3609,  0.0614, -0.4578],
         [ 0.0406,  0.8218, -0.0896, -1.2282],
         [-0.

In [8]:

# We get the transitions: each tensor is transformed so that:
# - we have the value at time step t and t+1 (so all the tensors first dimension have a size of 2)
# - there is no distinction between the different environments (here, there is just one environment to make it easy)
transitions = workspace.get_transitions()

display("Observations (first 4)", workspace["env/env_obs"][:4])

display("Transitions (first 3)")
for t in range(4):
    display(f'(s_{t}, s_{t+1})')
    # We ignore the first dimension as it corresponds to [t, t+1]
    display(transitions["env/env_obs"][:, t])

'Observations (first 4)'

tensor([[[-0.0471,  0.0265,  0.0220, -0.0336]],

        [[-0.0466,  0.2213,  0.0214, -0.3192]],

        [[-0.0422,  0.0259,  0.0150, -0.0199]],

        [[-0.0416,  0.2208,  0.0146, -0.3078]]])

'Transitions (first 3)'

'(s_0, s_1)'

tensor([[-0.0471,  0.0265,  0.0220, -0.0336],
        [-0.0466,  0.2213,  0.0214, -0.3192]])

'(s_1, s_2)'

tensor([[-0.0466,  0.2213,  0.0214, -0.3192],
        [-0.0422,  0.0259,  0.0150, -0.0199]])

'(s_2, s_3)'

tensor([[-0.0422,  0.0259,  0.0150, -0.0199],
        [-0.0416,  0.2208,  0.0146, -0.3078]])

'(s_3, s_4)'

tensor([[-0.0416,  0.2208,  0.0146, -0.3078],
        [-0.0372,  0.4157,  0.0084, -0.5958]])

You can see that each transition in the workspace corresponds to a pair of observations.

### Transitions as a workspace

A transition workspace is still a workspace... this is quite
 handy since each transition can be seen as a mini-episode of two time steps;
 we can use our agents on it.

It is often the case in BBRL that we have to apply an agent to an already existing workspace
as shown below.

In [11]:
for key in transitions.variables.keys():
    print(key, transitions[key])

t_random_agent = TemporalAgent(RandomAgent(action_dim))
t_random_agent(transitions, t=0, n_steps=2)

# Here, the action tensor will have been overwritten by the new actions
print(f"new action, {transitions['action']}")

env/env_obs tensor([[[-0.0471,  0.0265,  0.0220, -0.0336],
         [-0.0466,  0.2213,  0.0214, -0.3192],
         [-0.0422,  0.0259,  0.0150, -0.0199],
         [-0.0416,  0.2208,  0.0146, -0.3078],
         [-0.0372,  0.4157,  0.0084, -0.5958],
         [-0.0289,  0.2205, -0.0035, -0.3005],
         [-0.0245,  0.0254, -0.0095, -0.0089],
         [-0.0240,  0.2207, -0.0097, -0.3046],
         [-0.0196,  0.0257, -0.0158, -0.0149],
         [-0.0191, -0.1692, -0.0161,  0.2727],
         [-0.0224, -0.3641, -0.0106,  0.5603],
         [-0.0297, -0.1688,  0.0006,  0.2643],
         [-0.0331,  0.0263,  0.0059, -0.0282],
         [-0.0326,  0.2213,  0.0053, -0.3190]],

        [[-0.0466,  0.2213,  0.0214, -0.3192],
         [-0.0422,  0.0259,  0.0150, -0.0199],
         [-0.0416,  0.2208,  0.0146, -0.3078],
         [-0.0372,  0.4157,  0.0084, -0.5958],
         [-0.0289,  0.2205, -0.0035, -0.3005],
         [-0.0245,  0.0254, -0.0095, -0.0089],
         [-0.0240,  0.2207, -0.0097, -0.3046],

### Multiple environment case

Now we are using 3 environments.
Given the organization of transitions, to find the transitions of a particular environment
we have to watch in the transition every 3 lines, since transitions are stored one environment after the other.

In [12]:
# We deal with 3 environments at a time (random seed 2139)

multienv_agent = ParallelGymAgent(partial(make_env, env_name='CartPole-v1', autoreset=True), num_envs=3).seed(2139)
obs_size, action_dim = multienv_agent.get_obs_and_actions_sizes()
print(f"Environment: observation space in R^{obs_size} and action space R^{action_dim}")

agents = Agents(multienv_agent, RandomAgent(action_dim))
t_agents = TemporalAgent(agents)
workspace = Workspace()
t_agents(workspace, n_steps=epoch_size)
transitions = workspace.get_transitions()

display("Observations (first 4)", workspace["env/env_obs"][:4])

display("Transitions (first 3)")
for t in range(3):
    display(f'(s_{t}, s_{t+1})')
    display(transitions["env/env_obs"][:, t])

Environment: observation space in R^4 and action space R^2


'Observations (first 4)'

tensor([[[-0.0085, -0.0427, -0.0489,  0.0215],
         [ 0.0005,  0.0025, -0.0493, -0.0402],
         [ 0.0080,  0.0203, -0.0023, -0.0085]],

        [[-0.0094,  0.1531, -0.0485, -0.2862],
         [ 0.0006,  0.1983, -0.0501, -0.3480],
         [ 0.0084,  0.2155, -0.0025, -0.3019]],

        [[-0.0063, -0.0413, -0.0542, -0.0092],
         [ 0.0046,  0.0039, -0.0570, -0.0715],
         [ 0.0127,  0.0204, -0.0085, -0.0100]],

        [[-0.0071, -0.2356, -0.0544,  0.2659],
         [ 0.0046, -0.1904, -0.0584,  0.2027],
         [ 0.0132, -0.1746, -0.0087,  0.2800]]])

'Transitions (first 3)'

'(s_0, s_1)'

tensor([[-0.0085, -0.0427, -0.0489,  0.0215],
        [-0.0094,  0.1531, -0.0485, -0.2862]])

'(s_1, s_2)'

tensor([[ 0.0005,  0.0025, -0.0493, -0.0402],
        [ 0.0006,  0.1983, -0.0501, -0.3480]])

'(s_2, s_3)'

tensor([[ 0.0080,  0.0203, -0.0023, -0.0085],
        [ 0.0084,  0.2155, -0.0025, -0.3019]])

You can see how the transitions are organized in the workspace relative to the 3 environments.
You first get the first transition from the first environment.
Then the first transition from the second environment.
Then the first transition from the third environment.
Then the second transition from the first environment, etc.

## The replay buffer

Differently from the previous case, we use a replace buffer that stores
a set of transitions $(s_t, a_t, r_t, s_{t+1})$
Finally, the replay buffer keeps slices [:, i, ...] of the transition
workspace (here at most 80 transitions)

In [18]:
rb = ReplayBuffer(max_size=80)

# We add the transitions to the buffer....
rb.put(transitions)

# And sample from them here we get 3 tuples (s_t, s_{t+1})
rb.get_shuffled(3)["env/env_obs"]

tensor([[[ 0.0755,  0.8044, -0.1017, -1.2627],
         [-0.0992, -0.2348,  0.0644,  0.2480],
         [ 0.0473,  0.8026, -0.0585, -1.2190]],

        [[ 0.0916,  0.6107, -0.1270, -1.0036],
         [-0.1039, -0.0407,  0.0693, -0.0237],
         [ 0.0634,  0.6083, -0.0828, -0.9452]]])

## Collecting several epochs into the same workspace

In the code below, the workspace only contains one epoch at a time.
The content of these different epochs are concatenated into the replay buffer

In [19]:
nb_steps = 0
max_steps = 100
epoch_size = 10

while nb_steps < max_steps:
    # Execute the agent in the workspace
    if nb_steps == 0:
        # In the first epoch, we start with t=0
        t_agents(workspace, t=0, n_steps=epoch_size)
    else:
        # Clear all gradient graphs from the workspace
        workspace.zero_grad()
        # Here we duplicate the last column of the previous epoch into the first column of the next epoch
        workspace.copy_n_last_steps(1)

        # In subsequent epochs, we start with t=1 so as to avoid overwriting the first column we just duplicated
        t_agents(workspace, t=1, n_steps=epoch_size)

    transition_workspace = workspace.get_transitions()

    # The part below counts the number of steps: it ignores action performed during transition from one episode to the next,
    # as they have been discarded by the get_transitions() function

    action = transition_workspace["action"]
    nb_steps += action[0].shape[0]
    print(f"collecting new epoch, already performed {nb_steps} steps")

    if nb_steps > 0 or epoch_size  > 1:
        rb.put(transition_workspace)
    print(f"replay buffer size: {rb.size()}")

collecting new epoch, already performed 41 steps
replay buffer size: 80
collecting new epoch, already performed 82 steps
replay buffer size: 80
collecting new epoch, already performed 122 steps
replay buffer size: 80


## Exercise

Create a stupid agent that always outputs action 1, run it for 10 epochs of 100 steps over 2 instances of the CartPole-v1 environment.
Put the data into a replay buffer of size 5000.

Then do the following:
- Count the number of episodes the agent performed in each environment by counting the number of "done=True" elements in the workspace before applying the `get_transitions()` function
- Count the total number of episodes performed by the agent by measuring the difference between the size of the replay buffer and the number of steps performed by the agent.
- Make sure both counts are consistent

Can we count the number of episodes performed in one environment using the second method? Why?

In [20]:
class OneAgent(Agent):
    def __init__(self, action_dim):
        super().__init__()
        self.action_dim = action_dim

    def forward(self, t: int, choose_action=True, **kwargs):
        """An Agent can use self.workspace"""
        obs = self.get(("env/env_obs", t))
        action = torch.tensor([1]*len(obs))
        self.set(("action", t), action)

In [30]:
env_agent = ParallelGymAgent(partial(make_env, env_name='CartPole-v1', autoreset=True), num_envs=1).seed(2139)
obs_size, action_dim = env_agent.get_obs_and_actions_sizes()
print(f"Environment: observation space in R^{obs_size} and action space R^{action_dim}")



agents = Agents(env_agent, RandomAgent(action_dim))
t_agents = TemporalAgent(agents)
workspace = Workspace()

epoch_size = 100
n_epochs = 10
nb_steps = 0
max_steps = 1000

rb = ReplayBuffer(max_size=5000)


for i in range(n_epochs):
    while nb_steps < max_steps:
      if nb_steps == 0:
          t_agents(workspace, t=0, n_steps=epoch_size)
      else:
          workspace.zero_grad()
          workspace.copy_n_last_steps(1)
          t_agents(workspace, t=1, n_steps=epoch_size)

      transition_workspace = workspace.get_transitions()

      action = transition_workspace["action"]
      nb_steps += action[0].shape[0]
      print(f"collecting new epoch, already performed {nb_steps} steps")

      if nb_steps > 0 or epoch_size  > 1:
          rb.put(transition_workspace)
      print(f"replay buffer size: {rb.size()}")

Environment: observation space in R^4 and action space R^2
collecting new epoch, already performed 94 steps
replay buffer size: 94
collecting new epoch, already performed 191 steps
replay buffer size: 191
collecting new epoch, already performed 286 steps
replay buffer size: 286
collecting new epoch, already performed 384 steps
replay buffer size: 384
collecting new epoch, already performed 480 steps
replay buffer size: 480
collecting new epoch, already performed 576 steps
replay buffer size: 576
collecting new epoch, already performed 671 steps
replay buffer size: 671
collecting new epoch, already performed 766 steps
replay buffer size: 766
collecting new epoch, already performed 860 steps
replay buffer size: 860
collecting new epoch, already performed 957 steps
replay buffer size: 957
collecting new epoch, already performed 1054 steps
replay buffer size: 1054
