# Outlook

This notebook is designed to understand how to use a gymnasium environment as a BBRL agent in practice, using autoreset=False.
It is part of the [BBRL documentation](https://github.com/osigaud/bbrl/docs/index.html).

If this is your first contact with BBRL, you may start be having a look at [this more basic notebook](01-basic_concepts.student.ipynb).

## Installation and Imports

The BBRL library is [here](https://github.com/osigaud/bbrl).

Below, we import standard python packages, pytorch packages and gymnasium environments.

In [84]:
# Installs the necessary Python and system libraries
try:
    from easypip import easyimport, easyinstall, is_notebook
except ModuleNotFoundError as e:
    get_ipython().run_line_magic("pip", "install easypip")
    from easypip import easyimport, easyinstall, is_notebook

easyinstall("bbrl>=0.2.2")
easyinstall("swig")
easyinstall("bbrl_gymnasium>=0.2.0")
easyinstall("bbrl_gymnasium[classic_control]")

In [85]:
import os
import sys
from pathlib import Path
import math

from moviepy.editor import ipython_display as video_display
import time
from tqdm.auto import tqdm
from typing import Tuple, Optional
from functools import partial

from omegaconf import OmegaConf
import torch
import bbrl_gymnasium

import copy
from abc import abstractmethod, ABC
import torch.nn as nn
import torch.nn.functional as F
from time import strftime
OmegaConf.register_new_resolver(
    "current_time", lambda: strftime("%Y%m%d-%H%M%S"), replace=True
)

In [86]:
# Imports all the necessary classes and functions from BBRL
from bbrl.agents.agent import Agent
from bbrl import get_arguments, get_class, instantiate_class
# The workspace is the main class in BBRL, this is where all data is collected and stored
from bbrl.workspace import Workspace

# Agents(agent1, agent2, agent3, ...) executes the different agents the one after the other
# TemporalAgent(agent) executes an agent over multiple timesteps in the workspace,
# or until a given condition is reached

from bbrl.agents import Agents, TemporalAgent
from bbrl.agents.gymnasium import ParallelGymAgent, make_env

In [87]:
from gymnasium.wrappers.time_limit import TimeLimit

## Definition of agents

We first create an Agent representing [the CartPole-v1 gym environment](https://gymnasium.farama.org/environments/classic_control/cart_pole/).
This is done using the [ParallelGymAgent](https://github.com/osigaud/bbrl/blob/40fe0468feb8998e62c3cd6bb3a575fef88e256f/src/bbrl/agents/gymnasium.py#L261) class.

The ParallelGymAgent is an agent able to execute a batch of gymnasium environments
with or without auto-resetting. These agents produce multiple variables in the workspace:
’env/env_obs’, ’env/reward’, ’env/timestep’, ’env/terminated’,
'env/truncated', 'env/done', ’env/cumulated_reward’.

When called at timestep t=0, the environments are automatically reset. At
timestep t>0, these agents will read the ’action’ variable in the workspace at
time t − 1 to generate the next state, by calling the step(action) of the contained gymnasium environment.

In the example below, we are working with batches (i.e. several episodes at the same time),
so here our agent uses `n_envs = 3` environments.

In [88]:
# We run episodes over 3 environments at a time
n_envs = 3
env_agent = ParallelGymAgent(partial(make_env, 'CartPole-v1', autoreset=False, wrappers=[lambda x: TimeLimit(x,100)]), n_envs, reward_at_t=False)
# The random seed is set to 2139
env_agent.seed(2139)

obs_size, action_dim = env_agent.get_obs_and_actions_sizes()
print(f"Environment: observation space in R^{obs_size} and action space {{1, ..., {action_dim}}}")

Environment: observation space in R^4 and action space {1, ..., 2}


In [89]:
# Creates a new workspace
workspace = Workspace()

# Execute the first step
env_agent(workspace, t=0)

# Our first set of observations. The size of the observation space is 4, and we have 3 environments.
obs = workspace.get("env/env_obs", 0)
print("Observation", obs)

Observation tensor([[-0.0085, -0.0427, -0.0489,  0.0215],
        [ 0.0005,  0.0025, -0.0493, -0.0402],
        [ 0.0080,  0.0203, -0.0023, -0.0085]])


To generate more steps into the workspace, we need to send actions to the environment.

### Random action without agent

We first set an action directly without using an agent

In [90]:
# Sets the next action
action = torch.randint(0, action_dim, (n_envs, ))
workspace.set("action", 0, action)
print(action)
env_agent(workspace, t=1)

# And perform one step
workspace.get("env/env_obs", 1)

tensor([1, 1, 0])


tensor([[-0.0094,  0.1531, -0.0485, -0.2862],
        [ 0.0006,  0.1983, -0.0501, -0.3480],
        [ 0.0084, -0.1747, -0.0025,  0.2834]])

Let us now look at what's in the workspace. You can see below all the variables it generates.

In [91]:
for key in workspace.variables.keys():
    print(key, workspace[key])

env/env_obs tensor([[[-0.0085, -0.0427, -0.0489,  0.0215],
         [ 0.0005,  0.0025, -0.0493, -0.0402],
         [ 0.0080,  0.0203, -0.0023, -0.0085]],

        [[-0.0094,  0.1531, -0.0485, -0.2862],
         [ 0.0006,  0.1983, -0.0501, -0.3480],
         [ 0.0084, -0.1747, -0.0025,  0.2834]]])
env/terminated tensor([[False, False, False],
        [False, False, False]])
env/truncated tensor([[False, False, False],
        [False, False, False]])
env/done tensor([[False, False, False],
        [False, False, False]])
env/reward tensor([[0., 0., 0.],
        [1., 1., 1.]])
env/cumulated_reward tensor([[0., 0., 0.],
        [1., 1., 1.]])
env/timestep tensor([[0, 0, 0],
        [1, 1, 1]])
action tensor([[1, 1, 0]])


You can observe that we have two time steps for each variable that are stored
within tensors where the first dimension is time.

You can also see that by convention, all variables written by the environment start with "env/".

### Random agent

The process above can be
automatized with `Agents` and `TemporalAgent` as shown below - but first we have
to create an agent that selects the actions (here, randomly).

In [92]:
class RandomAgent(Agent):
    def __init__(self, action_dim):
        super().__init__()
        self.action_dim = action_dim

    def forward(self, t: int, choose_action=True, **kwargs):
        """An Agent can use self.workspace"""
        obs = self.get(("env/env_obs", t))
        action = torch.randint(0, self.action_dim, (len(obs), ))
        self.set(("action", t), action)

# Each agent is run in the order given when constructing Agents
agents = Agents(env_agent, RandomAgent(action_dim))

# And the TemporalAgent allows to run through time
t_agents = TemporalAgent(agents)

In [93]:
# We can now run the agents throught time with a simple call...

workspace = Workspace()
t_agents(workspace, t=0, stop_variable="env/done", stochastic=True)

In [94]:
for key in workspace.variables.keys():
    print(key, workspace[key])

env/env_obs tensor([[[ 1.2417e-02, -1.1647e-02,  2.1894e-02,  4.7717e-02],
         [ 1.0013e-02, -9.4643e-04, -1.0945e-02, -6.8630e-03],
         [ 4.5724e-02,  2.0465e-02,  4.8711e-02,  3.0704e-03]],

        [[ 1.2184e-02, -2.0708e-01,  2.2849e-02,  3.4723e-01],
         [ 9.9937e-03, -1.9591e-01, -1.1082e-02,  2.8235e-01],
         [ 4.6133e-02, -1.7532e-01,  4.8772e-02,  3.1072e-01]],

        [[ 8.0423e-03, -4.0252e-01,  2.9793e-02,  6.4703e-01],
         [ 6.0755e-03, -6.3148e-04, -5.4349e-03, -1.3811e-02],
         [ 4.2627e-02,  1.9074e-02,  5.4987e-02,  3.3804e-02]],

        [[-7.9994e-06, -5.9804e-01,  4.2734e-02,  9.4894e-01],
         [ 6.0629e-03,  1.9457e-01, -5.7111e-03, -3.0820e-01],
         [ 4.3009e-02,  2.1337e-01,  5.5663e-02, -2.4104e-01]],

        [[-1.1969e-02, -7.9371e-01,  6.1712e-02,  1.2547e+00],
         [ 9.9543e-03,  3.8977e-01, -1.1875e-02, -6.0268e-01],
         [ 4.7276e-02,  1.7495e-02,  5.0842e-02,  6.8673e-02]],

        [[-2.7843e-02, -9.8957e-0

### Termination

`env/done` tells us whether the episode was finished or not (it is either terminated or truncated)
here, with NoAutoReset, we wait that all episodes are "done"
and when the episode is finished, the variables are copied for that environment until all episodes are done.
So, when an environment is done before the others, its content is copied until the termination of all environments.
This is convenient for collecting the final reward.

In [95]:
workspace["env/done"].shape, workspace["env/done"][-10:]

(torch.Size([49, 3]),
 tensor([[ True,  True, False],
         [ True,  True, False],
         [ True,  True, False],
         [ True,  True, False],
         [ True,  True, False],
         [ True,  True, False],
         [ True,  True, False],
         [ True,  True, False],
         [ True,  True, False],
         [ True,  True,  True]]))

You can see that the variable is copied until all episodes are done.

### Observations

The resulting tensor of observations, with the last two observations:

In [96]:
workspace["env/env_obs"].shape, workspace["env/env_obs"][-2:]

(torch.Size([49, 3, 4]),
 tensor([[[-0.1228, -1.3851,  0.2395,  2.3180],
          [ 0.1204,  0.6012, -0.2105, -1.2803],
          [ 0.5376,  1.3538, -0.1869, -1.3499]],
 
         [[-0.1228, -1.3851,  0.2395,  2.3180],
          [ 0.1204,  0.6012, -0.2105, -1.2803],
          [ 0.5647,  1.5508, -0.2139, -1.6948]]]))

### Rewards

The resulting tensor of rewards, with the last 8 rewards:

In [97]:
workspace["env/reward"].shape, workspace["env/reward"][-8:]

(torch.Size([49, 3]),
 tensor([[0., 0., 1.],
         [0., 0., 1.],
         [0., 0., 1.],
         [0., 0., 1.],
         [0., 0., 1.],
         [0., 0., 1.],
         [0., 0., 1.],
         [0., 0., 1.]]))

and the cumulated rewards:

In [98]:
workspace["env/cumulated_reward"].shape, workspace["env/cumulated_reward"][-8:]

(torch.Size([49, 3]),
 tensor([[ 9., 17., 41.],
         [ 9., 17., 42.],
         [ 9., 17., 43.],
         [ 9., 17., 44.],
         [ 9., 17., 45.],
         [ 9., 17., 46.],
         [ 9., 17., 47.],
         [ 9., 17., 48.]]))

### Actions

The resulting tensor of actions, with the last two actions:

In [99]:
workspace["action"].shape, workspace["action"][-2:]

(torch.Size([49, 3]),
 tensor([[1, 1, 1],
         [1, 1, 0]]))

## Exercise

Create a stupid agent that always outputs action 1, until the episode stops.
Watch the content of the resulting workspace.

In [110]:
n_envs = 5
env_agent = ParallelGymAgent(partial(make_env, 'CartPole-v1', autoreset=False), n_envs, reward_at_t=False)
env_agent.seed(2139)

obs_size, action_dim = env_agent.get_obs_and_actions_sizes()
print(f"Environment: observation space in R^{obs_size} and action space {{1, ..., {action_dim}}}")

Environment: observation space in R^4 and action space {1, ..., 2}


In [113]:
class OneAgent(Agent):
    def __init__(self, action_dim):
        super().__init__()
        self.action_dim = action_dim

    def forward(self, t: int, choose_action=True, **kwargs):
        """An Agent can use self.workspace"""
        obs = self.get(("env/env_obs", t))
        action = torch.tensor([1]*len(obs))
        self.set(("action", t), action)


t_agents = TemporalAgent(Agents(env_agent, OneAgent(action_dim)))

workspace = Workspace()
t_agents(workspace, t=0, stop_variable="env/done", stochastic=True)

In [114]:
for key in workspace.variables.keys():
    print(key, workspace[key])

env/env_obs tensor([[[ 1.5349e-02, -1.6139e-02, -1.4235e-02,  2.6596e-02],
         [-2.9842e-02,  4.2431e-02,  7.2186e-04,  8.1762e-03],
         [-8.3105e-03,  6.8620e-03,  1.7204e-02, -1.3014e-02],
         [ 1.1758e-02, -4.3135e-02,  1.6623e-02, -9.7311e-03],
         [-2.3125e-02,  1.0865e-02,  3.3525e-02, -1.0814e-02]],

        [[ 1.5026e-02,  1.7918e-01, -1.3703e-02, -2.7054e-01],
         [-2.8993e-02,  2.3754e-01,  8.8538e-04, -2.8428e-01],
         [-8.1733e-03,  2.0173e-01,  1.6943e-02, -3.0022e-01],
         [ 1.0896e-02,  1.5174e-01,  1.6429e-02, -2.9712e-01],
         [-2.2908e-02,  2.0549e-01,  3.3308e-02, -2.9273e-01]],

        [[ 1.8610e-02,  3.7450e-01, -1.9114e-02, -5.6752e-01],
         [-2.4242e-02,  4.3265e-01, -4.8002e-03, -5.7668e-01],
         [-4.1386e-03,  3.9661e-01,  1.0939e-02, -5.8751e-01],
         [ 1.3930e-02,  3.4663e-01,  1.0486e-02, -5.8458e-01],
         [-1.8798e-02,  4.0012e-01,  2.7454e-02, -5.7473e-01]],

        [[ 2.6100e-02,  5.6988e-01, -

Ils tombent assez rapidement (logique), done = truncated(arreté par la fin de l'épisode) U terminated(fini)