<a href="https://colab.research.google.com/github/KeryanChelouche/project-POMPD-LSTM/blob/main/BBRL_interaction_loop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BBRL in practice: the interaction loop

## Outlook

In this notebook, we start practicing with the BBRL model, which is explained in [this notebook](https://colab.research.google.com/drive/1_yp-JKkxh_P8Yhctulqm0IrLbE41oK1p?usp=sharing). We just implement a simple interaction loop.


What you will see here is very close to what Ludovic Denoyer shows in [this video](https://www.youtube.com/watch?v=CSkkoq_k5zU).

# Installation

Just run the following cell.

Note the trick: we first try to import, if it fails we install the github repository and import again.

In [None]:
try:
  import bbrl
except ImportError:
  !pip install git+https://github.com/osigaud/bbrl.git
  import bbrl

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/osigaud/bbrl.git
  Cloning https://github.com/osigaud/bbrl.git to /tmp/pip-req-build-_g7tvfvr
  Running command git clone -q https://github.com/osigaud/bbrl.git /tmp/pip-req-build-_g7tvfvr
Collecting my_gym@ git+https://github.com/osigaud/my_gym.git
  Cloning https://github.com/osigaud/my_gym.git to /tmp/pip-install-kc6r__9y/my-gym_0ea4ceddb16b4d75948548ccd9d2c492
  Running command git clone -q https://github.com/osigaud/my_gym.git /tmp/pip-install-kc6r__9y/my-gym_0ea4ceddb16b4d75948548ccd9d2c492
Collecting protobuf==3.20.1
  Downloading protobuf-3.20.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.0 MB)
[K     |████████████████████████████████| 1.0 MB 14.7 MB/s 
Collecting gym==0.21.0
  Downloading gym-0.21.0.tar.gz (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 56.4 MB/s 
Collecting hydra-core
  Downloading hydra_core-1.2.0-py3-n

In [None]:
import torch


## BBRL imports

As explained in [the white paper](https://arxiv.org/pdf/2110.07910.pdf), everything in SaLinA (and also in BBRL) is an Agent.

This construct is defined in [the bbrl/agents/agent.py](https://github.com/osigaud/bbrl/blob/master/bbrl/agents/agent.py) file as the Agent class.

Any Agent class should come with a `forward(self, t, **kwargs)` method where t represents a time step.

Some of the comments below are just copy-pasted from the paper or from the code.

In [None]:
from bbrl.workspace import Workspace

from bbrl.agents.agent import Agent

# Agents(agent1,agent2,agent3,...) executes the different agents the one after the other
# TemporalAgent(agent) executes an agent (e.g an Agent) over multiple timesteps in the workspace, 
# or until a given condition is reached
from bbrl.agents import Agents, TemporalAgent

# GymAgent (resp. AutoResetGymAgent) are agents able to execute a batch of gym environments
# without (resp. with) auto-resetting. These agents produce multiple variables in the workspace: 
# ’env/env_obs’, ’env/reward’, ’env/timestep’, ’env/done’, ’env/initial_state’, ’env/cumulated_reward’, 
# ... When called at timestep t=0, then the environments are automatically reset. 
# At timestep t>0, these agents will read the ’action’ variable in the workspace at time t − 1
from bbrl.agents.gyma import AutoResetGymAgent, NoAutoResetGymAgent

Remember that a workspace contains tensors, so everything written into a workspace should be a tensor. In the examples below the agents will first write random tensors.

# Creating and running agents

To play with the BBRL model, we first create a simple ActionAgent

In [None]:
class ActionAgent(Agent):
    # Create the action agent
    # This is a fake agent for illustration purpose
    # In a standard ActionAgent, there should be an architecture 
    # to compute the action given the observation
    def __init__(self):
        super().__init__()

    def forward(self, t, **kwargs):
        obs = self.get(("obs", t))
        action = torch.rand(1) # here should be function of the obs 

        self.set(("action", t), action)

Then we create an EnvAgent

In [None]:
class EnvAgent(Agent):
  # Create the environment agent
  # This is a fake agent for illustration purpose
  # A standard EnvAgent would inherit from a GymAgent 
  def __init__(self):
    super().__init__()

  def forward(self, t, **kwargs):
    if t==0:
      # If we are in the first step, the agent has not acted yet
      # A real GymAgent would call obs = reset()
      obs = torch.rand(2)      
      reward = torch.randint(low=0, high=5, size=[1])     
      done = torch.zeros(1, dtype=torch.bool)
    else:
      # Here, a real GymAgent would call obs, reward, done, info = step(action)
      action = self.get(("action", t-1)) # beware, we take the previous action
      obs = torch.rand(2)           
      reward = torch.randint(low=0, high=5, size=[1])       
      done = torch.zeros(1, dtype=torch.bool)
    self.set(("obs", t), obs)
    self.set(("reward", t), reward)
    self.set(("done", t), done)


We bind them together into a TemporalAgent

In [None]:
action_agent = ActionAgent()
env_agent = EnvAgent()

# Compose both previous agents
composed_agent = Agents(env_agent, action_agent)
  
# Get a temporal agent that can be executed in a workspace
t_agent = TemporalAgent(composed_agent)

And finally we execute it in the workspace

In [None]:
# We create a workspace
workspace = Workspace()

# The temporal agent will be run for 10 steps on this workspace
t_agent(workspace, t=0, n_steps=10)

# We retrieve the information as they are stored into the workspace
obs, action, reward, done = workspace["obs", "action", "reward", "done"]

# And we print them
print("obs:", obs)
print("action:", action)
print("reward:", reward)
print("done:", done)
# You should see that each variable has been recorded for the number of specified 
# time steps...

obs: tensor([[0.4689, 0.4154],
        [0.4670, 0.5493],
        [0.6481, 0.6698],
        [0.0290, 0.6147],
        [0.5292, 0.7809],
        [0.2537, 0.4671],
        [0.8689, 0.9229],
        [0.5004, 0.0675],
        [0.5761, 0.1857],
        [0.6349, 0.9387]])
action: tensor([[0.1494],
        [0.5670],
        [0.2232],
        [0.6347],
        [0.9418],
        [0.1383],
        [0.3017],
        [0.8173],
        [0.0026],
        [0.3141]])
reward: tensor([[1],
        [4],
        [3],
        [0],
        [3],
        [4],
        [1],
        [3],
        [1],
        [2]])
done: tensor([[False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False]])


## What's next?

In [the next notebook](https://colab.research.google.com/drive/1Ui481r47fNHCQsQfKwdoNEVrEiqAEokh?usp=sharing) we will replace these simple random agents with real agents based on neural networks and a real environnement: we will use a neural network ActionAgent and an RL environment from gym to write an elementary RL loop.