# Outlook

In this notebook, we will implement the REINFORCE algorithm using BBRL. To understand this code, you need [to know more about BBRL](https://colab.research.google.com/drive/1_yp-JKkxh_P8Yhctulqm0IrLbE41oK1p?usp=sharing):

-  You should first have a look at [the BBRL interaction model](https://colab.research.google.com/drive/1gSdkOBPkIQi_my9TtwJ-qWZQS0b2X7jt?usp=sharing), 

- then [a first true RL example](https://colab.research.google.com/drive/1Ui481r47fNHCQsQfKwdoNEVrEiqAEokh?usp=sharing) 

- and, most importantly, details about the [NoAutoResetGymAgent](https://colab.research.google.com/drive/1EX5O03mmWFp9wCL_Gb_-p08JktfiL2l5?usp=sharing).

The REINFORCE algorithm is explained in a series of 3 videos: [video 1](https://www.youtube.com/watch?v=R7ULMBXOQtE), [video 2](https://www.youtube.com/watch?v=dKUWto9B9WY) and [video 3](https://www.youtube.com/watch?v=GcJ9hl3T6x8). You can also read the corresponding slides: [slides1](http://pages.isir.upmc.fr/~sigaud/teach/ps/3_pg_derivation1.pdf), [slides2](http://pages.isir.upmc.fr/~sigaud/teach/ps/4_pg_derivation2.pdf), [slides3](http://pages.isir.upmc.fr/~sigaud/teach/ps/5_pg_derivation3.pdf).

## Installation and Imports

### Installation

The BBRL library is [here](https://github.com/osigaud/bbrl).

This is OmegaConf that makes it possible that by just defining the `def run_a2c(cfg):` function and then executing a long `params = {...}` variable at the bottom of this colab, the code is run with the parameters without calling an explicit main.

More precisely, the code is run by calling

`config=OmegaConf.create(params)`

`run_a2c(config)`

at the very bottom of the colab, after starting tensorboard.

In [None]:
try:
    from easypip import easyimport
except:
    !pip install easypip
    from easypip import easyimport

import functools
import time

easyimport("importlib_metadata==4.13.0")

OmegaConf = easyimport("omegaconf").OmegaConf
bbrl = easyimport("bbrl")
import gym


### Imports

Below, we import standard python packages, pytorch packages and gym environments.

[OpenAI gym](https://gym.openai.com/) is a collection of benchmark environments to evaluate RL algorithms.

In [None]:
import copy
import time

import torch
import torch.nn as nn
import torch.nn.functional as F

import gym

### BBRL imports

In [None]:
from bbrl.agents.agent import Agent
from bbrl import get_arguments, get_class, instantiate_class

# The workspace is the main class in BBRL, this is where all data is collected and stored
from bbrl.workspace import Workspace

# Agents(agent1,agent2,agent3,...) executes the different agents the one after the other
# TemporalAgent(agent) executes an agent over multiple timesteps in the workspace, 
# or until a given condition is reached
from bbrl.agents import Agents, RemoteAgent, TemporalAgent, PrintAgent

# The NoAutoResetGymAgent is an agent executing a batch of gym environments
# without auto-resetting. These agents produce multiple variables in the workspace: 
# ’env/env_obs’, ’env/reward’, ’env/timestep’, ’env/done’, ’env/initial_state’, ’env/cumulated_reward’, 
# At timestep t>0, these agents will read the ’action’ variable in the workspace at time t − 1
from bbrl.agents.gymb import NoAutoResetGymAgent

## Definition of agents

The [REINFORCE](https://link.springer.com/content/pdf/10.1007/BF00992696.pdf) uses a stochastic policy and a baseline which is the value function. Thus we need an Actor agent, a Critic agent and an Environment agent. 
The actor agent is built on an intermediate ProbAgent, see [this notebook](https://colab.research.google.com/drive/1Ui481r47fNHCQsQfKwdoNEVrEiqAEokh?usp=sharing) for explanations about the  ProbaAgent, the ActorAgent and the environment agent.

 As in [a previous notebook about DQN](https://colab.research.google.com/drive/1raeuB6uUVUpl-4PLArtiAoGnXj0sGjSV?usp=sharing), the neural networks we build are multi-layer perceptrons.

In [None]:
def build_mlp(sizes, activation, output_activation=nn.Identity()):
    layers = []
    for j in range(len(sizes) - 1):
        act = activation if j < len(sizes) - 2 else output_activation
        layers += [nn.Linear(sizes[j], sizes[j + 1]), act]
    return nn.Sequential(*layers)

In [None]:
class ProbAgent(Agent):
    """Computes the distribution $p(a_t|s_t)$"""
    
    def __init__(self, state_dim, hidden_layers, n_action):
        super().__init__(name="prob_agent")
        self.model = build_mlp(
            [state_dim] + list(hidden_layers) + [n_action], activation=nn.ReLU()
        )

    def forward(self, t, **kwargs):
        # Get $s_t$
        observation = self.get(("env/env_obs", t))
        # Compute the distribution over actions
        scores = self.model(observation)
        action_probs = torch.softmax(scores, dim=-1)
        assert not torch.any(torch.isnan(action_probs)), "NaN Here"
        
        self.set(("action_probs", t), action_probs)
        entropy = torch.distributions.Categorical(action_probs).entropy()
        self.set(("entropy", t), entropy)



In [None]:
class ActorAgent(Agent):
    """Choose an action (either acoording to p(a_t|s_t) when stochastic is true,
       or with argmax if false.
    """
    def __init__(self):
        super().__init__()

    def forward(self, t, stochastic, **kwargs):
        probs = self.get(("action_probs", t))
        if stochastic:
            action = torch.distributions.Categorical(probs).sample()
        else:
            action = probs.argmax(1)

        self.set(("action", t), action)

In [None]:
def make_env(env_name):
    return gym.make(env_name)

### VAgent

The VAgent is a neural network which takes an observation as input and whose output is the value $V(s)$ of this observation.

The `squeeze(-1)` removes the last dimension of the tensor. TODO: explain why we need it

In [None]:
class VAgent(Agent):
    def __init__(self, state_dim, hidden_layers):
        super().__init__()
        self.is_q_function = False
        self.model = build_mlp(
            [state_dim] + list(hidden_layers) + [1], activation=nn.ReLU()
        )

    def forward(self, t, **kwargs):
        observation = self.get(("env/env_obs", t))
        critic = self.model(observation).squeeze(-1)
        self.set(("v_value", t), critic)

### Create the REINFORCE agent

The code below is rather straightforward. Note that we have not defined anything about data collection, using a RolloutBuffer or something to store the n_step return so far. This will come inside the training loop below.

Interestingly, the loop between the policy and the environment is first defined as a collection of agents, and then embedded into a single TemporalAgent.

In [None]:
def create_reinforce_agent(cfg, env_agent):
    obs_size, act_size = env_agent.get_obs_and_actions_sizes()
    proba_agent = ProbAgent(obs_size, cfg.algorithm.architecture.actor_hidden_size, act_size)
    action_agent = ActorAgent()
    # print_agent = PrintAgent()
    tr_agent = Agents(env_agent, proba_agent, action_agent)  # , print_agent)

    critic_agent = TemporalAgent(
        VAgent(obs_size, cfg.algorithm.architecture.critic_hidden_size)
    )

    # Get an agent that is executed on a complete workspace
    train_agent = TemporalAgent(tr_agent)
    train_agent.seed(cfg.algorithm.seed)
    return train_agent, proba_agent, critic_agent  # , print_agent

### The Logger class

The logger class below is not generic, it is specifically designed in the context of this A2C colab.

The logger parameters are defined below in `params = { "logger":{ ...`

In this colab, the logger is defined as `bbrl.utils.logger.TFLogger` so as to use a tensorboard visualisation (see the parameters part below).
Note that the salina Logger is also saving the log in a readable format such that you can use `Logger.read_directories(...)` to read multiple logs, create a dataframe, and analyze many experiments afterward in a notebook for instance. 

The code for the different kinds of loggers is available in the [bbrl/utils/logger.py](https://github.com/osigaud/bbrl/blob/master/bbrl/utils/logger.py) file.

Having logging provided under the hood is one of the features where using RL libraries like BBRL will allow you to save time.

`instantiate_class` is an inner BBRL mechanism. The `instantiate_class` function is available in the [`bbrl/__init__.py`](https://github.com/osigaud/bbrl/blob/master/bbrl/__init__.py) file.

In [None]:
class Logger():
      def __init__(self, cfg):
        self.logger = instantiate_class(cfg.logger)

      def add_log(self, log_string, loss, epoch):
        self.logger.add_scalar(log_string, loss.item(), epoch)

      # Log losses
      def log_losses(self, epoch, critic_loss, actor_loss):
        self.add_log("critic_loss", critic_loss, epoch)
        self.add_log("actor_loss", actor_loss, epoch)



### Setup the optimizer

We use a single optimizer to tune the parameters of the actor (in the prob_agent part) and the critic (in the critic_agent part). It would be possible to have two optimizers which would work separately on the parameters of each component agent, but it would be more complicated because updating the actor requires the gradient of the critic.

In [None]:
# Configure the optimizer over the a2c agent
def setup_optimizer(cfg, prob_agent, critic_agent):
    optimizer_args = get_arguments(cfg.optimizer)
    parameters = nn.Sequential(prob_agent, critic_agent).parameters()
    optimizer = get_class(cfg.optimizer)(parameters, **optimizer_args)
    return optimizer

### Compute critic loss

Note the `critic[1:].detach()` in the computation of the temporal difference target. The idea is that we compute this target as a function of $V(s_{t+1})$, but we do not want to apply gradient descent on this $V(s_{t+1})$, we will only apply gradient descent to the $V(s_t)$ according to this target value.

In practice, `x.detach()` detaches a computation graph from a tensor, so it avoids computing a gradient over this tensor.

Note also the trick to deal with terminal states. If the state is terminal, $V(s_{t+1})$ does not make sense. Thus we need to ignore this term. So we multiply the term by `must_bootstrap`: if `must_bootstrap` is True (converted into an int, it becomes a 1), we get the term. If `must_bootstrap` is False (=0), we are at a terminal state, so we ignore the term. This trick is used in many RL libraries, e.g. SB3.

In [None]:
def compute_critic_loss(cfg, reward, must_bootstrap, critic):
    # Compute temporal difference
    target = reward[:-1] + cfg.algorithm.discount_factor * critic[1:].detach() * must_bootstrap[1:].int()
    td = (target - critic[:-1]) * must_bootstrap[1:].int()

    # Compute critic loss
    td_error = td ** 2
    critic_loss = td_error.mean()
    return critic_loss, td

## Main training loop

### First algorithm: summing all the rewards along an episode

The most basic variant of the Policy Gradient algorithms just sums all the rewards along an episode.

This is implemented with the `apply_sum` function below.

In [None]:
def apply_sum(cfg, reward, v_value):
    reward_sum = reward.sum(axis=0)
    for i in range(len(reward)):
        reward[i] = reward_sum
    return reward

### Main loop

Note that we `optimizer.zero_grad()`, `loss.backward()` and `optimizer.step()` lines. Several things need to be explained here.
- `optimizer.zero_grad()` is necessary to cancel all the gradients computed at the previous iterations
- note that we sum all the losses, both for the critic and the actor, before applying back-propagation with `loss.backward()`. At first glance, summing these losses may look weird, as the actor and the critic receive different updates with different parts of the loss. This mechanism relies on the central property of tensor manipulation libraries like TensorFlow and pytorch. In pytorch, each loss tensor comes with its own graph of computation for back-propagating the gradient, in such a way that when you back-propagate the loss, the adequate part of the loss is applied to the adequate parameters.
These mechanisms are partly explained [here](https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html).
- since the optimizer has been set to work with both the actor and critic parameters, `optimizer.step()` will optimize both agents and pytorch ensure that each will receive its own part of the gradient.

In [None]:
def run_reinforce(cfg, *, compute_reward=apply_sum, compute_critic_loss=compute_critic_loss):
    """Run Reinforce
    
    :param compute_reward: 
        The function called to compute the reward 
        for Reinforve at each time step (default to apply_sum)
        
    :param compute_critic_loss: 
        Function that specifies how to compute the critic loss
    """
    logger = Logger(cfg)
    best_reward = -10e10

    # 2) Create the environment agent
    env_agent = NoAutoResetGymAgent(
        get_class(cfg.gym_env),
        get_arguments(cfg.gym_env),
        cfg.algorithm.n_envs,
        cfg.algorithm.seed,
    )

    reinforce_agent, proba_agent, critic_agent = create_reinforce_agent(cfg, env_agent)

    # 7) Configure the optimizer over the a2c agent
    optimizer = setup_optimizer(cfg, reinforce_agent, critic_agent)

    # 8) Training loop
    nb_steps = 0

    for episode in range(cfg.algorithm.nb_episodes):
        # print_agent.reset()
        # Execute the agent on the workspace to sample complete episodes
        # Since not all the variables of workspace will be overwritten, it is better to clear the workspace
        # Configure the workspace to the right dimension.
        train_workspace = Workspace()

        reinforce_agent(train_workspace, stochastic=True, t=0, stop_variable="env/done")

        # Get relevant tensors (size are timestep x n_envs x ....)
        obs, done, truncated, action_probs, reward, action = train_workspace[
            "env/env_obs",
            "env/done",
            "env/truncated",
            "action_probs",
            "env/reward",
            "action",
        ]
        critic_agent(train_workspace, stop_variable="env/done")
        v_value = train_workspace["v_value"]

        for i in range(cfg.algorithm.n_envs):
            nb_steps += len(action[:, i])

        # Determines whether values of the critic should be propagated
        # True if the episode reached a time limit or if the task was not done
        # See https://colab.research.google.com/drive/1W9Y-3fa6LsPeR6cBC1vgwBjKfgMwZvP5?usp=sharing
        must_bootstrap = torch.logical_or(~done, truncated)

        critic_loss, td = compute_critic_loss(cfg, reward, must_bootstrap, v_value)

        reward = compute_reward(cfg, reward, v_value)

        # Take the log probability of the actions performed
        action = action.unsqueeze(-1)
        action_logp = torch.gather(action_probs.squeeze(), dim=2, index=action).squeeze().log()
        
        # Compute the policy gradient loss based on the log probability of the actions performed
        actor_loss = action_logp * reward.detach() * must_bootstrap.int()
        actor_loss = actor_loss.mean()

        # Log losses
        logger.log_losses(nb_steps, critic_loss, actor_loss)

        loss = (
            cfg.algorithm.critic_coef * critic_loss
            - cfg.algorithm.actor_coef * actor_loss
        )


        # Compute the cumulated reward on final_state
        cumulated_reward = train_workspace["env/cumulated_reward"][-1]
        mean = cumulated_reward.mean()
        print(f"episode: {episode}, reward: {mean}")
        logger.add_log("reward", mean, nb_steps)
        

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()



## Definition of the parameters

The logger is defined as `bbrl.utils.logger.TFLogger` so as to use a tensorboard visualisation.

In [None]:
params={
  "logger":{
    "classname": "bbrl.utils.logger.TFLogger",
    "log_dir": "./tblogs/reinforce-" + str(time.time()),
    "cache_size": 10000,
    "every_n_seconds": 10,
    "verbose": False,    
    },

  "algorithm":{
    "seed": 1,
    "n_envs": 8,
    "nb_episodes": 1000,
    "discount_factor": 0.95,
    "critic_coef": 1.0,
    "actor_coef": 1.0,
    "architecture":{
        "actor_hidden_size": [32],
        "critic_hidden_size": [36],
    },
  },

  "gym_env":{
    "classname": "__main__.make_env",
    "env_name": "CartPole-v1",
  },
  "optimizer":
  {
    "classname": "torch.optim.Adam",
    "lr": 0.001,
  }
}



### Launching tensorboard to visualize the results

# For Colab - otherwise, it is easier and better to launch tensorboard from
# the terminal
if get_ipython().__class__.__module__ == "google.colab._shell":
    # %load_ext tensorboard
    # %tensorboard --logdir ./tmp
else:
    import sys
    import os
    import os.path as osp
    print(f'''Launch tensorboard from the shell:\n{osp.dirname(sys.executable)}/tensorboard --logdir="{os.getcwd()}/tblogs"''')
```

```{python id="l42OUoGROlSt"}
config=OmegaConf.create(params)
torch.manual_seed(config.algorithm.seed)
run_reinforce(config)
```

## Exercises

### First algorithm: summing discounted rewards

As explained in the [second video](https://www.youtube.com/watch?v=dKUWto9B9WY) and [the corresponding slides](http://pages.isir.upmc.fr/~sigaud/teach/ps/4_pg_derivation2.pdf), using a discounted reward after the current step and ignoring the rewards before the current step results in lower variance.

By taking inspiration from the `apply_sum()` function above, code a function `apply_discounted_sum()` that computes sum of discounted rewards from immediate rewards.

Two hints:
- you should proceed backwards, starting from the final step of the episode and storing the previous sum into a register
- you need the discount factor as an input to your function.

In [None]:
def apply_discounted_sum(cfg, reward, v_value):
    # À compléter...  
    assert False, 'Code non implémenté'


Then compare the performance of this algorithm to that of the previous approach where the rewards were just summed up.

In [None]:
torch.manual_seed(config.algorithm.seed)

config=OmegaConf.create(params)
config.logger.log_dir = "./tblogs/reinforce_dreward-" + str(time.time())
run_reinforce(config, compute_reward=apply_discounted_sum)

### Second algorithm: Focus on learning the baseline value

The `compute_critic_loss()` function above uses the Temporal Difference approach to critic estimation. In this part, we will compare it to using the Monte Carlo estimation approach.

As explained in [this video](https://www.youtube.com/watch?v=GcJ9hl3T6x8) and [these slides](http://pages.isir.upmc.fr/~sigaud/teach/ps/5_pg_derivation3.pdf), the MC estimation approach uses the following equation:

$$\phi_{j+1} = \mathop{\mathrm{argmin}}_{\phi_j} 
    \frac{1}{m\times H}\sum_{i=1}^m 
    \sum_{t=1}^H 
        \left(
            \left(\sum_{k=t}^H \gamma^{k-t} r(s_k^{(i)},a_k^{(i)}) \right) - \hat{V}^\pi_{\phi_j}(s_t^{(i)})
        \right)^2
$$

The innermost sum of discounted rewards exactly corresponds to the computation of the `apply_discounted_sum()` function. The rest just consists in computing the squared difference (also known as the Means Squared Error, or MSE) over the $m \times H$ samples ($m$ episodes of lenght $H$) that we have collected.

From the above information, create a `compute_critic_loss_mc()` function which must be called after `apply_discounted_sum()` on the reward.

In [None]:
# À compléter...  
assert False, 'Code non implémenté'
def compute_critic_loss_mc():
pass


Then compare the learning dynamics and the learned critic using the Temporal Difference estimation approach and the Monte Carlo estimation approach.

In [None]:
torch.manual_seed(config.algorithm.seed)
config=OmegaConf.create(params)
config.gamma = 0.99
config.logger.log_dir = "./tblogs/reinforce_dreward_mc_critic-" + str(time.time())

run_reinforce(config, compute_reward=apply_discounted_sum, compute_critic_loss=compute_critic_loss_mc)

### Third algorithm: discounted sum minus baseline

From [this video](https://www.youtube.com/watch?v=GcJ9hl3T6x8) and [these slides](http://pages.isir.upmc.fr/~sigaud/teach/ps/5_pg_derivation3.pdf), we know that we can substract a baseline to the gradient calculation methods studied above, and that the optimal baseline to reduce the variance is the value function.

Code a `apply_discounted_sum_minus_baseline()` function, using the critic learned simultaneously with the policy.

In [None]:
def apply_discounted_sum_minus_baseline(cfg, reward, v_value):
    # À compléter...  
    assert False, 'Code non implémenté'



torch.manual_seed(config.algorithm.seed)
config=OmegaConf.create(params)
config.logger.log_dir = "./tblogs/reinforce_reinforce_critic-" + str(time.time())

run_reinforce(config, compute_reward=apply_discounted_sum_minus_baseline, compute_critic_loss=compute_critic_loss_mc)

Most probably, this will not work well, as initially the learned critic is a poor estimate of the true $V(s)$. Instead, load an already trained critic that you have saved after convergence from a previous run, and see if it works better.

Loading and saving a network or a BBRL agent can easily be performed using `agent.save(filename)` and `agent.load(filename)`.

# Warning

Be cautious with the use of ProbAgent with just a hidden layer, ProbAgent with build_mlp, and DiscreteActor. Try to be progressive...