# Outlook

In this notebook, using BBRL, we code a version of the DQN algorithm with a
replay buffer and a target network, using the AutoReset approach.

To understand this code, you need to know more about 
[the BBRL interaction model](https://colab.research.google.com/drive/1_yp-JKkxh_P8Yhctulqm0IrLbE41oK1p?usp=sharing).
Then you should run [a first example](https://colab.research.google.com/drive/1Ui481r47fNHCQsQfKwdoNEVrEiqAEokh?usp=sharing)
to see how agents interact.

You also need to understand [details about
autoreset=True](https://colab.research.google.com/drive/1W9Y-3fa6LsPeR6cBC1vgwBjKfgMwZvP5?usp=sharing).

The DQN algorithm is explained in [this
video](https://www.youtube.com/watch?v=CXwvOMJujZk) and you can also read [the
corresponding slides](http://pages.isir.upmc.fr/~sigaud/teach/dqn.pdf).

In this notebook, you will learn how to modify the previous notebook:

- to use a replay buffer and an environment that resets
- to use a target network for $Q$
- to use a better estimation for the maximum (Double-DQN)

## Installation and Imports

### Installation

The BBRL library is [here](https://github.com/osigaud/bbrl).

We use OmegaConf to that makes it possible that by just defining the `def
run_dqn(cfg):` function and then executing a long `params = {...}` variable at
the bottom of this colab, the code is run with the parameters without calling
an explicit main.

More precisely, the code is run by calling

`config=OmegaConf.create(params)`

`run_dqn(config)`

at the very bottom of the colab, after starting tensorboard.

Below, we import standard python packages, pytorch packages and gymnasium
environments.

In [3]:
# Installs the necessary Python and system libraries
try:
    from easypip import easyimport, easyinstall, is_notebook
except ModuleNotFoundError as e:
    get_ipython().run_line_magic("pip", "install easypip")
    from easypip import easyimport, easyinstall, is_notebook

easyinstall("bbrl>=0.2.2")
easyinstall("swig")
easyinstall("bbrl_gymnasium>=0.2.0")
easyinstall("bbrl_gymnasium[box2d]")
easyinstall("bbrl_gymnasium[classic_control]")
easyinstall("tensorboard")
easyinstall("moviepy")
easyinstall("box2d-kengz")

[easypip] Installing bbrl_gymnasium>=0.2.0


[easypip] Installing bbrl_gymnasium[box2d]
[easypip] Installing bbrl_gymnasium[classic_control]


In [4]:
import os
import sys
from pathlib import Path
import math

from moviepy.editor import ipython_display as video_display
import time
from tqdm.auto import tqdm
from typing import Tuple, Optional
from functools import partial

from omegaconf import OmegaConf
import torch
import bbrl_gymnasium

import copy
from abc import abstractmethod, ABC
import torch.nn as nn
import torch.nn.functional as F
from time import strftime
OmegaConf.register_new_resolver(
    "current_time", lambda: strftime("%Y%m%d-%H%M%S"), replace=True
)

In [5]:
# Imports all the necessary classes and functions from BBRL
from bbrl.agents.agent import Agent
from bbrl import get_arguments, get_class, instantiate_class
# The workspace is the main class in BBRL, this is where all data is collected and stored
from bbrl.workspace import Workspace

# Agents(agent1,agent2,agent3,...) executes the different agents the one after the other
# TemporalAgent(agent) executes an agent over multiple timesteps in the workspace, 
# or until a given condition is reached
from bbrl.agents import Agents, TemporalAgent

# ParallelGymAgent is an agent able to execute a batch of gymnasium environments
# with auto-resetting. These agents produce multiple variables in the workspace:
# ’env/env_obs’, ’env/reward’, ’env/timestep’, ’env/terminated’,
# 'env/truncated', 'env/done', ’env/cumulated_reward’, ... 
# 
# When called at timestep t=0, the environments are automatically reset. At
# timestep t>0, these agents will read the ’action’ variable in the workspace at
# time t − 1
from bbrl.agents.gymnasium import GymAgent, ParallelGymAgent, make_env, record_video

# Replay buffers are useful to store past transitions when training
from bbrl.utils.replay_buffer import ReplayBuffer

In [6]:
# Utility function for launching tensorboard
# For Colab - otherwise, it is easier and better to launch tensorboard from
# the terminal
def setup_tensorboard(path):
    path = Path(path)
    answer = ""
    if is_notebook():
        if get_ipython().__class__.__module__ == "google.colab._shell":
            answer = "y"
        while answer not in ["y", "n"]:
                answer = input(f"Do you want to launch tensorboard in this notebook [y/n] ").lower()

    if answer == "y":
        get_ipython().run_line_magic("load_ext", "tensorboard")
        get_ipython().run_line_magic("tensorboard", f"--logdir {path.absolute()}")
    else:
        import sys
        import os
        import os.path as osp
        print(f"Launch tensorboard from the shell:\n{osp.dirname(sys.executable)}/tensorboard --logdir={path.absolute()}")

As before, we start with a Random Agent and 3 instances of the CartPole environment

In [7]:
# We deal with 3 a single environment (random seed 2139)

env_agent = ParallelGymAgent(partial(make_env, env_name='CartPole-v1'), 1).seed(2139)
obs_size, action_dim = env_agent.get_obs_and_actions_sizes()
print(f"Environment: observation space in R^{obs_size} and action space R^{action_dim}")

class RandomAgent(Agent):
    def __init__(self, action_dim):
        super().__init__()
        self.action_dim = action_dim

    def forward(self, t: int, choose_action=True, **kwargs):
        """An Agent can use self.workspace"""
        obs = self.get(("env/env_obs", t))
        action = torch.randint(0, self.action_dim, (len(obs), ))
        self.set(("action", t), action)

# Each agent will be run (in the order given when constructing Agents)
agents = Agents(env_agent, RandomAgent(action_dim))
t_agents = TemporalAgent(agents)

Environment: observation space in R^4 and action space R^2


Let us have a closer look at the content of the workspace

In [8]:
# Creates a new workspace
workspace = Workspace() 
t_agents(workspace, stop_variable="env/done")

# We get the transitions: each tensor is transformed so
# that: 
# - we have the value at time step t and t+1 (so all the tensors first dimension have a size of 2)
# - there is no distinction between the different environments (here, there is just one environment run in parallel to make it easy)
transitions = workspace.get_transitions()

# You can see that each pair of actions in the transitions can be found in the workspace
display("Observations (first 3)", workspace["env/env_obs"][:3, 0])

display("Transitions of actions (first 3)")
for t in range(3):
    display(f'(s_{t}, s_{t+1})')
    display(transitions["env/env_obs"][:, t])

'Observations (first 3)'

tensor([[-0.0471,  0.0265,  0.0220, -0.0336],
        [-0.0466,  0.2213,  0.0214, -0.3192],
        [-0.0422,  0.4162,  0.0150, -0.6051]])

'Transitions of actions (first 3)'

'(s_0, s_1)'

tensor([[-0.0471,  0.0265,  0.0220, -0.0336],
        [-0.0466,  0.2213,  0.0214, -0.3192]])

'(s_1, s_2)'

tensor([[-0.0466,  0.2213,  0.0214, -0.3192],
        [-0.0422,  0.4162,  0.0150, -0.6051]])

'(s_2, s_3)'

tensor([[-0.0422,  0.4162,  0.0150, -0.6051],
        [-0.0338,  0.6111,  0.0029, -0.8930]])

Note that if we were using more than 1 environment (say N), we would have to watch in the transition every N lines, since transitions are stored one environment after the other

## The replay buffer

Differently from the previous case, we use a replace buffer that stores the a
set of transitions $(s_t, a_t, r_t, s_{t+1})$
Finally, the replay buffer keeps slices [:, i, ...] of the transition
workspace (here at most 100 transitions)

In [9]:
rb = ReplayBuffer(max_size=100)

# We add the transitions to the buffer....
rb.put(transitions)

# And sample from them here we get 3 tuples (s_t, s_{t+1})
rb.get_shuffled(3)["env/env_obs"]

tensor([[[-4.3540e-02, -5.5535e-01,  5.2399e-04,  7.6790e-01],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00],
         [-8.8761e-03,  2.6261e-02, -3.3184e-02, -2.7407e-02]],

        [[-5.4647e-02, -7.5048e-01,  1.5882e-02,  1.0608e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00],
         [-8.3509e-03, -1.6837e-01, -3.3732e-02,  2.5462e-01]]])

A transition workspace is still a workspace... this is quite
 handy since each transition can be seen as a mini-episode of two time steps;
 we can use our agents on it:

In [10]:
# Just as a reference

display(transitions["action"])

t_random_agent = TemporalAgent(RandomAgent(action_dim))
t_random_agent(transitions, t=0, n_steps=2)

# Here, the action tensor will have been overwritten by the new actions
display(transitions["action"])

tensor([[1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
         0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1,
         1, 0, 0, 0, 1],
        [1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0,
         0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
         0, 0, 0, 1, 0]])

tensor([[0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0,
         1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
         1, 1, 0, 0, 0],
        [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1,
         0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1,
         1, 0, 1, 1, 1]])

## Definition of agents

### The critic agent

The [DQN](https://daiwk.github.io/assets/dqn.pdf) algorithm is a critic only
algorithm. Thus we just need a Critic agent (which will also be used to output
actions) and an Environment agent. We reuse the `DiscreteQAgent` class that we
have already explained in the previous notebook.

The function below builds a multi-layer perceptron where the size of each layer is given in the `size` list.
We also specify the activation function of neurons at each layer and optionally a different activation function for the final layer.

In [11]:
import torch.nn as nn
def build_mlp(sizes, activation, output_activation=nn.Identity()):
    """Helper function to build a multi-layer perceptron (function from $\mathbb R^n$ to $\mathbb R^p$)
    
    Args:
        sizes (List[int]): the number of neurons at each layer
        activation (nn.Module): a PyTorch activation function (after each layer but the last)
        output_activation (nn.Module): a PyTorch activation function (last layer)
    """
    layers = []
    for j in range(len(sizes) - 1):
        act = activation if j < len(sizes) - 2 else output_activation
        layers += [nn.Linear(sizes[j], sizes[j + 1]), act]
    return nn.Sequential(*layers)



  """Helper function to build a multi-layer perceptron (function from $\mathbb R^n$ to $\mathbb R^p$)



In [27]:
class DiscreteQAgent(Agent):
    def __init__(self, state_dim, hidden_layers, action_dim):
        super().__init__()
        self.model = build_mlp(
            [state_dim] + list(hidden_layers) + [action_dim], activation=nn.ReLU()
        )

    def forward(self, t, choose_action=True, **kwargs):
        obs = self.get(("env/env_obs", t))
        #print('obs', obs)
        #print(obs.shape)
        q_values = self.model(obs)
        self.set(("q_values", t), q_values)

        # Sets the action
        if choose_action:
            action = q_values.argmax(1)
            #print('action', action)
            #print(action.shape)
            self.set(("action", t), action)

### Creating an Exploration method

As Q-learning, DQN needs some exploration to prevent too early convergence.
Here we will use the simple $\epsilon$-greedy exploration method. The method
is implemented as an agent which chooses an action based on the Q-values.

In [13]:
class EGreedyActionSelector(Agent):
    def __init__(self, epsilon):
        super().__init__()
        self.epsilon = epsilon

    def forward(self, t, **kwargs):
        q_values = self.get(("q_values", t))
        nb_actions = q_values.size()[1]
        size = q_values.size()[0]
        is_random = torch.rand(size).lt(self.epsilon).float()
        random_action = torch.randint(low=0, high=nb_actions, size=(size,))
        max_action = q_values.max(1)[1]
        action = is_random * random_action + (1 - is_random) * max_action
        action = action.long()
        self.set(("action", t), action)

### Training and evaluation environments

We build two environments: one for training and another one for evaluation.

For training, it is more efficient to use an autoreset agent, as we do not
want to waste time if the task is done in an environment sooner than in the
others.

By contrast, for evaluation, we just need to perform a fixed number of
episodes (for statistics), thus it is more convenient to use a
noautoreset agent with a set of environments and just run one episode in
each environment. Thus we can use the `env/done` stop variable and take the
average over the cumulated reward of all environments.

See [this
notebook](https://colab.research.google.com/drive/1Ui481r47fNHCQsQfKwdoNEVrEiqAEokh?usp=sharing)
for explanations about agents and environment agents.

In [14]:
from typing import Tuple
from bbrl.agents.gymnasium import make_env, GymAgent, ParallelGymAgent
from functools import partial

def get_env_agents(cfg, *, autoreset=True, include_last_state=True) -> Tuple[GymAgent, GymAgent]:
    # Returns a pair of environments (train / evaluation) based on a configuration `cfg`
    
    # Train environment
    train_env_agent = ParallelGymAgent(
        partial(make_env, cfg.gym_env.env_name, autoreset=autoreset),
        cfg.algorithm.n_envs, 
        include_last_state=include_last_state
    ).seed(cfg.algorithm.seed)

    # Test environment
    eval_env_agent = ParallelGymAgent(
        partial(make_env, cfg.gym_env.env_name), 
        cfg.algorithm.nb_evals,
        include_last_state=include_last_state
    ).seed(cfg.algorithm.seed)

    return train_env_agent, eval_env_agent

In [15]:
def create_dqn_agent(cfg, train_env_agent, eval_env_agent):
    obs_size, act_size = train_env_agent.get_obs_and_actions_sizes()

    # Get the two agents (critic and target critic)
    critic = DiscreteQAgent(obs_size, cfg.algorithm.architecture.hidden_size, act_size)
    target_critic = copy.deepcopy(critic)

    # Builds the train agent that will produce transitions
    explorer = EGreedyActionSelector(cfg.algorithm.epsilon)
    tr_agent = Agents(train_env_agent, critic, explorer)
    train_agent = TemporalAgent(tr_agent)

    # Creates two temporal agents just for "replaying" some parts
    # of the transition buffer    
    q_agent = TemporalAgent(critic)
    target_q_agent = TemporalAgent(target_critic)


    # Get an agent that is executed on a complete workspace
    ev_agent = Agents(eval_env_agent, critic)
    eval_agent = TemporalAgent(ev_agent)

    return train_agent, eval_agent, q_agent, target_q_agent

### The Logger class

The logger is in charge of collecting statistics during the training
process.

Having logging provided under the hood is one of the features allowing you
to save time when using RL libraries like BBRL.

In these notebooks, the logger is defined as `bbrl.utils.logger.TFLogger` so as
to use a tensorboard visualisation (see the parameters part `params = { "logger":{ ...` below).

Note that the BBRL Logger is also saving the log in a readable format such
that you can use `Logger.read_directories(...)` to read multiple logs, create
a dataframe, and analyze many experiments afterward in a notebook for
instance. The code for the different kinds of loggers is available in the
[bbrl/utils/logger.py](https://github.com/osigaud/bbrl/blob/master/src/bbrl/utils/logger.py)
file.

`instantiate_class` is an inner BBRL mechanism. The
`instantiate_class`function is available in the
[`bbrl/__init__.py`](https://github.com/osigaud/bbrl/blob/master/src/bbrl/__init__.py)
file.

In [16]:
from bbrl import instantiate_class

class Logger():

    def __init__(self, cfg):
        self.logger = instantiate_class(cfg.logger)

    def add_log(self, log_string, loss, steps):
        self.logger.add_scalar(log_string, loss.item(), steps)

    # A specific function for RL algorithms having a critic, an actor and an entropy losses
    def log_losses(self, critic_loss, entropy_loss, actor_loss, steps):
        self.add_log("critic_loss", critic_loss, steps)
        self.add_log("entropy_loss", entropy_loss, steps)
        self.add_log("actor_loss", actor_loss, steps)

    def log_reward_losses(self, rewards, nb_steps):
        self.add_log("reward/mean", rewards.mean(), nb_steps)
        self.add_log("reward/max", rewards.max(), nb_steps)
        self.add_log("reward/min", rewards.min(), nb_steps)
        self.add_log("reward/median", rewards.median(), nb_steps)

### Setup the optimizers

We use a single optimizer to tune the parameters of the actor (in the
prob_agent part) and the critic (in the critic_agent part). It would be
possible to have two optimizers which would work separately on the parameters
of each component agent, but it would be more complicated because updating the
actor requires the gradient of the critic.

In [17]:
# Configure the optimizer over the q agent
def setup_optimizers(cfg, q_agent):
    optimizer_args = get_arguments(cfg.optimizer)
    parameters = q_agent.parameters()
    optimizer = get_class(cfg.optimizer)(parameters, **optimizer_args)
    return optimizer

### Compute critic loss

Detailed explanations of the function to compute the critic loss when using
`autoreset=False` are given in [this
notebook](http://master-dac.isir.upmc.fr/rld/rl/03-1-dqn-introduction.student.ipynb).
The case where we use `autoreset=True` is very similar, but we need to
specify that we use the first part of the Q-values (`q_values[0]`) for
representing $Q(s_t,a_t)$ and the second part (`q_values[1]`) for representing
$Q(s_{t+1},a)$, as these values are stored into a transition model.

In [18]:
def compute_critic_loss(cfg, reward, must_bootstrap, q_values, target_q_values, action):

    # To be completed...

    #Adapt from the previous notebook and adapt to our case (target Q network)
    #Don't forget that we deal with transitions (and not episodes)
    # assert False, 'Not implemented yet'



    # Compute critic loss (no need to use must_bootstrap here since we are dealing with "full" transitions)
    '''mse = nn.MSELoss()
    critic_loss = mse(target, qvals)
    return critic_loss'''
    # Select the Q-values for the actions taken
    q_values_for_actions = q_values.gather(2, action.unsqueeze(-1)).squeeze(-1)
    
    # Compute the max Q-value for the next state, but not for the last timestep
    next_q_values = q_values[1:].max(dim=2)[0]
    # Compute the expected Q-values (target) for the current state and action
    # Assuming next_q_values and must_bootstrap are correctly aligned and one step "ahead" of reward
    target_q_values = reward[1:] + cfg["algorithm"]["discount_factor"] * next_q_values * must_bootstrap

    
    # Compute the loss as the mean squared error between the current and target Q-values
    loss = F.mse_loss(q_values_for_actions[:-1], target_q_values)
    
    return loss

## Main training loop

Note that everything about the shared workspace between all the agents is
completely hidden under the hood. This results in a gain of productivity, at
the expense of having to dig into the BBRL code if you want to understand the
details, change the multiprocessing model, etc.

### Agent execution

This is the tricky part with BBRL, the one we need to understand in detail.
The difficulty lies in the copy of the last step and the way to deal with the
n_steps return.

The call to `train_agent(workspace, t=1, n_steps=cfg.algorithm.n_timesteps -
1, stochastic=True)` makes the agent run a number of steps in the workspace.
In practice, it calls the
[`__call__(...)`](https://github.com/osigaud/bbrl/blob/master/src/bbrl/agents/agent.py#L59)
function which makes a forward pass of the agent network using the workspace
data and updates the workspace accordingly.

Now, if we start at the first epoch (`epoch=0`), we start from the first step
(`t=0`). But when subsequently we perform the next epochs (`epoch>0`), we must
not forget to cover the transition at the border between the previous epoch
and the current epoch. To avoid this risk, we copy the information from the
last time step of the previous epoch into the first time step of the next
epoch.

Note that we `optimizer.zero_grad()`, `loss.backward()` and `optimizer.step()`
lines. `optimizer.zero_grad()` is necessary to cancel all the gradients
computed at the previous iterations

In [28]:
def run_dqn(cfg, compute_critic_loss):
    # 1)  Build the  logger
    logger = Logger(cfg)
    best_reward = float('-inf')

    # 2) Create the environment agents
    train_env_agent, eval_env_agent = get_env_agents(cfg)

    # 3) Create the DQN-like Agent
    train_agent, eval_agent, q_agent, target_q_agent = create_dqn_agent(
        cfg, train_env_agent, eval_env_agent
    )

    # 5) Configure the workspace to the right dimension
    # Note that no parameter is needed to create the workspace.
    # In the training loop, calling the agent() and critic_agent()
    # will take the workspace as parameter
    train_workspace = Workspace()  # Used for training
    rb = ReplayBuffer(max_size=cfg.algorithm.buffer_size)

    # 6) Configure the optimizer over the dqn agent
    optimizer = setup_optimizers(cfg, q_agent)
    nb_steps = 0
    last_eval_step = 0
    last_critic_update_step = 0
    best_agent = eval_agent.agent.agents[1]

    # 7) Training loop
    pbar = tqdm(range(cfg.algorithm.max_epochs))
    for epoch in pbar:
        # Execute the agent in the workspace
        if epoch > 0:
            train_workspace.zero_grad()
            train_workspace.copy_n_last_steps(1)
            train_agent(
                train_workspace, t=1, n_steps=cfg.algorithm.n_steps, stochastic=True
            )
        else:
            train_agent(
                train_workspace, t=0, n_steps=cfg.algorithm.n_steps, stochastic=True
            )

        # Get the transitions
        transition_workspace = train_workspace.get_transitions()
        #print(transition_workspace)

        action = transition_workspace["action"]
        nb_steps += action[0].shape[0]
        
        # Adds the transitions to the workspace
        rb.put(transition_workspace)
        #print('rb.size() ', rb.size())
        #print('cfg.algorithm.learning_starts ', cfg.algorithm.learning_starts)
        if rb.size() > cfg.algorithm.learning_starts:
            print('test 1')
            for _ in range(cfg.algorithm.n_updates):
                rb_workspace = rb.get_shuffled(cfg.algorithm.batch_size)

                # The q agent needs to be executed on the rb_workspace workspace (gradients are removed in workspace)
                q_agent(rb_workspace, t=0, n_steps=2, choose_action=False)
                q_values, terminated, reward, action = rb_workspace[
                    "q_values", "env/terminated", "env/reward", "action"
                ]

                with torch.no_grad():
                    target_q_agent(rb_workspace, t=0, n_steps=2, stochastic=True)
                target_q_values = rb_workspace["q_values"]

                # Determines whether values of the critic should be propagated
                must_bootstrap = ~terminated[1]

                # Compute critic loss
                # FIXME: homogénéiser les notations (soit tranche temporelle, soit rien)
                critic_loss = compute_critic_loss(
                    cfg, reward, must_bootstrap, q_values, target_q_values[1], action
                )
                # Store the loss for tensorboard display
                logger.add_log("critic_loss", critic_loss, nb_steps)

                optimizer.zero_grad()
                critic_loss.backward()
                torch.nn.utils.clip_grad_norm_(q_agent.parameters(), cfg.algorithm.max_grad_norm)
                optimizer.step()
                if nb_steps - last_critic_update_step > cfg.algorithm.target_critic_update:
                    last_critic_update_step = nb_steps
                    target_q_agent.agent = copy.deepcopy(q_agent.agent)

        # Evaluate the current policy
        if nb_steps - last_eval_step > cfg.algorithm.eval_interval:
            print('test 2')
            last_eval_step = nb_steps
            eval_workspace = Workspace()
            eval_agent(
                eval_workspace, t=0, stop_variable="env/done", choose_action=True
            )
            rewards = eval_workspace["env/cumulated_reward"][-1]
            mean = rewards.mean()
            logger.log_reward_losses(rewards, nb_steps)
            pbar.set_description(f"nb steps: {nb_steps}, reward: {mean:.3f}")
            if cfg.save_best and mean > best_reward:
                print('test 3')
                best_reward = mean
                best_agent = copy.deepcopy(eval_agent.agent.agents[1])
                directory = "./dqn_critic/"
                if not os.path.exists(directory):
                    os.makedirs(directory)
                filename = directory + "dqn0_" + str(mean.item()) + ".agt"
                eval_agent.save_model(filename)

    return best_agent

## Definition of the parameters

The logger is defined as `bbrl.utils.logger.TFLogger` so as to use a
tensorboard visualisation.

### Launching tensorboard to visualize the results

In [20]:

setup_tensorboard('/tblogs') # ""

In [29]:
params={
  "save_best": False,
  "logger":{
    "classname": "bbrl.utils.logger.TFLogger",
    "log_dir": "./tblogs/dqn-buffer-" + str(time.time()),
    "cache_size": 10000,
    "every_n_seconds": 10,
    "verbose": False,    
    },

  "algorithm":{
    "seed": 4,
    "max_grad_norm": 0.5,
    "epsilon": 0.02,
    "n_envs": 8,
    "n_steps": 32,
    "n_updates": 32,
    "eval_interval": 2000,
    "learning_starts": 2000,
    "nb_evals": 10,
    "buffer_size": 1e6,
    "batch_size": 256,
    "target_critic_update": 5000,
    "max_epochs": 3500,
    "discount_factor": 0.99,
    "architecture":{"hidden_size": [64, 64]},
  },
  "gym_env":{
    "env_name": "CartPole-v1",
  },
  "optimizer":
  {
    "classname": "torch.optim.Adam",
    "lr": 1e-3,
  }
}

config=OmegaConf.create(params)
torch.manual_seed(config.algorithm.seed)
best_agent = run_dqn(config, compute_critic_loss)

  0%|          | 0/3500 [00:00<?, ?it/s]

obs tensor([[-0.0453,  0.0018, -0.0125, -0.0271],
        [ 0.0424,  0.0193, -0.0427,  0.0124],
        [-0.0086, -0.0373,  0.0014, -0.0203],
        [-0.0016,  0.0348,  0.0174,  0.0126],
        [ 0.0080,  0.0027,  0.0365,  0.0235],
        [-0.0119,  0.0224, -0.0178,  0.0465],
        [-0.0247,  0.0028,  0.0261,  0.0135],
        [ 0.0338, -0.0493, -0.0269, -0.0386]])
torch.Size([8, 4])
action tensor([1, 1, 1, 1, 1, 1, 1, 1])
torch.Size([8])
obs tensor([[-0.0452,  0.1971, -0.0131, -0.3237],
        [ 0.0428,  0.2150, -0.0425, -0.2934],
        [-0.0093,  0.1578,  0.0010, -0.3125],
        [-0.0009,  0.2297,  0.0176, -0.2745],
        [ 0.0081,  0.1973,  0.0370, -0.2575],
        [-0.0115,  0.2178, -0.0169, -0.2517],
        [-0.0246,  0.1975,  0.0264, -0.2709],
        [ 0.0328,  0.1462, -0.0276, -0.3396]])
torch.Size([8, 4])
action tensor([1, 1, 1, 1, 1, 1, 1, 1])
torch.Size([8])
obs tensor([[-0.0413,  0.3924, -0.0195, -0.6205],
        [ 0.0471,  0.4107, -0.0483, -0.5992],
        

KeyboardInterrupt: 

In [None]:
# Visualization
env = make_env(config.gym_env.env_name, render_mode="rgb_array")
record_video(env, best_agent, "videos/dqn-full.mp4")
video_display("videos/dqn-full.mp4")

Moviepy - Building video D:\SORBONNE S2\PLDAC_BBRL\intro\videos\dqn-full.mp4.
Moviepy - Writing video D:\SORBONNE S2\PLDAC_BBRL\intro\videos\dqn-full.mp4



                                                               

Moviepy - Done !
Moviepy - video ready D:\SORBONNE S2\PLDAC_BBRL\intro\videos\dqn-full.mp4


## Coding Exercise: Double DQN (DDQN)

In DQN, the same network is responsible for selecting and estimating the best
next action (in the TD-target) and that may lead to over-estimation: the
action which q-value is over-estimated will be chosen more often. As a result,
training is slower.

To reduce over-estimation, double q-learning (and then DDQN) was proposed. It
decouples the action selection from the value estimation.

Concretely, in DQN, the target value in the critic loss (used to update the Q
critic) for a sample at time $t$ is defined as:

$$Y^{DQN}_{t} = r_{t+1} + \gamma{Q}\left(s_{t+1}, \arg\max_{a}Q\left(s_{t+1},
a; \mathbb{\theta}_{target}\right); \mathbb{\theta}_{target}\right)$$

where the target network `target_q_agent` with parameters
$\mathbb{\theta}_{target}$ is used for both action selection and estimation,
and can therefore be rewritten:

$$Y^{DQN}_{t} = r_{t+1} + \gamma \max_{a}{Q}\left(s_{t+1}, a;
\mathbb{\theta}_{target}\right)$$

Instead, DDQN uses the online critic `q_agent` with parameters
$\mathbb{\theta}_{online}$ to select the action, whereas it uses the target
network `target_q_agent` to estimate the associated Q-values:

$$Y^{DDQN}_{t} = r_{t+1} + \gamma{Q}\left(s_{t+1}, \arg\max_{a}Q\left(s_{t+1},
a; \mathbb{\theta}_{online}\right); \mathbb{\theta}_{target}\right)$$

The goal in this exercise is for you to write the update method for `DDQN`.

In [None]:
def compute_ddqn_loss(cfg, reward, must_bootstrap, q_values, target_q_values, action):
    # Assuming action, reward, must_bootstrap are properly aligned with q_values & target_q_values' time dimension
    
    # Step 1: Action selection using the online Q-values (excluding the last timestep)
    next_actions = q_values[:-1].max(dim=2, keepdim=True)[1]  # Keeping dimension for gather
    
    # Step 2: Q-value estimation for the selected actions using the target network (for the next timestep)
    # It's crucial that next_actions are used to index into the next timestep's Q-values from target_q_values
    # Here, we correctly align the dimensions for gather and ensure we're gathering along the correct axis
    next_q_values = target_q_values.gather(2, next_actions).squeeze(-1)  # Removing the last dimension after gather
    
    # Ensure correct alignment of dimensions for reward and must_bootstrap
    # Only include rewards and bootstrap flags for timesteps that correspond to next_actions
    rewards = reward[1:]
    dones = must_bootstrap[1:]
    
    # Step 3: Compute the DDQN target Q-value for each action taken
    targets = rewards + (cfg.algorithm.discount_factor * next_q_values * dones)
    
    # Extracting the Q-values for the actions taken from q_values
    qvals = q_values[:-1].gather(2, action[:-1].unsqueeze(-1)).squeeze(-1)
    
    # Calculate the mean squared error loss
    mse = nn.MSELoss()
    critic_loss = mse(qvals, targets.detach())
    
    return critic_loss

In [None]:
params={
  "save_best": False,
  "logger":{
    "classname": "bbrl.utils.logger.TFLogger",
    "log_dir": "./tblogs/ddqn-buffer-" + str(time.time()),
    "cache_size": 10000,
    "every_n_seconds": 10,
    "verbose": False,    
    },

  "algorithm":{
    "seed": 4,
    "max_grad_norm": 0.5,
    "epsilon": 0.02,
    "n_envs": 8,
    "n_steps": 32,
    "n_updates": 32,
    "eval_interval": 2000,
    "learning_starts": 2000,
    "nb_evals": 10,
    "buffer_size": 1e6,
    "batch_size": 256,
    "target_critic_update": 5000,
    "max_epochs": 3500,
    "discount_factor": 0.99,
    "architecture":{"hidden_size": [128, 128]},
  },
  "gym_env":{
    "env_name": "CartPole-v1",
  },
  "optimizer":
  {
    "classname": "torch.optim.Adam",
    "lr": 1e-3,
  }
}

config=OmegaConf.create(params)
torch.manual_seed(config.algorithm.seed)
best_agent = run_dqn(config, compute_ddqn_loss)

  0%|          | 0/3500 [00:00<?, ?it/s]

IndexError: Dimension out of range (expected to be in range of [-2, 1], but got 2)

In [None]:
# Visualization
env = make_env(config.gym_env.env_name, render_mode="rgb_array")
record_video(env, best_agent, "videos/dqn-double.mp4")
video_display("videos/dqn-double.mp4")