 Copyright © Sorbonne University.

 This source code is licensed under the MIT license found in the LICENSE file
 in the root directory of this source tree.

# Outlook
In this notebook we code the Truncated Quantile Critic (TQC) algorithm using
BBRL. This algorithm is described in [this
paper](http://proceedings.mlr.press/v119/kuznetsov20a/kuznetsov20a.pdf).

To understand this code, you need to know more about [the BBRL interaction
model](https://github.com/osigaud/bbrl/blob/master/docs/overview.md) Then you
should run [a didactical
example](https://github.com/osigaud/bbrl/blob/master/docs/notebooks/03-multi_env_autoreset.student.ipynb)
to see how agents interact in BBRL when autoreset=True.

The algorithm is explained in [this
video](https://www.youtube.com/watch?v=U20F-MvThjM) (in the end, after SAC)
and you can also read [the corresponding
slides](https://dac.lip6.fr/wp-content/uploads/2022/11/12_sac.pdf).

## Installation and Imports

### Installation

The BBRL library is [here](https://github.com/osigaud/bbrl).

Below, we import standard python packages, pytorch packages and gymnasium
environments.

In [1]:
# Installs the necessary Python and system libraries
try:
    from easypip import easyimport, easyinstall, is_notebook
except ModuleNotFoundError:
    get_ipython().run_line_magic("pip", "install easypip")
    from easypip import easyimport, easyinstall, is_notebook

easyinstall("bbrl>=0.2.2")
easyinstall("swig")
easyinstall("bbrl_gymnasium>=0.3.5")
easyinstall("bbrl_gymnasium[box2d]")
easyinstall("bbrl_gymnasium[classic_control]")
easyinstall("tensorboard")
easyinstall("moviepy")
easyinstall("box2d-kengz")

In [25]:
import os
import sys
import math
import time
import numpy as np
import torch
import bbrl_gymnasium
import copy
import torch.nn as nn
import torch.nn.functional as F
from pathlib import Path
from moviepy.editor import ipython_display as video_display
from tqdm.auto import tqdm
from typing import Tuple, Optional, Iterator
from functools import partial
from omegaconf import OmegaConf
from abc import abstractmethod, ABC
from time import strftime
from bbrl import instantiate_class


# Useful when using a timestamp for a directory name
OmegaConf.register_new_resolver(
    "current_time", lambda: strftime("%Y%m%d-%H%M%S"), replace=True
)

In [28]:
# Imports all the necessary classes and functions from BBRL
from bbrl import get_arguments, get_class, instantiate_class

# The workspace is the main class in BBRL, this is where all data is collected and stored
from bbrl.workspace import Workspace

# Agents(agent1,agent2,agent3,...) executes the different agents the one after the other
# TemporalAgent(agent) executes an agent over multiple timesteps in the workspace,
# or until a given condition is reached
from bbrl.agents import Agent, Agents, TemporalAgent, KWAgentWrapper

# ParallelGymAgent is an agent able to execute a batch of gymnasium environments
# with auto-resetting. These agents produce multiple variables in the workspace:
# ’env/env_obs’, ’env/reward’, ’env/timestep’, ’env/terminated’,
# 'env/truncated', 'env/done', ’env/cumulated_reward’, ...
#
# When called at timestep t=0, the environments are automatically reset. At
# timestep t>0, these agents will read the ’action’ variable in the workspace at
# time t − 1
from bbrl.agents.gymnasium import GymAgent, ParallelGymAgent, make_env, record_video

# Replay buffers are useful to store past transitions when training
from bbrl.utils.replay_buffer import ReplayBuffer

In [4]:
# Utility function for launching tensorboard
# For Colab - otherwise, it is easier and better to launch tensorboard from
# the terminal
def setup_tensorboard(path):
    path = Path(path)
    answer = ""
    if is_notebook():
        if get_ipython().__class__.__module__ == "google.colab._shell":
            answer = "y"
        while answer not in ["y", "n"]:
            answer = input(
                f"Do you want to launch tensorboard in this notebook [y/n] "
            ).lower()

    if answer == "y":
        get_ipython().run_line_magic("load_ext", "tensorboard")
        get_ipython().run_line_magic("tensorboard", f"--logdir {path.absolute()}")
    else:
        import sys
        import os
        import os.path as osp

        print(
            f"Launch tensorboard from the shell:\n{osp.dirname(sys.executable)}/tensorboard --logdir={path.absolute()}"
        )

## Definition of Agents

### Functions to build networks

We define a few utilitary functions to build neural networks

The function below builds a multi-layer perceptron where the size of each
layer is given in the `size` list. We also specify the activation function of
neurons at each layer and optionally a different activation function for the
final layer.

In [5]:
def build_mlp(sizes, activation, output_activation=nn.Identity()):
    """Helper function to build a multi-layer perceptron (function from $\mathbb R^n$ to $\mathbb R^p$)
    
    Args:
        sizes (List[int]): the number of neurons at each layer
        activation (nn.Module): a PyTorch activation function (after each layer but the last)
        output_activation (nn.Module): a PyTorch activation function (last layer)
    """
    layers = []
    for j in range(len(sizes) - 1):
        act = activation if j < len(sizes) - 2 else output_activation
        layers += [nn.Linear(sizes[j], sizes[j + 1]), act]
    return nn.Sequential(*layers)

In [6]:
def build_backbone(sizes, activation):
    layers = []
    for j in range(len(sizes) - 1):
        layers += [nn.Linear(sizes[j], sizes[j + 1]), activation]
    return layers

### Base Actor

All actors should inherit from the ```BaseActor``` class, so that we can easily copy their parameters.

In [7]:
class BaseActor(Agent):
    """ Generic class to centralize copy_parameters"""

    def copy_parameters(self, other):
        """Copy parameters from other agent"""
        for self_p, other_p in zip(self.parameters(), other.parameters()):
            self_p.data.copy_(other_p)

### Squashed Gaussian Policy

Like SAC, TQC works better with a Squashed Gaussian policy, which enables the reparametrization trick.

The code of the `SquashedGaussianActor` policy is below, it is the same as for SAC.

It relies on a specific type of distribution, the `SquashedDiagGaussianDistribution` which is taken from [the Stable Baselines3 library](https://github.com/DLR-RM/stable-baselines3).

In [11]:
from bbrl.utils.distributions import SquashedDiagGaussianDistribution

The fact that we use the reparametrization trick is hidden inside the code of this distribution. In more details, the key is that the [`sample(self)` method](https://github.com/osigaud/bbrl/blob/master/src/bbrl/utils/distributions.py#L200) calls `rsample()`.

In [12]:
class SquashedGaussianActor(BaseActor):
    def __init__(self, state_dim, hidden_layers, action_dim):
        super().__init__()
        backbone_dim = [state_dim] + list(hidden_layers)
        self.layers = build_backbone(backbone_dim, activation=nn.Tanh())
        self.backbone = nn.Sequential(*self.layers)
        self.last_mean_layer = nn.Linear(hidden_layers[-1], action_dim)
        self.last_std_layer = nn.Linear(hidden_layers[-1], action_dim)
        self.action_dist = SquashedDiagGaussianDistribution(action_dim)

    def get_distribution(self, obs: torch.Tensor):
        backbone_output = self.backbone(obs)
        mean = self.last_mean_layer(backbone_output)
        std_out = self.last_std_layer(backbone_output)

        std_out = std_out.clamp(-20, 2)  # as in the official code
        std = torch.exp(std_out)
        return self.action_dist.make_distribution(mean, std)

    def forward(self, t, stochastic=False, predict_proba=False, **kwargs):
        action_dist = self.get_distribution(self.get(("env/env_obs", t)))
        if predict_proba:
            action = self.get(("action", t))
            log_prob = action_dist.log_prob(action)
            self.set(("logprob_predict", t), log_prob)
        else:
            if stochastic:
                action = action_dist.sample()
            else:
                action = action_dist.mode()
            log_prob = action_dist.log_prob(action)
            self.set(("action", t), action)
            self.set(("action_logprobs", t), log_prob)

    def predict_action(self, obs, stochastic=False):
        """Predict just one action (without using the workspace)"""
        action_dist = self.get_distribution(obs)
        return action_dist.sample() if stochastic else action_dist.mode()

### CriticAgent

As critic, TQC uses a network with several heads, defined below. As seen in the forward function, it outputs a vector of quantiles.

In [13]:
class TruncatedQuantileNetwork(Agent):
    def __init__(self, state_dim, hidden_layers, n_nets, action_dim, n_quantiles):
        super().__init__()
        self.is_q_function = True
        self.nets = []
        for i in range(n_nets):
            net = build_mlp([state_dim + action_dim] + list(hidden_layers) + [n_quantiles], activation=nn.ReLU())
            self.add_module(f'qf{i}', net)
            self.nets.append(net)

    def forward(self, t):
        obs = self.get(("env/env_obs", t))
        action = self.get(("action", t))
        obs_act = torch.cat((obs, action), dim=1)
        quantiles = torch.stack(tuple(net(obs_act) for net in self.nets), dim=1)
        self.set(("quantiles", t), quantiles)
        return quantiles

    def predict_value(self, obs, action):
        obs_act = torch.cat((obs, action), dim=0)
        quantiles = torch.stack(tuple(net(obs_act) for net in self.nets), dim=1)
        return quantiles

### Training and evaluation environments

We build two environments: one for training and another one for evaluation.

For training, it is more efficient to use an autoreset agent, as we do not
want to waste time if the task is done in an environment sooner than in the
others.

By contrast, for evaluation, we just need to perform a fixed number of
episodes (for statistics), thus it is more convenient to use a
noautoreset agent with a set of environments and just run one episode in
each environment. Thus we can use the `env/done` stop variable and take the
average over the cumulated reward of all environments.

See [this
notebook](https://colab.research.google.com/drive/1Ui481r47fNHCQsQfKwdoNEVrEiqAEokh?usp=sharing)
for explanations about agents and environment agents.

In [14]:
from typing import Tuple
from bbrl.agents.gymnasium import make_env, GymAgent, ParallelGymAgent
from functools import partial

def get_env_agents(cfg, *, autoreset=True, include_last_state=True) -> Tuple[GymAgent, GymAgent]:
    # Returns a pair of environments (train / evaluation) based on a configuration `cfg`
    
    # Train environment
    train_env_agent = ParallelGymAgent(
        partial(make_env, cfg.gym_env.env_name, autoreset=autoreset),
        cfg.algorithm.n_envs, 
        include_last_state=include_last_state
    ).seed(cfg.algorithm.seed)

    # Test environment
    eval_env_agent = ParallelGymAgent(
        partial(make_env, cfg.gym_env.env_name), 
        cfg.algorithm.nb_evals,
        include_last_state=include_last_state
    ).seed(cfg.algorithm.seed)

    return train_env_agent, eval_env_agent

### Building the complete training and evaluation agents

In the code below we create the Squashed Gaussian actor, one critic and the corresponding target critic. Beforehand, we checked that the environment takes continuous actions (otherwise we would need a different code).

An good exercise is to check that TQC also works without a target critic.

In [32]:
# Create the TQC Agent
def create_tqc_agent(cfg, train_env_agent, eval_env_agent):
    obs_size, act_size = train_env_agent.get_obs_and_actions_sizes()
    assert (
        train_env_agent.is_continuous_action()
    ), "TQC code dedicated to continuous actions"

    # Actor
    actor = SquashedGaussianActor(
        obs_size, cfg.algorithm.architecture.actor_hidden_size, act_size
    )

    # Train/Test agents
    tr_agent = Agents(train_env_agent, actor)
    ev_agent = Agents(eval_env_agent, actor)

    # Builds the critics
    critic = TruncatedQuantileNetwork(
        obs_size, cfg.algorithm.architecture.critic_hidden_size,
        cfg.algorithm.architecture.n_nets, act_size,
        cfg.algorithm.architecture.n_quantiles
    )
    target_critic = copy.deepcopy(critic)

    train_agent = TemporalAgent(tr_agent)
    eval_agent = TemporalAgent(ev_agent)
    #train_agent.seed(cfg.algorithm.seed)
    return (
        train_agent,
        eval_agent,
        actor,
        critic,
        target_critic
    )

### The Logger class

The logger is in charge of collecting statistics during the training process.
The logger defines the following methods, where `steps` is the number of steps
since the training began:
- `logger.log_losses(critic_loss: float, entropy_loss: float, actor_loss:
  float, steps: int)`
- `logger.log_reward_losses(self, rewards: torch.Tensor, nb_steps)`
- `logger.add_log(log_string: float, loss: float, steps: int)`

Having logging provided under the hood is one of the features allowing you
to save time when using RL libraries like BBRL.

In these notebooks, the logger is defined as `bbrl.utils.logger.TFLogger` so as
to use a tensorboard visualisation (see the configuration parameters).

Note that the BBRL Logger is also saving the log in a readable format such
that you can use `Logger.read_directories(...)` to read multiple logs, create
a dataframe, and analyze many experiments afterward in a notebook for
instance. The code for the different kinds of loggers is available in the
[bbrl/utils/logger.py](https://github.com/osigaud/bbrl/blob/master/src/bbrl/utils/logger.py)
file.

`instantiate_class` is an inner BBRL mechanism. The
`instantiate_class`function is available in the
[`bbrl/__init__.py`](https://github.com/osigaud/bbrl/blob/master/src/bbrl/__init__.py)
file.

In [16]:
class Logger:
    def __init__(self, cfg):
        self.logger = instantiate_class(cfg.logger)

    def add_log(self, log_string: float, loss: float, steps: int):
        self.logger.add_scalar(log_string, loss.item(), steps)

    # A specific function for RL algorithms having a critic, an actor and an
    # entropy losses
    def log_losses(
        self, critic_loss: float, entropy_loss: float, actor_loss: float, steps: int
    ):
        self.add_log("critic_loss", critic_loss, steps)
        self.add_log("entropy_loss", entropy_loss, steps)
        self.add_log("actor_loss", actor_loss, steps)

    def log_reward_losses(self, rewards: torch.Tensor, nb_steps):
        self.add_log("reward/mean", rewards.mean(), nb_steps)
        self.add_log("reward/max", rewards.max(), nb_steps)
        self.add_log("reward/min", rewards.min(), nb_steps)
        self.add_log("reward/median", rewards.median(), nb_steps)

### Setup the optimizers

In [17]:
# Configure the optimizer
def setup_optimizers(cfg, actor, critic):
    actor_optimizer_args = get_arguments(cfg.actor_optimizer)
    parameters = actor.parameters()
    actor_optimizer = get_class(cfg.actor_optimizer)(parameters, **actor_optimizer_args)
    critic_optimizer_args = get_arguments(cfg.critic_optimizer)
    parameters = critic.parameters()
    critic_optimizer = get_class(cfg.critic_optimizer)(
        parameters, **critic_optimizer_args
    )
    return actor_optimizer, critic_optimizer

In [18]:
def setup_entropy_optimizers(cfg):
    if cfg.algorithm.target_entropy == "auto":
        entropy_coef_optimizer_args = get_arguments(cfg.entropy_coef_optimizer)
        # Note: we optimize the log of the entropy coef which is slightly different from the paper
        # as discussed in https://github.com/rail-berkeley/softlearning/issues/37
        # Comment and code taken from the SB3 version of SAC
        log_entropy_coef = torch.log(
            torch.ones(1) * cfg.algorithm.entropy_coef
        ).requires_grad_(True)
        entropy_coef_optimizer = get_class(cfg.entropy_coef_optimizer)(
            [log_entropy_coef], **entropy_coef_optimizer_args
        )
    else:
        log_entropy_coef = 0
        entropy_coef_optimizer = None
    return entropy_coef_optimizer, log_entropy_coef

### Compute the critic loss

By contrast with the SAC version, we prepare data and compute the critic loss into a single function.

In [19]:
def compute_critic_loss(
        cfg, reward, must_bootstrap,
        t_actor,
        q_agent,
        target_q_agent,
        rb_workspace,
        ent_coef
):
    # Compute quantiles from critic with the actions present in the buffer:
    # at t, we have Qu  ntiles(s,a) from the (s,a) in the RB
    q_agent(rb_workspace, t=0, n_steps=1)
    quantiles = rb_workspace["quantiles"].squeeze()

    with torch.no_grad():
        # Replay the current actor on the replay buffer to get actions of the
        # current policy
        t_actor(rb_workspace, t=1, n_steps=1, stochastic=True)
        action_logprobs_next = rb_workspace["action_logprobs"]

        # Compute target quantiles from the target critic: at t+1, we have
        # Quantiles(s+1,a+1) from the (s+1,a+1) where a+1 has been replaced in the RB

        target_q_agent(rb_workspace, t=1, n_steps=1)
        post_quantiles = rb_workspace["quantiles"][1]

        sorted_quantiles, _ = torch.sort(post_quantiles.reshape(quantiles.shape[0], -1))
        quantiles_to_drop_total = cfg.algorithm.top_quantiles_to_drop * cfg.algorithm.architecture.n_nets
        truncated_sorted_quantiles = sorted_quantiles[:,
                                     :quantiles.size(-1) * quantiles.size(-2) - quantiles_to_drop_total]

        # compute the target
        logprobs = ent_coef * action_logprobs_next[1]
        y = reward[0].unsqueeze(-1) + must_bootstrap.int().unsqueeze(-1) * cfg.algorithm.discount_factor * (
                    truncated_sorted_quantiles - logprobs.unsqueeze(-1))

    # computing the Huber loss
    pairwise_delta = y[:, None, None, :] - quantiles[:, :, :, None]  # batch x nets x quantiles x samples

    abs_pairwise_delta = torch.abs(pairwise_delta)
    huber_loss = torch.where(abs_pairwise_delta > 1,
                             abs_pairwise_delta - 0.5,
                             pairwise_delta ** 2 * 0.5)

    n_quantiles = quantiles.shape[2]
    tau = torch.arange(n_quantiles).float() / n_quantiles + 1 / 2 / n_quantiles
    loss = (torch.abs(tau[None, None, :, None] - (pairwise_delta < 0).float()) * huber_loss).mean()
    return loss

### Soft parameter updates

To update the target critic, one uses the following equation:
$\theta' \leftarrow \tau \theta + (1- \tau) \theta'$
where $\theta$ is the vector of parameters of the critic, and $\theta'$ is the vector of parameters of the target critic.
The `soft_update_params(...)` function is in charge of performing this soft update.

In [20]:
def soft_update_params(net, target_net, tau):
    for param, target_param in zip(net.parameters(), target_net.parameters()):
        target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)

### Compute the actor loss

Again, by contrast with the SAC version, we prepare data and compute the actor loss into a single function.

In [21]:
def compute_actor_loss(ent_coef, t_actor, q_agent, rb_workspace):
    """Actor loss computation

    :param ent_coef: The entropy coefficient $\alpha$
    :param t_actor: The actor agent (temporal agent)
    :param q_agent: The critic (temporal agent) (n net of m quantiles)
    :param rb_workspace: The replay buffer (2 time steps, $t$ and $t+1$)
    """
    # Recompute the quantiles from the current policy, not from the actions in the buffer

    t_actor(rb_workspace, t=0, n_steps=1, stochastic=True)
    action_logprobs_new = rb_workspace["action_logprobs"]

    q_agent(rb_workspace, t=0, n_steps=1)
    quantiles = rb_workspace["quantiles"][0]

    actor_loss = (ent_coef * action_logprobs_new[0] - quantiles.mean(2).mean(1))

    return actor_loss.mean()

## Main training loop

In [37]:
def run_tqc(cfg):
    # 1)  Build the  logger
    logger = Logger(cfg)
    best_reward = float('-inf')
    ent_coef = cfg.algorithm.entropy_coef

    # 2) Create the environment agent
    # train_env_agent = AutoResetGymAgent(
    #     get_class(cfg.gym_env),
    #     get_arguments(cfg.gym_env),
    #     cfg.algorithm.n_envs,
    #     cfg.algorithm.seed,
    # )
    # eval_env_agent = NoAutoResetGymAgent(
    #     get_class(cfg.gym_env),
    #     get_arguments(cfg.gym_env),
    #     cfg.algorithm.nb_evals,
    #     cfg.algorithm.seed,
    # )
    
    train_env_agent, eval_env_agent = get_env_agents(cfg)

    # 3) Create the A2C Agent
    (
        train_agent,
        eval_agent,
        actor,
        critic,
        target_critic
    ) = create_tqc_agent(cfg, train_env_agent, eval_env_agent)

    t_actor = TemporalAgent(actor)
    q_agent = TemporalAgent(critic)
    target_q_agent = TemporalAgent(target_critic)
    train_workspace = Workspace()

    # Creates a replay buffer
    rb = ReplayBuffer(max_size=cfg.algorithm.buffer_size)

    # Configure the optimizer
    actor_optimizer, critic_optimizer = setup_optimizers(cfg, actor, critic)
    entropy_coef_optimizer, log_entropy_coef = setup_entropy_optimizers(cfg)
    nb_steps = 0
    tmp_steps = 0

    # Initial value of the entropy coef alpha. If target_entropy is not auto,
    # will remain fixed
    if cfg.algorithm.target_entropy == "auto":
        target_entropy = -np.prod(train_env_agent.action_space.shape).astype(np.float32)
    else:
        target_entropy = cfg.algorithm.target_entropy

    # Training loop
    pbar = tqdm(range(cfg.algorithm.max_epochs))
    for epoch in pbar:
        # Execute the agent in the workspace
        if epoch > 0:
            train_workspace.zero_grad()
            train_workspace.copy_n_last_steps(1)
            train_agent(
                train_workspace,
                t=1,
                n_steps=cfg.algorithm.n_steps - 1,
                stochastic=True,
            )
        else:
            train_agent(
                train_workspace,
                t=0,
                n_steps=cfg.algorithm.n_steps,
                stochastic=True,
            )

        transition_workspace = train_workspace.get_transitions()
        action = transition_workspace["action"]
        nb_steps += action[0].shape[0]
        rb.put(transition_workspace)

        if nb_steps > cfg.algorithm.learning_starts:
            # Get a sample from the workspace
            rb_workspace = rb.get_shuffled(cfg.algorithm.batch_size)

            done, truncated, reward, action_logprobs_rb = rb_workspace[
                "env/done", "env/truncated", "env/reward", "action_logprobs"
            ]

            # Determines whether values of the critic should be propagated
            # True if the episode reached a time limit or if the task was not done
            # See https://github.com/osigaud/bbrl/blob/master/docs/time_limits.md
            must_bootstrap = ~done[1]

            critic_loss = compute_critic_loss(cfg, reward, must_bootstrap,
                                              t_actor, q_agent, target_q_agent,
                                              rb_workspace, ent_coef)

            logger.add_log("critic_loss", critic_loss, nb_steps)

            actor_loss = compute_actor_loss(
                ent_coef, t_actor, q_agent, rb_workspace
            )
            logger.add_log("actor_loss", actor_loss, nb_steps)

            # Entropy coef update part ########################
            if entropy_coef_optimizer is not None:
                # Important: detach the variable from the graph
                # so that we don't change it with other losses
                # see https://github.com/rail-berkeley/softlearning/issues/60
                ent_coef = torch.exp(log_entropy_coef.detach())
                entropy_coef_loss = -(
                        log_entropy_coef * (action_logprobs_rb + target_entropy)
                ).mean()
                entropy_coef_optimizer.zero_grad()
                # We need to retain the graph because we reuse the
                # action_logprobs are used to compute both the actor loss and
                # the critic loss
                entropy_coef_loss.backward(retain_graph=True)
                entropy_coef_optimizer.step()
                logger.add_log("entropy_coef_loss", entropy_coef_loss, nb_steps)
                logger.add_log("entropy_coef", ent_coef, nb_steps)

            # Actor update part ###############################
            actor_optimizer.zero_grad()
            actor_loss.backward()
            torch.nn.utils.clip_grad_norm_(
                actor.parameters(), cfg.algorithm.max_grad_norm
            )
            actor_optimizer.step()

            # Critic update part ###############################
            critic_optimizer.zero_grad()
            critic_loss.backward()
            torch.nn.utils.clip_grad_norm_(
                critic.parameters(), cfg.algorithm.max_grad_norm
            )
            critic_optimizer.step()
            ####################################################

            # Soft update of target q function
            tau = cfg.algorithm.tau_target
            soft_update_params(critic, target_critic, tau)
            # soft_update_params(actor, target_actor, tau)

        # Evaluate ###########################################
        if nb_steps - tmp_steps > cfg.algorithm.eval_interval:
            tmp_steps = nb_steps
            eval_workspace = Workspace()  # Used for evaluation
            eval_agent(
                eval_workspace,
                t=0,
                stop_variable="env/done",
                stochastic=False,
            )
            rewards = eval_workspace["env/cumulated_reward"][-1]
            mean = rewards.mean()
            logger.log_reward_losses(mean, nb_steps)

            pbar.set_description(f"nb_steps: {nb_steps}, reward: {mean:.3f}")
            if cfg.save_best and mean > best_reward:
                best_reward = mean
                directory = f"./agents/{cfg.gym_env.env_name}/tqc_agent/"
                if not os.path.exists(directory):
                    os.makedirs(directory)
                filename = directory + cfg.gym_env.env_name + "#tqc#team" + str(mean.item()) + ".agt"
                actor.save_model(filename)

## Definition of the parameters

In [38]:
params={
  "save_best": False,
  "logger":{
    "classname": "bbrl.utils.logger.TFLogger",
    "log_dir": "./tblogs/" + str(time.time()),
    "cache_size": 10000,
    "every_n_seconds": 10,
    "verbose": False,    
    },

  "algorithm":{
    "seed": 1,
    "n_envs": 1,
    "n_steps": 512,
    "n_updates": 512,
    "buffer_size": 1e6,
    "batch_size": 256,
    "max_grad_norm": 0.5,
    "nb_evals":10,
    "eval_interval": 2000,
    "learning_starts": 10000,
    "max_epochs": 8000,
    "discount_factor": 0.98,
    "entropy_coef": 1e-7,
    "target_entropy": "auto",
    "tau_target": 0.05,
    "top_quantiles_to_drop": 2,
    "architecture":{
      "actor_hidden_size": [32, 32],
      "critic_hidden_size": [256, 256],
      "n_nets": 2,
      "n_quantiles": 25,
    },
  },
  "gym_env":{
    "classname": "__main__.make_gym_env",
    "env_name": "CartPoleContinuous-v1",
    },
  "actor_optimizer":{
    "classname": "torch.optim.Adam",
    "lr": 1e-3,
    },
  "critic_optimizer":{
    "classname": "torch.optim.Adam",
    "lr": 1e-3,
    },
  "entropy_coef_optimizer":{
    "classname": "torch.optim.Adam",
    "lr": 1e-3,
    }
}

### Launching tensorboard to visualize the results

In [None]:
%load_ext tensorboard
%tensorboard --logdir ./tmp

config=OmegaConf.create(params)
torch.manual_seed(config.algorithm.seed)
run_tqc(config)

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 4676), started 1:54:02 ago. (Use '!kill 4676' to kill it.)

  0%|          | 0/8000 [00:00<?, ?it/s]

## Exercises

- use the same code on the Pendulum-v1 environment. This one is harder to tune. Get the parameters from the [rl-baseline3-zoo](https://github.com/DLR-RM/rl-baselines3-zoo) and see if you manage to get SAC working on Pendulum