
# W3D3 - RL with PPO algorithm

Today you'll be implementing the Proximal Policy Gradient algorithm!

## Table of Contents

- [Readings](#readings)
- [On-Policy vs Off-Policy](#on-policy-vs-off-policy)
- [Actor-Critic Methods](#actor-critic-methods)
- [Learning Objectives](#learning-objectives)
- [References (not required reading)](#references-not-required-reading)
- [Actor-Critic Agent Implementation (detail #2)](#actor-critic-agent-implementation-detail-)
- [Generalized Advantage Estimation (detail #5)](#generalized-advantage-estimation-detail-)
- [Minibatch Update (detail #6)](#minibatch-update-detail-)
- [Loss Function](#loss-function)
    - [Gradient Ascent](#gradient-ascent)
    - [Clipped Surrogate Loss](#clipped-surrogate-loss)
    - [Minibatch Advantage Normalization (detail #7)](#minibatch-advantage-normalization-detail-)
    - [Value Function Loss (detail #9)](#value-function-loss-detail-)
    - [Entropy Bonus (detail #10)](#entropy-bonus-detail-)
- [Entropy Diagnostic](#entropy-diagnostic)
- [Adam Optimizer and Scheduler (details #3 and #4)](#adam-optimizer-and-scheduler-details--and-)
- [Putting It All Together](#putting-it-all-together)
- [Debug Variables (detail #12)](#debug-variables-detail-)
    - [Update Frequency](#update-frequency)
- [Bonus](#bonus)
    - [Continuous Action Spaces](#continuous-action-spaces)
    - [Vectorized Advantage Calculation](#vectorized-advantage-calculation)

## Readings

- [Spinning Up in Deep RL - PPO](https://spinningup.openai.com/en/latest/algorithms/ppo.html) - you don't need to follow all the derivations, but try to have a qualitative understanding of what all the symbols represent.
- [Spinning Up in Deep RL - Vanilla Policy Gradient](https://spinningup.openai.com/en/latest/algorithms/vpg.html#background) - PPO is a fancier version of vanilla policy gradient, so if you're struggling to understand PPO it may help to look at the simpler setting first.
- [The 37 Implementation Details of Proximal Policy Optimization](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/#solving-pong-in-5-minutes-with-ppo--envpool) - the good news is that you won't need all 37 of these today, so no need to read to the end. We will be tackling the 13 "core" details, not in the same order as presented here.
- [Andy Jones - Debugging RL, Without the Agonizing Pain](https://andyljones.com/posts/rl-debugging.html) - you've already read this for W3D2 but it will come in handy again. You'll want to reuse your probe environments from yesterday, or you can import them from the solution if you didn't implement them all.

I would recommend making a physical checklist of the 13 items and marking them as you go with how confident you are in your implementation. If things aren't working, this will help you notice if you missed one, or focus on the sections most likely to be bugged.

## On-Policy vs Off-Policy

Broadly, RL algorithms can be categorized as off-policy or on-policy. DQN learns from a replay buffer of old experiences that could have been generated by an old policy quite different than the current one. This means it is off-policy.

PPO will only learn from experiences that were generated by the current policy, which is why it's called on-policy. We will generate batch of experiences, train on them once, and then discard them.

## Actor-Critic Methods

In DQN, there was no neural network representing the policy; the policy was "sometimes act randomly, otherwise take the action with max q-value". In PPO, we're going to have two neural networks:

- The actor network takes observations and outputs logits, which we can normalize into a probability distribution and sample from to determine our action.
- The critic network takes observations and outputs a predicted value for that observation. Again, we're going to equivocate between states and observations as is tradition. The point is that it doesn't have to output a value for every possible action, just the value assuming we did the optimal action. The critic is called a critic because like a movie critic, it just watches what's happening without taking any actions and forms an opinion on whether states are good or bad.

## Learning Objectives

Your implementation might get huge benchmark scores by the end of the day, but don't worry if it actually just doesn't work at all on the simplest of tasks. RL can be frustrating because the feedback you get is extremely noisy and random. The agent can fail even when the code is correct and the agent can succeed even when the code is buggy. Forming a systematic process for coping with the confusion and uncertainty is the point of today, more so than producing a working PPO implementation.

Some parts of your process could include:

- Forming hypotheses about why it isn't working, and thinking about what tests you could write, Gym environments that would test the hypothesis, or where you could set a breakpoint.
- Getting a sense for the meaning of various logged metrics, and what this implies about the training process
- Noticing confusion and sections that don't make sense, and investigating this instead of hand-waving over it.

## References (not required reading)

- [The Policy of Truth](http://www.argmin.net/2018/02/20/reinforce/) - a contrarian take on why Policy Gradients are actually a "terrible algorithm" that is "legitimately bad" and "never a good idea".
- [Tricks from Deep RL Bootcamp at UC Berkeley](https://github.com/williamFalcon/DeepRLHacks/blob/master/README.md) - more debugging tips that may be of use.
- [What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study](https://arxiv.org/pdf/2006.05990.pdf) - Google Brain researchers trained over 250K agents to figure out what really affects performance. The answers may surprise you.
- [Lilian Weng Blog](https://lilianweng.github.io/posts/2018-04-08-policy-gradient/#ppo)
- [A Closer Look At Deep Policy Gradients](https://arxiv.org/pdf/1811.02553.pdf)
- [Where Did My Optimum Go?: An Empirical Analysis of Gradient Descent Optimization in Policy Gradient Methods](https://arxiv.org/pdf/1810.02525.pdf)
- [Independent Policy Gradient Methods for Competitive Reinforcement Learning](https://papers.nips.cc/paper/2020/file/3b2acfe2e38102074656ed938abf4ac3-Supplemental.pdf) - requirements for multi-agent Policy Gradient to converge to Nash equilibrium.




In [ ]:
import argparse
import dataclasses
import os
import random
import sys
import time
from dataclasses import dataclass
from distutils.util import strtobool
from typing import Optional
import gym
import numpy as np
import torch
import torch as t
import torch.nn as nn
import torch.optim as optim
from einops import rearrange
from gym.spaces import Discrete
from torch.distributions.categorical import Categorical
from torch.utils.tensorboard import SummaryWriter
import utils
import w3d2_part2_dqn_solution
import w3d3_test
from w3d2_utils import make_env

MAIN = __name__ == "__main__"
IS_CI = os.getenv("IS_CI")




## Actor-Critic Agent Implementation (detail #2)

Implement the `Agent` class according to the diagram, inspecting `envs` to determine the observation shape and number of actions. We are doing separate Actor and Critic networks because detail #13 notes that is performs better than a single shared network in simple environments. Note that today `envs` will actually have multiple instances of the environment inside, unlike yesterday's DQN which had only one instance inside.

Use `layer_init` to initialize each `Linear`, overriding the norm of the rows according to the diagram. What is the benefit of using a small norm for the last actor layer?


```mermaid

graph TD
    subgraph Critic
        Linear1["Linear(obs_shape, 64)"] --> Tanh1[Tanh] --> Linear2["Linear(64, 64)"] --> Tanh2[Tanh] --> Linear3["Linear(64, 1)<br/>row_norm=1"] --> Out

    end
    subgraph Actor
        ALinear1["Linear(obs_shape, 64)"] --> ATanh1[Tanh]--> ALinear2["Linear(64, 64)"] --> ATanh2["Tanh"] --> ALinear3["Linear(64, num_actions)<br/>row_norm=0.01"] --> AOut[Out]
    end
```




In [ ]:
def layer_init(layer: nn.Linear, row_norm=np.sqrt(2), bias_const=0.0) -> nn.Linear:
    """Initialize the provided linear layer.

    - Each row of the weight has the specified norm
    - Each element of the bias is bias_const.
    """
    torch.nn.init.orthogonal_(layer.weight, row_norm)
    torch.nn.init.constant_(layer.bias, bias_const)
    return layer


class Agent(nn.Module):
    critic: nn.Sequential
    actor: nn.Sequential

    def __init__(self, envs: gym.vector.SyncVectorEnv):
        pass


if MAIN:
    w3d3_test.test_agent_init(Agent)




## Generalized Advantage Estimation (detail #5)

There are various ways to compute advantages - follow detail #5 closely for today.

At the point where this is called we've already run some number of environments `n_envs` for some number of steps `t`: this is called the rollout phase. Now it's time to compute the advantages so we can use them in the loss function.

Given a batch of experiences, we want to compute each advantage[t][env]. This is equations 11 and 12 of the [PPO paper](https://arxiv.org/pdf/1707.06347.pdf), reproduced here:

> $\hat{A}_t = \delta_t + (\gamma\lambda)\delta_{t+1} + (\gamma\lambda)^2\delta_{t+2} + (\gamma\lambda)^3\delta_{t+3} + ... + (\gamma\lambda)^{T-t+1}\delta_{T-1}$,
>
> where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$

<details>
<summary>Understanding GAE</summary>

We can break down the [value function](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#value-functions) $V(s_t)$ as follows (where $R$ is the reward function, so the reward at timestep $t = r_t = R(s_t)$):
$$V(s_t) = E[R(s_t) + \gamma R(s_{t+1}) + \gamma^2 R(s_{t+2}) + ...] = E[R(s_t)] + \gamma V(s_{t+1})$$

We can then use this to replace $V(s_t)$ in the equation for $\delta_t$:

$$
\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)
= r_t + \gamma V(s_{t+1}) - E[R(s_t)] - \gamma V(s_{t+1})
= r_t - E[R(s_t)]
$$

**So $\delta_t$ is just how much higher the received reward was at timestep $t$ than we would have expected.**

When $\lambda = 1$, the value of $\hat{A}_t$ is just the values of $\delta$ at the various timesteps summed using the normal discount:
$$
\lambda = 1: \hat{A}_t = \delta_t + \gamma\delta_{t+1} + \gamma^2\delta_{t+2} + ... + \gamma^{T-t+1}\delta_{T-1}
 = R(\tau) - E[R(\tau)]
$$

Which is just how much better the actual return was than the expected return for this trajectory.

When $\lambda \lt 1$, the exponential discount of future rewards comes slightly faster which helps to reduce the variance of the estimate. More info about this can be found on page 15 of [this PDF](https://arxiv.org/pdf/1804.02717.pdf), under the section titled MULTI-STEP RETURNS at the start of the page.
</details>

Implement `compute_advantages`. I recommend using a reversed for loop over `t` to get it working, and not worrying about trying to completely vectorize it. Also note that in the implementation in the solutions, a value of 1 at position `t` in `dones` indicates that `t` is the *first* timestep of a new trajectory (not the last timestep of the previous trajectory).




In [ ]:
@torch.no_grad()
def compute_advantages(
    next_value: t.Tensor,
    next_done: t.Tensor,
    rewards: t.Tensor,
    values: t.Tensor,
    dones: t.Tensor,
    device: t.device,
    gamma: float,
    gae_lambda: float,
) -> t.Tensor:
    """Compute advantages using Generalized Advantage Estimation.

    next_value: shape (1, n_envs) - represents V(s_{t+1}) which is needed for the last advantage term
    next_done: shape (n_envs,)
    rewards: shape (t, n_envs)
    values: shape (t, n_envs)
    dones: shape (t, n_envs)

    Return: shape (t, n_envs)
    """
    pass


if MAIN:
    w3d3_test.test_compute_advantages(compute_advantages)




## Minibatch Update (detail #6)

After generating our experiences that have `(t, n_envs)` dimensions, we need to:

- Flatten the `(t, n_envs)` dimensions into one batch dimension
- Split the batch into minibatches, so we can take an optimizer step for each minibatch.

If we just randomly sampled the minibatch each time, some of our experiences might not appear in any minibatch due to random chance. This would be wasteful - we're going to discard all these experiences immediately after training, so there's no second chance for the experience to be used, unlike if it was in a replay buffer.

Implement the following functions so that each experience appears exactly once.

Tip: `Minibatch` stores the returns, which are just advantages + values.




In [ ]:
@dataclass
class Minibatch:
    obs: t.Tensor
    logprobs: t.Tensor
    actions: t.Tensor
    advantages: t.Tensor
    returns: t.Tensor
    values: t.Tensor


def minibatch_indexes(batch_size: int, minibatch_size: int) -> list[np.ndarray]:
    """Return a list of length (batch_size // minibatch_size) where each element is an array of indexes into the batch.

    Each index should appear exactly once.
    """
    assert batch_size % minibatch_size == 0
    pass


def make_minibatches(
    obs: t.Tensor,
    logprobs: t.Tensor,
    actions: t.Tensor,
    advantages: t.Tensor,
    values: t.Tensor,
    obs_shape: tuple,
    action_shape: tuple,
    batch_size: int,
    minibatch_size: int,
) -> list[Minibatch]:
    """
    Flatten the n_envs and steps dimensions into one batch dimension, then shuffle and split into minibatches.

    obs: shape (t, n_envs, *observation_shape)
    logprobs: shape (t, n_envs)
    actions: shape (t, n_envs, *action_shape)
    advantages: shape (t, n_envs)
    values: shape (t, n_envs)
    """
    pass


if MAIN:
    w3d3_test.test_minibatch_indexes(minibatch_indexes)
    w3d3_test.test_make_minibatches(make_minibatches)




## Loss Function

The overall loss function is given by Eq 9 in the paper and is the sum of three terms - we'll implement each term individually.

### Gradient Ascent

Eq 9 is presented for gradient ascent, which I find confusing since we've always done gradient descent to this point.

You can actually configure Adam to do gradient ascent by passing `maximize=True`, but I've chosen to use gradient descent as usual and flip the signs of the objective as needed.

### Clipped Surrogate Loss

For each minibatch, calculate $L^{CLIP}$ from Eq 7 in the paper. This will allow us to improve the parameters of our actor.

Tip: we want to maximize $L^{CLIP}$, so for gradient descent the loss returned needs to be negative of the equation.

Tip: In the paper, don't confuse $r_{t}$ which is reward at time $t$ with $r_{t}(\theta)$, which is the probability ratio between the current policy (output of the actor) and the old policy (stored in mb_logprobs).

### Minibatch Advantage Normalization (detail #7)

Remember to normalize the minibatch of advantages before using it.




In [ ]:
def calc_policy_loss(
    probs: Categorical, mb_action: t.Tensor, mb_advantages: t.Tensor, mb_logprobs: t.Tensor, clip_coef: float
) -> t.Tensor:
    """Return the negative policy loss, suitable for minimization with gradient descent.

    probs: a torch Categorical distribution containing the actor's unnormalized logits of shape (minibatch_size, n_actions)
    mb_action: shape (minibatch_size, *action_shape)
    mb_advantages: shape (minibatch_size,)
    mb_logprobs: shape (minibatch_size,)
    clip_coef: amount of clipping, denoted by epsilon in Eq 7.
    """
    pass


if MAIN:
    w3d3_test.test_calc_policy_loss(calc_policy_loss)




### Value Function Loss (detail #9)

The value function loss lets us improve the parameters of our critic. Today we're going to implement the simple form: this is just 1/2 the mean squared difference between the critic's prediction and the observed returns. We're defining returns as `returns = advantages + values`.

The PPO paper did a more complicated thing with clipping, but we're going to deviate from the paper and NOT clip, since detail #9 gives evidence that it isn't beneficial.

Implement `calc_value_function_loss` which returns the term denoted $c_1 L_t^{VF}$ in Eq 9.

Exercise: what should the sign be on the return value from this function?

<details>

<summary>Solution - sign of value loss</summary>

We want to minimize the difference between predicted and observed, so this term should be always non-negative.

</details>




In [ ]:
def calc_value_function_loss(critic: nn.Sequential, mb_obs: t.Tensor, mb_returns: t.Tensor, v_coef: float) -> t.Tensor:
    """Compute the value function portion of the loss function.

    v_coef: the coefficient for the value loss, which weights its contribution to the overall loss. Denoted by c_1 in the paper.
    """
    pass


if MAIN:
    w3d3_test.test_calc_value_function_loss(calc_value_function_loss)




### Entropy Bonus (detail #10)

The entropy bonus term is intended to incentivize exploration by increasing the entropy of the actions distribution. For a discrete probability distribution, entropy is just the sum over x of $-p(x) log(p(x))$.

You should understand what entropy of a discrete distribution means, but you don't have to implement it yourself: `probs.entropy` computes it using the above formula but in a numerically stable way.

Exercise: in CartPole, what are the minimum and maximum values that entropy can take? What behaviors correspond to each of these cases?

<details>

<summary>Solution - CartPole Entropy</summary>

The minimum entropy is zero, under the policy "always move left" or "always move right".

The maximum entropy is $log(2) \approx 0.693$ under the uniform random policy over the 2 actions.

</details>

## Entropy Diagnostic

Separately from its role in the loss function, the entropy of our action distribution is a useful diagnostic to have: if the entropy of agent's actions is near the maximum, it's playing nearly randomly which means it isn't learning anything (assuming the optimal policy isn't random). If it is near the minimum especially early in training, then the agent might not be exploring enough.

Implement `calc_entropy_loss`.

Tip: make sure the sign is correct; for gradient descent, to actually increase entropy this term needs to be negative.




In [ ]:
def calc_entropy_loss(probs: Categorical, ent_coef: float):
    """Return the entropy loss term.

    ent_coef: the coefficient for the entropy loss, which weights its contribution to the overall loss. Denoted by c_2 in the paper.
    """
    pass


if MAIN:
    w3d3_test.test_calc_entropy_loss(calc_entropy_loss)




## Adam Optimizer and Scheduler (details #3 and #4)

Even though Adam is already an adaptive learning rate optimizer, empirically it's still beneficial to decay the learning rate.

Implement a linear decay from `initial_lr` to `end_lr` over num_updates steps.




In [ ]:
class PPOScheduler:
    def __init__(self, optimizer, initial_lr: float, end_lr: float, num_updates: int):
        self.optimizer = optimizer
        self.initial_lr = initial_lr
        self.end_lr = end_lr
        self.num_updates = num_updates
        self.n_step_calls = 0

    def step(self):
        """Implement linear learning rate decay so that after num_updates calls to step, the learning rate is end_lr."""
        pass


def make_optimizer(agent: Agent, num_updates: int, initial_lr: float, end_lr: float) -> tuple[optim.Adam, PPOScheduler]:
    """Return an appropriately configured Adam with its attached scheduler."""
    pass


if MAIN:
    w3d3_test.test_ppo_scheduler(PPOScheduler)




## Putting It All Together

Again, we've provided the boilerplate for you. It looks worse than it is - a lot of it is just tracking metrics for debugging. Implement the sections marked with placeholders.




In [ ]:
@dataclass
class PPOArgs:
    exp_name: str = os.path.basename(globals().get("__file__", "PPO_implementation").rstrip(".py"))
    seed: int = 1
    torch_deterministic: bool = True
    cuda: bool = True
    track: bool = False
    wandb_project_name: str = "mlab2_ppo"
    wandb_entity: Optional[str] = None
    capture_video: bool = False
    env_id: str = "CartPole-v1"
    total_timesteps: int = 500000
    learning_rate: float = 0.00025
    num_envs: int = 4
    num_steps: int = 128
    gamma: float = 0.99
    gae_lambda: float = 0.95
    num_minibatches: int = 4
    update_epochs: int = 4
    clip_coef: float = 0.2
    ent_coef: float = 0.01
    vf_coef: float = 0.5
    max_grad_norm: float = 0.5

    def __post_init__(self):
        self.batch_size: int = int(self.num_envs * self.num_steps)
        self.minibatch_size: int = int(self.batch_size // self.num_minibatches)


arg_help_strings = {
    "exp_name": "the name of this experiment",
    "seed": "seed of the experiment",
    "torch_deterministic": "if toggled, `torch.backends.cudnn.deterministic=False`",
    "cuda": "if toggled, cuda will be enabled by default",
    "track": "if toggled, this experiment will be tracked with Weights and Biases",
    "wandb_project_name": "the wandb's project name",
    "wandb_entity": "the entity (team) of wandb's project",
    "capture_video": "whether to capture videos of the agent performances (check out `videos` folder)",
    "env_id": "the id of the environment",
    "total_timesteps": "total timesteps of the experiments",
    "learning_rate": "the learning rate of the optimizer",
    "num_envs": "the number of parallel game environments",
    "num_steps": "the number of steps to run in each environment per policy rollout",
    "gamma": "the discount factor gamma",
    "gae_lambda": "the lambda for the general advantage estimation",
    "num_minibatches": "the number of mini-batches",
    "update_epochs": "the K epochs to update the policy",
    "clip_coef": "the surrogate clipping coefficient",
    "ent_coef": "coefficient of the entropy",
    "vf_coef": "coefficient of the value function",
    "max_grad_norm": "the maximum norm for the gradient clipping",
}
toggles = ["torch_deterministic", "cuda", "track", "capture_video"]


def parse_args(arg_help_strings=arg_help_strings, toggles=toggles) -> PPOArgs:
    parser = argparse.ArgumentParser()
    for field in dataclasses.fields(PPOArgs):
        flag = "--" + field.name.replace("_", "-")
        type_function = field.type if field.type != bool else lambda x: bool(strtobool(x))
        toggle_kwargs = {"nargs": "?", "const": True} if field.name in toggles else {}
        parser.add_argument(
            flag, type=type_function, default=field.default, help=arg_help_strings[field.name], **toggle_kwargs
        )
    return PPOArgs(**vars(parser.parse_args()))


def train_ppo(args: PPOArgs) -> Agent:
    run_name = f"{args.env_id}__{args.exp_name}__{args.seed}__{int(time.time())}"
    if args.track:
        import wandb

        wandb.init(
            project=args.wandb_project_name,
            entity=args.wandb_entity,
            sync_tensorboard=True,
            config=vars(args),
            name=run_name,
            monitor_gym=True,
            save_code=True,
        )
    writer = SummaryWriter(f"runs/{run_name}")
    writer.add_text(
        "hyperparameters",
        "|param|value|\n|-|-|\n%s" % "\n".join([f"|{key}|{value}|" for (key, value) in vars(args).items()]),
    )
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    torch.backends.cudnn.deterministic = args.torch_deterministic
    device = torch.device("cuda" if torch.cuda.is_available() and args.cuda else "cpu")
    envs = gym.vector.SyncVectorEnv(
        [make_env(args.env_id, args.seed + i, i, args.capture_video, run_name) for i in range(args.num_envs)]
    )
    action_shape = envs.single_action_space.shape
    assert action_shape is not None
    assert isinstance(envs.single_action_space, Discrete), "only discrete action space is supported"
    agent = Agent(envs).to(device)
    num_updates = args.total_timesteps // args.batch_size
    (optimizer, scheduler) = make_optimizer(agent, num_updates, args.learning_rate, 0.0)
    obs = torch.zeros((args.num_steps, args.num_envs) + envs.single_observation_space.shape).to(device)
    actions = torch.zeros((args.num_steps, args.num_envs) + action_shape).to(device)
    logprobs = torch.zeros((args.num_steps, args.num_envs)).to(device)
    rewards = torch.zeros((args.num_steps, args.num_envs)).to(device)
    dones = torch.zeros((args.num_steps, args.num_envs)).to(device)
    values = torch.zeros((args.num_steps, args.num_envs)).to(device)
    global_step = 0
    old_approx_kl = 0.0
    approx_kl = 0.0
    value_loss = t.tensor(0.0)
    policy_loss = t.tensor(0.0)
    entropy_loss = t.tensor(0.0)
    clipfracs = []
    info = []
    start_time = time.time()
    next_obs = torch.Tensor(envs.reset()).to(device)
    next_done = torch.zeros(args.num_envs).to(device)
    num_episodes = 0
    for _ in range(num_updates):
        for i in range(0, args.num_steps):
            "YOUR CODE: Rollout phase (see detail #1)"
            for item in info:
                if "episode" in item.keys():
                    num_episodes += 1
                    if num_episodes % 100 == 0:
                        print(f"global_step={global_step}, episodic_return={item['episode']['r']}")
                    writer.add_scalar("charts/episodic_return", item["episode"]["r"], global_step)
                    writer.add_scalar("charts/episodic_length", item["episode"]["l"], global_step)
                    break
        next_value = rearrange(agent.critic(next_obs), "env 1 -> 1 env")
        advantages = compute_advantages(
            next_value, next_done, rewards, values, dones, device, args.gamma, args.gae_lambda
        )
        clipfracs.clear()
        for _ in range(args.update_epochs):
            minibatches = make_minibatches(
                obs,
                logprobs,
                actions,
                advantages,
                values,
                envs.single_observation_space.shape,
                action_shape,
                args.batch_size,
                args.minibatch_size,
            )
            for mb in minibatches:
                "YOUR CODE: compute loss on the minibatch and step the optimizer (not the scheduler). Do detail #11 (global gradient clipping) here using nn.utils.clip_grad_norm_."
        scheduler.step()
        (y_pred, y_true) = (mb.values.cpu().numpy(), mb.returns.cpu().numpy())
        var_y = np.var(y_true)
        explained_var = np.nan if var_y == 0 else 1 - np.var(y_true - y_pred) / var_y
        with torch.no_grad():
            newlogprob: t.Tensor = probs.log_prob(mb.actions)
            logratio = newlogprob - mb.logprobs
            ratio = logratio.exp()
            old_approx_kl = (-logratio).mean().item()
            approx_kl = (ratio - 1 - logratio).mean().item()
            clipfracs += [((ratio - 1.0).abs() > args.clip_coef).float().mean().item()]
        writer.add_scalar("charts/learning_rate", optimizer.param_groups[0]["lr"], global_step)
        writer.add_scalar("losses/value_loss", value_loss.item(), global_step)
        writer.add_scalar("losses/policy_loss", policy_loss.item(), global_step)
        writer.add_scalar("losses/entropy", entropy_loss.item(), global_step)
        writer.add_scalar("losses/old_approx_kl", old_approx_kl, global_step)
        writer.add_scalar("losses/approx_kl", approx_kl, global_step)
        writer.add_scalar("losses/clipfrac", np.mean(clipfracs), global_step)
        writer.add_scalar("losses/explained_variance", explained_var, global_step)
        writer.add_scalar("charts/SPS", int(global_step / (time.time() - start_time)), global_step)
        if global_step % 10 == 0:
            print("steps per second (SPS):", int(global_step / (time.time() - start_time)))
    envs.close()
    writer.close()
    return agent


if MAIN and (not IS_CI):
    if "ipykernel_launcher" in os.path.basename(sys.argv[0]):
        filename = globals().get("__file__", "<filename of this script>")
        print(f"Try running this file from the command line instead: python {os.path.basename(filename)} --help")
        args = PPOArgs()
    else:
        args = parse_args()
    agent = train_ppo(args)
    device = torch.device("cuda" if torch.cuda.is_available() and args.cuda else "cpu")
    if args.env_id == "Probe1-v0":
        print(
            "Probe1-v0: Checking if agent learns constant reward. On my machine this consistently passed after 10K timesteps."
        )
        batch = t.tensor([[0.0]]).to(device)
        value = agent.critic(batch)
        print("Value: ", value)
        expected = t.tensor([[1.0]]).to(device)
        utils.allclose_atol(value, expected, 0.0001)
    elif args.env_id == "Probe2-v0":
        print(
            "Probe2-v0: Checking if agent learns predictable reward. On my machine this consistently passed after 10K timesteps."
        )
        batch = t.tensor([[-1.0], [+1.0]]).to(device)
        value = agent.critic(batch)
        print("Value:", value)
        expected = batch
        utils.allclose_atol(value, expected, 0.0001)
    elif args.env_id == "Probe3-v0":
        print(
            "Probe3-v0: Checking reward discounting. The value of the 0 observation in the initial state should converge to gamma."
        )
        batch = t.tensor([[0.0], [1.0]]).to(device)
        value = agent.critic(batch)
        print("Value: ", value)
        expected = t.tensor([[args.gamma], [1.0]])
        utils.allclose_atol(value, expected, 0.001)
    elif args.env_id == "Probe4-v0":
        print("Probe4-v0: Checking policy & advantage. May take 30K+ steps to converge!")
        batch = t.tensor([[0.0]]).to(device)
        value = agent.critic(batch)
        expected_value = t.tensor([[1.0]])
        print("Value: ", value)
        policy_probs = agent.actor(batch).softmax(dim=-1)
        expected_probs = t.tensor([[0, 1]]).to(device)
        print("Policy: ", policy_probs)
        utils.allclose_atol(policy_probs, expected_probs, 0.01)
        utils.allclose_atol(value, expected_value, 0.01)
    elif args.env_id == "Probe5-v0":
        print("Checking dependence on both obs and action. May also take 30K+ steps to converge.")
        batch = t.tensor([[0.0], [1.0]]).to(device)
        value = agent.critic(batch)
        expected_value = t.tensor([[1.0], [1.0]]).to(device)
        print("Value: ", value)
        policy_probs = agent.actor(batch).softmax(dim=-1)
        expected_probs = t.tensor([[1.0, 0.0], [0.0, 1.0]]).to(device)
        print("Policy: ", policy_probs)
        utils.allclose_atol(policy_probs, expected_probs, 0.01)
        utils.allclose_atol(value, expected_value, 0.01)




## Debug Variables (detail #12)

Go through and check each of the debug variables that are logged. Make sure your implementation computes or calculates the values and that you have an understanding of what they mean and what they should look like.

### Update Frequency

Note that the debug values are currently only logged once per update, meaning some are computed from the last minibatch of the last epoch in the update. This isn't necessarily the best thing to do, but if you log too often it can slow down training. You can experiment with logging more often, or tracking the average over the update or even an exponentially moving average.

## Bonus

### Continuous Action Spaces

The `MountainCar-v0` environment has discrete actions, but there's also a version `MountainCarContinuous-v0` with continuous action spaces. Unlike DQN, PPO can handle continuous actions with minor modifications. Try to adapt your agent; you'll need to handle `gym.spaces.Box` instead of `gym.spaces.Discrete` and make note of the "9 details for continuous action domains" section of the reading.

### Vectorized Advantage Calculation

Try optimizing away the for-loop in your advantage calculation. It's tricky, so an easier version of this is: find a vectorized calculation and try to explain what it does.
