In [19]:
import sys, os
if 'google.colab' in sys.modules and not os.path.exists('.setup_complete'):
    # Install xvfb and our launcher script for it
    !apt-get install -y xvfb
    !wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/xvfb -O ../xvfb

    # Download dependencies from Github
    !wget https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/week06_policy_based/atari_wrappers.py
    !wget https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/week06_policy_based/env_batch.py
    !wget https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/week06_policy_based/runners.py

    # Update the gym environment to be compatible with the Atari environment
    !pip install -q gymnasium[atari,accept-rom-license]
    !pip install -q tensorboardX

    !touch .setup_complete

# This code creates a virtual display to draw game images on.
# It will have no effect if your machine has a monitor.
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY")) == 0:
    !bash ../xvfb start
    os.environ['DISPLAY'] = ':1'

# Implementing Advantage-Actor Critic (A2C)


In this notebook you will implement Advantage Actor Critic algorithm that trains on a batch of Atari 2600 environments running in parallel.

Firstly, we will use environment wrappers implemented in file `atari_wrappers.py`. These wrappers preprocess observations (resize, grayscale, take max between frames, skip frames and stack them together) and rewards. Some of the wrappers help to reset the environment and pass `done` flag equal to `True` when agent dies.
File `env_batch.py` includes implementation of `ParallelEnvBatch` class that allows to run multiple environments in parallel. To create an environment we can use `nature_dqn_env` function. Note that if you are using
PyTorch and not using `tensorboardX` you will need to implement a wrapper that will log **raw** total rewards that the _unwrapped_ environment returns and redefine the implemention of `nature_dqn_env` function here.


In [44]:
import numpy as np
import gymnasium as gym
from atari_wrappers import nature_dqn_env


env_name = "SpaceInvadersNoFrameskip-v4"
nenvs = 8  # change this if you have more than 8 CPU ;)
summaries = "Tensorboard"

env = nature_dqn_env(env_name, nenvs=nenvs, summaries=summaries)

n_actions = env.action_space.spaces[0].n
obs, _ = env.reset()
assert obs.shape == (nenvs, 4, 84, 84)
assert obs.dtype == np.float32


A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]


A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]


Next, we will need to implement a model that predicts logits and values. It is suggested that you use the same model as in [Nature DQN paper](https://www.nature.com/articles/nature14236) with a modification that instead of having a single output layer, it will have two output layers taking as input the output of the last hidden layer. **Note** that this model is different from the model you used in homework where you implemented DQN. You can use your favorite deep learning framework here. We suggest that you use orthogonal initialization with parameter $\sqrt{2}$ for kernels and initialize biases with zeros.


In [98]:
import torch as T
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical
import numpy as np


class Network(nn.Module):
    def __init__(
        self,
        in_channels=4,
        n_actions=6,
    ) -> None:
        super().__init__()

        self.conv1 = nn.Conv2d(in_channels, 32, 8, stride=4)
        self.conv2 = nn.Conv2d(32, 64, 4, stride=2)
        self.conv3 = nn.Conv2d(64, 64, 3, stride=1)
        self.flatten = nn.Flatten()
        self.fc = nn.Linear(7 * 7 * 64, 512)
        self.actions = nn.Linear(512, n_actions)
        self.value = nn.Linear(512, 1)

        self.initialize_weights()

    def initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight.data)
                if m.bias is not None:
                    m.bias.data.zero_()
            elif isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight.data)
                m.bias.data.zero_()

    def forward(self, obs):
        out = self.conv1(obs)
        out = F.relu(out)
        out = self.conv2(out)
        out = F.relu(out)
        out = self.conv3(out)
        out = F.relu(out)
        out = self.flatten(out)
        out = self.fc(out)
        out = F.relu(out)

        logits = self.actions(out)
        value = self.value(out)

        return logits, value

You will also need to define and use a policy that wraps the model. While the model computes logits for all actions, the policy will sample actions and also compute their log probabilities. `policy.act` should return a dictionary of all the arrays that are needed to interact with an environment and train the model.
Note that actions must be an `np.ndarray` while the other
tensors need to have the type determined by your deep learning framework.


In [64]:
class Policy:
    def __init__(self, model: Network):
        self.model = model

    def act(self, inputs):
        # Implement a policy by calling the model, sampling actions and computing their log probs.
        # Should return a dict containing keys ['actions', 'logits', 'log_probs', 'values'].
        inputs = T.from_numpy(inputs).float()

        logits, values = self.model(inputs)
        values = values.squeeze(-1)
        dist = Categorical(logits=logits)
        actions = dist.sample()
        log_probs = dist.log_prob(actions)

        return {
            "actions": actions,
            "logits": logits,
            "log_probs": log_probs,
            "values": values,
        }

Next will pass the environment and policy to a runner that collects partial trajectories from the environment.
The class that does is is already implemented for you.


In [65]:
from runners import EnvRunner

In [66]:
model = Network(n_actions=n_actions)
policy = Policy(model)
runner = EnvRunner(env, policy, nsteps=5)

In [67]:
# generates new rollout
trajectory = runner.get_next()
trajectory.keys()

dict_keys(['actions', 'logits', 'log_probs', 'values', 'observations', 'rewards', 'resets', 'state'])

In [68]:
# Sanity checks
assert 'logits' in trajectory, "Not found: policy didn't provide logits"
assert 'log_probs' in trajectory, "Not found: policy didn't provide log_probs of selected actions"
assert 'values' in trajectory, "Not found: policy didn't provide critic estimations"
assert trajectory['logits'][0].shape == (nenvs, n_actions), "logits wrong shape"
assert trajectory['log_probs'][0].shape == (nenvs,), "log_probs wrong shape"
assert trajectory['values'][0].shape == (nenvs,), "values wrong shape"

for key in trajectory.keys():
    if key == 'state': continue
    assert len(trajectory[key]) == 5, f"wrong number of steps in {key}"
    
print("All tests passed!")

All tests passed!


This runner interacts with the environment for a given number of steps and returns a dictionary containing
keys

-   'observations'
-   'rewards'
-   'resets'
-   'actions'
-   all other keys that you defined in `Policy`

under each of these keys there is a python `list` of interactions with the environment. This list has length $T$ that is size of partial trajectory. Partial trajectory for given moment `t` is part of `ComputeValueTargets.__call__` input argument `trajectory` from moment `t` to the end (i.e. it's different at each iteration in the algorithm).


To train the part of the model that predicts state values you will need to compute the value targets.
Any callable could be passed to `EnvRunner` to be applied to each partial trajectory after it is collected.
Thus, we can implement and use `ComputeValueTargets` callable.
The formula for the value targets is simple:

$$
\hat v(s_t) = \left( \sum_{t'=0}^{T - 1} \gamma^{t'}r_{t+t'} \right) + \gamma^T \hat{v}(s_{t+T}),
$$

In implementation, however, do not forget to use
`trajectory['resets']` flags to check if you need to add the value targets at the next step when
computing value targets for the current step. You can access `trajectory['state']['latest_observation']`
to get last observations in partial trajectory &mdash; $s_{t+T}$.


In [69]:
class ComputeValueTargets:
    def __init__(self, policy, gamma=0.99):
        self.policy = policy
        self.gamma = gamma

    def __call__(self, trajectory):
        """Compute value targets for a given partial trajectory."""

        # This method should modify trajectory inplace by adding
        # an item with key 'value_targets' to it.

        trajectory["value_targets"] = [
            T.zeros_like(trajectory["values"][i])
            for i in range(trajectory["state"]["env_steps"])
        ]
        trajectory["value_targets"][-1] = trajectory["values"][-1] * T.tensor(
            1 - trajectory["resets"][-1]
        )
        for t in reversed(range(len(trajectory["rewards"]) - 1)):
            trajectory["value_targets"][t] = T.tensor(
                trajectory["rewards"][t]
            ) + self.gamma * trajectory["value_targets"][t + 1] * T.tensor(
                1 - trajectory["resets"][t]
            )

After computing value targets we will transform lists of interactions into tensors
with the first dimension `batch_size` which is equal to `env_steps * num_envs`, i.e. you essentially need
to flatten the first two dimensions.


In [70]:
class MergeTimeBatch:
    """Merges first two axes typically representing time and env batch."""

    def __call__(self, trajectory):
        # Modify trajectory inplace.
        for key in ["log_probs", "value_targets", "values"]:
            trajectory[key] = T.stack(trajectory[key], dim=0).view(-1).float()

In [71]:
model = Network()
policy = Policy(model)
runner = EnvRunner(
    env=env,
    policy=policy,
    nsteps=5,
    transforms=[
        ComputeValueTargets(policy),
        MergeTimeBatch(),
    ],
)

In [72]:
trajectory = runner.get_next()

In [73]:
# More sanity checks
assert 'value_targets' in trajectory, "Value targets not found"
assert trajectory['log_probs'].shape == (5 * nenvs,)
assert trajectory['value_targets'].shape == (5 * nenvs,)
assert trajectory['values'].shape == (5 * nenvs,)

assert trajectory['log_probs'].requires_grad, "Gradients are not available for actor head!"
assert trajectory['values'].requires_grad, "Gradients are not available for critic head!"

print("All tests passed!")

All tests passed!


Now is the time to implement the advantage actor critic algorithm itself. You can look into your lecture,
[Mnih et al. 2016](https://arxiv.org/abs/1602.01783) paper, and [lecture](https://www.youtube.com/watch?v=Tol_jw5hWnI&list=PLkFD6_40KJIxJMR-j5A1mkxK26gh_qg37&index=20) by Sergey Levine.


In [74]:
class A2C:
    def __init__(self,
                 policy,
                 optimizer,
                 value_loss_coef=0.25,
                 entropy_coef=0.01,
                 max_grad_norm=0.5):
        self.policy = policy
        self.optimizer = optimizer
        self.value_loss_coef = value_loss_coef
        self.entropy_coef = entropy_coef
        self.max_grad_norm = max_grad_norm

    def policy_loss(self, trajectory):
        # You will need to compute advantages here.
        advantages = trajectory["value_targets"] - trajectory["values"]
        policy_loss = -trajectory["log_probs"] * advantages.detach()
        
        return policy_loss.mean()

    def value_loss(self, trajectory):
        return F.mse_loss(trajectory["values"], trajectory["value_targets"])

    def loss(self, trajectory):
        return (
            self.policy_loss(trajectory)
            + self.value_loss_coef * self.value_loss(trajectory)
            - self.entropy_coef * trajectory["log_probs"].mean()
        )

    def step(self, trajectory):
        self.optimizer.zero_grad()
        loss = self.loss(trajectory)
        loss.backward()
        nn.utils.clip_grad_norm_(self.policy.model.parameters(), self.max_grad_norm)
        self.optimizer.step()

Now you can train your model. With reasonable hyperparameters training on a single GTX1080 for 10 million steps across all batched environments (which translates to about 5 hours of wall clock time)
it should be possible to achieve _average raw reward over last 100 episodes_ (the average is taken over 100 last
episodes in each environment in the batch) of about 600. You should plot this quantity with respect to
`runner.step_var` &mdash; the number of interactions with all environments. It is highly
encouraged to also provide plots of the following quantities (these are useful for debugging as well):

-   [Coefficient of Determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) between
    value targets and value predictions
-   Entropy of the policy $\pi$
-   Value loss
-   Policy loss
-   Value targets
-   Value predictions
-   Gradient norm
-   Advantages
-   A2C loss

For optimization we suggest you use RMSProp with learning rate starting from 7e-4 and linearly decayed to 0, smoothing constant (alpha in PyTorch and decay in TensorFlow) equal to 0.99 and epsilon equal to 1e-5.


In [100]:
#if you use TensorboardSummaries
%load_ext tensorboard
%tensorboard --logdir logs

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 50347), started 3:35:29 ago. (Use '!kill 50347' to kill it.)

In [102]:
env_name = "SpaceInvadersNoFrameskip-v4"
nenvs = 8
summaries = "Tensorboard"

env = nature_dqn_env(env_name, nenvs=nenvs, summaries=summaries)

n_actions = env.action_space.spaces[0].n
obs, _ = env.reset()

model = Network(obs.shape[1], n_actions)

policy = Policy(model)
runner = EnvRunner(
    env, policy, nsteps=10, transforms=[ComputeValueTargets(policy), MergeTimeBatch()]
)

optimizer = optim.Adam(policy.model.parameters(), lr=7e-4, eps=1e-5)

epoch = 0
num_epochs = int(1e5)

lr_lambda = lambda epoch: 1 - epoch / num_epochs
scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

best_mean_reward = 95

a2c = A2C(policy, optimizer)

A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]


A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]


In [109]:
def evaluate(env, policy, n_games=1, t_max=10000):
    '''
    Plays n_games and returns rewards
    '''
    rewards = []
    
    for _ in range(n_games):
        s, _ = env.reset()
        
        R = 0
        for _ in range(t_max):
            action = policy.act(np.array([s]))["actions"][0]
            
            s, r, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            R += r
            if done:
                break

        rewards.append(R)
    return np.array(rewards)


eval_env = nature_dqn_env(
    env_name,
    nenvs=None,
    clip_reward=False,
    summaries=False,
)

In [131]:
# Training loop

from tqdm import trange

save_dir = "models"
os.makedirs(save_dir, exist_ok=True)

for epoch in trange(epoch, num_epochs):
    trajectory = runner.get_next()
    a2c.step(trajectory)
    scheduler.step()

    if epoch % 1000 == 0:
        eval_rewards = evaluate(eval_env, policy, n_games=50)
        if eval_rewards.mean() > best_mean_reward:
            best_mean_reward = eval_rewards.mean()
            T.save(
                model.state_dict(),
                f"{save_dir}/best_model.pt",
            )

        T.save(model.state_dict(), f"{save_dir}/model.pt")

 24%|██▍       | 20009/84000 [57:10<3:02:50,  5.83it/s]  


KeyboardInterrupt: 

In [134]:
# evaluation will take some time!
m = Network()
m.load_state_dict(T.load(f"{save_dir}/best_model.pt"))
sessions = evaluate(eval_env, Policy(m), n_games=100)
score = sessions.mean()
print(f"Your score: {score}")

Your score: 93.75


## Monitor

In [None]:
from gymnasium.wrappers.record_video import RecordVideo
env_monitor = RecordVideo(env=eval_env, video_folder="videos")
final_rewards = evaluate(
    env_monitor,
    policy,
    n_games=20,
)
env_monitor.close()

video_names = list(filter(lambda s: s.endswith(".mp4"), os.listdir("./videos/")))

In [147]:
from IPython.display import HTML

print("Final mean reward:", np.mean(final_rewards))
HTML(
    """
<video width="640" height="480" controls>
  <source src="{}" type="video/mp4">
</video>
""".format(
        "./videos/" + np.random.choice(video_names)
    )
)

Final mean reward: 96.0


### Target networks?

You may recall a technique called "target networks" we used a few weeks ago when we trained a DQN agent to play Atari Breakout and wonder why we have not suggested using them here. The answer is that this is more historical than practical.

While the "chasing the target" problem is still present in actor-critic value estimation and target networks do show up in follow-up papers, the original A3C/A2C papers do not mention them and do not explain this omission.

The hypothesis why this may not be a big deal (compared to Q-learning) goes like this. An A3C/A2C agent selects actions based on policy, not an epsilon greedy exploration function, for which the argmax can change drastically due to tiny errors in function approximation. Therefore, errors in the value target caused by target chasing will cause less damage.

Also, the actor-critic gradient relies on the advantage function $A(s_t, a_t) = Q(s_t, a_t) - V(s_t)$. Compare this to the $Q$-function $Q(s_t, a_t) = r(s_t, a_t) + \gamma \cdot \mathbb{E}_{s_{t+1} \mid s_t, a_t} V(s_{t+1})$ used in Q-learning and SARSA: we would expect that any bias in $V$-function approximation will be carried over from $V(s_{t+1})$ to $V(s_t)$ by gradient updates. However, in the formula for the advantage function the two approximations ($Q$-function and $V$-function) come with opposite signs, and thus the errors will cancel out.

The last reason may be computational. Authors were concerned to beat existent algorithms in the wall-clock learning time, and any overhead of parameter copying (target network update) counted against this goal.
