# Policy Based Methods

In this notebook we will focus on policy based methods, these methods focus on the policy of the agent, i.e. what actions should be performed rather than trying to estimate the potential utility value of an action for a given state. You already know one policy based method - the CrossEntropy method. This method was presented on earlier classes but as you may remember this method had a lot of drawbacks, this time we will try to broaden our understanding of policy based methods and discover solutions which are much more robust. However, first of all, let's discuss why exactly policy based methods are so important and what are their advanteges and drawbacks.

**Advantages of policy-based methods:**

*   Intuitive - in general as humans we usually try to discover actions that are the best under given circumstances, we do not calculate utility functions, so policy based methods may be more intuitive
*  Handling continuous action space - although not a focus of this notebook, in general policy based methods are a much better choice when dealing with continuous action spaces
*  Good for stochastic environments - since we will model discreete actions as a probability distribution (sort of like a classification problem where for every state we have to pick the best action), the use of policy based methods naturally lends itself to stochastic environments and the exploitation vs exploration paradigm of reinfrocement learning


**Disadvantages of policy-based methods:**
*  The greatest disadvantage of policy based methods is its computational intensity due to on-policy natue of these algorithms, continuous interaction with the environment is necessary which greatly increases the complexity of these methods

## Installations and Imports

In [2]:
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
!pip install gym[box2d] pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb ffmpeg > /dev/null 2>&1
!pip install pyvirtualdisplay > /dev/null 2>&1
!pip install gymnasium
!pip install tensorboardX
!pip install vizdoom

Collecting gymnasium
  Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/953.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m174.1/953.9 kB[0m [31m5.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m952.3/953.9 kB[0m [31m15.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.9/953.9 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
Collecting farama-notifications>=0.0.1 (from gymnasium)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Installing collected packages: farama-notifications, gymnasium
Successfully installed farama-notifications-0.0.4 gymnasium-0.29.1
Collecting tensorboardX
  Downloading tensorboardX-2.6.2.2-py2.py3-none-any.whl (101 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.7/

In [None]:
import gymnasium as gym
import numpy as np
from tensorboardX import SummaryWriter
from gymnasium import wrappers
import matplotlib.pyplot as plt
from IPython import display as ipythondisplay
import os
import pyvirtualdisplay
import base64
import io
import imageio
from datetime import datetime
from IPython.display import HTML
from gymnasium import Wrapper
import warnings
import random
import cv2
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import vizdoom
from vizdoom import gymnasium_wrapper

In [None]:
def render_as_image(env):
    '''
    Renders the environment as an image using Matplotlib.

    Arguments:
    - env: The environment object to render.

    Returns:
    None
    '''
    plt.imshow(env.render())
    plt.axis('off')
    plt.show()

def embed_video(file_path):
    '''
    Embeds a video file into HTML for display.

    Arguments:
    - file_path: The path to the video file.

    Returns:
    - HTML: HTML code for embedding the video.
    '''
    video_file = open(file_path, "rb").read()
    video_url = f"data:video/mp4;base64,{base64.b64encode(video_file).decode()}"
    return HTML(f"""<video width="640" height="480" controls><source src="{video_url}" type="video/mp4"></video>""")

def random_filename():
    '''
    Generates a random filename in the format "YYYY_MM_DD_HH_MM_SS.mp4".

    Returns:
    - str: Randomly generated filename.
    '''
    return datetime.now().strftime('%Y_%m_%d_%H_%M_%S.mp4')

class VideoRecorder:
    '''
    Utility class for recording video of an environment.

    Methods:
    - __init__: Initializes the video recorder.
    - record_frame: Records a frame from the environment.
    - close: Closes the video writer.
    - play: Plays the recorded video.
    - __enter__: Enters the context manager.
    - __exit__: Exits the context manager.
    '''
    def __init__(self, filename=random_filename(), fps=30):
        '''
        Initializes the VideoRecorder.

        Arguments:
        - filename: The filename to save the recorded video.
        - fps: Frames per second of the recorded video.
        '''
        self.filename = filename
        self.writer = imageio.get_writer(filename, fps=fps)

    def record_frame(self, env, target_width = 608, target_height=400):
        '''
        Records a frame from the environment.

        Arguments:
        - env: The environment object to record.
        - target_width: Width of the target frame.
        - target_height: Height of the target frame.

        Returns:
        None
        '''
        frame = env.render()
        resized_frame = cv2.resize(frame, (target_width, target_height))
        self.writer.append_data(resized_frame)

    def close(self, *args, **kwargs):
        '''
        Closes the video writer.

        Arguments:
        None

        Returns:
        None
        '''
        self.writer.close(*args, **kwargs)

    def play(self):
        '''
        Plays the recorded video.

        Arguments:
        None

        Returns:
        None
        '''
        self.close()
        embed_video(self.filename)

    def __enter__(self):
        return self

    def __exit__(self, type, value, traceback):
        self.play()

## REINFORCE

REINFORCE algorithm is a generalization of the previously presented Cross-Entropy method. As a reminder in the Cross-Entropy method we would allow the agent to perform some actions for n games, these actions would generally be random at first, however then we would take some set % of the best actions and try to use them to teach the model to perform better in the future, by modelling the problem as a classification problem with the actiosn taken by the model being set to 1 and the rest to 0 and the model is forced to classify for each state the correct action to take. This approach was limited when distribution of rewards was more "challenging" or when finishing an episode was not an option. REINFORCE method solves at least one of the above problems by giving us a function to optimize.

For the reinforce method we would also like to model the actions as a classification problem based on provided state, however, this time we will collect all games instead of a ser percentage of them, and we will try to optimize the following loss function:

$L = -Q(s,a)log\pi(a|s)$

The exact means by which the scientists have derived this formula do not concern us, however, we should focus on the intuitive explanation of it here. In other words, we want to minimize $-Q(s,a)log\pi(a|s)$, which is equivalent to maximizing $Q(s,a)log\pi(a|s)$. This in turn means we want to achieve such policy $\pi(a|s)$ that we will achieve the greatest possible $Q(s,a)$. In practice, we will calculate log of the probability of picking an action multiply it by the $Q(s,a)$ of the state, action pair and use it to perform backpropagation on our model.

You might have already noticed that CrossEntropy method works in the exact same but we assume $Q(s,a)=1$ for every picked action. This method naturally fits into the exploration vs exploitation paradigm, since we can choose each action with probability indicated by our network modeled as a classification problem.

### REINFORCE Algorithm - introduction

In [None]:
GAMMA = 0.99
LEARNING_RATE = 0.01
EPISODES_TO_TRAIN = 4
NUM_TEST_GAMES = 10
EXPECTED_REWARD = 180

In [5]:
class PGN(nn.Module):
    def __init__(self, input_size, n_actions):
        super(PGN, self).__init__()

        self.net = nn.Sequential(
            nn.Linear(input_size, 128),
            nn.ReLU(),
            nn.Linear(128, n_actions)
        )

    def forward(self, x):
        return self.net(x)

In [7]:
class ReinforceAgent:
    def __init__(self, net, env):
        self.net = net
        self.env = env
        self.train_data = []

    def get_output(self, obs):
        tensor_obs = torch.tensor(obs, dtype=torch.float32)
        return self.net(tensor_obs)

    def get_action(self, obs):
        probs = F.softmax(obs, dim=0)
        action = torch.multinomial(probs, num_samples=1)
        return action.item()

    def play_test_game(self, rec):
        state = self.env.reset()[0]
        rec.record_frame(self.env)
        total_reward = 0
        while True:
            agent_output = self.get_output(state)
            action = self.get_action(agent_output)
            new_state, reward, done, _, _ = self.env.step(action)
            state = new_state
            total_reward+=reward
            if done:
                break
            rec.record_frame(self.env)
        return total_reward

    def play_n_games(self, n=100):
        self.train_data = []
        for _ in range(n):
            state = self.env.reset()[0]
            trajectory = []
            step_num = 0
            while True:
                agent_output = self.get_output(state)
                action = self.get_action(agent_output)
                new_state, reward, done, _, _ = self.env.step(action)
                trajectory.append((state, action, reward))
                state = new_state
                if done:
                    break
            self.train_data.append(trajectory)

    def train(self, optimizer, gamma=GAMMA):
        self.net.train()
        total_loss = 0.0
        for trajectory in self.train_data:
            returns = 0.0
            log_probs = []
            for state, action, reward in reversed(trajectory):
                returns = gamma * returns + reward
                agent_output = self.get_output(state)
                log_prob = F.log_softmax(agent_output, dim=0)[action]
                log_probs.append(log_prob * returns)
            loss = -torch.stack(log_probs).sum()
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        return total_loss / len(self.train_data)

In [8]:
rec = VideoRecorder()
env = gym.make("CartPole-v1", render_mode="rgb_array")
writer = SummaryWriter(comment="-cartpole-reinforce")

net = PGN(env.observation_space.shape[0], env.action_space.n)
agent = ReinforceAgent(net, env)
optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
num_steps = 0
prev_reward = 0

while True:
    num_steps+=1
    agent.play_n_games()
    agent.train(optimizer)
    total_reward = 0
    for i in range(NUM_TEST_GAMES):
        total_reward += agent.play_test_game(rec)
    total_reward /= NUM_TEST_GAMES
    writer.add_scalar("reward", total_reward, num_steps)
    if total_reward>prev_reward:
        print(f"Reward improved {prev_reward} -> {total_reward}")
        prev_reward = total_reward
    if total_reward>EXPECTED_REWARD:
        print(f"Done in {num_steps} steps")
        break

writer.close()

Reward improved 0 -> 31.2
Reward improved 31.2 -> 35.4
Reward improved 35.4 -> 58.7
Reward improved 58.7 -> 69.2
Reward improved 69.2 -> 73.2
Reward improved 73.2 -> 93.7
Reward improved 93.7 -> 121.1
Reward improved 121.1 -> 121.4
Reward improved 121.4 -> 283.4
Done in 10 steps


In [9]:
rec.close()
embed_video(rec.filename)

### REINFORCE algorithm with cross-entropy loss

Previously implemented solution performed really well and achieved expected results, however, there are still quite a few disadvanteges to this method.

**Disadvanteges of REINFORCE:**
*  Necessity of finishing an episode - we need to finish an episode in order to get an appropriate approximation of the value of $Q(s,a)$, of course this was not a problem in case of the CartPole environment but for more complex environments this could end up being quite problematic.

*  Variation of large gradients - depending on the environment the variation of the gradient may be significant leading to unstable training (More on this in the explanation of A2C)

*  Strong correlation of training examples - due to the agent playing the same game, over and over again and one state being the consequence of the previous one, the data may be strongly correlated, which makes training using SGD harder

*  Exploration - this method may be susseptible to getting stuck in local optima, even though the solution itself returns probabilities, the network modelling the classification problem may be too attached to certain solutions, as such we may "punish" the network for being overconfident

In this section, we will focus on how exactly we can push the network to be exploratory in nature. We do this using Entropy. The can use the following formula:

$H(\pi)=-\sum_a\pi(a|s)log\pi(a|s)$

Entropy will have a lower value if our network is more confident (distribution collapses to a single value) and lower value if the distribution is uniform. We can use use this formula in the loss function to encourage the model to be less confident and discourage exploitation of only a few, best actions. However, we should use some constant to dampen the effects of this entropy loss on the total calculated loss. In the below code we can observe the effects after using different coefficents for the entropy loss formula.

In [10]:
GAMMA = 0.99
LEARNING_RATE = 0.01
EPISODES_TO_TRAIN = 4
EXPECTED_REWARD = 180
entropy_coeffs = [0, 0.001, 0.01, 0.05, 0.1, 0.5]

In [11]:
class ImprovedReinforceAgent:
    def __init__(self, net, env):
        self.net = net
        self.env = env
        self.train_data = []

    def get_output(self, obs):
        tensor_obs = torch.tensor(obs, dtype=torch.float32)
        return self.net(tensor_obs)

    def get_action(self, obs):
        probs = F.softmax(obs, dim=0)
        action = torch.multinomial(probs, num_samples=1)
        return action.item()

    def play_test_game(self, rec):
        state = self.env.reset()[0]
        rec.record_frame(self.env)
        total_reward = 0
        while True:
            agent_output = self.get_output(state)
            action = self.get_action(agent_output)
            new_state, reward, done, _, _ = self.env.step(action)
            state = new_state
            total_reward+=reward
            if done:
                break
            rec.record_frame(self.env)
        return total_reward

    def play_n_games(self, n=100):
        self.train_data = []
        for _ in range(n):
            state = self.env.reset()[0]
            trajectory = []
            step_num = 0
            while True:
                agent_output = self.get_output(state)
                action = self.get_action(agent_output)
                new_state, reward, done, _, _ = self.env.step(action)
                trajectory.append((state, action, reward))
                state = new_state
                if done:
                    break
            self.train_data.append(trajectory)

    def train(self, optimizer, gamma=GAMMA, entropy_coef=0.01):
        self.net.train()
        total_loss = 0.0
        for trajectory in self.train_data:
            returns = 0.0
            log_probs = []
            entropy = 0.0
            for state, action, reward in reversed(trajectory):
                returns = gamma * returns + reward
                agent_output = self.get_output(state)
                log_prob = F.log_softmax(agent_output, dim=0)[action]
                log_probs.append(log_prob * returns)
                entropy += -(F.softmax(agent_output, dim=0) * F.log_softmax(agent_output, dim=0)).sum()
            policy_loss = -torch.stack(log_probs).sum()
            loss = policy_loss - entropy_coef * entropy
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        return total_loss / len(self.train_data)

In [12]:
filenames = []

for coeff in entropy_coeffs:
    rec = VideoRecorder()
    print(f"----------------Entropy Coefficent {coeff}----------------")
    env = gym.make("CartPole-v1", render_mode="rgb_array")
    writer = SummaryWriter(comment=f"-cartpole-reinforce-improved-{coeff}")

    net = PGN(env.observation_space.shape[0], env.action_space.n)
    agent = ImprovedReinforceAgent(net, env)
    optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
    num_steps = 0
    prev_reward = 0
    while True:
        num_steps+=1
        agent.play_n_games()
        agent.train(optimizer, entropy_coef = coeff)
        total_reward = 0
        for i in range(NUM_TEST_GAMES):
            total_reward += agent.play_test_game(rec)
        total_reward /= NUM_TEST_GAMES
        writer.add_scalar("reward", total_reward, num_steps)
        if total_reward>prev_reward:
            print(f"For Entropy Coefficent: {coeff} reward improved {prev_reward} -> {total_reward}")
            prev_reward = total_reward
        if total_reward>EXPECTED_REWARD:
            print(f"For Entropy coefficent: {coeff} done in {num_steps} steps")
            print()
            break
    rec.close()
    filenames.append((f"Entropy: {coeff}", rec.filename))

writer.close()

----------------Entropy Coefficent 0----------------
For Entropy Coefficent: 0 reward improved 0 -> 20.4
For Entropy Coefficent: 0 reward improved 20.4 -> 45.6
For Entropy Coefficent: 0 reward improved 45.6 -> 91.3
For Entropy Coefficent: 0 reward improved 91.3 -> 128.1
For Entropy Coefficent: 0 reward improved 128.1 -> 132.5
For Entropy Coefficent: 0 reward improved 132.5 -> 141.8
For Entropy Coefficent: 0 reward improved 141.8 -> 157.1
For Entropy Coefficent: 0 reward improved 157.1 -> 238.0
For Entropy coefficent: 0 done in 10 steps

----------------Entropy Coefficent 0.001----------------
For Entropy Coefficent: 0.001 reward improved 0 -> 28.3
For Entropy Coefficent: 0.001 reward improved 28.3 -> 49.2
For Entropy Coefficent: 0.001 reward improved 49.2 -> 68.6
For Entropy Coefficent: 0.001 reward improved 68.6 -> 108.7
For Entropy Coefficent: 0.001 reward improved 108.7 -> 143.4
For Entropy Coefficent: 0.001 reward improved 143.4 -> 256.1
For Entropy coefficent: 0.001 done in 7 step

In [13]:
print(f"Video for {filenames[0][0]}")
embed_video(filenames[0][1])

Video for Entropy: 0


In [14]:
print(f"Video for {filenames[1][0]}")
embed_video(filenames[1][1])

Video for Entropy: 0.001


In [15]:
print(f"Video for {filenames[2][0]}")
embed_video(filenames[2][1])

Video for Entropy: 0.01


In [16]:
print(f"Video for {filenames[3][0]}")
embed_video(filenames[3][1])

Video for Entropy: 0.05


In [17]:
print(f"Video for {filenames[4][0]}")
embed_video(filenames[4][1])

Video for Entropy: 0.1


In [18]:
print(f"Video for {filenames[5][0]}")
embed_video(filenames[5][1])

Video for Entropy: 0.5


In this section we aim to deal with the problem of correlated data points. The subsequent data points are highly correlated due to nature of the data collection process, for a single environment the agent continues playing until the end of the episode, collecting data after each action. First we need to understand why this is problematic.

When training neural networks, we try to find the minimum of the loss function for all data points, however, we do not have the access to all the possible data points. We only have the access to some samples, in terms of samples available in reinforcement learning the samples are those added by us to train_data in the below code, i.e. samples are based on states seen by the agent. Since we are only using approximations we cannot get the exact value of gradient for the stochastic gradient descent. However, by using multiple uncorrelated examples, we can get very close to the gradient value we want. I think you may already have an incliniation why correlated examples are bad. The more highly correlated examples we have, the more likely it is that the model will essentially recieve almost the same data point, or in other terms our approximation of the gradient for all data points will be very poor since it will overly favour some very narrow set of examples.

So how can we deal with the problem of correlated exmaples? We can simply run multiple environments in parallel and extract training data for all of them, although please keep in mind that this solution is far from perfect and requires substantial computational power.

### Parallel REINFORCE agent

In [None]:
GAMMA = 0.99
LEARNING_RATE = 0.01
EPISODES_TO_TRAIN = 4
NUM_TEST_GAMES = 10
EXPECTED_REWARD = 180

In [None]:
class ParallelReinforceAgent:
    def __init__(self, net, envs):
        self.net = net
        self.envs = envs
        self.train_data = []

    def get_output(self, obs):
        tensor_obs = torch.tensor(obs, dtype=torch.float32)
        return self.net(tensor_obs)

    def get_action(self, obs):
        probs = F.softmax(obs, dim=0)
        action = torch.multinomial(probs, num_samples=1)
        return action.item()

    def play_test_game(self, rec):
        total_rewards = []
        for env in self.envs:
            state = env.reset()[0]
            rec.record_frame(env)
            total_reward = 0
            while True:
                agent_output = self.get_output(state)
                action = self.get_action(agent_output)
                new_state, reward, done, _, _ = env.step(action)
                state = new_state
                total_reward += reward
                if done:
                    break
                rec.record_frame(env)
            total_rewards.append(total_reward)
        return np.mean(total_rewards)

    def play_n_games(self, n=100):
        self.train_data = []
        for _ in range(n):
            trajectories = []
            for env in self.envs:
                state = env.reset()[0]
                trajectory = []
                step_num = 0
                while True:
                    agent_output = self.get_output(state)
                    action = self.get_action(agent_output)
                    new_state, reward, done, _, _ = env.step(action)
                    trajectory.append((state, action, reward))
                    state = new_state
                    if done:
                        break
                trajectories.append(trajectory)
            self.train_data.append(trajectories)

    def train(self, optimizer, gamma=GAMMA, entropy_coef=0.01):
        self.net.train()
        total_loss = 0.0
        total_trajectories = sum(len(games) for games in self.train_data)
        for games in self.train_data:
            for trajectory in games:
                returns = 0.0
                log_probs = []
                entropy = 0.0
                for state, action, reward in reversed(trajectory):
                    returns = gamma * returns + reward
                    agent_output = self.get_output(state)
                    log_prob = F.log_softmax(agent_output, dim=0)[action]
                    log_probs.append(log_prob * returns)
                    entropy += -(F.softmax(agent_output, dim=0) * F.log_softmax(agent_output, dim=0)).sum()
                policy_loss = -torch.stack(log_probs).sum()
                loss = policy_loss - entropy_coef * entropy
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                total_loss += loss.item()
        return total_loss / total_trajectories

In [None]:
rec = VideoRecorder()
env1 = gym.make("CartPole-v1", render_mode="rgb_array")
env2 = gym.make("CartPole-v1", render_mode="rgb_array")
env3 = gym.make("CartPole-v1", render_mode="rgb_array")
envs = [env1, env2, env3]
writer = SummaryWriter(comment="-cartpole-reinforce-parallel")

net = PGN(env.observation_space.shape[0], env.action_space.n)
agent = ParallelReinforceAgent(net, envs)
optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
num_steps = 0
prev_reward = 0

while True:
    num_steps+=1
    agent.play_n_games()
    agent.train(optimizer)
    total_reward = 0
    for i in range(NUM_TEST_GAMES):
        total_reward += agent.play_test_game(rec)
    total_reward /= NUM_TEST_GAMES
    writer.add_scalar("reward", total_reward, num_steps)
    if total_reward>prev_reward:
        print(f"Reward improved {prev_reward} -> {total_reward}")
        prev_reward = total_reward
    if total_reward>EXPECTED_REWARD:
        print(f"Done in {num_steps} steps")
        break

writer.close()

In [None]:
rec.close()
embed_video(rec.filename)

## Actor and Critic - A2C

This section is dediacted to one of the most successful techniques of reinforcement learning - actor and critic method. This method combines strengths of policy based methods and value based methods. The main reason behind such a combination is to decrease the variance of the of the gradient present in our algorithm. We can do this by subtracting certain value from every reward $Q(s,a)$ value recieved by the agent. But first we need to realise, why we would even attempt to do that.

Let's use the below example to visualize why the way the algorithm works right now is not desirable. Let's assume we have the following $Q$ values recieved by our agent:

*   $Q_1>0$
*   $Q_2>0$
*   $(Q_3<0) AND (|Q_3|>Q_1+Q_2)$

It is logical, that SGD will try to move our agent toward behviour that strongly discourages policies that lead to $Q_3$ and slightly encourages policies leading to $Q_1$ and $Q_2$. This is quite logical, however, let us now assume that we have som enew values which have the same relative values in regard to one another as the previous ones, i.e. $Q_i - Q_j = Q_i' - Q_j'$ for all pairs of possible values, however this time these values are defined as:

*   $Q_1>0$
*   $Q_2>0$
*   $(Q_3>0) AND (|Q_3|<<Q_1)AND(|Q_3|<<Q_2)$

This time SGD will try to promote behaviour which leads to agent obtaining all of these values, even though the relative values are the same. This inconsistency in behaviour can be be problematic for the training process of our agent. This is exactly the problem we are trying to fix by using A2C method.

Now let's consider what we possible values can we subtract. We could try to subtract the mean reward. This can be quite a good idea and can work in simpler environments, however there are some issues related to it:

*  Mostly works for simpler environments - getting even an interpretation of a mean value for complex environments can be impossible
*  Computational complexity - for complex environments it can take a long time to finish the sampling process
*  Necessity to finish the episode - it may be necessary to finish the episode for us to get adequete approximations of $Q$ values
*  May require background knowledge - we could also set from the start values for each behaviour we want to discourage, but it requires expert domain knowledge and may be unstable

Instead the solution proposed by the A2C solution assumes approximation of the value of the state $V(s)$ by the network. In conclusion, the value of $Q(s,a) = V(s) + A(s,a)$ can be calculated by adding value of the action $A(s,a)$ and approximation of the network $V(s)$. Thefor every reward we perform the following calculation: $R <- r_i + \gamma R$, where $R = 0$ if we reached the end of the episdoe and otherwise $R$ is approximated using a network.

The network trying to find the optimal policy is called an Actor and the one estimatic the values $V(s)$ the critic, and they are both trained during the process of learning our reinfrocement learning to play a game.

### A2C - CartPole

Here we will try to create an A2C agent for the CartPole environment. This is of course excessive like using a tank to kill a spider ;) <br> Nonetheless, it is just for ilustrative purposes.

In [None]:
GAMMA = 0.99
LEARNING_RATE = 0.01
NUM_TEST_GAMES = 10
EXPECTED_REWARD = 180

In [None]:
class Actor(nn.Module):
    def __init__(self, input_size, n_actions):
        super(Actor, self).__init__()

        self.net = nn.Sequential(
            nn.Linear(input_size, 128),
            nn.ReLU(),
            nn.Linear(128, n_actions)
        )

    def forward(self, x):
        return self.net(x)

class Critic(nn.Module):
    def __init__(self, input_size, n_actions):
        super(Critic, self).__init__()

        self.net = nn.Sequential(
            nn.Linear(input_size, 128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )

    def forward(self, x):
        return self.net(x)

In [None]:
class A2CAgent:
    def __init__(self, actor_net, critic_net, env):
        self.actor = actor_net
        self.critic = critic_net
        self.env = env
        self.train_data = []

    def get_actor_output(self, obs):
        tensor_obs = torch.tensor(obs, dtype=torch.float32)
        return self.actor(tensor_obs)

    def get_critic_output(self, obs):
        tensor_obs = torch.tensor(obs, dtype=torch.float32)
        return self.critic(tensor_obs)

    def get_action(self, obs):
        probs = F.softmax(obs, dim=0)
        action = torch.multinomial(probs, num_samples=1)
        return action.item()

    def play_test_game(self, rec):
        state = self.env.reset()[0]
        rec.record_frame(self.env)
        total_reward = 0
        while True:
            agent_output = self.get_actor_output(state)
            action = self.get_action(agent_output)
            new_state, reward, done, _, _ = self.env.step(action)
            state = new_state
            total_reward+=reward
            if done:
                break
            rec.record_frame(self.env)
        return total_reward

    def play_n_games(self, n=100):
        self.train_data = []
        for _ in range(n):
            state = self.env.reset()[0]
            trajectory = []
            step_num = 0
            while True:
                actor_output = self.get_actor_output(state)
                critic_output = self.get_critic_output(state)
                action = self.get_action(actor_output)
                new_state, reward, done, _, _ = self.env.step(action)
                trajectory.append((state, action, reward, critic_output))
                state = new_state
                if done:
                    break
            self.train_data.append(trajectory)

    def train(self, optimizer_actor, optimizer_critic, gamma=0.99, value_coef=0.5):
        self.actor.train()
        self.critic.train()
        total_loss_actor = 0.0
        total_loss_critic = 0.0
        for trajectory in self.train_data:
            actor_loss = 0.0
            critic_loss = 0.0
            policy_loss = 0.0
            returns = 0.0
            for state, action, reward, critic_output in reversed(trajectory):
                returns = gamma * returns + reward
                advantage = returns - critic_output.item()
                actor_output = self.get_actor_output(state)
                critic_value = self.get_critic_output(state)
                log_prob = F.log_softmax(actor_output, dim=0)[action]
                actor_loss -= log_prob * advantage
                critic_loss += F.mse_loss(critic_value, torch.tensor([returns]))
                policy_loss += F.cross_entropy(actor_output.unsqueeze(0), torch.tensor([action]))
            actor_loss *= value_coef
            total_loss_actor += actor_loss.item()
            total_loss_critic += critic_loss.item()
            optimizer_actor.zero_grad()
            optimizer_critic.zero_grad()
            actor_loss.backward()
            critic_loss.backward()
            optimizer_actor.step()
            optimizer_critic.step()
        return total_loss_actor / len(self.train_data), total_loss_critic / len(self.train_data)

In [None]:
rec = VideoRecorder()
env = gym.make("CartPole-v1", render_mode="rgb_array")
writer = SummaryWriter(comment="-cartpole-A2C")

actor = Actor(env.observation_space.shape[0], env.action_space.n)
critic = Critic(env.observation_space.shape[0], env.action_space.n)
agent = A2CAgent(actor, critic, env)
optimizer_actor = optim.Adam(actor.parameters(), lr=LEARNING_RATE)
optimizer_critic = optim.Adam(critic.parameters(), lr=LEARNING_RATE)
num_steps = 0
prev_reward = 0

while True:
    num_steps+=1
    agent.play_n_games()
    agent.train(optimizer_actor, optimizer_critic)
    total_reward = 0
    for i in range(NUM_TEST_GAMES):
        total_reward += agent.play_test_game(rec)
    total_reward /= NUM_TEST_GAMES
    writer.add_scalar("reward", total_reward, num_steps)
    if total_reward>prev_reward:
        print(f"Reward improved {prev_reward} -> {total_reward}")
        prev_reward = total_reward
    if total_reward>EXPECTED_REWARD:
        print(f"Done in {num_steps} steps")
        break

writer.close()

In [None]:
rec.close()
embed_video(rec.filename)

### A2C - DOOM

Here we aim to show you that certain architectural elements can be shared between networks of agent and critic, this is because some elements like convolutional filters here, can extract features that can be useful for both the agent and the critic, as such we have implemented the convolutional architecture to be shared between the two models

In [4]:
GAMMA = 0.99
LEARNING_RATE = 0.01
NUM_TEST_GAMES = 10
EXPECTED_REWARD = 10

In [5]:
class SharedConvolutional(nn.Module):
    def __init__(self, input_shape):
        super(SharedConvolutional, self).__init__()

        self.conv = nn.Sequential(
            nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU()
        )

    def forward(self, x):
        return self.conv(x)

class DOOMActor(nn.Module):
    def __init__(self, input_shape, n_actions):
        super(DOOMActor, self).__init__()

        self.conv = SharedConvolutional(input_shape)

        conv_out_size = self._get_conv_out(input_shape)
        self.fc = nn.Sequential(
            nn.Linear(conv_out_size, 512),
            nn.ReLU(),
            nn.Linear(512, n_actions)
        )

    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x):
        conv_out = self.conv(x).view(x.size()[0], -1)
        return self.fc(conv_out)


class DOOMCritic(nn.Module):
    def __init__(self, input_shape):
        super(DOOMCritic, self).__init__()

        self.conv = SharedConvolutional(input_shape)

        conv_out_size = self._get_conv_out(input_shape)
        self.fc = nn.Sequential(
            nn.Linear(conv_out_size, 512),
            nn.ReLU(),
            nn.Linear(512, 1)
        )

    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x):
        conv_out = self.conv(x).view(x.size()[0], -1)
        return self.fc(conv_out)

In [1]:
def preprocess_net_input(x):
    gray = np.dot(x[..., :3], [0.2989, 0.5870, 0.1140])
    gray_normalized = gray / 255.0
    gray_reshaped = gray_normalized.reshape((1, x.shape[0], x.shape[1]))
    return gray_reshaped

class DOOMAgent:
    def __init__(self, actor_net, critic_net, env):
        self.actor = actor_net
        self.critic = critic_net
        self.env = env
        self.train_data = []

    def get_actor_output(self, obs, device):
        tensor_obs = torch.tensor(obs, dtype=torch.float32).to(device)
        return self.actor(tensor_obs)

    def get_critic_output(self, obs, device):
        tensor_obs = torch.tensor(obs, dtype=torch.float32).to(device)
        return self.critic(tensor_obs)

    def get_action(self, obs):
        probs = F.softmax(obs, dim=0)
        action = torch.multinomial(probs, num_samples=1)
        return action.item()

    def play_test_game(self, rec, device):
        state = self.env.reset()[0]
        rec.record_frame(self.env)
        total_reward = 0
        while True:
            agent_output = self.get_actor_output(preprocess_net_input(state["screen"]), device)
            action = self.get_action(agent_output)
            new_state, reward, done, _, _ = self.env.step(action)
            state = new_state
            total_reward+=reward
            if done:
                break
            rec.record_frame(self.env)
        return total_reward

    def play_n_games(self, device, n=20):
        self.train_data = []
        for _ in range(n):
            state = self.env.reset()[0]
            trajectory = []
            step_num = 0
            while True:
                actor_output = self.get_actor_output(preprocess_net_input(state["screen"]), device)
                critic_output = self.get_critic_output(preprocess_net_input(state["screen"]), device)
                action = self.get_action(actor_output)
                new_state, reward, done, _, _ = self.env.step(action)
                trajectory.append((state, action, reward, critic_output))
                state = new_state
                if done:
                    break
            self.train_data.append(trajectory)

    def train(self, optimizer_actor, optimizer_critic, device, gamma=GAMMA, entropy_coef=0.01, value_coef=0.5):
        self.actor.train()
        self.critic.train()
        total_loss_actor = 0.0
        total_loss_critic = 0.0

        for trajectory in self.train_data:
            actor_loss = 0.0
            critic_loss = 0.0
            policy_loss = 0.0
            returns = 0.0

            for state, action, reward, critic_output in reversed(trajectory):
                returns = gamma * returns + reward
                advantage = returns - critic_output.item()
                state_preprocessed = preprocess_net_input(state["screen"])

                action_tensor = torch.tensor([action]).to(device)
                returns_tensor = torch.tensor([returns]).to(device)
                actor_output = self.get_actor_output(state_preprocessed, device)
                critic_value = self.get_critic_output(state_preprocessed, device)

                log_prob = F.log_softmax(actor_output, dim=0)[action]
                actor_loss -= log_prob * advantage
                critic_loss += F.mse_loss(critic_value, torch.tensor([returns]).to(device))
                policy_loss += F.cross_entropy(actor_output.unsqueeze(0), torch.tensor([action]).to(device))

            actor_loss *= value_coef
            total_loss_actor += actor_loss.item()
            total_loss_critic += critic_loss.item()
            optimizer_actor.zero_grad()
            optimizer_critic.zero_grad()

            actor_loss.backward()
            critic_loss.backward()
            optimizer_actor.step()
            optimizer_critic.step()
        return total_loss_actor / len(self.train_data), total_loss_critic / len(self.train_data)

NameError: name 'GAMMA' is not defined

In [7]:
rec = VideoRecorder()
env = gym.make("VizdoomDefendCenter-v0", render_mode="rgb_array")
writer = SummaryWriter(comment="-DOOM-A2C")
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

frame_shape = env.observation_space.sample()
frame_shape = frame_shape["screen"].shape
frame_shape = frame_shape[:-1]
final_shape = (1, frame_shape[0], frame_shape[1])

actor = DOOMActor(final_shape, env.action_space.n).to(device)
critic = DOOMCritic(final_shape).to(device)
agent = DOOMAgent(actor, critic, env)
optimizer_actor = optim.Adam(actor.parameters(), lr=LEARNING_RATE)
optimizer_critic = optim.Adam(critic.parameters(), lr=LEARNING_RATE)
num_steps = 0
prev_reward = -5

while True:
    num_steps+=1
    agent.play_n_games(device)
    agent.train(optimizer_actor, optimizer_critic, device)
    total_reward = 0
    for i in range(NUM_TEST_GAMES):
        total_reward += agent.play_test_game(rec, device)
    total_reward /= NUM_TEST_GAMES
    writer.add_scalar("reward", total_reward, num_steps)
    if total_reward>prev_reward:
        print(f"Reward improved {prev_reward} -> {total_reward}")
        prev_reward = total_reward
    if total_reward>EXPECTED_REWARD:
        print(f"Done in {num_steps} steps")
        break

writer.close()

  critic_loss += F.mse_loss(critic_value, torch.tensor([returns]).to(device))


Reward improved -5 -> -1.0


SignalException: Signal SIGINT received. ViZDoom instance has been closed.

In [None]:
%load_ext tensorboard
%tensorboard --logdir=runs