# Actor-Critic Algorithms

In the [previous notebook](./reinforce.ipynb), we have encountered the [REINFORCE algorithm]() as a powerful alternative to [deep Q learning](). It is based directly on the idea of finding a parametric policy that maximizes the total expected rewards. In order to reduce the high variance of Monte Carlo samples for the gradients, we also discussed the use of baselines. 

Although, we identified the state-value function $V$ as the optimal baseline, we did not comment on how to compute it in practice. In this notebook, we discuss the **Actor-Critic Method** to address this problem. Loosely speaking, we extend the architecture of the REINFORCE algorithm with a second deep network that is used to estimate the value function.

The networks are learned simultaneously as the agent proceeds through the environment. The policy network from the REINFORCE algorithm is called **Actor** as it suggests which action to take when in a given state. The second network is called **Critic** as it evaluates the steps by the actor through estimating the value function.

Again, we build on the code samples from the amazing [pytorch-examples repository](https://github.com/pytorch/examples/blob/master/reinforcement_learning/actor_critic.py) and follow closely the highly instructive conceptual derivations from [OpenAI spinning up](https://spinningup.openai.com/en/latest/algorithms/sac.html).

##  Pseudocode

As indicated above, the Actor-Critic Method consists of two central ingredients. First, a policy-network $\pi_\theta(s)$ returning the probability of selecting an action when starting from a state $s$. Additionally, we now have a value network $V_\phi(s)$ approximating the value function $V^{\pi_\theta}(s)$ under the policy $\pi_\theta$.

To fit policy parameters $\theta$, we perform a gradient step as in the REINFORCE algorithm with the critic $V_\phi$ as baseline:
$$\theta \leftarrow \theta + \alpha \sum_{t < T} \nabla_\theta \log \pi_\theta(a_t|s_t)\Big(\sum_{t' \ge t} r_{t'} - V_\phi(s_t)\Big)$$

To fit the critic parameters $\phi$ we perform a gradient step obtained from regressing the critic $V_\varphi(s_t)$ against the rewards-to-go $\sum_{t' \ge t}r_{t'}$ with a suitable loss-function $\ell$. That is,
$$\phi \leftarrow \phi + \alpha' \sum_{t < T} \nabla_\phi \ell\Big(\sum_{t' \ge t}r_{t'}, V_\phi(s_t)\Big)$$

## Cartpole Example

First, we import ``gym`` and load the ``CartPole``-environment.

In [42]:
import gym
env = gym.make('CartPole-v0')

Instead of creating two entirely separate networks for selecting an action and estimating the value-function, we use a common base network with one hidden layer. However, in addition to one layer for selecting an action, the network now contains a second head to estimate the value function.

In [43]:
import torch
import torch.nn as nn
import torch.nn.functional as F

nactions = env.action_space.n
state_dim = env.observation_space.shape[0]
nhidden = 128

class Policy(nn.Module):
    """Neural network parametrizing the policy
    """
    def __init__(self):
        """Initialize the policy network
        """
        super(Policy, self).__init__()
        self.affine = nn.Linear(state_dim, nhidden)
        self.action_head = nn.Linear(nhidden, nactions)
        self.value_head = nn.Linear(nhidden, 1)

        self.saved_actions = []
        self.rewards = []

    def forward(self, s):
        """Compute action and state value from state
    
        # Arguments
            s: input state
        # Result
            suggested action and state value
        """
        s = F.relu(self.affine(s))
        action_scores = self.action_head(s)
        state_values = self.value_head(s)
        return F.softmax(action_scores, dim=-1), state_values


model = Policy()

The selection of an action proceeds in almost the same way as for the REINFORCE algorithm. The only difference is that we now also record the value function.

In [44]:
import torch.optim as optim
from torch.distributions import Categorical
from collections import namedtuple

def select_action(s):
    """Select action according to policy
    
    # Arguments
        s: state
    # Result
        selected action
    """
    s = torch.from_numpy(s).float()
    probs, v = model(s)
    m = Categorical(probs)
    action = m.sample()
    
    #save action and associated value function
    model.saved_actions.append((m.log_prob(action), v))
    return action.item()

Also the learning step is very similar.  However, we now also fit the output of the value network to the rewards-to-go.

In [45]:
GAMMA = .99
optimizer = optim.Adam(model.parameters(), 
                       lr = 1e-2)
eps = 1e-7

def finish_episode():
    """Apply optimizer to policy network after each episode
    """
    R = 0
    saved_actions = model.saved_actions
    policy_losses = []
    value_losses = []
    returns = []
    
    #compute to go rewards and standardize
    for r in model.rewards[::-1]:
        R = r + GAMMA * R
        returns.insert(0, R)
    returns = torch.tensor(returns)
    returns = (returns - returns.mean()) / (returns.std() + eps)
    
    #compute losses for actor and critic
    for (log_prob, value), R in zip(saved_actions, returns):
        advantage = R - value.item()
        policy_losses.append(-log_prob * advantage)
        value_losses.append(F.smooth_l1_loss(value, torch.tensor([R])))
        
        
    optimizer.zero_grad()
    loss = torch.stack(policy_losses).sum() + torch.stack(value_losses).sum()
    loss.backward()
    optimizer.step()
    del model.rewards[:]
    del model.saved_actions[:]

Finally, we iterate over several episodes.

In [None]:
neps = int(1e3)

PRINT_FREQ = int(1e2)
MAX_STEPS = int(1e4)

running_reward = 10

for i in range(neps):
    s, ep_reward = env.reset(), 0
    
    #collect rewards
    for t in range(1, MAX_STEPS):
        a = select_action(s)
        s, r, done, _ = env.step(a)
        model.rewards.append(r)
        ep_reward += r
        if done:
                        break
                
    #average rewards
    running_reward = 0.05 * ep_reward + (1 - 0.05) * running_reward
    
    #train actor and critic
    finish_episode()
    if i % PRINT_FREQ == 0:
        print(running_reward)

10.15
112.55445072558855
136.02372568341798
178.54495893556654
192.908989916315
199.38158812970656
186.19958292910866
