# Actor Critic Methods

## From REINFORCE to Actor-Critic

The baseline can take various values. The set of equations below illustrates the classic variants of actor critic methods. 

\begin{equation}
\begin{split}
\nabla_{\theta}J(\theta) &= \mathop{\mathbb{E}_{\pi_{\theta}}}\big[\nabla_{\theta} \log \pi_{\theta}(s,a)G_{t} \big] \hspace{3cm} \text{REINFORCE}\\
&= \mathop{\mathbb{E}_{\pi_{\theta}}}\big[\nabla_{\theta} \log \pi_{\theta}(s,a)Q^{w}(s,a) \big] \hspace{2cm} \text{Q Actor-Critic}\\
&= \mathop{\mathbb{E}_{\pi_{\theta}}}\big[\nabla_{\theta} \log \pi_{\theta}(s,a)A^{w}(s,a) \big] \hspace{2cm} \text{Q Advantage Actor-Critic}\\
&= \mathop{\mathbb{E}_{\pi_{\theta}}}\big[\nabla_{\theta} \log \pi_{\theta}(s,a)\delta \big] \hspace{3.2cm} \text{Q TD Actor-Critic}\\
\end{split}
\end{equation}

Implementation examples: 
- https://www.datahubbs.com/policy-gradients-and-advantage-actor-critic/
- https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import gym
import sys

from collections import deque

import torch
from torch.autograd import Variable
from torch import nn
from torch.nn import functional as F
from torch import optim
from torch.distributions import Categorical

#### Create the environment

In [2]:
env = gym.make("Taxi-v2")

In [3]:
obs = env.reset()
next_obs, reward, done, info = env.step(env.action_space.sample())
env.render()
env.close()

+---------+
|[34;1mR[0m: | : :[43mG[0m|
| : : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (North)


In [4]:
print("Observation space:", env.observation_space.n)
print("Action space:", env.action_space.n)

Observation space: 500
Action space: 6


## Implementation

#### Algorithm
---
```
Input: a differentiable policy parameterization pi(a|s, theta)                   [Policy Network]
Input: a differentiable state-value function parameterization Q_w(s, a, w)       [Value Network]
Parameters: step sizes alpha_theta > 0; alpha_w > 0

Loop forever for each episode:

        Initialise S, theta
        Sample a in pi_theta
        
        Loop while S is not terminal for each time step:
                A = pi(.|S, theta) [policy(state)]
                Take action A, observe S', R
                delta = R + gamma * Q_w(S', A', w) - Q_w(S, A, w)  [TD(0) error, or advantage]
                theta = theta + alpha_theta * grad_pi log pi_theta(s,a) Q_w(S,A)     [policy gradient update]
                w = w + alpha_w * delta * x(s, a)    [TD(0)]
                A = A', S = S'
```
---

In [5]:
class Logs:
    """
    Global logs
    """
    def __init__(self):
        # History
        self.loss_history = []
        self.reward_history = []

In [6]:
class PolicyNetwork(nn.Module):
    """
    The policy network
    Args:
        n_inputs (int)
        n_outputs (int)
    """
    
    def __init__(self, n_inputs, n_outputs):
        super().__init__()
        self.n_inputs = n_inputs
        self.n_outputs = n_outputs
        
        self.reward_history = []
        self.loss_history = []
        
        self.fc1 = nn.Linear(self.n_inputs, 128)
        self.dropout1 = nn.Dropout(p=0.5)
        self.fc2 = nn.Linear(128, self.n_outputs)
        self.softmax = nn.Softmax(dim=-1)
        
        # save log probs history and rewards history
        self.saved_log_probs = []
        self.rewards = []
        
    def reset(self):
        self.saved_log_probs = []
        self.rewards = []
        
    def forward(self, x):
        """
        Forward pass
        Args:
            x (torch.Tensor)
        """
        x = F.relu(self.fc1(x))
        x = self.dropout1(x)
        x = self.fc2(x)
        return F.log_softmax(x, dim=-1)

In [7]:
class ValueNetwork(nn.Module):
    """Value network for value approximation"""
    
    def __init__(self, state_size, action_size):
        super().__init__()
        self.state_size = state_size
        self.action_size = action_size
        
        # MLP layers
        self.fc1 = nn.Linear(state_size, 64)
        self.fc2 = nn.Linear(64, action_size)
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

In [None]:
class A2CAgent:
    """Actor Critic Agent"""
    
    def __init__(self, env, policy_network, value_network):
        
        self.env = env
        self.policy_network = policy_network
        self.value_network = value_network
        
        self.n_state = env.observation_space.n
        self.n_action = env.action_space.n
        
        # Experience Replay
        self.experience_memory = deque(maxlen=500)
        
    def create_model(self, model):
        return model(self.n_state, self.n_action)
    
    def act(self, state):
        state = torch.from_numpy(np.array(state))
        state = F.one_hot(state, num_classes=env.observation_space.n).float()
        probs = self.policy_network(Variable(state))
        m = Categorical(probs)
        action = m.sample()
        policy.saved_log_probs.append(m.log_prob(action))

        return action.item()