# Actor Critic

Policy Gradients have the disadvantage of Monte Carlo algorithms. A full episode is sampled, if the episode went well, all actions get considered as good even if there where some bad decisions on the way. In Actor Critics we make an update at each step using a NN to predict the rewards

Policy Gradient update
\begin{equation*}
\Delta \theta = \alpha * \nabla_\theta (log(\pi(a_t|s_t,\theta)))*R(t)
\end{equation*}

Actor Critic update
\begin{equation*}
\Delta \theta = \alpha * \nabla_\theta (log(\pi(a_t|s_t,\theta)))*Q(s_t, a_t)
\end{equation*}

In an actor critic we have two neural nets <br>
*** actor: *** $\pi(a|s, \theta)$ who decides which actions to take<br>
*** critic: *** $Q(a|s, w)$ who approximates $Q(s_t, a_t)$ in the above equation (same idea as in TD) to update the weights of the critic <br>

Having two neural nets means that two seperate sets of weights have to be updated: <br>
\begin{equation*}
\Delta \theta = \alpha \nabla_\theta (log(\pi_\theta(s,a)))Q_w(s,a)
\end{equation*}
With $ Q_w(s,a) $ being a q learning function approximation <br>

\begin{equation*}
\Delta w = \beta (R(s,a) + \gamma Q_w(s_{t+1}, a_{t+1}) - Q_w(s_t, a_t))) \nabla_w Q(s_t, a_t)
\end{equation*}
With $\beta$ being a seperate learning rate, $(R(s,a) + \gamma Q_w(s_{t+1}, a_{t+1} - Q_w(s_t, a_t)))$ is the TD-error, $\nabla_w Q(s_t, a_t)$ is the Gradient of the value function <br>
Note: this is the update rule of TD since the argmax is missing

Process:
- At each time step the current state $s_t$ is fed into the actor --> $a_t$ <br>
- $a_t$ is fed into the environment and outputs a reward $r_{t+1}$ and $s_{t+1}$
- given $s_t$ and $a_t$ the critic calculates $Q(s_t,a_t)$ with which the actor is updated by $\Delta \theta = \alpha \nabla_\theta (log(\pi_\theta(s,a)))Q_w(s,a)
$
- with the updated actor, $a_{t+1}$ is calculated and the critic is updated $\beta (R(s,a) + \gamma Q_w(s_{t+1}, a_{t+1}) - Q_w(s_t, a_t))) \nabla_w Q(s_t, a_t)$

### A2C (Advantage Actor Critic) implementation

In [None]:
import numpy as np
np.random.seed(3)

from tensorflow import set_random_seed
set_random_seed(3)

import sys
import gym
import pylab
import numpy as np
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import Adam
from collections import deque

reward_list = deque(maxlen=20)
num_games = 300

class A2CAgent:
    def __init__(self, state_shape, action_shape):
        self.state_shape = state_shape
        self.action_shape = action_shape
        self.value_size = 1

        self.discount_factor = 0.99
        self.actor_lr = 0.01
        self.critic_lr = 0.05

        self.actor = self.build_actor()
        self.critic = self.build_critic()


    def build_actor(self):
        actor = Sequential()
        actor.add(Dense(8, input_dim=self.state_shape, activation='relu',
                        kernel_initializer='he_uniform', use_bias=False))
        actor.add(Dense(self.action_shape, activation='softmax',
                        kernel_initializer='he_uniform', use_bias=False))
        
        # Using categorical crossentropy as a loss is a trick to easily
        # implement the policy gradient. Categorical cross entropy is defined
        # H(p, q) = sum(p_i * log(q_i)). 
        # p_a = advantage. q_a is the output of the policy network, which is
        # the probability of taking the action a, i.e. policy(s, a). 
        # All other p_i are zero, thus we have H(p, q) = A * log(policy(s, a))
        actor.compile(loss='categorical_crossentropy',
                      optimizer=Adam(lr=self.actor_lr))
        return actor

    # critic: state is input and value of state is output of model
    def build_critic(self):
        critic = Sequential()
        critic.add(Dense(8, input_dim=self.state_shape, activation='relu',
                         kernel_initializer='he_uniform'))
        critic.add(Dense(self.value_size, activation='linear',
                         kernel_initializer='he_uniform'))
        
        critic.compile(loss="mse", optimizer=Adam(lr=self.critic_lr))
        return critic

    def get_action(self, state):
        policy = self.actor.predict(state, batch_size=1).flatten()
        return np.random.choice(self.action_shape, 1, p=policy)[0]

    # update policy network every episode
    def train_model(self, state, action, reward, next_state, done):
        target = np.zeros((1, self.value_size))
        advantages = np.zeros((1, self.action_shape))

        value = self.critic.predict(state)[0]
        next_value = self.critic.predict(next_state)[0]

        if done:
            advantages[0][action] = reward - value
            target[0][0] = reward
        else:
            advantages[0][action] = reward + self.discount_factor * (next_value) - value
            target[0][0] = reward + self.discount_factor * next_value

        self.actor.train_on_batch(state, advantages)
        self.critic.train_on_batch(state, target)


env = gym.make('CartPole-v0')
state_shape = env.observation_space.shape[0]
action_shape = env.action_space.n
agent = A2CAgent(state_shape, action_shape)

for game in range(num_games):
    done = False
    state = env.reset()
    state = np.reshape(state, [1, state_shape])
    episode_reward = 0

    while not done:
        action = agent.get_action(state)
        next_state, reward, done, _ = env.step(action)
        episode_reward += reward
        next_state = np.reshape(next_state, [1, state_shape])
        agent.train_model(state, action, reward, next_state, done)
        state = next_state
        
        if done:
            reward_list.append(episode_reward)
            if( ((game+1)%20) == 0 ):
                print("episode:", game+1, "  score:", np.mean(reward_list))
                