# Advanced Actor Critic Methods: A3C and A2C

In this notebook, we explore two advanced policy-gradient algorithms in reinforcement learning : **Asynchronous Advantage Actor-Critic (A3C)** and **Advantage Actor-Critic (A2C)**.

These algorithms improve upon the basic Actor-Critic method by introducing parallelization, stability, and more efficient policy updates.

## 1. Introduction

Actor Critic methods combine the benefits of both policy based and value based learning. The **actor** learns the policy, while the **critic** estimates the value function.

- **A3C (Asynchronous Advantage Actor-Critic):** Runs multiple agents in parallel on different threads, each interacting with its own copy of the environment.
- **A2C (Advantage Actor-Critic):** A synchronous version of A3C that averages gradients from multiple agents before updating the global model.

Both use the *advantage function* to stabilize training:

$$ A(s, a) = R_t - V(s_t) $$

In [None]:
import gym
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers

env = gym.make('CartPole-v1')
num_actions = env.action_space.n
obs_space = env.observation_space.shape[0]

## 2. Building the Actor-Critic Network

We’ll define a shared network backbone with two heads:
- **Actor head** for policy (action probabilities)
- **Critic head** for state value estimation

In [None]:
def build_actor_critic(num_actions):
    inputs = layers.Input(shape=(obs_space,))
    common = layers.Dense(128, activation='relu')(inputs)
    actor = layers.Dense(num_actions, activation='softmax')(common)
    critic = layers.Dense(1)(common)
    model = tf.keras.Model(inputs=inputs, outputs=[actor, critic])
    return model

model = build_actor_critic(num_actions)
model.summary()

## 3. Advantage Calculation
We compute the *advantage* to reduce variance in policy gradient updates.

In [None]:
def compute_advantage(rewards, values, gamma=0.99):
    returns = []
    discounted_sum = 0
    for r in rewards[::-1]:
        discounted_sum = r + gamma * discounted_sum
        returns.insert(0, discounted_sum)
    returns = np.array(returns)
    advantage = returns - np.array(values)
    return returns, advantage

## 4. A2C Training Loop (Simplified Example)
This example uses a synchronous setup similar to A2C, collecting trajectories before applying updates.

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

def train_a2c(model, env, episodes=500, gamma=0.99):
    for ep in range(episodes):
        state = env.reset()[0]
        done = False
        states, actions, rewards, values = [], [], [], []

        while not done:
            state_tensor = tf.convert_to_tensor(state[None, :], dtype=tf.float32)
            probs, value = model(state_tensor)
            action = np.random.choice(num_actions, p=np.squeeze(probs))
            next_state, reward, done, _, _ = env.step(action)

            states.append(state)
            actions.append(action)
            rewards.append(reward)
            values.append(value.numpy()[0, 0])
            state = next_state

        returns, advantage = compute_advantage(rewards, values)

        with tf.GradientTape() as tape:
            actor_probs, critic_values = model(tf.convert_to_tensor(np.array(states), dtype=tf.float32))
            critic_loss = tf.keras.losses.MSE(returns, tf.squeeze(critic_values))

            action_masks = tf.one_hot(actions, num_actions)
            log_probs = tf.reduce_sum(action_masks * tf.math.log(actor_probs + 1e-8), axis=1)
            actor_loss = -tf.reduce_mean(log_probs * advantage)
            total_loss = actor_loss + critic_loss

        grads = tape.gradient(total_loss, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

        if ep % 10 == 0:
            print(f"Episode {ep}, Total Reward: {np.sum(rewards):.2f}")

# Uncomment to train
# train_a2c(model, env, episodes=200)

## 5. A3C Overview
A3C is similar in logic but uses **asynchronous parallel agents** — each running in its own thread and sending gradients to a global network.

**Advantages of A3C:**
- Faster training due to parallelism
- Better exploration
- Reduced correlation in updates

Implementation typically involves Python’s `threading` or `multiprocessing` along with shared global models.

## 6. Summary
- A2C is the synchronous version of A3C.
- Both use the advantage function for stability.
- Actor-Critic methods balance between exploration and exploitation.
- A3C improves performance using asynchronous updates from multiple agents.

**Next Step:** Try implementing A3C using multiprocessing and compare it with the synchronous A2C.