# What is the Actor-Critic Algorithm

The actor-critic algorithm is a type of reinforcement learning algorithm that combines aspects of both **policy-based methods (Actor)** and **value-based methods (Critic)**. This hybrid approach is designed to address the limitations of each method when used individually. 

In the actor-critic framework, an agent (the "actor") learns a policy to make decisions, and a value function (the "critic") evaluates the actions taken by the actor. 

Simultaneously, the critic evaluates these actions by estimating their value or quality. This dual role allows the method to strike a balance between exploration and exploitation, leveraging the strengths of both policy and value functions. 

# Roles of Actor and Critic

* Actor: The actor makes decisions by selecting actions based on the current policy. Its responsibility lies in exploring the action space to maximize expected cumulative rewards. By continuously refining the policy, the actor adapts to the dynamic nature of the environment.

* Critic: The critic evaluates the actions taken by the actor. It estimates the value or quality of these actions by providing feedback on their performance. The critic's role is pivotal in guiding the actor towards actions that lead to hgiher expected returns, contributing to the overall improvement of the learning process. 

# Key Terms in Actor Critic Algorithm

* Policy (Actor):
  * The policy, denoted as $\pi(a|s)$, represents the probability of taking action **a** in state **s**.
  * The actor seeks to maximize the expected return by optimizing this policy.
  * The policy is modeled by the actor network, and its parameters are denoted by $\theta$.
 
* Value Function (Critic):
  * The value function, denoted as $V(s)$, estimates the expected cumulative reward starting from state **s**.
  * The value function is modeled by the critic network, and its parameters are denoted by **w**.
 

# How Actor-Critic Algorithm works? 

## Actor Critic Algorithm Objective Function
* The objective function for the Actor-Critic algorithm is a combination of the policy gradient (for the actor) and the value function (for the critic).
* The overall objective function is typically expressed as the sum of two components:

**Policy Gradient (Actor)**

$$\nabla_{\theta}J(\theta)\approx \frac{1}{N}\sum_{i=0}^N \nabla_{\theta}\log\pi_{\theta}(a_i|s_i)\cdot A(s_i, a_i)$$

Here,
* $J(\theta)$ represents the expected return under the policy parameterized by $\theta$
* $\pi_{\theta}(a|s)$ is the policy function
* N is the number of sampled experiences
* $A(s,a)$ is the advantage function representing the advantage of taking action a is state s
* $i$ represents the index of the sample


**Value Function Update (Critic)**

$$\nabla_{w}J(w)\approx\frac{1}{N}\sum_{i=1}^N \nabla_{w} (V_w(s_i)-Q_w(s_i,a_i))^2$$

Here,
* $\nabla_{w}J(w)$ is the gradient of the loss function with respect to the critic's parameters $w$
* N is the number of samples
* $V_w(s_i)$ is the critic's estimate of value of state s with parameter w
* $Q_w(s_i, a_i)$ is the critic's estimate of the action value of taking action a
* $i$ represents the index of the sample


## Update Rules
The update rules for the actor and critic involve adjusting their respective parameters using gradient ascent (for the actor) and gradient descent (for the critic).

**Actor Update**

$$\theta_{t+1}=\theta_t + \alpha\nabla_{theta}J(\theta_{t})$$

Here,
* $\alpha$: learning rate for the actor
* t is the time step within an episode

**Critic Update**

$$w_t = w_t - \beta \nabla_{w}J(w_t)$$

Here,
* w represents the parameters of the critic network
* $\beta$ is the learning rate for the critic

## Advantage Function

The advantage function, $A(s,a)$, measures the advantage of taking action a in state s over the expected value of the state under the current policy.

$$A(s,a) = Q(s,a) - V(s)$$

The advantage function, then, provides a measure of how much better or worse an action is compared to the average action. The actor is updated based on the policy gradient, encouraging actions with hgiher advantages, while the critic is updated to minimize the difference between the estimated value and the action-value. 


# A2C (advantage Actor Critic)

A2C is a specific variant of the actor-critic algorithm that introduces the concept of the **advantage function**. This function measures how much better an action is compared to the average action in a given state. By incorporating this advantage information, A2C focuses the learning process on actions that have a significantly higher value than the typical action taken in that state.

While both leverage the actor-critic architecture, here's a key distinction between them:
* Learning from the Average: The base Actor Critic method uses the difference between the actual reward and the estimated value (critic's evaluation) to update the actor.
* Learning from the Advantage: A2C leverages the advantage function, incorporating the difference between the action's value and the average value of actions in that state. This additional information refines the learning process further.

## Actor-Critic Algorithm Steps

The Actor-Critic algorithm combines these mathematical principles into a coherent learning framework. The algorithm involves:
1. Initialization: Initialize the policy parameters $\theta(actor)$ and the value function parameters $\phi(critic)$.
2. Interaction with the Environment: The agent interacts with the environment by taking actions according to the current policy and receiving observations and rewards in return.
3. Advantage Computation: Compute the advantage function $A(s,a)$ based on the current policy and value estimates.
4. Policy and Value Updates:
   *  Simultaneously update the actor's parameters $(\theta)$ using the policy gradient. The policy gradient is derived from the advantage function and guides the actor to increase the probabilities of actions that lead to higher advantages.
   *  Simultaneously update the critic's parameters $(\phi)$ using a value-based method. This often involves minimizing the temporal difference (TD) error, which is the difference between the observed rewards and the predicted values.
  
The actor learns a policy, and the critic evaluates the action taken by the actor. The actor is updated using the policy gradient, and the critic is updated using a value-based method. This combination allows for more stable and efficient learning in complex environments.

# Training Agent: Actor-Critic Algorithm

In [1]:
# import libraries
import numpy as np
import tensorflow as tf
import gymnasium as gym

In [2]:
# Creating CartPole environment

env = gym.make('CartPole-v1')


In [3]:
# Defining Actor and Critic Networks
# Actor and the Critic are implemented as neural networks using TensorFlow's Keras API
# Actor network maps the state to a probability distribution over actions.
# Critic network estimates the state's value

actor = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(env.action_space.n, activation='softmax')
])

critic = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1)
])

In [4]:
# Define optimizer and loss functions

actor_optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
critic_optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

In [5]:
# Main training loop
num_episodes = 50
gamma = 0.99

for episode in range(num_episodes):
    obs, info = env.reset() # for each episode, it resets the environment and initializes the episode reward to 0
    state = obs.flatten()
    episode_reward = 0

    # compute gradients for the acrtor and critic networks
    with tf.GradientTape(persistent=True) as tape:
        for t in range(1, 10000):  # Limit the number of time steps
            # Choose an action using the actor
            # agent chooses an action based on the actor's output probabilities and takes that action in the environment
            action_probs = actor(np.array([state]))
            action = np.random.choice(env.action_space.n, p=action_probs.numpy()[0])

            # Take the chosen action and observe the next state and reward
            next_state, reward, done, truncated, info = env.step(action)


            # Compute the advantage
            # advantage function is the difference between the expected return 
            # and the estimated value at the current state
            state_value = critic(np.array([state]))[0, 0]
            next_state_value = critic(np.array([next_state]))[0, 0]
            advantage = reward + gamma * next_state_value - state_value

            # Compute actor and critic losses based on advantage function
            actor_loss = -tf.math.log(action_probs[0, action]) * advantage
            critic_loss = tf.square(advantage)

            episode_reward += reward

            # Update actor and critic
            # Gradients are computed using tape.gradient
            # then applied to update the actor and critic networks using the respective optimizers
            actor_gradients = tape.gradient(actor_loss, actor.trainable_variables)
            critic_gradients = tape.gradient(critic_loss, critic.trainable_variables)
            actor_optimizer.apply_gradients(zip(actor_gradients, actor.trainable_variables))
            critic_optimizer.apply_gradients(zip(critic_gradients, critic.trainable_variables))

            if done:
                break
    # every 10 episodes, the current episode number and reward are printed
    if episode % 10 == 0:
        print(f"Episode {episode}, Reward: {episode_reward}")

env.close()


Episode 0, Reward: 24.0
Episode 10, Reward: 27.0
Episode 20, Reward: 27.0
Episode 30, Reward: 19.0
Episode 40, Reward: 17.0


# Advantages of Actor Critic Algorithm

1. **Improved Sample Efficiency**: The hybrid nature of actor-critic algorithms often leads to improved sample efficiency, requiring fewer interactions with the environment to achieve optimal performance.

2. **Faster Convergence**: The method's ability to update both the policy and value function concurrently contributes to faster convergence during training, enabling quicker adaptation to the learning task.

3. **Versatility Across Action Spaces**: Actor-Critic architectures can seamlessly handle both discrete and continuous action spaces, offering flexibility in addressing a wide range of RL problems.

4. **Off-Policy Learning (in some variants)**:Learns from past experiences, even when not directly following the current policy. 

# Advantage Actor Critic (A2C) vs. Asynchronous Advantage Actor Critic (A3C)

Asynchronous Advantage Actor-Critic (A3C) builds upon A2C by introducing parallelism.
In A2C, a **single actor-critic pair** interacts with the environment and updates its policy based on the experiences it gathers. However, A3C utilizes **multiple actor-critic pairs** operating simultaneously. Each pair interacts with a separate copy of the environment, collecting data independently. These experiences are then used to update a global actor-critic network.

Imagine **training multiple agents simultaneously**, **each exploring a separate world**. That's the core idea behind A3C (Asynchronous Advantage Actor-Critic). These agents, called "workers," independently learn from their experiences and update a central value function. This parallel approach allows A3C to explore the environment much faster than a single agent, leading to quicker learning.

A2C (Advantage Actor-Critic) is like A3C's simpler cousin. It uses the same core concept of actor-critic with an advantage function, but without the parallel workers. While A2C explores the environment less extensively, studies have shown it can achieve similar performance to A3C while being easier to implement and requiring less computational power.


## RL (A3C) using Pytorch + multiprocessing

In [2]:
# utils

from torch import nn
import torch
import numpy as np

def v_wrap(np_array, dtype=np.float32):
    if np_array.dtype != dtype:
        np_array = np_array.astype(dtype)
    return torch.from_numpy(np_array)


def set_init(layers):
    for layer in layers:
        nn.init.normal_(layer.weight, mean=0., std=0.1)
        nn.init.constant_(layer.bias, 0.)


def push_and_pull(opt, lnet, gnet, done, s_, bs, ba, br, gamma):
    if done:
        v_s_ = 0.               # terminal
    else:
        v_s_ = lnet.forward(v_wrap(s_[None, :]))[-1].data.numpy()[0, 0]

    buffer_v_target = []
    for r in br[::-1]:    # reverse buffer r
        v_s_ = r + gamma * v_s_
        buffer_v_target.append(v_s_)
    buffer_v_target.reverse()

    loss = lnet.loss_func(
        v_wrap(np.vstack(bs)),
        v_wrap(np.array(ba), dtype=np.int64) if ba[0].dtype == np.int64 else v_wrap(np.vstack(ba)),
        v_wrap(np.array(buffer_v_target)[:, None]))

    # calculate local gradients and push local parameters to global
    opt.zero_grad()
    loss.backward()
    for lp, gp in zip(lnet.parameters(), gnet.parameters()):
        gp._grad = lp.grad
    opt.step()

    # pull global parameters
    lnet.load_state_dict(gnet.state_dict())


def record(global_ep, global_ep_r, ep_r, res_queue, name):
    with global_ep.get_lock():
        global_ep.value += 1
    with global_ep_r.get_lock():
        if global_ep_r.value == 0.:
            global_ep_r.value = ep_r
        else:
            global_ep_r.value = global_ep_r.value * 0.99 + ep_r * 0.01
    res_queue.put(global_ep_r.value)
    print(
        name,
        "Ep:", global_ep.value,
        "| Ep_r: %.0f" % global_ep_r.value,
    )


class SharedAdam(torch.optim.Adam):
    def __init__(self, params, lr=1e-3, betas=(0.9, 0.99), eps=1e-8,
                 weight_decay=0):
        super(SharedAdam, self).__init__(params, lr=lr, betas=betas, eps=eps, weight_decay=weight_decay)
        # State initialization
        for group in self.param_groups:
            for p in group['params']:
                state = self.state[p]
                state['step'] = 0
                state['exp_avg'] = torch.zeros_like(p.data)
                state['exp_avg_sq'] = torch.zeros_like(p.data)

                # share in memory
                state['exp_avg'].share_memory_()
                state['exp_avg_sq'].share_memory_()

In [3]:
import torch
import torch.nn as nn
# from utils import v_wrap, set_init, push_and_pull, record
import torch.nn.functional as F
import torch.multiprocessing as mp
# from shared_adam import SharedAdam
import gymnasium as gym
import math, os

os.environ["OMP_NUM_THREADS"] = "1"

In [4]:
UPDATE_GLOBAL_ITER = 5
GAMMA = 0.9
MAX_EP = 300
MAX_EP_STEP = 20

env = gym.make('Pendulum-v1')
N_S = env.observation_space.shape[0]
N_A = env.action_space.shape[0]

class Net(nn.Module):
    def __init__(self, s_dim, a_dim):
        super(Net, self).__init__()
        self.s_dim = s_dim
        self.a_dim = a_dim
        self.a1 = nn.Linear(s_dim, 200)
        self.mu = nn.Linear(200, a_dim)
        self.sigma = nn.Linear(200, a_dim)
        self.c1 = nn.Linear(s_dim, 100)
        self.v = nn.Linear(100, 1)
        set_init([self.a1, self.mu, self.sigma, self.c1, self.v])
        self.distribution = torch.distributions.Normal

    def forward(self, x):
        a1 = F.relu6(self.a1(x))
        mu = 2 * F.tanh(self.mu(a1))
        sigma = F.softplus(self.sigma(a1)) + 0.001 
        c1 = F.relu6(self.c1(x))
        values = self.v(c1)
        return mu, sigma, values

    def choose_action(self, s):
        self.training = False
        mu, sigma, _ = self.forward(s)
        m = self.distribution(mu.view(1, ).data, sigma.view(1, ).data)
        return m.sample().numpy()

    def loss_function(self, s, a, v_t):
        self.train()
        mu, sigma, values = self.forward(s)
        td = v_t - values
        c_loss = td.pow(2)

        m = self.distribution(mu, sigma)
        log_prob = m.log_prob(a)
        entropy = 0.5 + 0.5 * math.log(2 * math.pi) + torch.log(m.scale) # exploration
        exp_v = log_prob * td.detach() + 0.005 * entropy
        a_loss = -exp_v
        total_loss = (a_loss + c_loss).mean()
        return total_loss

class Worker(mp.Process):
    def __init__(self, gnet, opt, global_ep, global_ep_r, res_queue, name):
        super(Worker, self).__init__()
        self.name = 'w%i' % name
        self.g_ep, self.g_ep_r, self.res_queue = global_ep, global_ep_r, res_queue
        self.gnet, self.opt = gnet, opt
        self.lnet = Net(N_S, N_A) # local network
        self.env = gym.make('Pendulum-v1').unwrapped

    def run(self):
        total_step = 1
        while self.g_ep.value < MAX_EP:
            s = self.env.reset()
            buffer_s, buffer_a, buffer_r = [], [], []
            ep_r = 0.
            for t in range(MAX_EP_STEP):
                if self.name == 'w0':
                    self.env.render()
                a = self.lnet.choose_action(v_wrap(s[None, :]))
                s_, r, done, _ = self.env.step(a.clip(-2, 2))
                if t == MAX_EP_STEP - 1:
                    done = True
                ep_r += r
                buffer_a.append(a)
                buffer_s.append(s)
                buffer_r.append((r+8.1)/8.1) # normalize

                if total_step % UPDATE_GLOBAL_ITER == 0 or done:  # update global and assign to local net
                    # sync
                    push_and_pull(self.opt, self.lnet, self.gnet, done, s_, buffer_s, buffer_a, buffer_r, GAMMA)
                    buffer_s, buffer_a, buffer_r = [], [], []

                    if done: # done and print information
                        record(self.g_ep, self.g_ep_r, ep_r, self.res_queue, self.name)
                        break

                s = s_
                total_step += 1

        self.res_queue.put(None)
        

In [None]:
if __name__ == "__main__":
    gnet = Net(N_S, N_A)        # global network
    gnet.share_memory()         # share the global parameters in multiprocessing
    opt = SharedAdam(gnet.parameters(), lr=1e-2, betas=(0.95, 0.999))  # global optimizer
    global_ep, global_ep_r, res_queue = mp.Value('i', 0), mp.Value('d', 0.), mp.Queue()

    # parallel training
    workers = [Worker(gnet, opt, global_ep, global_ep_r, res_queue, i) for i in range(mp.cpu_count())]
    [w.start() for w in workers]
    res = []                    # record episode reward to plot
    while True:
        r = res_queue.get()
        if r is not None:
            res.append(r)
        else:
            break
    [w.join() for w in workers]

    import matplotlib.pyplot as plt
    plt.plot(res)
    plt.ylabel('Moving average ep reward')
    plt.xlabel('Step')
    plt.show()