# Actor Critic Methods

The **Actor-Critic** framework combines the strengths of both **policy based** and **value-based** reinforcement learning methods.

- The **Actor** updates the policy parameters (decides what action to take).
- The **Critic** estimates the value function (evaluates how good the action was).

Together, they provide a more stable and efficient learning process compared to pure policy gradient methods like REINFORCE.

## 🎯 1. Concept Overview

In **REINFORCE**, we update the policy directly using the return $G_t$:

$$ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} [ \nabla_\theta \log \pi_\theta(a_t|s_t) G_t ] $$

However, this has **high variance** because $G_t$ can vary a lot.

The **Actor-Critic** method introduces a baseline — the *value function* $V(s_t)$ — to reduce this variance:

$$ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} [ \nabla_\theta \log \pi_\theta(a_t|s_t) (G_t - V(s_t)) ] $$

Here, $(G_t - V(s_t))$ is called the **advantage**, which measures how much better the action was compared to the average expectation.

## 🧠 2. Architecture Overview

An **Actor Critic model** usually has two neural networks:

- **Actor Network:** Outputs a probability distribution over actions (policy).
- **Critic Network:** Outputs a single value estimate for the state.

During training:
- The critic evaluates the action using a value estimate.
- The actor updates its policy in the direction suggested by the critic’s feedback.

## ⚙️ 3. Implementation using PyTorch

In [ ]:
import gym
import torch
import torch.nn as nn
import torch.optim as optim

# Define Actor Network
class Actor(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super(Actor, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
            nn.Softmax(dim=-1)
        )
    
    def forward(self, state):
        return self.model(state)

# Define Critic Network
class Critic(nn.Module):
    def __init__(self, state_dim, hidden_dim=128):
        super(Critic, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
    
    def forward(self, state):
        return self.model(state)

In [ ]:
# Initialize environment and networks
env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

actor = Actor(state_dim, action_dim)
critic = Critic(state_dim)

actor_optimizer = optim.Adam(actor.parameters(), lr=0.001)
critic_optimizer = optim.Adam(critic.parameters(), lr=0.005)

gamma = 0.99

In [ ]:
# Function to choose an action
def select_action(state):
    state = torch.FloatTensor(state)
    probs = actor(state)
    dist = torch.distributions.Categorical(probs)
    action = dist.sample()
    return action.item(), dist.log_prob(action)

In [ ]:
# Training Loop
num_episodes = 1000

for episode in range(num_episodes):
    state = env.reset()[0]
    done = False
    total_reward = 0
    
    while not done:
        action, log_prob = select_action(state)
        next_state, reward, done, _, _ = env.step(action)
        total_reward += reward

        state_t = torch.FloatTensor(state)
        next_state_t = torch.FloatTensor(next_state)

        # Compute TD Target and TD Error
        target_value = reward + gamma * critic(next_state_t) * (1 - int(done))
        value = critic(state_t)
        advantage = target_value - value

        # Update Critic
        critic_loss = advantage.pow(2).mean()
        critic_optimizer.zero_grad()
        critic_loss.backward()
        critic_optimizer.step()

        # Update Actor
        actor_loss = -log_prob * advantage.detach()
        actor_optimizer.zero_grad()
        actor_loss.backward()
        actor_optimizer.step()

        state = next_state

    if episode % 50 == 0:
        print(f"Episode {episode} | Total Reward: {total_reward}")

## 📈 4. Visualizing Learning Progress

In [ ]:
import matplotlib.pyplot as plt
import numpy as np

episode_rewards = []
state = env.reset()[0]

for episode in range(200):
    total_reward = 0
    state = env.reset()[0]
    done = False

    while not done:
        action, _ = select_action(state)
        next_state, reward, done, _, _ = env.step(action)
        total_reward += reward
        state = next_state

    episode_rewards.append(total_reward)

plt.plot(episode_rewards)
plt.title('Actor-Critic Performance over Episodes')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.show()

## ✅ 5. Advantages and Limitations

**Advantages:**
- Lower variance compared to REINFORCE.
- More sample-efficient (uses bootstrapping).
- Can be extended to advanced algorithms like A2C and A3C.

**Limitations:**
- May still be unstable without proper tuning.
- Two networks increase computational complexity.

➡️ Next step: Learn about **Advantage Actor-Critic (A2C)** and **Asynchronous Advantage Actor-Critic (A3C)** methods for parallel and efficient training.