# Actor-Critic

The Actor-Critic framework can be seen as a merging of policy-based methods and value-based methods in the hopes of achieving the benefits of both. 

It is beneficial to learn the value function as well as the policy since knowing the value function can assist and push the policy updates in the correct direction and with less variance than Monte Carlo estimates. This also allows policy methods to be used in non-episodic environments as well as be updated more frequently as it doesn't need to wait until the end of the episode.

Below is the Pseudocode for a one-step actor-critic:

![Pseudocode](https://i.stack.imgur.com/zFfxs.png)

The following will be an implementation of a one-step actor-critic but there are many other ways to do this. Instead of bootstrapping one could use Monte Carlo estimates to update the critic and update the policy on a whole episode of experience.


## Discrete Policy
The discete action space policy is simply a neural network with the following layer sizes:
- Input : State Observation size
- Hidden : Whatever you please.
- Output: Number of possible actions.

The policy then uses a softmax activation function at the end to give probabilities for selecting an action.

In [8]:
from torch import nn
import torch.nn as nn
import torch.nn.functional as F
import torch

class Discrete_Policy(nn.Module):

    def __init__(self, input_size, hidden_size, nb_actions) -> None:
        super(Discrete_Policy, self).__init__()
        self.policy_net = nn.Sequential(nn.Linear(input_size, hidden_size),
                                        nn.LeakyReLU(),
                                        nn.Linear(hidden_size, hidden_size),
                                        nn.LeakyReLU(),
                                        nn.Linear(hidden_size, nb_actions),
                                        nn.Softmax(dim=-1))
        
    def forward(self,obs):
        return self.policy_net(obs)

## Continuous Policy
The continuous action space policy is a little more complicated. This can be done differently depending on what you want to do but the following uses two separate neural networks to approximate a mean and std. deviation respectively.

The sizes of both neural networks are as follows:

- Input : State Observation size
- Hidden : Whatever you please.
- Output: 1

One doesnt need to use a separate neural network for the std deviation, they can use a single learned parameter or even some fixed constant but all will have different results and this involves some experimentation.

In [9]:
class Continuous_Policy(nn.Module):

    def __init__(self, input_size, hidden_size, nb_actions) -> None:
        super(Continuous_Policy, self).__init__()
        self.mean_net = nn.Sequential(nn.Linear(input_size, hidden_size),
                                        nn.LeakyReLU(),
                                        nn.Linear(hidden_size, hidden_size),
                                        nn.LeakyReLU(),
                                        nn.Linear(hidden_size, nb_actions))
        self.standard_deviation_net = nn.Sequential(nn.Linear(input_size, hidden_size),
                                        nn.LeakyReLU(),
                                        nn.Linear(hidden_size, hidden_size),
                                        nn.LeakyReLU(),
                                        nn.Linear(hidden_size, nb_actions))
        
    def forward(self,obs):
        return self.mean_net(obs),torch.abs(self.standard_deviation_net(obs))

## Critic
This is the critic model - a neural network to learn the value function. This could be a Q (action-value) or V (state-value) approximator. This depends on implementation - here I am creating a state-value approximator.

In [10]:
class Critic(nn.Module):

    def __init__(self, input_size, hidden_size) -> None:
        super(Critic, self).__init__()
        self.critic = nn.Sequential(nn.Linear(input_size, hidden_size),
                                        nn.LeakyReLU(),
                                        nn.Linear(hidden_size, hidden_size),
                                        nn.LeakyReLU(),
                                        nn.Linear(hidden_size, 1))
        
    def forward(self,obs):
        return self.critic(obs)

## Agent 
The following is the agent class:

In [11]:
from torch.distributions import Categorical, Normal
from torch.optim import AdamW
import torch

class Actor_Critic_Agent:

    def __init__(self,input_size,hidden_size,nb_actions,discrete=True,discount_factor=0.9,learning_rate=0.0001):
        self.discrete = discrete
        if discrete:
            self.policy = Discrete_Policy(input_size,hidden_size,nb_actions)
        else:
            self.policy = Continuous_Policy(input_size,hidden_size,nb_actions)
        
        self.critic = Critic(input_size,hidden_size)

        self.gamma = discount_factor
        
        # One can use one optimizer for both the critic and policy - this is a implementation detail
        self.policy_optimizer = AdamW(self.policy.parameters(),learning_rate)
        self.critic_optimizer = AdamW(self.critic.parameters(),learning_rate)
        
    def act(self,obs):
        if self.discrete:
            # Create categorical distribution using the action probabilities given by the agent's policy
            action_dist = Categorical(self.policy(obs))

        else:
            mu,dev = self.policy(obs)
            # Create a Normal distribution using the policy given mean and std deviation
            action_dist = Normal(mu,dev)
        
        # Return the action distribution created according to policy
        return action_dist
    
    def train(self):
        self.policy.train()
    
    def eval(self):
        self.policy.eval()
        
    def update_model(self,state,reward,log_action_probability,next_state,done):
        
        # Current State Value
        state_value = self.critic(state)

        # Target State Value
        target_value = reward+ (1-done)*self.gamma*self.critic(next_state)
        
        # Temporal Difference Error
        td_error = target_value - state_value

        # Critics Loss - Huber loss of state and target - can use other loss functions such as MSE
        critic_loss = F.smooth_l1_loss(state_value,target_value)

        # Policy Loss - log action probability shifted by the td_error
        policy_loss = -td_error.detach()*log_action_probability

        # clear the optimizers current gradients
        self.policy_optimizer.zero_grad()
        self.critic_optimizer.zero_grad()

        # backpropagate the loss to calculate the gradients
        critic_loss.backward()
        policy_loss.backward()

        # make one optimization step
        self.policy_optimizer.step()
        self.critic_optimizer.step()
    
    

## Training Loop
The last thing needed is the training loop.

In [12]:
from collections import deque
import gym
from numpy import double
import torch
import argparse

def train(agent, env, num_episodes, num_steps, log_every=20, render_from=1000):
    """
    :param agent: the agent to be trained.
    :param env: the gym environment.
    :param num_episodes: the number of episodes to train.
    :param num_steps: the max number of steps possible per episode.
    :param log_every: The frequency of logging. Default logs every 20 episodes.
    :param render_from: The episode number to start rendering to screen to allow one to view agent. Rendering significantly slows down training.
    """
    
    # Running Average Reward Memory
    running_avg_reward = deque(maxlen=100)
    agent.train()
    for episode in range(1,num_episodes+1):
        # Starting state observation
        obs = torch.tensor(env.reset(),dtype=torch.float32)
        reward_total = 0
        for step in range(num_steps):
            if episode>render_from:
                env.render()
            # Return agents action probability distribution
            action_dist = agent.act(obs)
            # Sample an action from this distribution
            action = action_dist.sample()
            # Take a step in the environment with the action drawn
            next_obs , reward, done, info = env.step(action.numpy())
            
            next_obs = torch.tensor(next_obs,dtype=torch.float32)
            
            reward_total+=reward
            
            # Perform one step update
            agent.update_model(obs,reward,action_dist.log_prob(action),next_obs,done)
            
            # Set the current state to the next state
            obs = next_obs
            
            # if done then log and break
            if done:
                if episode % log_every ==0 and len(running_avg_reward)>0:
                    running_avg = sum(running_avg_reward)/len(running_avg_reward)
                    print("Episode {0:4d} finished after {1:4d} timesteps with a total reward of {2:3.1f} | Running Average: {3:3.2f}".format(episode,step+1,reward_total,running_avg))
                break
        running_avg_reward.append(reward_total)
        

    env.close()

In [13]:

def eval(agent, env, num_episodes, num_steps, log_every=20):
    """
    :param agent: the agent to be trained.
    :param env: the gym environment.
    :param num_episodes: the number of episodes to eval.
    :param num_steps: the max number of steps possible per episode.
    :param log_every: The frequency of logging. Default logs every 20 episodes.
    """
    agent.eval()
    # Running Average Reward Memory
    running_avg_reward = deque(maxlen=100)
    for episode in range(1,num_episodes+1):
        # Starting state observation
        obs = env.reset()
        reward_total = 0
        for step in range(num_steps):
           
            env.render()
            # Return agents action probability distribution
            action_dist = agent.act(torch.tensor(obs,dtype=torch.float32))
            # Sample an action from this distribution
            action = action_dist.sample()
            # Take a step in the environment with the action drawn
            obs , reward, done, info = env.step(action.numpy())
            # Just for logging
            reward_total+=reward
           
            # if done then log and break
            if done:
                if episode % log_every ==0 and len(running_avg_reward)>0:
                    running_avg = sum(running_avg_reward)/len(running_avg_reward)
                    print("Episode {0:4d} finished after {1:4d} timesteps with a total reward of {2:3.1f} | Running Average: {3:3.2f}".format(episode,step+1,reward_total,running_avg))
                break
        running_avg_reward.append(reward_total)
    env.close()

## Time to Train

In [14]:
env = gym.make("CartPole-v0")

if (isinstance(env.action_space,gym.spaces.Discrete)):
    discrete = True
    nb_actions = env.action_space.n
else:
    discrete = False
    nb_actions = env.action_space.shape[0]

# Hyper parameters
HIDDEN_SIZE = 256
GAMMA = 0.99
LEARNING_RATE = 0.0005
EPISODES_TO_TRAIN = 700
EPISODES_TO_EVAL = 20
MAX_STEPS = 501
LOG_EVERY = 20
RENDER_FROM_EP = 2000

agent = Actor_Critic_Agent(env.observation_space.shape[0],HIDDEN_SIZE,nb_actions,discrete,GAMMA,LEARNING_RATE)

print("Training...")

train(agent,env,EPISODES_TO_TRAIN,MAX_STEPS,LOG_EVERY,RENDER_FROM_EP)

print("Evaluating...")

eval(agent,env,EPISODES_TO_EVAL,MAX_STEPS,log_every=1)

Training...
Episode   20 finished after   28 timesteps with a total reward of 28.0 | Running Average: 24.79
Episode   40 finished after   21 timesteps with a total reward of 21.0 | Running Average: 27.54
Episode   60 finished after   50 timesteps with a total reward of 50.0 | Running Average: 29.88
Episode   80 finished after   30 timesteps with a total reward of 30.0 | Running Average: 35.58
Episode  100 finished after   24 timesteps with a total reward of 24.0 | Running Average: 35.03
Episode  120 finished after   32 timesteps with a total reward of 32.0 | Running Average: 38.75
Episode  140 finished after   91 timesteps with a total reward of 91.0 | Running Average: 43.34
Episode  160 finished after   46 timesteps with a total reward of 46.0 | Running Average: 52.12
Episode  180 finished after   41 timesteps with a total reward of 41.0 | Running Average: 48.53
Episode  200 finished after   28 timesteps with a total reward of 28.0 | Running Average: 52.21
Episode  220 finished after 