## Tensorflow's implementation of an Actor Critic

The goal here is to see how to correctly use tensorflow to write a reinforcement learning agent. In this notebook we simply write down exactly what is in tensorflow's example but at the same time try to add notes and other useful information. The goal is to be able to read this page and understand how to correctly write an agent. 

Note that the algorithm being used here is an advantage actor critic that uses the returns G as the estimate and a value function V as a baseline. 

### Table of contents  
#### Setup:$\hspace{5 mm}$ <a href = '#Imports'>Imports</a>
#### Model:$\hspace{5 mm}$ <a href = '#The Model'>The Model</a>

#### Training:$\hspace{5 mm}$ <a href = '#Collecting Training Data'>Collecting Training Data</a> $\hspace{5 mm}$ <a href = '#Computing Expected Returns'>Computing Expected Returns</a> $\hspace{5 mm}$ <a href = '#Computing the loss'>Computing the loss</a> $\hspace{5 mm}$ <a href = '#Updating parameters'>Updating parameters</a> $\hspace{5 mm}$ <a href = '#Training Loop'>Training Loop</a> 

<a id = 'Imports'> </a>

### Imports 

All the libraries you will need

In [1]:
import gym                                     # Host the environment 
import numpy as np                             # Fast linear algebra
import tensorflow as tf                        # Fast machine learning 
import tqdm                                    # Only used once, is a progress bar (so optional)

from tensorflow.keras import layers            # Makes it easier to use functional API
from typing import Any, List, Sequence, Tuple  # Lets us use type checking by giving us names to call 


# Create the environment
env = gym.make("CartPole-v0")

# Set seed for experiment reproducibility
seed = 42
env.seed(seed)
tf.random.set_seed(seed)
np.random.seed(seed)

# Small epsilon value for stabilizing division operations
eps = np.finfo(np.float32).eps.item()

<a id = 'The Model'> </a>

### The Model 

Define the model architecture that we will use. With an actor critic the network has one shared portion and then splits off into the actor (policy network) and the critic (value function). The ActorCritic Class does just that. Note that below we create a model as a global variable that we can generally use. The plan is to not have an explicit agent class but rather simply code the features ourselves. 

In [2]:
class ActorCritic(tf.keras.Model):
    """Combined actor-critic network"""
    
    def __init__(self, num_actions: int, num_hidden_units: int):
        """Initialize."""
        super().__init__()
        self.common = layers.Dense(num_hidden_units)
        self.actor = layers.Dense(num_actions)
        self.critic = layers.Dense(1)
        
    def call(self, inputs: tf.Tensor) -> Tuple[tf.Tensor, tf.Tensor]:
        x = self.common(inputs)
        return self.actor(x), self.critic(x)
        
num_actions = env.action_space.n  # Good practice, in this case it is 2
num_hidden_units = 128

model = ActorCritic(num_actions, num_hidden_units)

<a id = 'Collecting Training Data'> </a>

### Collecting Training Data

Now we actually run an episode. The first function simply acts as a wrapper for the env.step() function. It adds on function type hints and specifies the input data type. 

The second function turns what we have into an tensorflow operation. Since we don't need to be able to take gradients through it we should be fine not using tf.py_function.  

The third function actually runs the full epsiode using the previous function. 

In [3]:
def env_step(action: np.ndarray) -> Tuple[np.ndarray, np.ndarray, np.ndarray]: #ndarray is the data type of a numpy array
    """Returns state, reward, and done flag given an action"""
    
    state, reward, done, _ = env.step(action)                                  # Returns ndarray, float, boolean
    return (state.astype(np.float32), np.array(reward, np.int32), np.array(done, np.int32))

def tf_env_step(action: tf.Tensor) -> List[tf.Tensor]:
    
    return tf.numpy_function(env_step, [action], [tf.float32, tf.int32, tf.int32])

def run_episode(initial_state: tf.Tensor, model: tf.keras.Model, max_steps: int) -> List[tf.Tensor]:
    """Runs a single epsiode to collect training data."""

    # tf.TensorArray allows you to create variable sized arrays. This means we can don't need to worry about episode length
    action_probs = tf.TensorArray(dtype = tf.float32, size = 0, dynamic_size = True)
    values = tf.TensorArray(dtype = tf.float32, size = 0, dynamic_size = True)
    rewards = tf.TensorArray(dtype = tf.int32, size = 0, dynamic_size = True)
    
    # env.reset() returns an intial observation. That is what initial_state is. 
    initial_state_shape = initial_state.shape
    state = initial_state 
    
    for t in tf.range(max_steps):
        # Convert state into a batched tensor (batch size = 1) so that shape is (1, 4) in our case
        state = tf.expand_dims(state, 0)
        
        # Run the model and get action probabilities and critic value
        action_logits_t, value = model(state)
        
        # Sample next action from probability distribution and track probability distribution
        # categorical basically takes the unnormalized values and then pretends the probability is the softmax of them. 
        # this is why if you only wanted to normalize probabilities (0.1, 0.4 => 20%, 80%) you would just take ln (do the math)
        action = tf.random.categorical(action_logits_t, 1)[0, 0] # This ends up having shape (1,1) so taking [0,0] corrects this
        action_probs_t = tf.nn.softmax(action_logits_t)
        
        # Store critic values and log probabilities of action getting chosen 
        values = values.write(t, tf.squeeze(value)) #Squeezing gets rid of the batch aspect
        action_probs = action_probs.write(t, action_probs_t[0, action])
        
        # Move forward in the environment
        state, reward, done = tf_env_step(action)
        state.set_shape(initial_state_shape)
        
        # Store the reward
        rewards = rewards.write(t, reward)
        
        # End episode?
        if tf.cast(done, tf.bool):
            break
    
    # Get one dimensional tensors representing each of the following
    action_probs = action_probs.stack()
    values = values.stack()
    rewards = rewards.stack()
    
    return action_probs, values, rewards

In [4]:
# Show what is happening inside run_epsiode
initial_state = env.reset() # Note that I'm being lazy, env.reset() should be a Tensor
max_steps = 2

action_probs = tf.TensorArray(dtype = tf.float32, size = 0, dynamic_size = True)
values = tf.TensorArray(dtype = tf.float32, size = 0, dynamic_size = True)
rewards = tf.TensorArray(dtype = tf.int32, size = 0, dynamic_size = True)

initial_state_shape = initial_state.shape
state = initial_state 
print("initial_state_shape: ", initial_state_shape)
print("initial_state: ", state)

for t in tf.range(max_steps):
    print("\nNEW======================================")
    state = tf.expand_dims(state, 0)
    print("Expanded state: ", state)
    
    action_logits_t, value = model(state)
    print("model output:", action_logits_t, value)

    action = tf.random.categorical(action_logits_t, 1)[0, 0]
    action_probs_t = tf.nn.softmax(action_logits_t)
    print("selected action: ", action)
    print("action probabilities: ", action_probs_t)

    values = values.write(t, tf.squeeze(value))
    action_probs = action_probs.write(t, action_probs_t[0, action])
    
    state, reward, done = tf_env_step(action)
    print("Normal state: ", state)
    state.set_shape(initial_state_shape)

    rewards = rewards.write(t, reward)
    print("rewards: ", reward)

    if tf.cast(done, tf.bool):
        print("leaving")
        break

action_probs = action_probs.stack()
values = values.stack()
rewards = rewards.stack()
print("\nFinal: ")
print(action_probs)
print(values)
print(rewards)

initial_state_shape:  (4,)
initial_state:  [-0.01258566 -0.00156614  0.04207708 -0.00180545]

Expanded state:  tf.Tensor([[-0.01258566 -0.00156614  0.04207708 -0.00180545]], shape=(1, 4), dtype=float64)
model output: tf.Tensor([[-0.01142954  0.00853915]], shape=(1, 2), dtype=float32) tf.Tensor([[-0.01699052]], shape=(1, 1), dtype=float32)
selected action:  tf.Tensor(1, shape=(), dtype=int64)
action probabilities:  tf.Tensor([[0.495008 0.504992]], shape=(1, 2), dtype=float32)
Normal state:  tf.Tensor([-0.01261699  0.1929279   0.04204097 -0.28092128], shape=(4,), dtype=float32)
rewards:  tf.Tensor(1, shape=(), dtype=int32)

Expanded state:  tf.Tensor([[-0.01261699  0.1929279   0.04204097 -0.28092128]], shape=(1, 4), dtype=float32)
model output: tf.Tensor([[-0.03415646 -0.07397287]], shape=(1, 2), dtype=float32) tf.Tensor([[-0.05889579]], shape=(1, 1), dtype=float32)
selected action:  tf.Tensor(0, shape=(), dtype=int64)
action probabilities:  tf.Tensor([[0.5099528 0.4900472]], shape=(1, 2

<a id = 'Computing Expected Returns'> </a>

#### Computing Expected Returns

Using the data we gain from an episode of training, we can now compute the return G. 

In [5]:
def get_expected_return(rewards: tf.Tensor, gamma: float, standardize: bool = True) -> tf.Tensor:
    """Compute expected returns per timestep."""
    
    # Create the variables we must track. We use tf.shape(rewards)[0] because that is the epsiode length
    n = tf.shape(rewards)[0]                           
    returns = tf.TensorArray(dtype=tf.float32, size=n)
    
    # To efficiently compute the return, we start by computing it for the last epsiode and then * by gamma followed by + r. 
    # We basically work from down up to find the returns
    
    rewards = tf.cast(rewards[::-1], dtype=tf.float32) # We use [::-1] to order the rewards from last one to first
    discounted_sum = tf.constant(0.0)
    discounted_sum_shape = discounted_sum.shape
    
    for i in tf.range(n):
        reward = rewards[i]
        discounted_sum = reward + gamma * discounted_sum 
        discounted_sum.set_shape(discounted_sum_shape) # Allows us to ensure that everything is going correctly
        returns = returns.write(i, discounted_sum) 
        
    returns = returns.stack()[::-1] # Now the returns are ordered from last to first so we swap em again. Shape is (n, )
    
    # 0 mean and unit std can help with convergence 
    if standardize:
        returns = (returns - tf.math.reduce_mean(returns)) / (tf.math.reduce_std(returns) + eps)
    
    return returns

In [6]:
# Show returns
action_probs, values, rewards = run_episode(env.reset(), model, 5) # Note that I'm being lazy, env.reset() should be a Tensor
returns = get_expected_return(rewards, 0.9)
print(returns)

tf.Tensor([ 1.3388376   0.7397628   0.07412429 -0.66547424 -1.4872503 ], shape=(5,), dtype=float32)


<a id = 'Computing the loss'> </a> 

### Computing the loss

The critic loss is just a regression problem, where we are tying to make V as close to G. As such, we can use:

$L_{critic} = L_{\delta}(G, V_{\theta}^\pi)$ where $L_{\delta}$ is the huber loss (less sensitive to outliers than MSE)

The actor loss is based off of policy gradients with the critic as a state-dependent baseline

$L_{actor} = - \sum^{T}_{t=1} \log{\pi_{\theta}(a_t|s_t)} * [G(s_t, a_t) - V^{\pi}_{\theta}(s_t)]$

In [7]:
huber_loss = tf.keras.losses.Huber(reduction=tf.keras.losses.Reduction.SUM)

def compute_loss(action_probs: tf.Tensor, values: tf.Tensor, returns: tf.Tensor) -> tf.Tensor:
    """Computes the combined actor-critic loss."""
 
    advantage = returns - values
    action_log_probs = tf.math.log(action_probs)

    actor_loss = -tf.math.reduce_sum(action_log_probs * advantage)
    critic_loss = huber_loss(values, returns)
    
    return actor_loss + critic_loss

<a id = 'Updating parameters'> </a>

### Updating parameters

Now we update the parameters. Using tf.function() turns this into a callable graph which can help speed things up greatly

In [8]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)

@tf.function()
def train_step(initial_state: tf.Tensor, model: tf.keras.Model, optimizer: tf.keras.optimizers.Optimizer,
               gamma: float, max_steps_per_episode: int) -> tf.Tensor:
    """Runs a model training step."""

    with tf.GradientTape() as tape:
        
        # Run the model for one episode to collect training data
        action_probs, values, rewards = run_episode(initial_state, model, max_steps_per_episode)
        
        # Calculate expected returns
        returns = get_expected_return(rewards, gamma)
        
        # Convert the data to get the correct shape: (n, 1) 
        action_probs, values, returns = [tf.expand_dims(x, 1) for x in [action_probs, values, returns]]
        
        # Calculate loss
        loss = compute_loss(action_probs, values, returns)
        
    # Compute the gradients from the loss
    grads = tape.gradient(loss, model.trainable_variables)
    
    # Apply the gradients to the model's parameters
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
        
    # Get the final reward
    episode_reward = tf.math.reduce_sum(rewards)
    
    return episode_reward

<a id= 'Training Loop'> </a>

### Training Loop

Now we write the actual training loop

In [9]:
%%time 

max_episodes = 1000
max_steps_per_episode = 1000

# The Cartpole environment is considered solved if the reward is over 195 over 100 trials
reward_threshold = 195
running_reward = 0

# Discount factor
gamma = 0.99

with tqdm.trange(max_episodes) as t:
    for i in t:
        initial_state = tf.constant(env.reset(), dtype = tf.float32)
        episode_reward = int(train_step(initial_state, model, optimizer, gamma, max_steps_per_episode))
        
        running_reward = 0.01 * episode_reward + running_reward * 0.99
        
        t.set_description(f'Episode {i}')
        t.set_postfix(episode_reward=episode_reward, running_reward=running_reward)

        if running_reward > reward_threshold:  
            break

print(f'\nSolved at episode {i}: average reward: {running_reward:.2f}!')

Episode 539:  54%|██████████████            | 539/1000 [00:52<00:45, 10.19it/s, episode_reward=200, running_reward=195]


Solved at episode 539: average reward: 195.03!
Wall time: 52.9 s





In [11]:
initial_state = tf.constant(env.reset(), dtype = tf.float32)
max_steps = 10000

# env.reset() returns an intial observation. That is what initial_state is. 
initial_state_shape = initial_state.shape
state = initial_state 

for t in tf.range(max_steps):
    state = tf.expand_dims(state, 0)

    action_logits_t, value = model(state)        
    action = tf.random.categorical(action_logits_t, 1)[0, 0] # This ends up having shape (1,1) so taking [0,0] corrects this

    env.render()
    state, reward, done = tf_env_step(action)

    # End episode?
    if tf.cast(done, tf.bool):
        break

env.close()