# Policy Gradient

[Policy Gradient algorithms](https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html) are as of today among the most powerful and versatile approaches towards reinforcement learning in gigantic state spaces. In this notebook, we go step by step through the [lucid example](https://github.com/openai/spinningup/blob/master/spinup/examples/pg_math/1_simple_pg.py) presented by OpenAI.

First, we load the CartPole environment and record the size of the state and action spaces.

In [1]:
import gym 
import numpy as np
env = gym.make('CartPole-v0')

n_s = env.observation_space.shape[0]
n_a = env.action_space.n

Next, we build a policy network as a linear model.

In [2]:
import tensorflow as tf

#input
s_ph = tf.placeholder(shape = (None, n_s), dtype = tf.float32)

#action probabilities
logits = tf.layers.dense(s_ph,
                         units = n_a,
                         activation = None)

#sample from policy 
actions = tf.squeeze(tf.multinomial(logits=logits,
                                    num_samples=1), 
                     axis=1)

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Instructions for updating:
Use keras.layers.dense instead.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Use `tf.random.categorical` instead.


  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


## Vanilla Policy Gradient

With this network we can build the expressions to be analyzed in policy gradient.

In [6]:
weights_ph = tf.placeholder(shape=(None,), 
                            dtype=tf.float32)
a_ph = tf.placeholder(shape=(None,), 
                        dtype=tf.int32)
a_masks = tf.one_hot(a_ph, n_a)
log_probs = tf.reduce_sum(a_masks * tf.nn.log_softmax(logits), 
                          axis=1)
pol_grad = -tf.reduce_mean(weights_ph * log_probs)

Given the rewards gathered in an episode in the vanilla policy gradient, the log-probabilities are weighted simply by the accumulated rewards.

In [8]:
def vanilla_pg(episode_rs):
    """weights for vanilla policy gradient
    
    # Arguments
        episode_rs: rewards inside episode
        
    # Result
        weights for vanilla policy gradient
    """  
    return [np.sum(episode_rs)] * len(episode_rs)
    

Now, we act according to the policy.

In [36]:
def play_mdp(policy_net,
             placeholder,
             weights = vanilla_pg,
             batch_size = 5000):
    """Play an MDP according to given policy network
    
    # Arguments
        policy_net: network for selecting actions
        placeholder: input for policy net
        weights: weight-function for policy gradient
        batch_size: number of rounds to play
        
    # Result
        batch of experienced states, actions and weights
    """    
    #rewards per episode 
    episode_rs = []
    trace_episode_rs = []
    
    #collect data on states, actions and weights
    batch_ss, batch_as, batch_ws = [], [], []
    
    #initialize environment
    s = env.reset()

    while True:
        #determine action
        a = sess.run(policy_net, 
                     {placeholder: s.reshape(1,-1)})[0]

        #store state-action pair
        batch_ss.append(s.copy())
        batch_as.append(a.copy())

        #act and increase episode rewards
        s, _, done, _ = env.step(a)
        episode_rs += [1]

        if done:
            trace_episode_rs += [np.sum(episode_rs)]
            
            #append new weights and increase total rewards
            batch_ws +=  weights(episode_rs)
            
            #reset environment and rewards
            s = env.reset()           
            episode_rs = []
            
            #end if sufficient information gathered
            if(len(batch_ss) > batch_size): break
                
    print("Episode reward: {}".format(np.mean(trace_episode_rs)))
    return(batch_ss,
           batch_as, 
           batch_ws)

Once the policy gradient is computed, we can take a gradient step in this direction.

In [30]:
train_op = tf.train.AdamOptimizer(learning_rate = 1e-2).minimize(pol_grad)

Now, we start the tensorflow session and then simulate.

In [42]:
#start tensorflow session
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())

for _ in range(50):
    (batch_ss, batch_as, batch_ws) = play_mdp(actions, 
                                              s_ph)
    
    _ = sess.run(train_op,
                         feed_dict={
                            s_ph: np.array(batch_ss),
                            a_ph: np.array(batch_as),
                            weights_ph: np.array(batch_ws)
                         })

Episode reward: 31.425


## Reward to go

To change to rewards to go, we only need to modify the weighting.

In [38]:
def rtg(episode_rs):
    """weights for reward-to-go policy gradient
    
    # Arguments
        episode_rs: rewards inside episode
        
    # Result
        weights for reward-to-go policy gradient
    """  
    n = len(episode_rs)
    rtgs = [0] * n
    for i in reversed(range(n)):
        rtgs[i] = episode_rs[i] + (rtgs[i+1] if i+1 < n else 0)
    return rtgs
    

Now, we with the modified weights.

In [41]:
#start tensorflow session
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())

for _ in range(50):
    (batch_ss, batch_as, batch_ws) = play_mdp(actions, 
                                              s_ph,
                                              rtg)
    _ = sess.run(train_op,
                         feed_dict={
                            s_ph: np.array(batch_ss),
                            a_ph: np.array(batch_as),
                            weights_ph: np.array(batch_ws)
                         })

Episode reward: 27.39344262295082
