## Deriving the simplest policy gradient

We aim to maximize $J(\pi_{\theta}) = \mathop{\mathbb{E}_{\tau \sim \pi_{\theta}}} [R(\tau)] $

where $R(\tau)$ is a finite-horizon undiscounted return

To obtain a computable expression of the gradient, we must
- Derive the analytical gradient of policy performance (turns out it's an expectation value)
- Form a sample estimate of the expected value, computed with data from a number of interaction steps

From notebook 1, we saw the probability of a trajectory. We can use the log-derivative trick:

<img src="https://spinningup.openai.com/en/latest/_images/math/2f10287db9a459af5467140025c35bb92e960ee3.svg"/>

and the log probability of a trajectory
<img src="https://spinningup.openai.com/en/latest/_images/math/1737476897d122c10cec3053f4d360f8b57d0a01.svg"/>

to get:

<img src="https://spinningup.openai.com/en/latest/_images/math/5661deae547ee000037bc5ecc5e3077de6ee57db.svg"/>

knowing that the gradients of the terms with no dependence on $\theta$ are 0

***Nice*** 

Having this, we can take the derivative of the objective $J(\pi_{\theta})) to get the expression for the gradient log-probability

<img src="https://spinningup.openai.com/en/latest/_images/math/6fcf142138ce8289bb4f5d4656f3f3bf1609214d.svg"/>

***Also nice***

We can estimate the expectation value above by getting a number of trajectories from the policy, using the law of large(ish) numbers.

## Gettin' dirty!

In [29]:
import tensorflow as tf
import numpy as np
import gym
from gym.spaces import Discrete, Box

from IPython import display
import matplotlib.pyplot as plt

%matplotlib inline

In [3]:
env = gym.make('CartPole-v0')

obs_dim = env.observation_space.shape[0]
n_acts = env.action_space.n

In [6]:
def mlp(x, sizes, activation=tf.tanh, output_activation=None):
    for size in sizes[:-1]:
        x = tf.layers.dense(x, units=size, activation=activation)
    return tf.layers.dense(x, units=sizes[-1], activation=output_activation)

hidden_sizes=[32]
lr = 1e-2
epochs = 50
batch_size = 5000

### Core of the ting

In [31]:
# mlp mapping observation to an action
obs_ph = tf.placeholder(shape=(None, obs_dim), dtype=tf.float32)
logits = mlp(obs_ph, sizes=hidden_sizes+[n_acts])

actions = tf.squeeze(tf.multinomial(logits=logits, num_samples=1), axis=1)

### Create a loss function whose gradient is the policy gradient!

In [32]:
weights_ph = tf.placeholder(shape=(None,), dtype=tf.float32)
act_ph = tf.placeholder(shape=(None,), dtype=tf.int32)

action_masks = tf.one_hot(act_ph, n_acts)

log_probs = tf.reduce_sum(action_masks * tf.nn.log_softmax(logits), axis=1)
loss = -tf.reduce_mean(weights_ph * log_probs)

optimizer = tf.train.AdamOptimizer(learning_rate=lr).minimize(loss)


In [33]:
# ugh tensorflow 1
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())



In [41]:
def render_():
    img = env.render(mode='rgb_array')
    plt.imshow(img)
    display.display(plt.gcf())
    display.clear_output(wait=True)
    
def train_once():
    batch_obs = []
    batch_acts = []
    batch_weights = []
    batch_rets = []
    batch_lens = []
    
    obs = env.reset()
    done = False
    ep_rews = []
    
    finished_rendering = False
    
    while True:
        
        if (not finished_rendering):
            render_()
            
        batch_obs.append(obs.copy())
        
        act = sess.run(actions, {obs_ph: obs.reshape(1,-1)})[0]
        obs, rew, done, _ = env.step(act)
        
        batch_acts.append(act)
        ep_rews.append(rew)
        
        if done: 
            ep_ret, ep_len = sum(ep_rews), len(ep_rews)
            batch_lens.append(ep_len)
            batch_rets.append(ep_ret)
        
        # the weight for each logprob(a|s) is R(tau)
            batch_weights += [ep_ret] * ep_len
            
            obs, done, ep_rews = env.reset(), False, []
            
            finished_rendering = True
            
            if len(batch_obs) > batch_size:
                break
                
    batch_loss, _ = sess.run([loss, optimizer],
                            feed_dict={
                                obs_ph: np.array(batch_obs),
                                act_ph: np.array(batch_acts),
                                weights_ph: np.array(batch_weights)
                            })
    return batch_loss, batch_rets, batch_lens     
        

### Train!

In [42]:
for i in range(epochs):
    batch_loss, batch_rets, batch_lens = train_once()
    print('epoch: {}, loss: {}, return: {}, ep_len: {}'.format(
    i, batch_loss, np.mean(batch_rets), np.mean(batch_lens)))

NameError: name 'base' is not defined