# Simple Reinforcement Learning in Tensorflow Part 2: Policy Gradient Method
This tutorial contains a simple example of how to build a policy-gradient based agent that can solve the CartPole problem. For more information, see this [Medium post](https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-2-ded33892c724#.mtwpvfi8b).

For more Reinforcement Learning algorithms, including DQN and Model-based learning in Tensorflow, see my Github repo, [DeepRL-Agents](https://github.com/awjuliani/DeepRL-Agents). 

Parts of this tutorial are based on code by [Andrej Karpathy](https://gist.github.com/karpathy/a4166c7fe253700972fcbc77e4ea32c5) and [korymath](https://gym.openai.com/evaluations/eval_a0aVJrGSyW892vBM04HQA).

In [2]:
from __future__ import division

import numpy as np
try:
    import cPickle as pickle
except:
    import pickle
import tensorflow as tf
%matplotlib inline
import matplotlib.pyplot as plt
import math

try:
    xrange = xrange
except:
    xrange = range

### Loading the CartPole Environment
If you don't already have the OpenAI gym installed, use  `pip install gym` to grab it.

In [3]:
import gym
env = gym.make('CartPole-v0')

What happens if we try running the environment with random actions? How well do we do? (Hint: not so well.)

In [3]:
env.reset()
random_episodes = 0
reward_sum = 0
while random_episodes < 10:
    env.render()
    observation, reward, done, _ = env.step(np.random.randint(0,2))
    reward_sum += reward
    if done:
        random_episodes += 1
        print("Reward for this episode was:",reward_sum)
        reward_sum = 0
        env.reset()

Reward for this episode was: 42.0
Reward for this episode was: 16.0
Reward for this episode was: 26.0
Reward for this episode was: 17.0
Reward for this episode was: 39.0
Reward for this episode was: 10.0
Reward for this episode was: 19.0
Reward for this episode was: 17.0
Reward for this episode was: 18.0
Reward for this episode was: 21.0


The goal of the task is to achieve a reward of 200 per episode. For every step the agent keeps the pole in the air, the agent recieves a +1 reward. By randomly choosing actions, our reward for each episode is only a couple dozen. Let's make that better with RL!

### Setting up our Neural Network agent
This time we will be using a Policy neural network that takes observations, passes them through a single hidden layer, and then produces a probability of choosing a left/right movement. To learn more about this network, see [Andrej Karpathy's blog on Policy Gradient networks](http://karpathy.github.io/2016/05/31/rl/).

In [4]:
# hyperparameters
H = 10 # number of hidden layer neurons
batch_size = 5 # every how many episodes to do a param update?
learning_rate = 1e-2 # feel free to play with this to train faster or more stably.
gamma = 0.99 # discount factor for reward

D = 4 # input dimensionality

In [5]:
tf.reset_default_graph()

#This defines the network as it goes from taking an observation of the environment to 
#giving a probability of chosing to the action of moving left or right.
observations = tf.placeholder(tf.float32, [None,D] , name="input_x")
W1 = tf.get_variable("W1", shape=[D, H],
           initializer=tf.contrib.layers.xavier_initializer())
layer1 = tf.nn.relu(tf.matmul(observations,W1))
W2 = tf.get_variable("W2", shape=[H, 1],
           initializer=tf.contrib.layers.xavier_initializer())
score = tf.matmul(layer1,W2)
probability = tf.nn.sigmoid(score)

#From here we define the parts of the network needed for learning a good policy.
tvars = tf.trainable_variables()
input_y = tf.placeholder(tf.float32,[None,1], name="input_y")
advantages = tf.placeholder(tf.float32,name="reward_signal")

# The loss function. This sends the weights in the direction of making actions 
# that gave good advantage (reward over time) more likely, and actions that didn't less likely.
loglik = tf.log(input_y*(input_y - probability) + (1 - input_y)*(input_y + probability))
loss = -tf.reduce_mean(loglik * advantages) 
newGrads = tf.gradients(loss,tvars)

# Once we have collected a series of gradients from multiple episodes, we apply them.
# We don't just apply gradeients after every episode in order to account for noise in the reward signal.
adam = tf.train.AdamOptimizer(learning_rate=learning_rate) # Our optimizer
W1Grad = tf.placeholder(tf.float32,name="batch_grad1") # Placeholders to send the final gradients through when we update.
W2Grad = tf.placeholder(tf.float32,name="batch_grad2")
batchGrad = [W1Grad,W2Grad]
updateGrads = adam.apply_gradients(zip(batchGrad,tvars))

### Advantage function
This function allows us to weigh the rewards our agent recieves. In the context of the Cart-Pole task, we want actions that kept the pole in the air a long time to have a large reward, and actions that contributed to the pole falling to have a decreased or negative reward. We do this by weighing the rewards from the end of the episode, with actions at the end being seen as negative, since they likely contributed to the pole falling, and the episode ending. Likewise, early actions are seen as more positive, since they weren't responsible for the pole falling.

In [6]:
def discount_rewards(r):
    """ take 1D float array of rewards and compute discounted reward """
    discounted_r = np.zeros_like(r)
    running_add = 0
    for t in reversed(xrange(0, r.size)):
        running_add = running_add * gamma + r[t]
        discounted_r[t] = running_add
    return discounted_r

### Running the Agent and Environment

Here we run the neural network agent, and have it act in the CartPole environment.

In [9]:
xs,hs,dlogps,drs,ys,tfps = [],[],[],[],[],[]
running_reward = None
reward_sum = 0
episode_number = 1
total_episodes = 10000
init = tf.global_variables_initializer()

#tf settings for incremental memory
config = tf.ConfigProto()
config.gpu_options.allow_growth = True



# Launch the graph
sess = tf.Session(config=config)
rendering = False
sess.run(init)
observation = env.reset() # Obtain an initial observation of the environment

# Reset the gradient placeholder. We will collect gradients in 
# gradBuffer until we are ready to update our policy network. 
gradBuffer = sess.run(tvars)
for ix,grad in enumerate(gradBuffer):
    gradBuffer[ix] = grad * 0

while episode_number <= total_episodes:
    # Rendering the environment slows things down, 
    # so let's only look at it once our agent is doing a good job.
    if reward_sum/batch_size > 100 or rendering == True : 
#             env.render()
        rendering = True

    # Make sure the observation is in a shape the network can handle.
    x = np.reshape(observation,[1,D])

    # Run the policy network and get an action to take. 
    tfprob = sess.run(probability,feed_dict={observations: x})
    action = 1 if np.random.uniform() < tfprob else 0

    xs.append(x) # observation
    y = 1 if action == 0 else 0 # a "fake label"
    ys.append(y)

    # step the environment and get new measurements
    observation, reward, done, info = env.step(action)
    reward_sum += reward

    drs.append(reward) # record reward (has to be done after we call step() to get reward for previous action)

    if done: 
        episode_number += 1
        # stack together all inputs, hidden states, action gradients, and rewards for this episode
        epx = np.vstack(xs)
        epy = np.vstack(ys)
        epr = np.vstack(drs)
        tfp = tfps
        xs,hs,dlogps,drs,ys,tfps = [],[],[],[],[],[] # reset array memory

        # compute the discounted reward backwards through time
        discounted_epr = discount_rewards(epr)
        # size the rewards to be unit normal (helps control the gradient estimator variance)
        discounted_epr -= np.mean(discounted_epr)
        discounted_epr //= np.std(discounted_epr)

        # Get the gradient for this episode, and save it in the gradBuffer
        tGrad = sess.run(newGrads,feed_dict={observations: epx, input_y: epy, advantages: discounted_epr})
        for ix,grad in enumerate(tGrad):
            gradBuffer[ix] += grad

        # If we have completed enough episodes, then update the policy network with our gradients.
        if episode_number % batch_size == 0: 
            sess.run(updateGrads,feed_dict={W1Grad: gradBuffer[0],W2Grad:gradBuffer[1]})
            for ix,grad in enumerate(gradBuffer):
                gradBuffer[ix] = grad * 0

            # Give a summary of how well our network is doing for each batch of episodes.
            running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01
            print('Average reward for episode %f.  Total average reward %f.' % (reward_sum//batch_size, running_reward//batch_size))
            print(episode_number)
            if reward_sum//batch_size > 200: 
                print("Task solved in",episode_number,'episodes!')
                break

            reward_sum = 0

        observation = env.reset()
#         #weights saver
#         saver = tf.train.Saver()
#         saver_path = saver.save(sess,"./PG_cartpole_weights.ckpt")
sess.close()
print(episode_number,'Episodes completed.')

Average reward for episode 11.000000.  Total average reward 11.000000.
5
Average reward for episode 21.000000.  Total average reward 11.000000.
10
Average reward for episode 21.000000.  Total average reward 11.000000.
15
Average reward for episode 20.000000.  Total average reward 11.000000.
20
Average reward for episode 31.000000.  Total average reward 12.000000.
25
Average reward for episode 16.000000.  Total average reward 12.000000.
30
Average reward for episode 23.000000.  Total average reward 12.000000.
35
Average reward for episode 20.000000.  Total average reward 12.000000.
40
Average reward for episode 18.000000.  Total average reward 12.000000.
45
Average reward for episode 37.000000.  Total average reward 12.000000.
50
Average reward for episode 24.000000.  Total average reward 12.000000.
55
Average reward for episode 18.000000.  Total average reward 12.000000.
60
Average reward for episode 25.000000.  Total average reward 12.000000.
65
Average reward for episode 32.000000.  

Average reward for episode 59.000000.  Total average reward 39.000000.
560
Average reward for episode 43.000000.  Total average reward 39.000000.
565
Average reward for episode 67.000000.  Total average reward 39.000000.
570
Average reward for episode 85.000000.  Total average reward 40.000000.
575
Average reward for episode 92.000000.  Total average reward 40.000000.
580
Average reward for episode 115.000000.  Total average reward 41.000000.
585
Average reward for episode 99.000000.  Total average reward 41.000000.
590
Average reward for episode 69.000000.  Total average reward 42.000000.
595
Average reward for episode 98.000000.  Total average reward 42.000000.
600
Average reward for episode 80.000000.  Total average reward 43.000000.
605
Average reward for episode 44.000000.  Total average reward 43.000000.
610
Average reward for episode 54.000000.  Total average reward 43.000000.
615
Average reward for episode 74.000000.  Total average reward 43.000000.
620
Average reward for episo

Average reward for episode 126.000000.  Total average reward 87.000000.
1105
Average reward for episode 143.000000.  Total average reward 87.000000.
1110
Average reward for episode 115.000000.  Total average reward 88.000000.
1115
Average reward for episode 110.000000.  Total average reward 88.000000.
1120
Average reward for episode 156.000000.  Total average reward 88.000000.
1125
Average reward for episode 142.000000.  Total average reward 89.000000.
1130
Average reward for episode 126.000000.  Total average reward 89.000000.
1135
Average reward for episode 145.000000.  Total average reward 90.000000.
1140
Average reward for episode 143.000000.  Total average reward 90.000000.
1145
Average reward for episode 147.000000.  Total average reward 91.000000.
1150
Average reward for episode 122.000000.  Total average reward 91.000000.
1155
Average reward for episode 162.000000.  Total average reward 92.000000.
1160
Average reward for episode 167.000000.  Total average reward 93.000000.
1165

Average reward for episode 196.000000.  Total average reward 142.000000.
1635
Average reward for episode 186.000000.  Total average reward 142.000000.
1640
Average reward for episode 200.000000.  Total average reward 143.000000.
1645
Average reward for episode 190.000000.  Total average reward 143.000000.
1650
Average reward for episode 200.000000.  Total average reward 144.000000.
1655
Average reward for episode 189.000000.  Total average reward 144.000000.
1660
Average reward for episode 200.000000.  Total average reward 145.000000.
1665
Average reward for episode 200.000000.  Total average reward 145.000000.
1670
Average reward for episode 200.000000.  Total average reward 146.000000.
1675
Average reward for episode 196.000000.  Total average reward 146.000000.
1680
Average reward for episode 200.000000.  Total average reward 147.000000.
1685
Average reward for episode 187.000000.  Total average reward 147.000000.
1690
Average reward for episode 187.000000.  Total average reward 148

Average reward for episode 191.000000.  Total average reward 177.000000.
2165
Average reward for episode 181.000000.  Total average reward 177.000000.
2170
Average reward for episode 198.000000.  Total average reward 177.000000.
2175
Average reward for episode 200.000000.  Total average reward 177.000000.
2180
Average reward for episode 187.000000.  Total average reward 177.000000.
2185
Average reward for episode 200.000000.  Total average reward 177.000000.
2190
Average reward for episode 175.000000.  Total average reward 177.000000.
2195
Average reward for episode 200.000000.  Total average reward 178.000000.
2200
Average reward for episode 200.000000.  Total average reward 178.000000.
2205
Average reward for episode 189.000000.  Total average reward 178.000000.
2210
Average reward for episode 180.000000.  Total average reward 178.000000.
2215
Average reward for episode 200.000000.  Total average reward 178.000000.
2220
Average reward for episode 200.000000.  Total average reward 178

Average reward for episode 198.000000.  Total average reward 189.000000.
2695
Average reward for episode 200.000000.  Total average reward 190.000000.
2700
Average reward for episode 200.000000.  Total average reward 190.000000.
2705
Average reward for episode 200.000000.  Total average reward 190.000000.
2710
Average reward for episode 198.000000.  Total average reward 190.000000.
2715
Average reward for episode 200.000000.  Total average reward 190.000000.
2720
Average reward for episode 194.000000.  Total average reward 190.000000.
2725
Average reward for episode 197.000000.  Total average reward 190.000000.
2730
Average reward for episode 200.000000.  Total average reward 190.000000.
2735
Average reward for episode 200.000000.  Total average reward 190.000000.
2740
Average reward for episode 200.000000.  Total average reward 190.000000.
2745
Average reward for episode 200.000000.  Total average reward 190.000000.
2750
Average reward for episode 197.000000.  Total average reward 190

Average reward for episode 196.000000.  Total average reward 190.000000.
3225
Average reward for episode 195.000000.  Total average reward 190.000000.
3230
Average reward for episode 191.000000.  Total average reward 190.000000.
3235
Average reward for episode 200.000000.  Total average reward 190.000000.
3240
Average reward for episode 191.000000.  Total average reward 190.000000.
3245
Average reward for episode 193.000000.  Total average reward 190.000000.
3250
Average reward for episode 199.000000.  Total average reward 190.000000.
3255
Average reward for episode 183.000000.  Total average reward 190.000000.
3260
Average reward for episode 189.000000.  Total average reward 190.000000.
3265
Average reward for episode 188.000000.  Total average reward 190.000000.
3270
Average reward for episode 195.000000.  Total average reward 190.000000.
3275
Average reward for episode 200.000000.  Total average reward 190.000000.
3280
Average reward for episode 182.000000.  Total average reward 190

Average reward for episode 183.000000.  Total average reward 191.000000.
3755
Average reward for episode 200.000000.  Total average reward 191.000000.
3760
Average reward for episode 197.000000.  Total average reward 191.000000.
3765
Average reward for episode 199.000000.  Total average reward 191.000000.
3770
Average reward for episode 197.000000.  Total average reward 191.000000.
3775
Average reward for episode 185.000000.  Total average reward 191.000000.
3780
Average reward for episode 200.000000.  Total average reward 191.000000.
3785
Average reward for episode 199.000000.  Total average reward 191.000000.
3790
Average reward for episode 200.000000.  Total average reward 191.000000.
3795
Average reward for episode 191.000000.  Total average reward 191.000000.
3800
Average reward for episode 181.000000.  Total average reward 191.000000.
3805
Average reward for episode 192.000000.  Total average reward 191.000000.
3810
Average reward for episode 200.000000.  Total average reward 191

Average reward for episode 200.000000.  Total average reward 195.000000.
4285
Average reward for episode 200.000000.  Total average reward 195.000000.
4290
Average reward for episode 200.000000.  Total average reward 195.000000.
4295
Average reward for episode 200.000000.  Total average reward 195.000000.
4300
Average reward for episode 200.000000.  Total average reward 195.000000.
4305
Average reward for episode 200.000000.  Total average reward 195.000000.
4310
Average reward for episode 200.000000.  Total average reward 195.000000.
4315
Average reward for episode 200.000000.  Total average reward 195.000000.
4320
Average reward for episode 200.000000.  Total average reward 195.000000.
4325
Average reward for episode 200.000000.  Total average reward 195.000000.
4330
Average reward for episode 178.000000.  Total average reward 195.000000.
4335
Average reward for episode 200.000000.  Total average reward 195.000000.
4340
Average reward for episode 200.000000.  Total average reward 195

Average reward for episode 195.000000.  Total average reward 196.000000.
4815
Average reward for episode 200.000000.  Total average reward 196.000000.
4820
Average reward for episode 197.000000.  Total average reward 196.000000.
4825
Average reward for episode 200.000000.  Total average reward 196.000000.
4830
Average reward for episode 198.000000.  Total average reward 196.000000.
4835
Average reward for episode 200.000000.  Total average reward 196.000000.
4840
Average reward for episode 200.000000.  Total average reward 197.000000.
4845
Average reward for episode 200.000000.  Total average reward 197.000000.
4850
Average reward for episode 200.000000.  Total average reward 197.000000.
4855
Average reward for episode 200.000000.  Total average reward 197.000000.
4860
Average reward for episode 200.000000.  Total average reward 197.000000.
4865
Average reward for episode 200.000000.  Total average reward 197.000000.
4870
Average reward for episode 200.000000.  Total average reward 197

Average reward for episode 200.000000.  Total average reward 198.000000.
5345
Average reward for episode 200.000000.  Total average reward 198.000000.
5350
Average reward for episode 200.000000.  Total average reward 198.000000.
5355
Average reward for episode 199.000000.  Total average reward 198.000000.
5360
Average reward for episode 200.000000.  Total average reward 198.000000.
5365
Average reward for episode 194.000000.  Total average reward 198.000000.
5370
Average reward for episode 200.000000.  Total average reward 198.000000.
5375
Average reward for episode 200.000000.  Total average reward 198.000000.
5380
Average reward for episode 200.000000.  Total average reward 198.000000.
5385
Average reward for episode 190.000000.  Total average reward 198.000000.
5390
Average reward for episode 200.000000.  Total average reward 198.000000.
5395
Average reward for episode 200.000000.  Total average reward 198.000000.
5400
Average reward for episode 191.000000.  Total average reward 198

Average reward for episode 200.000000.  Total average reward 199.000000.
5875
Average reward for episode 200.000000.  Total average reward 199.000000.
5880
Average reward for episode 200.000000.  Total average reward 199.000000.
5885
Average reward for episode 200.000000.  Total average reward 199.000000.
5890
Average reward for episode 200.000000.  Total average reward 199.000000.
5895
Average reward for episode 200.000000.  Total average reward 199.000000.
5900
Average reward for episode 200.000000.  Total average reward 199.000000.
5905
Average reward for episode 200.000000.  Total average reward 199.000000.
5910
Average reward for episode 200.000000.  Total average reward 199.000000.
5915
Average reward for episode 200.000000.  Total average reward 199.000000.
5920
Average reward for episode 200.000000.  Total average reward 199.000000.
5925
Average reward for episode 200.000000.  Total average reward 199.000000.
5930
Average reward for episode 200.000000.  Total average reward 199

Average reward for episode 200.000000.  Total average reward 199.000000.
6405
Average reward for episode 200.000000.  Total average reward 199.000000.
6410
Average reward for episode 191.000000.  Total average reward 199.000000.
6415
Average reward for episode 200.000000.  Total average reward 199.000000.
6420
Average reward for episode 200.000000.  Total average reward 199.000000.
6425
Average reward for episode 200.000000.  Total average reward 199.000000.
6430
Average reward for episode 183.000000.  Total average reward 199.000000.
6435
Average reward for episode 200.000000.  Total average reward 199.000000.
6440
Average reward for episode 200.000000.  Total average reward 199.000000.
6445
Average reward for episode 200.000000.  Total average reward 199.000000.
6450
Average reward for episode 200.000000.  Total average reward 199.000000.
6455
Average reward for episode 198.000000.  Total average reward 199.000000.
6460
Average reward for episode 200.000000.  Total average reward 199

Average reward for episode 197.000000.  Total average reward 197.000000.
6935
Average reward for episode 186.000000.  Total average reward 197.000000.
6940
Average reward for episode 200.000000.  Total average reward 197.000000.
6945
Average reward for episode 199.000000.  Total average reward 197.000000.
6950
Average reward for episode 200.000000.  Total average reward 197.000000.
6955
Average reward for episode 200.000000.  Total average reward 197.000000.
6960
Average reward for episode 200.000000.  Total average reward 197.000000.
6965
Average reward for episode 200.000000.  Total average reward 197.000000.
6970
Average reward for episode 194.000000.  Total average reward 197.000000.
6975
Average reward for episode 200.000000.  Total average reward 197.000000.
6980
Average reward for episode 200.000000.  Total average reward 197.000000.
6985
Average reward for episode 200.000000.  Total average reward 198.000000.
6990
Average reward for episode 200.000000.  Total average reward 198

Average reward for episode 200.000000.  Total average reward 199.000000.
7465
Average reward for episode 200.000000.  Total average reward 199.000000.
7470
Average reward for episode 200.000000.  Total average reward 199.000000.
7475
Average reward for episode 200.000000.  Total average reward 199.000000.
7480
Average reward for episode 200.000000.  Total average reward 199.000000.
7485
Average reward for episode 200.000000.  Total average reward 199.000000.
7490
Average reward for episode 200.000000.  Total average reward 199.000000.
7495
Average reward for episode 200.000000.  Total average reward 199.000000.
7500
Average reward for episode 200.000000.  Total average reward 199.000000.
7505
Average reward for episode 200.000000.  Total average reward 199.000000.
7510
Average reward for episode 200.000000.  Total average reward 199.000000.
7515
Average reward for episode 200.000000.  Total average reward 199.000000.
7520
Average reward for episode 200.000000.  Total average reward 199

Average reward for episode 200.000000.  Total average reward 199.000000.
7995
Average reward for episode 200.000000.  Total average reward 199.000000.
8000
Average reward for episode 200.000000.  Total average reward 199.000000.
8005
Average reward for episode 200.000000.  Total average reward 199.000000.
8010
Average reward for episode 200.000000.  Total average reward 199.000000.
8015
Average reward for episode 200.000000.  Total average reward 199.000000.
8020
Average reward for episode 200.000000.  Total average reward 199.000000.
8025
Average reward for episode 196.000000.  Total average reward 199.000000.
8030
Average reward for episode 200.000000.  Total average reward 199.000000.
8035
Average reward for episode 199.000000.  Total average reward 199.000000.
8040
Average reward for episode 200.000000.  Total average reward 199.000000.
8045
Average reward for episode 200.000000.  Total average reward 199.000000.
8050
Average reward for episode 200.000000.  Total average reward 199

Average reward for episode 177.000000.  Total average reward 182.000000.
8525
Average reward for episode 186.000000.  Total average reward 182.000000.
8530
Average reward for episode 199.000000.  Total average reward 182.000000.
8535
Average reward for episode 195.000000.  Total average reward 182.000000.
8540
Average reward for episode 176.000000.  Total average reward 182.000000.
8545
Average reward for episode 183.000000.  Total average reward 182.000000.
8550
Average reward for episode 183.000000.  Total average reward 182.000000.
8555
Average reward for episode 182.000000.  Total average reward 182.000000.
8560
Average reward for episode 180.000000.  Total average reward 182.000000.
8565
Average reward for episode 189.000000.  Total average reward 182.000000.
8570
Average reward for episode 200.000000.  Total average reward 182.000000.
8575
Average reward for episode 195.000000.  Total average reward 182.000000.
8580
Average reward for episode 191.000000.  Total average reward 182

Average reward for episode 198.000000.  Total average reward 185.000000.
9055
Average reward for episode 193.000000.  Total average reward 185.000000.
9060
Average reward for episode 195.000000.  Total average reward 185.000000.
9065
Average reward for episode 186.000000.  Total average reward 185.000000.
9070
Average reward for episode 186.000000.  Total average reward 185.000000.
9075
Average reward for episode 191.000000.  Total average reward 185.000000.
9080
Average reward for episode 198.000000.  Total average reward 185.000000.
9085
Average reward for episode 200.000000.  Total average reward 186.000000.
9090
Average reward for episode 192.000000.  Total average reward 186.000000.
9095
Average reward for episode 191.000000.  Total average reward 186.000000.
9100
Average reward for episode 185.000000.  Total average reward 186.000000.
9105
Average reward for episode 184.000000.  Total average reward 186.000000.
9110
Average reward for episode 196.000000.  Total average reward 186

Average reward for episode 168.000000.  Total average reward 191.000000.
9585
Average reward for episode 180.000000.  Total average reward 191.000000.
9590
Average reward for episode 189.000000.  Total average reward 191.000000.
9595
Average reward for episode 192.000000.  Total average reward 191.000000.
9600
Average reward for episode 185.000000.  Total average reward 191.000000.
9605
Average reward for episode 186.000000.  Total average reward 191.000000.
9610
Average reward for episode 194.000000.  Total average reward 191.000000.
9615
Average reward for episode 158.000000.  Total average reward 191.000000.
9620
Average reward for episode 186.000000.  Total average reward 191.000000.
9625
Average reward for episode 200.000000.  Total average reward 191.000000.
9630
Average reward for episode 176.000000.  Total average reward 191.000000.
9635
Average reward for episode 167.000000.  Total average reward 190.000000.
9640
Average reward for episode 166.000000.  Total average reward 190

In [10]:
saver = tf.train.Saver()
saver_path = saver.save(sess,"./PG_cartpole_weights.ckpt")

In [11]:
sess.close()

In [14]:
#infrrence logic
sess = tf.Session(config=config)
saver = tf.train.Saver()
saver.restore(sess,"./PG_cartpole_weights.ckpt")


INFO:tensorflow:Restoring parameters from ./PG_cartpole_weights.ckpt


In [16]:
obs =env.reset()
x = np.reshape(observation,[1,D])

In [18]:
tf_prob = sess.run(probability,feed_dict={observations:x})

In [20]:
action = 1 if tf_prob > 0.5 else 0

In [26]:
env.reset()
env.mode = "human"
random_episodes = 0
reward_sum = 0
while random_episodes < 100:
    env.render()
    observation, reward, done, _ = env.step(action)
    x = np.reshape(observation,[1,D])
    action = 1 if sess.run(probability,feed_dict={observations:x}) > 0.5 else 0
    reward_sum += reward
    if done:
        random_episodes += 1
        print("Reward for this episode was:",reward_sum)
        reward_sum = 0
        env.reset()

Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this episode was: 200.0
Reward for this epis

As you can see, the network not only does much better than random actions, but achieves the goal of 200 points per episode, thus solving the task!

In [20]:
episodic_average = []
total_average = []
with open("PG_rewards.txt") as f:
    for i in f.readlines():
        sample = i.split("  ")
        episodic_average.append(float(sample[0][27:-1]))#,sample[1][21]
        total_average.append(sample[1][20:])

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 18.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 19.000000.

 20.000000.

 20.000000.

 20.000000.

 20.000000.

 20.000000.

 20.000000.

 20.000000.

 20.000000.

 20.000000.

 20.000000.

 20.000000.

 20.000000.

 20.000000.

 20.000000.

 20.000000.

 21.000000.

 21.000000.

 21.000000.

 21.000000.

 21.000000.

 21.000000.

 21.000000.

 22.000000.

 22.000000.

 22.000000.

 22.000000.

 22.000000.

 22.000000.

 22.000000.

 22.000000.

 22.000000.

 22.000000.

 22.000000.

 22.000000.

 22.000000.

 23.000000.

 23.000000.

 23.000000.

 23.000000.

 23.000000.

 23.000000.

 23.000000.
