In [1]:
from IPython.core.display import HTML  #For a more pleasing rendering...
HTML(open("styles/custom.css").read()) #When run in your local notebook.

## 2.2 Policy Gradient

We will introduce the topic with two **Bandit Problems** at first and then get started with the **openAI Gym**. 


<img src="./images/example_structure_policy.png" height="400" />

### Example: n-armed Bandit

We have a slot machine with $n$ arms, in this case $4$. Every arm has different chances to get a positiv reward; our goal is to maximize the reward over time. 

 <font>All we need to focus on is learning which rewards we get for each of the possible actions, and ensuring we chose the optimal ones. In the context of reinforcement learning, this is called learning a policy. We are going to be using a method called policy gradients, where our simple neural network learns a policy for picking actions by adjusting it’s weights through gradient descent using feedback from the environment </font>

In [12]:
import tensorflow as tf
import tensorflow.contrib.slim as slim
import numpy as np

In [13]:
# List out our bandit arms. 
# Currently arm 4 (index #3) is set to most often provide a positive reward.
bandit_arms = [0.2,0.5,0.3,-2] # Define probabilitys
num_arms = len(bandit_arms) # Number of possible arms

# Probability of getting a positiv reward
def pullBandit(bandit):
    # Get a random number.
    result = np.random.randn(1)
    if result > bandit:
        # Return a positive reward.
        return 1
    else:
        # Return a negative reward.
        return -1

In [14]:
tf.reset_default_graph()

# These two lines established the feed-forward part of the network. 
# Note that it is just one layer and we dont have any states as input. 
weights = tf.Variable(tf.ones([num_arms]))
output = tf.nn.softmax(weights)

# The next six lines establish the training proceedure. 
# We feed the reward and chosen action into the network
# to compute the loss, and use it to update the network.
reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)
action_holder = tf.placeholder(shape=[1],dtype=tf.int32)

responsible_output = tf.slice(output,action_holder,[1])
loss = -(tf.log(responsible_output)*reward_holder)
optimizer = tf.train.AdamOptimizer(learning_rate=1e-3)
update = optimizer.minimize(loss)

In [15]:
total_episodes = 1000 # Set total number of episodes to train agent on.
total_reward = np.zeros(num_arms) # Set scoreboard for bandit arms to 0.

init = tf.global_variables_initializer()

# Launch the tensorflow graph
with tf.Session() as sess:
    sess.run(init)

    i = 0
    while i < total_episodes:
        
        probs = sess.run(output) # Compute probabilitys
        action = np.random.choice(range(num_arms),p=probs) # Chose one arm with the probability

        reward = pullBandit(bandit_arms[action]) 
        # Get our reward from picking one of the bandit arms.
        
        # Update the network.
        _, ww = sess.run([update,weights], feed_dict={reward_holder:[reward],action_holder:[action]})
        
        # Update our running memory with scores.
        total_reward[action] += reward

        if i % 100 == 0:
            print("Running reward for the " + str(num_arms) + " arms of the bandit: " + str(total_reward))
        i+=1
        
print("\nThe agent thinks arm " + str(np.argmax(ww)+1) + " is the most promising....")
if np.argmax(ww) == np.argmax(-np.array(bandit_arms)):
    print("...and it was right!")
else:
    print("...and it was wrong!")

Running reward for the 4 arms of the bandit: [ 1.  0.  0.  0.]
Running reward for the 4 arms of the bandit: [ -2. -14.  -3.  18.]
Running reward for the 4 arms of the bandit: [ -9. -11. -12.  45.]
Running reward for the 4 arms of the bandit: [-21. -11. -14.  67.]
Running reward for the 4 arms of the bandit: [-23. -20. -16.  92.]
Running reward for the 4 arms of the bandit: [ -25.  -25.  -20.  121.]
Running reward for the 4 arms of the bandit: [ -34.  -28.  -29.  162.]
Running reward for the 4 arms of the bandit: [ -39.  -35.  -35.  200.]
Running reward for the 4 arms of the bandit: [ -40.  -37.  -42.  236.]
Running reward for the 4 arms of the bandit: [ -43.  -42.  -62.  270.]

The agent thinks arm 4 is the most promising....
...and it was right!


In [16]:
print 'Resulting weights to choose an arm.'
print ww

Resulting weights to choose an arm.
[ 0.8347196   0.80655986  0.76708001  1.57525516]


###  Example: contextual Bandit

<font>Contextual Bandits introduce the concept of the state. The state consists of a description of the environment that the agent can use to take more informed actions. In our problem, instead of a single bandit, there can now be multiple bandits. The state of the environment tells us which bandit we are dealing with, and the goal of the agent is to learn the best action not just for a single bandit, but for any number of them. Since each bandit will have different reward probabilities for each arm, our agent will need to learn to condition its action on the state of the environment. </font>

<font>Here we define our contextual bandits. In this example, we are using **three four-armed bandits**. What this means is that each bandit has four arms that can be pulled. Each bandit has different success probabilities for each arm, and as such requires different actions to obtain the best result. Every episode we are in a random **state**, means we are infront of a random bandit. </font>

In [3]:
import tensorflow as tf
import tensorflow.contrib.slim as slim
import numpy as np

In [4]:
state = 0
#List out our bandits. Currently arms 4, 2, and 1 (respectively) are the most optimal.
bandits = np.array([[0.2,0,-0.0,-5],[0.1,-5,1,0.25],[-5,5,5,5]])
num_bandits = bandits.shape[0] # Number of bandits
num_actions = bandits.shape[1] # Number of arms per bandit
        
def getBandit(): # Random state at the beginning
    state = np.random.randint(0,len(bandits)) #Returns a random state for each episode.
    return state
        
def pullArm(action):
    #Get a random number.
    bandit_prob = bandits[state,action]
    result = np.random.randn(1)
    if result > bandit_prob:
          #return a positive reward.
        return 1
    else:
        #return a negative reward.
        return -1

In [5]:
tf.reset_default_graph() #Clear the Tensorflow graph.

#These lines established the feed-forward part of the network. The agent takes a state and produces an action.
state_in= tf.placeholder(shape=[1],dtype=tf.int32)
state_in_OH = slim.one_hot_encoding(state_in,num_bandits) # Easier to handle encoding 
output = slim.fully_connected(state_in_OH,num_actions, biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones_initializer())

output = tf.reshape(output,[-1])
chosen_action = tf.argmax(output,0)

# Here we used TensorFlows contrib.slim framework, which allowed us to create a one layer
# network in one line. This should give you a taste what possibilities are offered; we sadly
# can't present all of them. 


# The next six lines establish the training proceedure.
# We feed the reward and chosen action into the network
# to compute the loss, and use it to update the network.
reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)
action_holder = tf.placeholder(shape=[1],dtype=tf.int32)
responsible_weight = tf.slice(output,action_holder,[1])

loss = -(tf.log(responsible_weight)*reward_holder)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
update = optimizer.minimize(loss)

In [6]:
weights = tf.trainable_variables()[0] # The weights we will evaluate to look into the network.

total_episodes = 10000 # Set total number of episodes to train agent on.
total_reward = np.zeros([num_bandits,num_actions]) # Set scoreboard for bandits to 0.
e = 0.1 # Set the chance of taking a random action. e-greedy approach

init = tf.global_variables_initializer()

# Launch the tensorflow graph
with tf.Session() as sess:
    sess.run(init)

    i = 0
    while i < total_episodes:
        state = getBandit() # Get a state from the environment.
        
        # Choose either a random action or one from our network.
        if np.random.rand(1) < e:
            action = np.random.randint(num_actions)
        else:
            action = sess.run(chosen_action,feed_dict={state_in:[state]})
        
        reward = pullArm(action) # Get our reward for taking an action given a bandit.
        
        # Update the network.
        feed_dict={reward_holder:[reward],action_holder:[action],state_in:[state]}
        _,ww = sess.run([update,weights], feed_dict=feed_dict)
         
        # Update our running memory with scores.
        total_reward[state,action] += reward
        if i % 500 == 0:
            print("Mean reward for each of the " + str(num_bandits) + " bandits: " + str(np.mean(total_reward,axis=1)))
        i+=1
for a in range(num_bandits):
    print("The agent thinks action " + str(np.argmax(ww[a])+1) + " for bandit " + str(a+1) + " is the most promising....")
    if np.argmax(ww[a]) == np.argmin(bandits[a]):
        print("...and it was right!")
    else:
        print("...and it was wrong!")

Mean reward for each of the 3 bandits: [ 0.    0.    0.25]
Mean reward for each of the 3 bandits: [ 16.75  36.5   34.5 ]
Mean reward for each of the 3 bandits: [ 60.5   69.    68.75]
Mean reward for each of the 3 bandits: [ 102.    104.5   106.75]
Mean reward for each of the 3 bandits: [ 139.25  144.    139.5 ]
Mean reward for each of the 3 bandits: [ 178.5   179.5   176.25]
Mean reward for each of the 3 bandits: [ 216.5   218.5   213.75]
Mean reward for each of the 3 bandits: [ 254.5   254.25  254.5 ]
Mean reward for each of the 3 bandits: [ 295.75  292.    285.  ]
Mean reward for each of the 3 bandits: [ 333.75  331.5   319.  ]
Mean reward for each of the 3 bandits: [ 373.5   366.    353.75]
Mean reward for each of the 3 bandits: [ 411.5   403.5   386.75]
Mean reward for each of the 3 bandits: [ 447.    445.25  423.  ]
Mean reward for each of the 3 bandits: [ 491.5   476.5   456.75]
Mean reward for each of the 3 bandits: [ 529.75  513.    493.5 ]
Mean reward for each of the 3 bandits

In [9]:
print 'Resulting weights to choose an arm.'
print 'The row index is the bandit number and the collumn the arm number'
print ww

Resulting weights to choose an arm.
The row index is the bandit number and the collumn the arm number
[[ 0.99677056  0.99650079  1.00107944  1.63615358]
 [ 0.99946415  1.63713121  0.98513019  0.99136895]
 [ 1.63468564  0.97667587  0.97831613  0.97776961]]


### Example: Simple Policy solving CartPole-Environment

Now with a similar idea we can solve our first environment the only difference this time: our state is determined by our previous action.  

In [1]:
import numpy as np
import tensorflow as tf

In [2]:
import gym
env = gym.make('CartPole-v0')

[2017-09-26 16:07:38,206] Making new env: CartPole-v0


In [3]:
""" take 1D float array of rewards and compute discounted reward """
def discount_rewards(r):
    discounted_r = np.zeros_like(r)
    running_add = 0
    for t in reversed(xrange(0, r.size)):
        running_add = running_add * gamma + r[t]
        discounted_r[t] = running_add
    return discounted_r


In [4]:
# hyperparameters
H = 16 # number of hidden layer neurons
learning_rate = 1e-2 # feel free to play with this to train faster or more stably.
gamma = 0.99 # discount factor for reward

D = 4 # input dimensionality

logs_path = '/tmp/logs_pole/pole_1'

In [5]:
tf.reset_default_graph()

#This defines the network as it goes from taking an observation of the environment to 
#giving a probability of chosing to the action of moving left or right.
observations = tf.placeholder(tf.float32, [None,D] , name="observations")

with tf.name_scope('Model'):
    W1 = tf.get_variable("W1", shape=[D, H],
           initializer=tf.contrib.layers.xavier_initializer())
    layer1 = tf.nn.relu(tf.matmul(observations,W1))
    W2 = tf.get_variable("W2", shape=[H, 1],
               initializer=tf.contrib.layers.xavier_initializer())
    score = tf.matmul(layer1,W2)
    probability = tf.nn.sigmoid(score)

#From here we define the parts of the network needed for learning a good policy.
tvars = tf.trainable_variables()
input_y = tf.placeholder(tf.float32,[None,1], name="input_y")
advantages = tf.placeholder(tf.float32,name="reward_signal")

# The loss function. This sends the weights in the direction of making actions 
# that gave good advantage (reward over time) more likely, and actions that didn't less likely.
# If input_y is 0 -> log-likelyhood is log(probability)
# If input_y is 1 -> log-lik is log(1-probability)
loglik = tf.log(input_y*(input_y - probability) + (1 - input_y)*(input_y + probability))
loss = -tf.reduce_mean(loglik * advantages) 
    
newGrads = tf.gradients(loss,tvars)


# We just apply gradients after every episode 
adam = tf.train.AdamOptimizer(learning_rate=learning_rate) # Our optimizer
updateGrads = adam.apply_gradients(zip(newGrads,tvars))

true_reward = tf.placeholder(tf.float32,name="true_reward")

In [6]:
# 'Saver' op to save and restore all the variables
saver = tf.train.Saver()

# Data construction for tensorboard output

# Create a summary to monitor loss tensor
tf.summary.scalar("Loss", -loss)
# Create a summary to monitor accuracy tensor
tf.summary.scalar("Reward", true_reward)

# Create summaries to visualize weights
#for var in tf.trainable_variables():
#    tf.summary.histogram(var.name, var)

# Merge all summaries into a single op
merged_summary_op = tf.summary.merge_all()

In [7]:
xs,drs,ys = [],[],[]
running_reward = None
reward_sum = 0
episode_number = 1
total_episodes = 400

init = tf.global_variables_initializer()
sess = tf.Session()

# Launch the graph
sess.run(init)

summary_writer = tf.summary.FileWriter(logs_path, graph=tf.get_default_graph())

observation = env.reset() # Obtain an initial observation of the environment
    
while episode_number < total_episodes:
            
    # Make sure the observation is in a shape the network can handle.
    x = np.reshape(observation,[1,D])
    
    # Run the policy network and get an action to take. 
    tfprob = sess.run(probability,feed_dict={observations: x})
    action = 1 if np.random.uniform() < tfprob else 0
        
    xs.append(x) # observation
    y = 1 if action == 0 else 0 # a "fake label"
    ys.append(y)

    # step the environment and get new measurements
    observation, reward, done, info = env.step(action)
    reward_sum += reward
    drs.append(reward) # record reward (has to be done after we call step() to get reward for previous action)

    if done: 
        episode_number += 1
        # stack together all inputs, hidden states, action gradients, and rewards for this episode
        epx = np.vstack(xs)
        epy = np.vstack(ys)
        epr = np.vstack(drs)
        xs,drs,ys = [],[],[] # reset array memory

        # compute the discounted reward backwards through time
        discounted_epr = discount_rewards(epr)
        # size the rewards to be unit normal (helps control the gradient estimator variance)
        discounted_epr -= np.mean(discounted_epr)
        discounted_epr //= np.std(discounted_epr)
            
        summary = sess.run(merged_summary_op, feed_dict={observations: epx, input_y: epy, advantages: discounted_epr, true_reward: reward_sum})
        # Write logs at every iteration
        summary_writer.add_summary(summary, episode_number)
        
        
        # If we have completed enough episodes, then update the policy network with our gradients.
        sess.run(updateGrads,feed_dict={observations: epx, input_y: epy, advantages: discounted_epr})
        running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01
        
        print 'Reward for episode',  episode_number,':', float(reward_sum)
                
        reward_sum = 0
            
        observation = env.reset()
        
print(episode_number,'Episodes completed.')

Reward for episode 2 : 19.0
Reward for episode 3 : 28.0
Reward for episode 4 : 26.0
Reward for episode 5 : 23.0
Reward for episode 6 : 62.0
Reward for episode 7 : 23.0
Reward for episode 8 : 19.0
Reward for episode 9 : 21.0
Reward for episode 10 : 26.0
Reward for episode 11 : 17.0
Reward for episode 12 : 12.0
Reward for episode 13 : 24.0
Reward for episode 14 : 41.0
Reward for episode 15 : 49.0
Reward for episode 16 : 22.0
Reward for episode 17 : 35.0
Reward for episode 18 : 9.0
Reward for episode 19 : 12.0
Reward for episode 20 : 30.0
Reward for episode 21 : 45.0
Reward for episode 22 : 13.0
Reward for episode 23 : 13.0
Reward for episode 24 : 39.0
Reward for episode 25 : 21.0
Reward for episode 26 : 12.0
Reward for episode 27 : 9.0
Reward for episode 28 : 23.0
Reward for episode 29 : 39.0
Reward for episode 30 : 20.0
Reward for episode 31 : 35.0
Reward for episode 32 : 39.0
Reward for episode 33 : 67.0
Reward for episode 34 : 15.0
Reward for episode 35 : 15.0
Reward for episode 36 : 

Reward for episode 280 : 200.0
Reward for episode 281 : 73.0
Reward for episode 282 : 127.0
Reward for episode 283 : 172.0
Reward for episode 284 : 89.0
Reward for episode 285 : 200.0
Reward for episode 286 : 200.0
Reward for episode 287 : 200.0
Reward for episode 288 : 184.0
Reward for episode 289 : 169.0
Reward for episode 290 : 132.0
Reward for episode 291 : 200.0
Reward for episode 292 : 200.0


KeyboardInterrupt: 

In [None]:
reward_total = 0
t = 10
for _ in range(t):
    observation = env.reset()
    reward_sum = 0
    while True:
        env.render()
        x = np.reshape(observation, [1, D])
        tfprob = sess.run(probability,feed_dict={observations: x})
        action = 1 if np.random.uniform() < tfprob else 0    
        observation, reward, done, _ = env.step(action)
        reward_sum += reward
        
        if done:
            reward_total += reward_sum
            break

print 'Average score: ', reward_total/t            

### A more complex task: solving Pong

In [None]:
import numpy as np
import gym
import tensorflow as tf

In [None]:
# Hyperparameters
n_obs = 80 * 80           # dimensionality of observations
h = 200                 # number of hidden layer neurons
n_actions = 3            # number of available actions
learning_rate = 3e-3      # Learning Rate
gamma = .99               # discount factor for reward
decay = 0.99              # decay rate for RMSProp gradients
index = 1
save_path= '/tmp/pong-attempt/models/{}/pong.ckpt'.format(index) # Save path
logs_path = '/tmp/pong-attempt/logs/{}/'.format(index) #log path

In [None]:
# downsampling
def prepro(I):
    """ prepro 210x160x3 uint8 frame into 6400 (80x80) 1D float vector """
    I = I[35:195] # crop
    I = I[::2,::2,0] # downsample by factor of 2
    I[I == 144] = 0  # erase background (background type 1)
    I[I == 109] = 0  # erase background (background type 2)
    I[I != 0] = 1    # everything else (paddles, ball) just set to 1
    return I.astype(np.float).ravel()

In [None]:
# gamespace 
env = gym.make("Pong-v0") # environment info
observation = env.reset()
prev_x = None
xs,rs,ys = [],[],[]
running_reward = None
reward_sum = 0
episode_number = 0

In [None]:
tf.reset_default_graph()
# initialize model
tf_model = {}
with tf.variable_scope('layer_one',reuse=False):
    xavier_l1 = tf.truncated_normal_initializer(mean=0, stddev=1./np.sqrt(n_obs), dtype=tf.float32)
    tf_model['W1'] = tf.get_variable("W1", [n_obs, h], initializer=xavier_l1)
with tf.variable_scope('layer_two',reuse=False):
    xavier_l2 = tf.truncated_normal_initializer(mean=0, stddev=1./np.sqrt(h), dtype=tf.float32)
    tf_model['W2'] = tf.get_variable("W2", [h,n_actions], initializer=xavier_l2)

In [None]:
# tf operations
def tf_discount_rewards(tf_r): #tf_r ~ [game_steps,1]
    discount_f = lambda a, v: a*gamma + v;
    tf_r_reverse = tf.scan(discount_f, tf.reverse(tf_r,[True, False]))
    tf_discounted_r = tf.reverse(tf_r_reverse,[True, False])
    return tf_discounted_r

def tf_policy_forward(x): #x ~ [1,D]
    h = tf.matmul(x, tf_model['W1'])
    h = tf.nn.relu(h)
    logp = tf.matmul(h, tf_model['W2'])
    p = tf.nn.softmax(logp)
    return p

In [None]:
# tf placeholders
tf_x = tf.placeholder(dtype=tf.float32, shape=[None, n_obs],name="tf_x")
tf_y = tf.placeholder(dtype=tf.float32, shape=[None, n_actions],name="tf_y")
tf_epr = tf.placeholder(dtype=tf.float32, shape=[None,1], name="tf_epr")
true_reward = tf.placeholder(tf.float32,name="true_reward")

# tf reward processing (need tf_discounted_epr for policy gradient wizardry)
tf_discounted_epr = tf_discount_rewards(tf_epr)
tf_mean, tf_variance= tf.nn.moments(tf_discounted_epr, [0], shift=None, name="reward_moments")
tf_discounted_epr -= tf_mean
tf_discounted_epr /= tf.sqrt(tf_variance + 1e-6)

# tf optimizer op
tf_aprob = tf_policy_forward(tf_x)
loss = tf.nn.l2_loss(tf_y-tf_aprob)
optimizer = tf.train.RMSPropOptimizer(learning_rate, decay=decay)
tf_grads = optimizer.compute_gradients(loss, var_list=tf.trainable_variables(), grad_loss=tf_discounted_epr)
train_op = optimizer.apply_gradients(tf_grads)

In [None]:
# tf graph initialization
init = tf.global_variables_initializer()
sess = tf.Session()

In [10]:
# try load saved model
saver = tf.train.Saver(tf.global_variables())
load_was_success = True # yes, I'm being optimistic
try:
    save_dir = '/'.join(save_path.split('/')[:-1])
    ckpt = tf.train.get_checkpoint_state(save_dir)
    load_path = ckpt.model_checkpoint_path
    saver.restore(sess, load_path)
except:
    print "no saved model to load. starting new session"
    load_was_success = False
else:
    print "loaded model: {}".format(load_path)
    saver = tf.train.Saver(tf.global_variables())
    episode_number = int(load_path.split('-')[-1])

no saved model to load. starting new session


In [None]:
# 'Saver' op to save and restore all the variables
saver = tf.train.Saver()
# Data construction for tensorboard output
# Create a summary to monitor loss tensor
tf.summary.scalar("Loss", loss)
# Create a summary to monitor accuracy tensor
tf.summary.scalar("Reward", true_reward)

# Merge all summaries into a single op
merged_summary_op = tf.summary.merge_all()

In [None]:
# training loop
sess.run(init)

# op to write logs to Tensorboard
summary_writer = tf.summary.FileWriter(logs_path, graph=tf.get_default_graph())

while True:
    # preprocess the observation, set input to network to be difference image
    cur_x = prepro(observation)
    x = cur_x - prev_x if prev_x is not None else np.zeros(n_obs)
    prev_x = cur_x

    # stochastically sample a policy from the network
    feed = {tf_x: np.reshape(x, (1,-1))}
    aprob = sess.run(tf_aprob,feed_dict = feed) 
    aprob = aprob[0,:]
    action = np.random.choice(n_actions, p=aprob)
    label = np.zeros_like(aprob) ; label[action] = 1

    # step the environment and get new measurements
    observation, reward, done, info = env.step(action+1)
    reward_sum += reward
    
    # record game history
    xs.append(x) ; ys.append(label) ; rs.append(reward)
    
    if done:
        # update running reward
        running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01
        
        # parameter update
        feed = {tf_x: np.vstack(xs), tf_epr: np.vstack(rs), tf_y: np.vstack(ys), true_reward: reward_sum}
        
        summary = sess.run(merged_summary_op, feed_dict=feed )
        # Write logs at every iteration
        summary_writer.add_summary(summary, episode_number)

        _ = sess.run(train_op,feed)
        
        # print progress console
        if episode_number % 10 == 0:
            print 'ep {}: reward: {}, mean reward: {:3f}'.format(episode_number, reward_sum, running_reward)
        #else:
            #print '\tep {}: reward: {}'.format(episode_number, reward_sum)
        
        # bookkeeping
        xs,rs,ys = [],[],[] # reset game history
        episode_number += 1 # the Next Episode
        observation = env.reset() # reset env
        reward_sum = 0
        if episode_number % 50 == 0:
            saver.save(sess, save_path, global_step=episode_number)
            print "SAVED MODEL #{}".format(episode_number)

We trained this model on our laptop and achieved an stable average score of +5 after ~2 days of training. To get this simple implementation better more fine tuning is needed. 


<img src="./images/pg_pong_reward.png" height="400" />

<img src="./images/pg_pong_loss.png" height="400" />