## Cart-Pole Task
In this tutorial, we will learn how to train a pole to balance itself, which is also a typical reinforce learning problem

In order to accomplish this, we are going to need a challenge that is more difficult for the agent than the two-armed bandit. To meet provide this challenge we are going to utilize the OpenAI gym, a collection of reinforcement learning environments. We will be using one of the classic tasks, the Cart-Pole. To learn more about the OpenAI gym, and this specific task, check out their tutorial here. Essentially, we are going to have our agent learn how to balance a pole for as long as possible without it falling. Unlike the two-armed bandit, this task requires:<br>

> **Observations** — The agent needs to know where pole currently is, and the angle at which it is balancing. To accomplish this, our neural network will take an observation and use it when producing the probability of an action.<br>
**Delayed reward** — Keeping the pole in the air as long as possible means moving in ways that will be advantageous for both the present and the future. To accomplish this we will adjust the reward value for each observation-action pair using a function that weighs actions over time.<br>

To take reward over time into account, the form of Policy Gradient we used in the previous tutorials will need a few adjustments. The first of which is that we now need to update our agent with more than one experience at a time. To accomplish this, we will collect experiences in a buffer, and then occasionally use them to update the agent all at once. These sequences of experience are sometimes referred to as rollouts, or experience traces. We can’t just apply these rollouts by themselves however, we will need to ensure that the rewards are properly adjusted by a discount factor.

Intuitively this allows each action to be a little bit responsible for not only the immediate reward, but all the rewards that followed. We now use this modified reward as an estimation of the advantage in our loss equation. With those changes, we are ready to solve CartPole!

In [1]:
import tensorflow as tf
import tensorflow.contrib.slim as slim
import numpy as np
import gym
import matplotlib.pyplot as plt
%matplotlib inline

try:
    xrange = xrange
except:
    xrange = range
    
env = gym.make('CartPole-v0')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


In [5]:
## The Policy-Based Agent
gamma = 0.99

def discount_rewards(r):
    """ take 1D float array of rewards and computer discounted reward"""
    discounted_r = np.zeros_like(r)
    running_add = 0
    
    for t in reversed(xrange(0,r.size)):
        # Cumulate the reward
        runnning_add = running_add * gamma + r[t]
        # add the reward to certain action
        discounted_r[t] = running_add
    return discounted_r 

In [2]:

gamma = 0.99

def discount_rewards(r):
    """ take 1D float array of rewards and compute discounted reward """
    discounted_r = np.zeros_like(r)
    running_add = 0
    for t in reversed(xrange(0, r.size)):
        running_add = running_add * gamma + r[t]
        discounted_r[t] = running_add
    return discounted_r

In [3]:
class agent():
    def __init__(self, lr, s_size, a_size, h_size):
        # THe agent take a state and performs an action based on that
        self.state_in = tf.placeholder(shape=[None, s_size], dtype= tf.float32)
        hidden      = slim.fully_connected(self.state_in, h_size, biases_initializer=None, activation_fn=tf.nn.relu)
        self.output = slim.fully_connected(hidden, a_size, biases_initializer=None, activation_fn=tf.nn.softmax)
        self.chosen_action = tf.argmax(self.output, 1)
        
        # define reward and action tensor 
        self.reward_holder = tf.placeholder(shape=[None], dtype = tf.float32)
        self.action_holder = tf.placeholder(shape=[None], dtype =tf.int32)
        
        self.indexes = tf.range(0, tf.shape(self.output)[0]) * tf.shape(self.output)[1] + self.action_holder
        self.responsible_output = tf.gather(tf.reshape(self.output, [-1]), self.indexes)
        self.loss = -tf.reduce_mean(tf.log(self.responsible_output)*self.reward_holder)
        
        tvars = tf.trainable_variables()
        self.gradient_holders = []
        
        for idx, var in enumerate(tvars):
            placeholder = tf.placeholder(tf.float32,name=str(idx)+'_holder')
            self.gradient_holders.append(placeholder)
            
        self.gradients = tf.gradients(self.loss,tvars)
        
        optimizer = tf.train.AdamOptimizer(learning_rate=lr)
        self.update_batch = optimizer.apply_gradients(zip(self.gradient_holders,tvars))

## Training 

In [4]:
tf.reset_default_graph() # Reset the graph 

myAgent = agent(lr =1e-2, s_size=4, a_size=2, h_size=8) # initialize an agent with 4 states, 2 action and 8 hidden layers

total_epoch = 5000 # Set total number of episodes
max_ep = 999 # TODO change back to 9990
update_frequency = 5

init = tf.global_variables_initializer()

# Launch the tensorflow graph 
with tf.Session() as sess: 
    sess.run(init)
    i = 0 
    total_reward = []
    total_lenght = []
    
    # trainable_variables: 
    # [<tf.Variable 'fully_connected/weights:0' shape=(4, 4) dtype=float32_ref>,
    #  <tf.Variable 'fully_connected_1/weights:0' shape=(4, 2) dtype=float32_ref>]
    gradBuffer = sess.run(tf.trainable_variables())
    for ix,grad in enumerate(gradBuffer):
        gradBuffer[ix] = grad * 0 

    while i < total_epoch:
        s = env.reset()
        running_reward = 0
        ep_history = []
        
        for j in range(max_ep):
            # Probabilistically pick an action given our network output.
            a_dist = sess.run(myAgent.output, feed_dict={myAgent.state_in:[s]})
            # a is a vector with two number sum to 1 
            a = np.random.choice(a_dist[0], p=a_dist[0])
            # randomly pick a number from a 
            a = np.argmax(a_dist == a)
            # then return the index of the larger num

            s1,r,d,_ = env.step(a) # Get our rewar for taking an action given a bandit
            ep_history.append([s,a,r,s1]) # Record 
            s = s1 # update state
            running_reward += r # culmulate reward 
            
            if d == True:
                # Update the network
                ep_history = np.array(ep_history)
                ep_history[:,2] = discount_rewards(ep_history[:,2]) #Update reward
                feed_dict = {myAgent.reward_holder:ep_history[:,2],
                            myAgent.action_holder:ep_history[:,1],myAgent.state_in:np.vstack(ep_history[:,0])}
                grads = sess.run(myAgent.gradients, feed_dict=feed_dict) #calculate the gradient of loss
                
                for idx, grad in enumerate(grads):
                    gradBuffer[idx] += grad
                    
                if i % update_frequency == 0 and i != 0:
                    feed_dict = dictionary = dict(zip(myAgent.gradient_holders, gradBuffer))
                    _ = sess.run(myAgent.update_batch, feed_dict=feed_dict)
                    for ix,grad in enumerate(gradBuffer):
                        gradBuffer[ix] = grad * 0 
                
                total_reward.append(running_reward)
                total_lenght.append(j)
                break
                
        if i % 100 == 0:
            print(np.mean(total_reward[-100:]))

        i += 1

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


46.0
38.3
55.66
101.91
156.24
166.71
172.67
188.94
186.27
184.18
179.82
168.14
190.52
194.49
187.19
180.64
185.49
186.99
190.5
195.79
199.17
199.93
198.85
200.0
198.94
197.88
199.78
198.07
197.35
196.67
195.14
196.26
189.35
189.89
194.77
194.49
192.86
192.78
194.33
196.55
194.77
191.71
194.41
198.07
198.16
190.17
180.27
185.99
191.26
187.37
