In [1]:
from IPython.core.display import HTML  #For a more pleasing rendering...
HTML(open("styles/custom.css").read())

## 2.3 Actor Critic Methods - A3C

<font> **A3C** is short for **A**synchronous **A**dvantage **A**ctor-**C**ritic and was first described by the Deep Mind Team in 2016. The algorithm tries to utilize the benefits of Q-Learning and Policy Gradient methods. Let's start by unrolling the name: </font>
 

<font> **Asynchronous**: Unlike DQN, where a single agent represented by a single neural network interacts with a single environment, A3C utilizes multiple incarnations of the above in order to learn more efficiently. In A3C there is a global network, and multiple worker agents which each have their own set of network parameters. Each of these agents interacts with it’s own copy of the environment at the same time as the other agents are interacting with their environments. The reason this works better than having a single agent, is that the experience of each agent is independent of the experience of the others. In this way the overall experience available for training becomes more diverse. (We also don't need an experience buffer anymore) </font>

<font> **Advantage**: The update rule we used with Policy Gradients: the discounted returns from a set of experiences, were used in order to tell the agent which of its actions were “good” and which were “bad.” The insight of using advantage estimates rather than just discounted returns is to allow the agent to determine not just how good its actions were, but how much better they turned out to be than expected. Intuitively, this allows the algorithm to focus on where the network’s predictions were lacking. </font>

<font> **Actor-Critic**: Our network will estimate both a value function V(s) (how good a certain state is to be in) and a policy $π(s)$ (a set of action probability outputs). These will each be separate fully-connected layers sitting at the top of the network. Critically, the agent uses the value estimate (the critic) to update the policy (the actor) more intelligently than traditional policy gradient methods.</font>

<font> Currently there are two state-of-the-art implementations of this algorithm. One is pureply based on computation on the cpu and one utilizes an gpu on top. We will look at two simple approaches to both types and link more complex projects for further reading. </font>

### A3C - CPU

<font>he main
reason for using CPU other than GPU, is the inherently sequential nature of RL in general,  and
A3C in particular.  In RL, the training data are generated while learning, which means the training
and inference batches are small and GPU is mostly idle during the training, waiting for new data to
arrive.  Since A3C does not utilize any replay memory, it is completely sequential and therefore a
CPU implementation is as fast as a naive GPU implementation.</font>

This method is based on this [paper](https://arxiv.org/pdf/1602.01783.pdf). Good implementations can be found [here](https://github.com/openai/universe-starter-agent) or [here](https://github.com/ppwwyyxx/tensorpack/tree/master/examples/A3C-Gym).

<img src="./images/a3c_cpu.png" width=450/>

#### Imports

In [10]:
import numpy as np
import tensorflow.contrib.slim as slim
import scipy.signal
import gym
import os
import threading
import multiprocessing
import tensorflow as tf

<font> The algorithm is very sensitive to learning parameters, just play around to find the ones that fit your system the best. </font>

#### Parameters

In [11]:
# Clipping ratio for gradients
CLIP_NORM = 40.0
# Cell units
CELL_UNITS = 16
#Size of mini batches to run training on
MINI_BATCH = 40
REWARD_FACTOR = 0.001

# Gym environment
ENV_NAME = 'CartPole-v0' # Discrete (4, 2)
STATE_DIM = 4
ACTION_DIM = 2

# Learning rate
LEARNING_RATE = 0.0005
# Discount rate for advantage estimation and reward discounting
GAMMA = 0.99

#### Network

In [12]:
#Used to initialize weights for policy and value output layers
def normalized_columns_initializer(std=1.0):
    def _initializer(shape, dtype=None, partition_info=None):
        out = np.random.randn(*shape).astype(np.float32)
        out *= std / np.sqrt(np.square(out).sum(axis=0, keepdims=True))
        return tf.constant(out)
    return _initializer

class AC_Network():
    def __init__(self, s_size, a_size, scope, trainer):
        with tf.variable_scope(scope):
            # Input
            self.inputs = tf.placeholder(shape=[None, s_size], dtype=tf.float32)

            # Recurrent network for temporal dependencies
            lstm_cell = tf.contrib.rnn.BasicLSTMCell(CELL_UNITS, state_is_tuple=True)
            c_init = np.zeros((1, lstm_cell.state_size.c), np.float32)
            h_init = np.zeros((1, lstm_cell.state_size.h), np.float32)
            self.state_init = [c_init, h_init]
            c_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.c])
            h_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.h])
            self.state_in = [c_in, h_in]
            rnn_in = tf.expand_dims(self.inputs, [0])
            state_in = tf.contrib.rnn.LSTMStateTuple(c_in, h_in)
            lstm_outputs, lstm_state = tf.nn.dynamic_rnn(
                lstm_cell, rnn_in,
                initial_state=state_in,
                time_major=False)
            lstm_c, lstm_h = lstm_state
            self.state_out = (lstm_c[:1, :], lstm_h[:1, :])
            rnn_out = tf.reshape(lstm_outputs, [-1, CELL_UNITS])

            # Output layers for policy and value estimations
            self.policy = slim.fully_connected(rnn_out, a_size,
                                               activation_fn=tf.nn.softmax,
                                               weights_initializer=normalized_columns_initializer(0.01),
                                               biases_initializer=None)
            self.value = slim.fully_connected(rnn_out, 1,
                                              activation_fn=None,
                                              weights_initializer=normalized_columns_initializer(1.0),
                                              biases_initializer=None)

            # Only the worker network need ops for loss functions and gradient updating.
            if scope != 'global':
                self.actions = tf.placeholder(shape=[None, a_size], dtype=tf.float32)
                self.target_v = tf.placeholder(shape=[None], dtype=tf.float32)
                self.advantages = tf.placeholder(shape=[None], dtype=tf.float32)

                self.responsible_outputs = tf.reduce_sum(self.policy * self.actions, [1])

                # Value loss function
                self.value_loss = 0.5 * tf.reduce_sum(tf.square(self.target_v - tf.reshape(self.value, [-1])))

                # Softmax policy loss function
                self.policy_loss = -tf.reduce_sum(tf.log(tf.maximum(self.responsible_outputs, 1e-12)) * self.advantages)

                # Softmax entropy function
                self.entropy = - tf.reduce_sum(self.policy * tf.log(tf.maximum(self.policy, 1e-12)))

                self.loss = 0.5 * self.value_loss + self.policy_loss - self.entropy * 0.01

                # Get gradients from local network using local losses
                local_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)
                self.gradients = tf.gradients(self.loss, local_vars)
                self.var_norms = tf.global_norm(local_vars)
                grads, self.grad_norms = tf.clip_by_global_norm(self.gradients, 40.0)

                # Apply local gradients to global network
                global_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'global')
                self.apply_grads = trainer.apply_gradients(zip(grads, global_vars))

#### Worker

In [13]:
# Copies one set of variables to another.
# Used to set worker network parameters to those of global network.
def update_target_graph(from_scope,to_scope):
    from_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, from_scope)
    to_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, to_scope)

    op_holder = []
    for from_var,to_var in zip(from_vars,to_vars):
        op_holder.append(to_var.assign(from_var))
    return op_holder

# Weighted random selection returns n_picks random indexes.
# the chance to pick the index i is give by the weight weights[i].
def weighted_pick(weights,n_picks):
    t = np.cumsum(weights)
    s = np.sum(weights)
    return np.searchsorted(t,np.random.rand(n_picks)*s)

# Discounting function used to calculate discounted returns.
def discounting(x, gamma):
    return scipy.signal.lfilter([1], [1, -gamma], x[::-1], axis=0)[::-1]

# Normalization of inputs and outputs
def norm(x, upper, lower=0.):
    return (x-lower)/max((upper-lower), 1e-12)

class Worker():
    def __init__(self, name, s_size, a_size, trainer, global_episodes, env_name, seed):
        self.name = "worker_" + str(name)
        self.number = name
        self.trainer = trainer
        self.global_episodes = global_episodes
        self.increment = self.global_episodes.assign_add(1)
        self.episode_rewards = []
        self.episode_lengths = []
        self.episode_mean_values = []

        self.a_size = a_size

        # Create the local copy of the network and the tensorflow op to copy global parameters to local network
        self.local_AC = AC_Network(s_size, a_size, self.name, trainer)
        self.update_local_ops = update_target_graph('global', self.name)

        self.env = gym.make(env_name)
        self.env.seed(seed)

    def get_env(self):
        return self.env

    def train(self, rollout, sess, gamma, r):
        rollout = np.array(rollout)
        states = rollout[:, 0]
        actions = rollout[:, 1]
        rewards = rollout[:, 2]
        values = rollout[:, 5]

        # Here we take the rewards and values from the rollout, and use them to
        # generate the advantage and discounted returns.
        rewards_list = np.asarray(rewards.tolist()+[r])*REWARD_FACTOR
        discounted_rewards = discounting(rewards_list, gamma)[:-1]

        # Advantage estimation
        # JS, P Moritz, S Levine, M Jordan, P Abbeel,
        # "High-dimensional continuous control using generalized advantage estimation."
        # arXiv preprint arXiv:1506.02438 (2015).
        values_list = np.asarray(values.tolist()+[r])*REWARD_FACTOR
        advantages = rewards + gamma * values_list[1:] - values_list[:-1]
        discounted_advantages = discounting(advantages, gamma)


        # Update the global network using gradients from loss
        # Generate network statistics to periodically save
        # sess.run(self.local_AC.reset_state_op)
        rnn_state = self.local_AC.state_init
        feed_dict = {self.local_AC.target_v: discounted_rewards,
                     self.local_AC.inputs: np.vstack(states),
                     self.local_AC.actions: np.vstack(actions),
                     self.local_AC.advantages: discounted_advantages,
                     self.local_AC.state_in[0]: rnn_state[0],
                     self.local_AC.state_in[1]: rnn_state[1]}
        v_l, p_l, e_l, g_n, v_n, _ = sess.run([self.local_AC.value_loss,
                                               self.local_AC.policy_loss,
                                               self.local_AC.entropy,
                                               self.local_AC.grad_norms,
                                               self.local_AC.var_norms,
                                               self.local_AC.apply_grads],
                                              feed_dict=feed_dict)
        return v_l / len(rollout), p_l / len(rollout), e_l / len(rollout), g_n, v_n

    def work(self, gamma, sess, coord):
        episode_count = sess.run(self.global_episodes)
        total_steps = 0
        print("Starting worker " + str(self.number))
        with sess.as_default(), sess.graph.as_default():
            while not coord.should_stop():
                sess.run(self.update_local_ops)
                episode_buffer = []
                episode_mini_buffer = []
                episode_values = []
                episode_states = []
                episode_reward = 0
                episode_step_count = 0

                # Restart environment
                terminal = False
                s = self.env.reset()

                rnn_state = self.local_AC.state_init

                # Run an episode
                while not terminal:
                    episode_states.append(s)


                    # Get preferred action distribution
                    a_dist, v, rnn_state = sess.run([self.local_AC.policy, self.local_AC.value, self.local_AC.state_out],
                                         feed_dict={self.local_AC.inputs: [s],
                                                    self.local_AC.state_in[0]: rnn_state[0],
                                                    self.local_AC.state_in[1]: rnn_state[1]})

                    a0 = weighted_pick(a_dist[0], 1) # Use stochastic distribution sampling
                    a = np.zeros(self.a_size)
                    a[a0] = 1

                    s2, r, terminal, info = self.env.step(np.argmax(a))

                    episode_reward += r

                    episode_buffer.append([s, a, r, s2, terminal, v[0, 0]])
                    episode_mini_buffer.append([s, a, r, s2, terminal, v[0, 0]])

                    episode_values.append(v[0, 0])

                    # Train on mini batches from episode
                    if len(episode_mini_buffer) == MINI_BATCH:
                        v1 = sess.run([self.local_AC.value],
                                      feed_dict={self.local_AC.inputs: [s],
                                                    self.local_AC.state_in[0]: rnn_state[0],
                                                    self.local_AC.state_in[1]: rnn_state[1]})
                        v_l, p_l, e_l, g_n, v_n = self.train(episode_mini_buffer, sess, gamma, v1[0][0])
                        episode_mini_buffer = []

                    # Set previous state for next step
                    s = s2
                    total_steps += 1
                    episode_step_count += 1

                self.episode_rewards.append(episode_reward)
                self.episode_lengths.append(episode_step_count)
                self.episode_mean_values.append(np.mean(episode_values))

                print("Reward: " + str(episode_reward), " | Episode", episode_count)
                sess.run(self.increment) # Next global episode

                episode_count += 1

#### Main

In [14]:
def main(_):
    global master_network
    global global_episodes

    tf.reset_default_graph()


    with tf.device("/cpu:0"):
        RANDOM_SEED = 1234
        np.random.seed(RANDOM_SEED)
        tf.set_random_seed(RANDOM_SEED)

        global_episodes = tf.Variable(0, dtype=tf.int32, name='global_episodes', trainable=False)
        trainer = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE)
        master_network = AC_Network(STATE_DIM, ACTION_DIM, 'global', None)  # Generate global network
        num_workers = multiprocessing.cpu_count()  # Set workers to number of available CPU threads
        

        workers = []
        # Create worker classes
        for i in range(num_workers):
            workers.append(Worker(i, STATE_DIM, ACTION_DIM, trainer, global_episodes,
                                  ENV_NAME, RANDOM_SEED))

    with tf.Session() as sess:
        coord = tf.train.Coordinator()
        sess.run(tf.global_variables_initializer())

        # This is where the asynchronous magic happens.
        # Start the "work" process for each worker in a separate thread.
        worker_threads = []
        for worker in workers:
            worker_work = lambda: worker.work(GAMMA, sess, coord)
            t = threading.Thread(target=(worker_work))
            t.start()
            worker_threads.append(t)
            
        coord.join(worker_threads)

In [15]:
tf.app.run()

[2017-09-21 17:07:22,888] Making new env: CartPole-v0
[2017-09-21 17:07:23,297] Making new env: CartPole-v0


Starting worker 1
Starting worker 0
('Reward: 26.0', ' | Episode', 0)
('Reward: 15.0', ' | Episode', 1)
('Reward: 30.0', ' | Episode', 0)
('Reward: 15.0', ' | Episode', 2)
('Reward: 10.0', ' | Episode', 3)
('Reward: 18.0', ' | Episode', 1)
('Reward: 11.0', ' | Episode', 4)
('Reward: 36.0', ' | Episode', 2)
('Reward: 16.0', ' | Episode', 3)
('Reward: 26.0', ' | Episode', 4)
('Reward: 58.0', ' | Episode', 5)
('Reward: 40.0', ' | Episode', 6)
('Reward: 17.0', ' | Episode', 7)
('Reward: 41.0', ' | Episode', 5)
('Reward: 13.0', ' | Episode', 8)
('Reward: 16.0', ' | Episode', 6)
('Reward: 14.0', ' | Episode', 9)
('Reward: 35.0', ' | Episode', 7)
('Reward: 12.0', ' | Episode', 8)
('Reward: 41.0', ' | Episode', 10)
('Reward: 25.0', ' | Episode', 9)
('Reward: 38.0', ' | Episode', 11)
('Reward: 30.0', ' | Episode', 10)
('Reward: 19.0', ' | Episode', 12)
('Reward: 14.0', ' | Episode', 11)
('Reward: 33.0', ' | Episode', 13)
('Reward: 23.0', ' | Episode', 12)
('Reward: 17.0', ' | Episode', 14)
('Re

('Reward: 12.0', ' | Episode', 114)
('Reward: 91.0', ' | Episode', 119)
('Reward: 19.0', ' | Episode', 120)
('Reward: 54.0', ' | Episode', 115)
('Reward: 30.0', ' | Episode', 121)
('Reward: 13.0', ' | Episode', 116)
('Reward: 18.0', ' | Episode', 122)
('Reward: 13.0', ' | Episode', 123)
('Reward: 50.0', ' | Episode', 117)
('Reward: 32.0', ' | Episode', 124)
('Reward: 35.0', ' | Episode', 118)
('Reward: 34.0', ' | Episode', 125)
('Reward: 32.0', ' | Episode', 119)
('Reward: 23.0', ' | Episode', 126)
('Reward: 13.0', ' | Episode', 120)
('Reward: 16.0', ' | Episode', 127)
('Reward: 17.0', ' | Episode', 121)
('Reward: 9.0', ' | Episode', 122)
('Reward: 33.0', ' | Episode', 128)
('Reward: 36.0', ' | Episode', 123)
('Reward: 15.0', ' | Episode', 129)
('Reward: 14.0', ' | Episode', 124)
('Reward: 24.0', ' | Episode', 130)
('Reward: 15.0', ' | Episode', 125)
('Reward: 11.0', ' | Episode', 131)
('Reward: 22.0', ' | Episode', 132)
('Reward: 21.0', ' | Episode', 133)
('Reward: 31.0', ' | Episode'

('Reward: 48.0', ' | Episode', 228)
('Reward: 19.0', ' | Episode', 229)
('Reward: 10.0', ' | Episode', 230)
('Reward: 24.0', ' | Episode', 231)
('Reward: 19.0', ' | Episode', 232)
('Reward: 86.0', ' | Episode', 237)
('Reward: 25.0', ' | Episode', 238)
('Reward: 54.0', ' | Episode', 233)
('Reward: 11.0', ' | Episode', 234)
('Reward: 36.0', ' | Episode', 239)
('Reward: 18.0', ' | Episode', 235)
('Reward: 14.0', ' | Episode', 236)
('Reward: 47.0', ' | Episode', 240)
('Reward: 21.0', ' | Episode', 237)
('Reward: 12.0', ' | Episode', 238)
('Reward: 33.0', ' | Episode', 241)
('Reward: 27.0', ' | Episode', 239)
('Reward: 13.0', ' | Episode', 240)
('Reward: 59.0', ' | Episode', 242)
('Reward: 30.0', ' | Episode', 241)
('Reward: 20.0', ' | Episode', 243)
('Reward: 12.0', ' | Episode', 242)
('Reward: 11.0', ' | Episode', 243)
('Reward: 14.0', ' | Episode', 244)
('Reward: 9.0', ' | Episode', 245)
('Reward: 36.0', ' | Episode', 244)
('Reward: 17.0', ' | Episode', 245)
('Reward: 48.0', ' | Episode'

('Reward: 40.0', ' | Episode', 352)
('Reward: 64.0', ' | Episode', 344)
('Reward: 29.0', ' | Episode', 353)
('Reward: 17.0', ' | Episode', 354)
('Reward: 23.0', ' | Episode', 345)
('Reward: 13.0', ' | Episode', 346)
('Reward: 25.0', ' | Episode', 355)
('Reward: 32.0', ' | Episode', 356)
('Reward: 15.0', ' | Episode', 357)
('Reward: 71.0', ' | Episode', 347)
('Reward: 57.0', ' | Episode', 358)
('Reward: 14.0', ' | Episode', 359)
('Reward: 55.0', ' | Episode', 348)
('Reward: 14.0', ' | Episode', 349)
('Reward: 14.0', ' | Episode', 360)
('Reward: 21.0', ' | Episode', 350)
('Reward: 31.0', ' | Episode', 361)
('Reward: 14.0', ' | Episode', 351)
('Reward: 17.0', ' | Episode', 362)
('Reward: 14.0', ' | Episode', 363)
('Reward: 27.0', ' | Episode', 352)
('Reward: 15.0', ' | Episode', 364)
('Reward: 17.0', ' | Episode', 353)
('Reward: 12.0', ' | Episode', 354)
('Reward: 49.0', ' | Episode', 365)
('Reward: 61.0', ' | Episode', 355)
('Reward: 35.0', ' | Episode', 366)
('Reward: 15.0', ' | Episode

('Reward: 41.0', ' | Episode', 459)
('Reward: 68.0', ' | Episode', 466)
('Reward: 39.0', ' | Episode', 460)
('Reward: 11.0', ' | Episode', 461)
('Reward: 12.0', ' | Episode', 462)
('Reward: 36.0', ' | Episode', 467)
('Reward: 18.0', ' | Episode', 463)
('Reward: 40.0', ' | Episode', 468)
('Reward: 28.0', ' | Episode', 464)
('Reward: 23.0', ' | Episode', 469)
('Reward: 23.0', ' | Episode', 465)
('Reward: 24.0', ' | Episode', 466)
('Reward: 61.0', ' | Episode', 470)
('Reward: 29.0', ' | Episode', 471)
('Reward: 71.0', ' | Episode', 467)
('Reward: 12.0', ' | Episode', 468)
('Reward: 34.0', ' | Episode', 472)
('Reward: 15.0', ' | Episode', 473)
('Reward: 23.0', ' | Episode', 469)
('Reward: 13.0', ' | Episode', 474)
('Reward: 9.0', ' | Episode', 475)
('Reward: 25.0', ' | Episode', 470)
('Reward: 28.0', ' | Episode', 471)
('Reward: 16.0', ' | Episode', 472)
('Reward: 46.0', ' | Episode', 476)
('Reward: 15.0', ' | Episode', 473)
('Reward: 20.0', ' | Episode', 474)
('Reward: 22.0', ' | Episode'

('Reward: 46.0', ' | Episode', 574)
('Reward: 63.0', ' | Episode', 580)
('Reward: 38.0', ' | Episode', 575)
('Reward: 42.0', ' | Episode', 581)
('Reward: 59.0', ' | Episode', 576)
('Reward: 56.0', ' | Episode', 582)
('Reward: 40.0', ' | Episode', 577)
('Reward: 17.0', ' | Episode', 578)
('Reward: 54.0', ' | Episode', 583)
('Reward: 52.0', ' | Episode', 584)
('Reward: 67.0', ' | Episode', 579)
('Reward: 84.0', ' | Episode', 585)
('Reward: 91.0', ' | Episode', 580)
('Reward: 36.0', ' | Episode', 586)
('Reward: 34.0', ' | Episode', 581)
('Reward: 24.0', ' | Episode', 587)
('Reward: 49.0', ' | Episode', 588)
('Reward: 99.0', ' | Episode', 582)
('Reward: 20.0', ' | Episode', 589)
('Reward: 9.0', ' | Episode', 590)
('Reward: 25.0', ' | Episode', 583)
('Reward: 13.0', ' | Episode', 584)
('Reward: 19.0', ' | Episode', 591)
('Reward: 17.0', ' | Episode', 585)
('Reward: 11.0', ' | Episode', 586)
('Reward: 68.0', ' | Episode', 592)
('Reward: 32.0', ' | Episode', 593)
('Reward: 106.0', ' | Episode

('Reward: 40.0', ' | Episode', 692)
('Reward: 92.0', ' | Episode', 691)
('Reward: 37.0', ' | Episode', 692)
('Reward: 69.0', ' | Episode', 693)
('Reward: 63.0', ' | Episode', 694)
('Reward: 76.0', ' | Episode', 693)
('Reward: 64.0', ' | Episode', 695)
('Reward: 33.0', ' | Episode', 696)
('Reward: 87.0', ' | Episode', 694)
('Reward: 79.0', ' | Episode', 697)
('Reward: 104.0', ' | Episode', 695)
('Reward: 49.0', ' | Episode', 696)
('Reward: 104.0', ' | Episode', 698)
('Reward: 36.0', ' | Episode', 697)
('Reward: 21.0', ' | Episode', 698)
('Reward: 38.0', ' | Episode', 699)
('Reward: 55.0', ' | Episode', 699)
('Reward: 102.0', ' | Episode', 700)
('Reward: 61.0', ' | Episode', 700)
('Reward: 28.0', ' | Episode', 701)
('Reward: 55.0', ' | Episode', 702)
('Reward: 101.0', ' | Episode', 701)
('Reward: 45.0', ' | Episode', 703)
('Reward: 25.0', ' | Episode', 704)
('Reward: 80.0', ' | Episode', 702)
('Reward: 44.0', ' | Episode', 705)
('Reward: 50.0', ' | Episode', 703)
('Reward: 72.0', ' | Epi

('Reward: 135.0', ' | Episode', 808)
('Reward: 26.0', ' | Episode', 809)('Reward: 94.0', ' | Episode', 804)

('Reward: 34.0', ' | Episode', 810)
('Reward: 59.0', ' | Episode', 805)
('Reward: 50.0', ' | Episode', 806)
('Reward: 102.0', ' | Episode', 811)
('Reward: 12.0', ' | Episode', 812)
('Reward: 132.0', ' | Episode', 807)
('Reward: 48.0', ' | Episode', 808)
('Reward: 64.0', ' | Episode', 809)
('Reward: 200.0', ' | Episode', 813)
('Reward: 94.0', ' | Episode', 810)
('Reward: 150.0', ' | Episode', 814)
('Reward: 85.0', ' | Episode', 811)
('Reward: 43.0', ' | Episode', 812)
('Reward: 27.0', ' | Episode', 813)
('Reward: 149.0', ' | Episode', 815)
('Reward: 49.0', ' | Episode', 814)
('Reward: 70.0', ' | Episode', 816)
('Reward: 104.0', ' | Episode', 815)
('Reward: 69.0', ' | Episode', 817)


Exception in thread Thread-20:
Traceback (most recent call last):
  File "/home/schnack/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/schnack/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "<ipython-input-14-4d76e79971fe>", line 33, in <lambda>
    worker_work = lambda: worker.work(GAMMA, sess, coord)
  File "<ipython-input-13-28a60037dcdd>", line 119, in work
    self.local_AC.state_in[1]: rnn_state[1]})
  File "/home/schnack/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/home/schnack/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/schnack/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
    options, run_metadata)
  File "/home/schnack/.local/lib/pyt

KeyboardInterrupt: 

To get a feeling on how the results could look like; here is a graph of the 12 workers on our 8 kernel CPU that we used to train Pong on our server with the [universe starter agent](https://github.com/openai/universe-starter-agent). After ~4 hours of training we got an average score of +15. 

<img src="./images/universea3c_pong_reward.png" height="400" />

### GA3C - GPU & CPU

<img src="./images/a3c_gpu.png" width=650/>

<font>This topic was heavily influenced by this [paper](https://arxiv.org/pdf/1611.06256.pdf) with the related [implementation](https://github.com/NVlabs/GA3C). Our implementation is simplified of course but should give you an idea on how it works. We won't have exactly the same structure as in the picture above, but that's fine.   </font>

#### Imports

In [2]:
import numpy as np
import tensorflow as tf
import gym, time, random, threading
from keras.models import *
from keras.layers import *
from keras import backend as K

Using TensorFlow backend.


<font> As you already can see we will use another framework: keras. It is already included in tensorflow and makes creating neural networks a bit easier. You could still implement them in raw tensorflow the same way. </font>

#### Hyperparameters

In [3]:
#-- constants
ENV = 'CartPole-v0'

RUN_TIME = 30 # How Long we let it run in s
THREADS = 12 # Number of threads 
OPTIMIZERS = 2 # How many parallel optimizers that are working on the Brain
THREAD_DELAY = 0.005 # Thread delay for more than 1 thread per kernel

GAMMA = 0.99 # Discount factor

N_STEP_RETURN = 8 # N-step return for discounted reward
GAMMA_N = GAMMA ** N_STEP_RETURN # gamma ^n_step_return

EPS_START = 0.4 # epsilon exploration
EPS_STOP  = .15
EPS_STEPS = 75000

MIN_BATCH = 32 # minimal batch size for training
LEARNING_RATE = 1e-2 # learning rate

LOSS_V = .5 # v loss coefficient
LOSS_ENTROPY = .01 # entropy coefficient

frames = 0 

#### Brain

In [4]:
class Brain:
    train_queue = [ [], [], [], [], [] ] # s, a, r, s', a' terminal mask
    lock_queue = threading.Lock()

    def __init__(self):
        self.session = tf.Session()
        K.set_session(self.session)
        K.manual_variable_initialization(True)

        self.model = self._build_model()
        self.graph = self._build_graph(self.model)

        self.session.run(tf.global_variables_initializer())
        self.default_graph = tf.get_default_graph()

        self.default_graph.finalize() # avoid modifications

    def _build_model(self):

        l_input = Input( batch_shape=(None, NUM_STATE) )
        l_dense = Dense(16, activation='relu')(l_input)

        out_actions = Dense(NUM_ACTIONS, activation='softmax')(l_dense)
        out_value   = Dense(1, activation='linear')(l_dense)

        model = Model(inputs=[l_input], outputs=[out_actions, out_value])
        model._make_predict_function() # have to initialize before threading

        return model

    def _build_graph(self, model):
        s_t = tf.placeholder(tf.float32, shape=(None, NUM_STATE))
        a_t = tf.placeholder(tf.float32, shape=(None, NUM_ACTIONS))
        r_t = tf.placeholder(tf.float32, shape=(None, 1)) # not immediate, but discounted n step reward

        p, v = model(s_t)

        log_prob = tf.log( tf.reduce_sum(p * a_t, axis=1, keep_dims=True) + 1e-10)
        advantage = r_t - v

        loss_policy = - log_prob * tf.stop_gradient(advantage)  # maximize policy
        loss_value  = LOSS_V * tf.square(advantage) # minimize value error
        entropy = LOSS_ENTROPY * tf.reduce_sum(p * tf.log(p + 1e-10), axis=1, keep_dims=True) # maximize entropy (regularization)

        loss_total = tf.reduce_mean(loss_policy + loss_value + entropy)

        optimizer = tf.train.RMSPropOptimizer(LEARNING_RATE, decay=.99)
        minimize = optimizer.minimize(loss_total)

        return s_t, a_t, r_t, minimize

    def optimize(self):
        if len(self.train_queue[0]) < MIN_BATCH:
            time.sleep(0) # yield
            return

        with self.lock_queue:
            if len(self.train_queue[0]) < MIN_BATCH: # more thread could have passed without lock
                return  # we can't yield inside lock

            s, a, r, s_, s_mask = self.train_queue
            self.train_queue = [ [], [], [], [], [] ]

        s = np.vstack(s)
        a = np.vstack(a)
        r = np.vstack(r)
        s_ = np.vstack(s_)
        s_mask = np.vstack(s_mask)

        if len(s) > 5*MIN_BATCH: print("Optimizer alert! Minimizing batch of %d" % len(s))

        v = self.predict_v(s_)
        r = r + GAMMA_N * v * s_mask  # set v to 0 where s_ is terminal state

        s_t, a_t, r_t, minimize = self.graph
        self.session.run(minimize, feed_dict={s_t: s, a_t: a, r_t: r})

    def train_push(self, s, a, r, s_):
        with self.lock_queue:
            self.train_queue[0].append(s)
            self.train_queue[1].append(a)
            self.train_queue[2].append(r)

            if s_ is None:
                self.train_queue[3].append(NONE_STATE)
                self.train_queue[4].append(0.)
            else:
                self.train_queue[3].append(s_)
                self.train_queue[4].append(1.)

    def predict(self, s):
        with self.default_graph.as_default():
            #Keras runs the graph once
            p, v = self.model.predict(s)
            return p, v

    def predict_p(self, s):
        with self.default_graph.as_default():
            p, v = self.model.predict(s)
            return p

    def predict_v(self, s):
        with self.default_graph.as_default():
            p, v = self.model.predict(s)    
            return v

#### Agent

In [5]:
class Agent:
    def __init__(self, eps_start, eps_end, eps_steps):
        self.eps_start = eps_start
        self.eps_end   = eps_end
        self.eps_steps = eps_steps

        self.memory = [] # used for n_step return
        self.R = 0
        
    def getEpsilon(self):
        if(frames >= self.eps_steps):
            return self.eps_end
        else:
            return self.eps_start + frames * (self.eps_end - self.eps_start) / self.eps_steps	# linearly interpolate

    def act(self, s):
        eps = self.getEpsilon()			
        global frames; frames = frames + 1

        if random.random() < eps:
            return random.randint(0, NUM_ACTIONS-1)

        else:
            s = np.array([s])
            p = brain.predict_p(s)[0]

            # a = np.argmax(p) # greedy
            a = np.random.choice(NUM_ACTIONS, p=p)

            return a

    def train(self, s, a, r, s_):
        def get_sample(memory, n):
            s, a, _, _  = memory[0]
            _, _, _, s_ = memory[n-1]

            return s, a, self.R, s_

        a_cats = np.zeros(NUM_ACTIONS)	# turn action into one-hot representation
        a_cats[a] = 1 

        self.memory.append( (s, a_cats, r, s_) )

        self.R = ( self.R + r * GAMMA_N ) / GAMMA

        if s_ is None:
            while len(self.memory) > 0:
                n = len(self.memory)
                s, a, r, s_ = get_sample(self.memory, n)
                brain.train_push(s, a, r, s_)

                self.R = ( self.R - self.memory[0][2] ) / GAMMA
                self.memory.pop(0)		

            self.R = 0

        if len(self.memory) >= N_STEP_RETURN:
            s, a, r, s_ = get_sample(self.memory, N_STEP_RETURN)
            brain.train_push(s, a, r, s_)

            self.R = self.R - self.memory[0][2]
            self.memory.pop(0)

#### Environment

In [6]:
class Environment(threading.Thread):
    stop_signal = False

    def __init__(self, render=False, eps_start=EPS_START, eps_end=EPS_STOP, eps_steps=EPS_STEPS):
        threading.Thread.__init__(self)

        self.render = render
        self.env = gym.make(ENV)
        self.agent = Agent(eps_start, eps_end, eps_steps)
        self.episode_number = 0
        
    def runEpisode(self):
        s = self.env.reset()
        
        R = 0
        while True:         
            time.sleep(THREAD_DELAY) # yield 

            if self.render: self.env.render()

            a = self.agent.act(s)
            s_, r, done, info = self.env.step(a)

            if done: # terminal state
                s_ = None

            self.agent.train(s, a, r, s_)

            s = s_
            R += r

            if done or self.stop_signal:
                break
                
        self.episode_number += 1
        print(self.episode_number, " | Total Reward:", R)

    def run(self):
        while not self.stop_signal:
            self.runEpisode()

    def stop(self):
        self.stop_signal = True

#### Optimizer

In [7]:
class Optimizer(threading.Thread):
    stop_signal = False

    def __init__(self):
        threading.Thread.__init__(self)

    def run(self):
        while not self.stop_signal:
            brain.optimize()

    def stop(self):
        self.stop_signal = True

#### Main

In [8]:
env_test = Environment(render=True, eps_start=0., eps_end=0.)
NUM_STATE = env_test.env.observation_space.shape[0]
NUM_ACTIONS = env_test.env.action_space.n
NONE_STATE = np.zeros(NUM_STATE)

brain = Brain() # brain is global in A3C

envs = [Environment() for i in range(THREADS)]
opts = [Optimizer() for i in range(OPTIMIZERS)]

for o in opts:
    o.start()

for e in envs:
    e.start()

time.sleep(RUN_TIME)

for e in envs:
    e.stop()
for e in envs:
    e.join()

for o in opts:
    o.stop()
for o in opts:
    o.join()

print("Training finished")


[2017-09-21 17:05:07,421] Making new env: CartPole-v0
[2017-09-21 17:05:07,813] Making new env: CartPole-v0
[2017-09-21 17:05:07,816] Making new env: CartPole-v0
[2017-09-21 17:05:07,822] Making new env: CartPole-v0
[2017-09-21 17:05:07,825] Making new env: CartPole-v0
[2017-09-21 17:05:07,828] Making new env: CartPole-v0
[2017-09-21 17:05:07,832] Making new env: CartPole-v0
[2017-09-21 17:05:07,835] Making new env: CartPole-v0
[2017-09-21 17:05:07,838] Making new env: CartPole-v0
[2017-09-21 17:05:07,841] Making new env: CartPole-v0
[2017-09-21 17:05:07,845] Making new env: CartPole-v0
[2017-09-21 17:05:07,848] Making new env: CartPole-v0
[2017-09-21 17:05:07,851] Making new env: CartPole-v0


(1, ' | Total Reward:', 15.0)
(1, ' | Total Reward:', 15.0)
(1, ' | Total Reward:', 16.0)
(1, ' | Total Reward:', 17.0)
(1, ' | Total Reward:', 12.0)
(1, ' | Total Reward:', 20.0)
(1, ' | Total Reward:', 16.0)
(1, ' | Total Reward:', 21.0)
(1, ' | Total Reward:', 23.0)
(2, ' | Total Reward:', 11.0)
(2, ' | Total Reward:', 13.0)
(2, ' | Total Reward:', 23.0)
(1, ' | Total Reward:', 33.0)
(3, ' | Total Reward:', 17.0)
(3, ' | Total Reward:', 17.0)
(2, ' | Total Reward:', 30.0)
(2, ' | Total Reward:', 28.0)
(1, ' | Total Reward:', 39.0)
(2, ' | Total Reward:', 11.0)
(2, ' | Total Reward:', 44.0)
(3, ' | Total Reward:', 12.0)
(2, ' | Total Reward:', 38.0)
(3, ' | Total Reward:', 14.0)(1, ' | Total Reward:', 52.0)

(3, ' | Total Reward:', 23.0)
(2, ' | Total Reward:', 36.0)
(2, ' | Total Reward:', 17.0)
(2, ' | Total Reward:', 44.0)
(4, ' | Total Reward:', 27.0)
(3, ' | Total Reward:', 14.0)
(4, ' | Total Reward:', 12.0)
(3, ' | Total Reward:', 24.0)
(5, ' | Total Reward:', 10.0)
(4, ' | To

(26, ' | Total Reward:', 127.0)
(25, ' | Total Reward:', 127.0)
(21, ' | Total Reward:', 68.0)
(18, ' | Total Reward:', 200.0)
(24, ' | Total Reward:', 200.0)
(26, ' | Total Reward:', 120.0)
(25, ' | Total Reward:', 118.0)
(24, ' | Total Reward:', 108.0)
(25, ' | Total Reward:', 200.0)
(25, ' | Total Reward:', 121.0)
(19, ' | Total Reward:', 44.0)
(27, ' | Total Reward:', 52.0)
(26, ' | Total Reward:', 200.0)
(27, ' | Total Reward:', 129.0)
(28, ' | Total Reward:', 32.0)
(17, ' | Total Reward:', 200.0)
(27, ' | Total Reward:', 48.0)
(26, ' | Total Reward:', 108.0)
(26, ' | Total Reward:', 112.0)
(26, ' | Total Reward:', 200.0)
(22, ' | Total Reward:', 200.0)
(25, ' | Total Reward:', 200.0)
(26, ' | Total Reward:', 200.0)
(25, ' | Total Reward:', 200.0)
(23, ' | Total Reward:', 56.0)
(20, ' | Total Reward:', 200.0)
(26, ' | Total Reward:', 27.0)
(27, ' | Total Reward:', 92.0)
(28, ' | Total Reward:', 163.0)
(27, ' | Total Reward:', 73.0)
(29, ' | Total Reward:', 199.0)
(26, ' | Total Re

In [9]:
env_test.run()

(1, ' | Total Reward:', 200.0)
(2, ' | Total Reward:', 200.0)
(3, ' | Total Reward:', 200.0)
(4, ' | Total Reward:', 200.0)
(5, ' | Total Reward:', 200.0)
(6, ' | Total Reward:', 200.0)
(7, ' | Total Reward:', 200.0)
(8, ' | Total Reward:', 200.0)
(9, ' | Total Reward:', 200.0)
(10, ' | Total Reward:', 200.0)
(11, ' | Total Reward:', 200.0)
(12, ' | Total Reward:', 200.0)
(13, ' | Total Reward:', 200.0)
(14, ' | Total Reward:', 200.0)
(15, ' | Total Reward:', 200.0)
(16, ' | Total Reward:', 200.0)
(17, ' | Total Reward:', 200.0)
(18, ' | Total Reward:', 200.0)


ArgumentError: argument 2: <type 'exceptions.TypeError'>: wrong type

From here on things will get too complicated to simplify and present the code, still a good place to start would be the open ai [baselines](https://github.com/openai/baselines). This is a growing collection  of different very good RL algorithms. 