This post shows how to perform two of the required operations for DQN and DDQN
- copying online network weights to the target network
- sharing weights between an online network that predicts Q(s,a) and an online network that predicts Q(s',a)

In DQN we parameterize two neural networks 
- an online network which is used to select actions via an argmax across all actions
- a target network which is used to estimate the value of `Q(s',a)`, the expected discounted return for the next state

The online network weights are changed to minimize the temporal difference error
`td_error = Q(s,a) - r + gamma Q(s',a)`

To implement DQN we need some way to update the target network parameters as our online network changes.  There are two methods for this
1 - every C steps, copy the online weights to the target weights
2 - at each step, set the target network weights to a weighted combination of the old target weights and the online network weights

Below I show how to do both of these in TensorFlow using a single function.

In [1]:
import numpy as np

import tensorflow as tf

  return f(*args, **kwds)
  from ._conv import register_converters as _register_converters


In DDQN the structure of the Bellman target is different than in DQN.  We use the online network to select the best action in the next state, but use the target network to get the estimate.

We want to be able to do the training operation in a single Tensorflow session call (session calls are expensive!).  To do this we need a second online network, that shares weights with our acting online network, but is connected to a different placeholder. 

Below I show how to share weights between two online networks, and to create a target network that has different weights.  To do this we need to do a few things
- use `tf.get_variable` to create weights and biases
- create both networks under the same variable scope
- call `scope.reuse_variables` in between.  
- set `reuse=tf.AUTO_REUSE` in the lowe

In [2]:
obs = tf.placeholder(shape=(None, 5), dtype=tf.float32)
next_obs = tf.placeholder(shape=(None, 5), dtype=tf.float32)

o_p = np.arange(5).reshape(1, 5)
no_p = np.arange(5).reshape(1, 5)

In [3]:
def fully_connected_layer(scope, 
                          input_tensor, 
                          input_shape, 
                          output_nodes,
                          activation='relu'):
    """
    Creates a single fully connected layer
    
    args
        scope (str) usually 'input_layer' or 'hidden_layer_2' etc
        input_tensor (tensor) 
        input_shape (tuple or int) 
        output_nodes (int)
        activation (str) currently support relu or linear
        
    To correctly name the variables and still allow variable sharing:
    with tf.name_scope('online_network):
        layer = fully_connected_layer('input_layer', ...)
        
    """
    #  feed input shape as a tuple for support for high dimensional inputs
    if isinstance(input_shape, int):
        input_shape = (input_shape,)
    
    with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
        weights = tf.get_variable(
            'weights',
            shape=(*input_shape, output_nodes),
            initializer=tf.contrib.layers.xavier_initializer()
        )

        bias = tf.get_variable(
            'bias',
            shape=(output_nodes),
            initializer=tf.zeros_initializer()
        )

        layer = tf.add(
            tf.matmul(input_tensor, weights),
            bias,
            name='layer'
        )
        
    if activation == 'relu':
        return tf.nn.relu(layer)
    
    elif activation == 'linear':
        return layer
    
    else:
        raise ValueError(
            'Activation of {} not supported'.format(activation))
    
def feed_forward_network(scope,input_tensor,
                 input_shape,
                 hiddens,
                 output_nodes):
    """
    Creates a feed forward neural network (aka multilayer perceptron)
    
    args
        input_tensor (tensor)
        input_shape (tuple or int)
        hiddens (list) has nodes per layer (includes input layer)
        output_nodes (int)
    """
    with tf.name_scope(scope):
        layer = fully_connected_layer(
            'input_layer',
            input_tensor,
            input_shape,
            hiddens[0])

        for layer_num, nodes in enumerate(hiddens[1:]):
            layer = fully_connected_layer(
                'hidden_layer_{}'.format(layer_num),
                layer,
                (hiddens[layer_num-1],),
                nodes
            )

        output_layer = fully_connected_layer(
            'output_layer',
            layer,
            (hiddens[-1],),
            output_nodes,
            activation='linear'
        )

    return output_layer
        

In [4]:
with tf.variable_scope('online_networks') as scope:
    with tf.name_scope('online_obs'):
        online = fully_connected_layer('layer_1', obs, (5,), 10)
    
    scope.reuse_variables()
    
    with tf.name_scope('online_next_obs'):
        online_double_q = fully_connected_layer('layer_1', next_obs, (5,), 10)

with tf.name_scope('target'):
    target = fully_connected_layer('layer_1', obs, (5,), 10)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    o = sess.run(online, {obs: o_p})
    
    d = sess.run(online_double_q, {next_obs: no_p})
    
    t = sess.run(target, {obs: o_p})
    
print(o)
print(d)
print(t)

Instructions for updating:
Use the retry module or similar alternatives.
[[0.35389137 0.         4.3206797  0.         0.918682   1.7593181
  0.         0.         0.         1.2145033 ]]
[[0.35389137 0.         4.3206797  0.         0.918682   1.7593181
  0.         0.         0.         1.2145033 ]]
[[0.         3.0361607  0.28483748 0.         0.         0.
  0.         0.         0.34937274 0.        ]]


In [5]:
#  now lets try to create a network

tf.reset_default_graph()

obs = tf.placeholder(shape=(None, 5), dtype=tf.float32, name='observation')
next_obs = tf.placeholder(shape=(None, 5), dtype=tf.float32, name='next_observation')

o_p = np.arange(5).reshape(1, 5)
no_p = np.arange(5).reshape(1, 5)

with tf.variable_scope('online_networks') as scope:

    online_obs = feed_forward_network('online_obs', obs, (5,), (5, 5), 2)
    
    scope.reuse_variables()

    online_next_obs = feed_forward_network('online_next_obs', next_obs, (5,), (5, 5), 2)

with tf.variable_scope('target_network') as scope:
    target = feed_forward_network('target', next_obs, (5,), (5, 5), 2)

In [6]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    o = sess.run(online_obs, {obs: o_p})
    
    d = sess.run(online_next_obs, {next_obs: no_p})
    
    t = sess.run(target, {next_obs: no_p})
    

In [7]:
o

array([[-2.0789158 , -0.71550477]], dtype=float32)

In [8]:
d

array([[-2.0789158 , -0.71550477]], dtype=float32)

In [9]:
t

array([[0.01619816, 2.4164324 ]], dtype=float32)