# Notes for DQN

In [1]:
import tensorflow as tf

tf.enable_eager_execution()

  from ._conv import register_converters as _register_converters


In [3]:
import numpy as np

In [4]:
tf.VERSION

'1.7.0'

## Deep Q Learning

### How to populate replay memory

Pivoting based on dimension in numpy:

In [43]:
(state_next[:, :, None] == np.expand_dims(state_next, axis=2)).all() 

True

At each step, we generate one image. Each state is a sequence of 4 images (4 x 84 x 84, treat it like an image with 4 channels).

In [57]:
states = np.stack([np.random.randint(0, 255, [84, 84])] * 4, axis=2)
state_next = np.random.randint(0, 255, [84, 84])
new_states = np.append(states[:, :, 1:], np.expand_dims(state_next, 2), axis=2)
print((new_states[0, :, 3] == state_next[0, :]).all())

True


### How to calculate loss


To get loss, from each transition (state, action, reward, next_state), we use
1. state |[$\epsilon-greedy$]=> action |[env.step]=> transition (state, action, reward, next_state)
2. transition |=> q_values_next |[$greedy$]=> target
3. state, action |=> q_value for action
4. $loss = \frac{1}{2}(target - q\_value)^2$

The chosen action plays two roles: it generates the next state and the corresponding q_value.

### Pick up action predictions (values) for only selected actions

In [38]:
batch_size = 10
action_space = [0, 1, 2, 3] 
n_actions = len(action_space)

# predictions (q-values) from q-estimator for each obs
predictions = np.random.uniform(size=(batch_size, n_actions))
print('predictions batch_size X num_actions')
print(predictions)

# chosen actions (from epsilon-greedy) for each obs
actions_pl = [np.argmax(pred) for pred in predictions]

gather_indices = tf.range(batch_size) * tf.shape(predictions)[1] + actions_pl
print('gather_indices: {}'.format(gather_indices))
# get q-value for chosen action for each obs
action_predictions = tf.gather(tf.reshape(predictions, [-1]), gather_indices)
print('action_predictions: {}'.format(action_predictions))

predictions batch_size X num_actions
[[0.08078424 0.76907652 0.57850815 0.98225583]
 [0.99025714 0.32141543 0.60348217 0.7804746 ]
 [0.11778576 0.84922858 0.92765091 0.61298925]
 [0.30463415 0.50167728 0.59557644 0.12356276]
 [0.07895121 0.10563141 0.02999877 0.13290146]
 [0.72537502 0.56490885 0.75581052 0.68317478]
 [0.19152837 0.4128727  0.9772954  0.62705295]
 [0.37081678 0.23505685 0.26570782 0.13261947]
 [0.54624099 0.90158146 0.74054228 0.75706725]
 [0.93635203 0.51483494 0.8761039  0.59076226]]
gather_indices: [ 3  4 10 14 19 22 26 28 33 36]
action_predictions: [0.98225583 0.99025714 0.92765091 0.59557644 0.13290146 0.75581052
 0.9772954  0.37081678 0.90158146 0.93635203]


Use `np.invert` for negate when batch is done
```python
targets = reward_batch + np.invert(done_batch).astype(np.float).discount_factor * np.max(value_next_state_batch, axis=1)
```

In [4]:
np.invert(np.array([True, False])).astype(np.float32)

array([0., 1.], dtype=float32)