# Better Deep Q Networks

Tricks like feature engineering of the state space, or state space discritization can still be applied (though discritization is no longer required) to improve the performance of DQNs, but researchers have also invented several tactics specific to training Q-networks. In the code below we use the `keras-rl` package which has some powerful enhancements to the naive DQN concept, including:

## Experience Replay

Store a list of state-action->state' transitions and their associated rewards in a memory buffer, then replay (and learn from) these stored memories later. This solves two problems: 

1. We can train from memory to increase the number of updates to the network. This is akin to training multiple epochs on the same training data, where the training data is now a sample of our experiences. 
2. Gradient descent works better with independent transitions, but if we only learn "online" (during the game) then the updates will always have a chronological component. Seperating the chronological aspect has proven helpful in getting better Q values for state-action pairs.

## Seperated Target and Q-Networks

When we learn "online" we're using the same neural network (same weights and parameters) to describe the target and update our estimator. In Q-Learning we do not have an underlying source of truth for the value of state-action pairs, we're always estimating it. When the estimator acts as the goal and both are constantly being updated, it's like trying to hit a moving target. Researchers have found that setting a fixed target for a period of time, training the estimator to that target, and then periodically updating the target with the new values can improve performance and speed up the learning process. 


In [3]:
from PIL import Image
import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam
import keras.backend as K

# rl is the keras-rl package.
from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

# Before we can fit the network, we need to make the environment to which it will be fit:
dq_training_environment = gym.make('LunarLander-v2')

# In Keras-RL we still build the model using Keras as usual:
q_model = Sequential()
q_model.add(Flatten(input_shape=(1,) + dq_training_environment.observation_space.shape))
q_model.add(Dense(units=16, activation='relu'))
q_model.add(Dense(units=16, activation='relu'))
q_model.add(Dense(units=16, activation='relu'))
q_model.add(Dense(units=16, activation='relu'))
q_model.add(Dense(units=dq_training_environment.action_space.n, activation='linear'))

# We have to specify a policy type, we have only discussed "Epsilon Greedy" policy types
# But you can explore the other policies available, some of which will probably outperform EG.
# See the Reinforcement Learning book linked in the additional resources for more on the 
# difference kinds of policies.
policy = EpsGreedyQPolicy()

# Keras-RL supports experience replay via the memory option
# We'll remember up to 50,000 state-action->state' transitions.
# And our experienecs will be single state-action->state' transitions.
memory = SequentialMemory(limit=50000, window_length=1)

# nb_actions is the size of the action space, in lunar lander that's 4. 
# nb_steps_warmup is a number of actions to take completely at random, this 
#   is done to help the agent fill out it's memory of the state space as part 
#   of the exploration process. 
# taget_model_update controls the updates between the DQNetwork and the Target Network
dqn = DQNAgent(model=q_model, nb_actions=dq_training_environment.action_space.n, memory=memory, nb_steps_warmup=200,
target_model_update=1e-2, policy=policy)

# We compile the DQN, like we would a neural network.
dqn.compile(optimizer=Adam(), metrics=['mae'])

# fit looks very similar as well
dqn.fit(dq_training_environment, nb_steps=50000, verbose=True)

Instructions for updating:
Colocations handled automatically by placer.
Training for 50000 steps ...
Interval 1 (0 steps performed)
Instructions for updating:
Use tf.cast instead.
99 episodes - episode_reward: -179.019 [-652.775, -24.760] - loss: 28.211 - mean_absolute_error: 32.202 - mean_q: -38.740

Interval 2 (10000 steps performed)
47 episodes - episode_reward: -150.617 [-415.429, -24.242] - loss: 10.750 - mean_absolute_error: 30.651 - mean_q: -27.945

Interval 3 (20000 steps performed)
16 episodes - episode_reward: -110.085 [-176.352, -13.366] - loss: 6.696 - mean_absolute_error: 26.051 - mean_q: -18.481

Interval 4 (30000 steps performed)
13 episodes - episode_reward: -173.554 [-301.736, -45.693] - loss: 5.507 - mean_absolute_error: 22.292 - mean_q: -1.987

Interval 5 (40000 steps performed)
done, took 181.338 seconds


<keras.callbacks.History at 0x13952d160>

In [4]:
# After 50,000 training steps
for _ in range(5):
    orig_environment = gym.make('LunarLander-v2')
    environment = wrappers.Monitor(orig_environment, "gym-videos/", force=True)

    # Lets visualize a single playthrough.
    state = environment.reset()
    dqn.test(environment, nb_episodes=1, visualize=True)

    environment.close()
    orig_environment.close()

    display.display(imbed_round_video(environment))

Testing for 1 episodes ...
Episode 1: reward: -41.163, steps: 632


Testing for 1 episodes ...
Episode 1: reward: -178.073, steps: 909


Testing for 1 episodes ...
Episode 1: reward: 145.925, steps: 800


Testing for 1 episodes ...
Episode 1: reward: 166.203, steps: 833


Testing for 1 episodes ...
Episode 1: reward: 169.628, steps: 764


In [5]:
# Lets fit for longer!
dqn.fit(dq_training_environment, nb_steps=100000, verbose=True)

Training for 100000 steps ...
Interval 1 (0 steps performed)
11 episodes - episode_reward: 41.684 [-118.195, 224.953] - loss: 3.909 - mean_absolute_error: 20.440 - mean_q: 15.403

Interval 2 (10000 steps performed)
12 episodes - episode_reward: -8.404 [-190.392, 106.957] - loss: 3.293 - mean_absolute_error: 15.280 - mean_q: 17.024

Interval 3 (20000 steps performed)
17 episodes - episode_reward: 95.372 [-206.390, 242.065] - loss: 3.681 - mean_absolute_error: 13.190 - mean_q: 16.517

Interval 4 (30000 steps performed)
15 episodes - episode_reward: 63.488 [-147.142, 263.525] - loss: 4.394 - mean_absolute_error: 15.492 - mean_q: 19.264

Interval 5 (40000 steps performed)
20 episodes - episode_reward: 115.855 [-98.836, 274.517] - loss: 5.161 - mean_absolute_error: 20.188 - mean_q: 25.317

Interval 6 (50000 steps performed)
17 episodes - episode_reward: 81.606 [-597.934, 239.232] - loss: 6.350 - mean_absolute_error: 23.659 - mean_q: 30.936

Interval 7 (60000 steps performed)
17 episodes - e

<keras.callbacks.History at 0x138c55780>

In [6]:
# Now we're at ~150000 training steps
for _ in range(5):
    orig_environment = gym.make('LunarLander-v2')
    environment = wrappers.Monitor(orig_environment, "gym-videos/", force=True)

    # Lets visualize a single playthrough.
    state = environment.reset()
    dqn.test(environment, nb_episodes=1, visualize=True)

    environment.close()
    orig_environment.close()

    display.display(imbed_round_video(environment))

Testing for 1 episodes ...
Episode 1: reward: -88.296, steps: 1000


Testing for 1 episodes ...
Episode 1: reward: -49.213, steps: 1000


Testing for 1 episodes ...
Episode 1: reward: -23.208, steps: 1000


Testing for 1 episodes ...
Episode 1: reward: -164.860, steps: 158


Testing for 1 episodes ...
Episode 1: reward: -72.322, steps: 520


In [7]:
# Lets fit for even longer!
dqn.fit(dq_training_environment, nb_steps=100000, verbose=True)

Training for 100000 steps ...
Interval 1 (0 steps performed)
18 episodes - episode_reward: 89.362 [-99.898, 266.369] - loss: 6.952 - mean_absolute_error: 31.136 - mean_q: 41.056

Interval 2 (10000 steps performed)
18 episodes - episode_reward: 99.232 [-206.290, 269.604] - loss: 6.735 - mean_absolute_error: 29.663 - mean_q: 38.975

Interval 3 (20000 steps performed)
14 episodes - episode_reward: 155.398 [-17.055, 242.272] - loss: 7.277 - mean_absolute_error: 29.832 - mean_q: 39.540

Interval 4 (30000 steps performed)
17 episodes - episode_reward: 189.437 [34.167, 294.502] - loss: 6.205 - mean_absolute_error: 30.607 - mean_q: 41.236

Interval 5 (40000 steps performed)
15 episodes - episode_reward: 142.923 [-67.339, 239.379] - loss: 6.069 - mean_absolute_error: 31.109 - mean_q: 42.015

Interval 6 (50000 steps performed)
15 episodes - episode_reward: 127.494 [-89.938, 241.239] - loss: 5.731 - mean_absolute_error: 30.849 - mean_q: 41.603

Interval 7 (60000 steps performed)
18 episodes - epi

<keras.callbacks.History at 0x14026fba8>

In [8]:
# Now we're at ~250000 training steps
for _ in range(5):
    orig_environment = gym.make('LunarLander-v2')
    environment = wrappers.Monitor(orig_environment, "gym-videos/", force=True)

    # Lets visualize a single playthrough.
    state = environment.reset()
    dqn.test(environment, nb_episodes=1, visualize=True)

    environment.close()
    orig_environment.close()

    display.display(imbed_round_video(environment))

Testing for 1 episodes ...
Episode 1: reward: 147.970, steps: 383


Testing for 1 episodes ...
Episode 1: reward: 112.230, steps: 821


Testing for 1 episodes ...
Episode 1: reward: -83.507, steps: 434


Testing for 1 episodes ...
Episode 1: reward: -113.727, steps: 351


Testing for 1 episodes ...
Episode 1: reward: 129.959, steps: 839


In [9]:
# Lets fit for even longer!
dqn.fit(dq_training_environment, nb_steps=100000, verbose=True)

Training for 100000 steps ...
Interval 1 (0 steps performed)
18 episodes - episode_reward: 19.063 [-215.565, 205.861] - loss: 6.614 - mean_absolute_error: 29.734 - mean_q: 39.321

Interval 2 (10000 steps performed)
28 episodes - episode_reward: 97.330 [-79.432, 270.405] - loss: 7.802 - mean_absolute_error: 30.564 - mean_q: 40.572

Interval 3 (20000 steps performed)
21 episodes - episode_reward: 219.761 [-93.028, 273.820] - loss: 8.351 - mean_absolute_error: 33.906 - mean_q: 45.237

Interval 4 (30000 steps performed)
28 episodes - episode_reward: -42.275 [-483.275, 267.730] - loss: 10.921 - mean_absolute_error: 39.502 - mean_q: 52.875

Interval 5 (40000 steps performed)
31 episodes - episode_reward: -131.299 [-547.128, 268.823] - loss: 14.116 - mean_absolute_error: 45.947 - mean_q: 58.629

Interval 6 (50000 steps performed)
16 episodes - episode_reward: 10.344 [-335.634, 241.289] - loss: 14.297 - mean_absolute_error: 45.757 - mean_q: 57.452

Interval 7 (60000 steps performed)
15 episode

<keras.callbacks.History at 0x139584518>

In [10]:
# 350,000 steps
for _ in range(5):
    orig_environment = gym.make('LunarLander-v2')
    environment = wrappers.Monitor(orig_environment, "gym-videos/", force=True)

    # Lets visualize a single playthrough.
    state = environment.reset()
    dqn.test(environment, nb_episodes=1, visualize=True)

    environment.close()
    orig_environment.close()

    display.display(imbed_round_video(environment))

Testing for 1 episodes ...
Episode 1: reward: 27.072, steps: 167


Testing for 1 episodes ...
Episode 1: reward: 36.109, steps: 267


Testing for 1 episodes ...
Episode 1: reward: 243.373, steps: 346


Testing for 1 episodes ...
Episode 1: reward: 57.808, steps: 205


Testing for 1 episodes ...
Episode 1: reward: 257.787, steps: 317


In [None]:
# Pretty good!