# Intro to Deep Q-Networks

Deep Q-Networks are a way to marry the concepts of Neural Networks and Reinforcement Learning. Specifically, instead of maintaining a Q-table that maps states to possible actions we use a neural network to provide that mapping. Our Neural Network takes the state information as input and its output layer represents the available actions. Every time we make an action we train the neural network based on the Q function for rewards, just as we did in the Q table. 

![](images/ANN-DQN.png)

> Image Source: https://people.csail.mit.edu/hongzi/content/publications/DeepRM-HotNets16.pdf

## Naive Deep Q Learning

Essentially, we're building a regression DNN to estimate the Q values instead of maintaining a table of state-action-state' transitions. We still use the Bellman equation to iteratively update the Q-values, which serve as our labels, and every time we experience a new state we perform a single forward pass to get our policy action from the neural network, take the action to get our reward, and then we perform a single backwards pass based on the new reward information. 

As we'll see, this very naive version of Deep Q-Learning is not particularly effective, but it is useful to see just how similar it is to standard Q-Learning. The next two notebooks show extensions to this basic concept.

In [1]:
import numpy as np

import io
import base64
from IPython import display

import gym
from gym import wrappers

from keras.models import Sequential
from keras.layers import Dense, Dropout

Using TensorFlow backend.


In [6]:
# Same as before, just allowing us to display the video from OpenAI Gym 
def imbed_round_video(video_env):
    video = io.open('./gym-videos/openaigym.video.%s.video000000.mp4' % video_env.file_infix, 'r+b').read()
    encoded = base64.b64encode(video)
    return display.HTML(data='''
        <video width="360" height="auto" alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''
    .format(encoded.decode('ascii')))

In [3]:
# Here is a naive version of a deep q-network
environment = gym.make('LunarLander-v2')

q_model = Sequential()
q_model.add(Dense(units=16, activation='relu', input_shape=(8,)))
q_model.add(Dense(units=8, activation='relu'))

# This is our output layer, 4 is chosen because that's 
# how many actions we have access to in Lunar Lander
# We're using a linear activation function, which reflects 
# The fact that the desired predictions are the reward values
# for each action, given the state as input. 
q_model.add(Dense(units=4, activation='linear'))
q_model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])

# Some global parameters for Q-Learning
learning_rate = 0.1 
discount_factor = 0.95
exploration_rate = 0.3
training_episodes = 50000

# lets also track the average reward every so often
avg_reward = 0

for current_episode_num in range(training_episodes):
    state = environment.reset()

    done = False
    while not done:    
        # Now we have our model make a prediction, instead of
        # looking something up in the q-table. And we need these
        # values even if we explore randomly.
        
        # Some keras wonkeyness requires the np.array, and [0] to do a single 
        # prediction as opposed to a batch of predictions
        action_values = q_model.predict(np.array([state,]))[0]
        action = np.argmax(action_values)

        # We still have to explore the state space with DQN
        explore = np.random.random() < exploration_rate
        if explore:
            action = environment.action_space.sample()

        # Take the action, note we are discritizing again
        next_state, reward, done, _ = environment.step(action)
        
        prev_q_value = action_values[action]
        
        # Again, a little Keras uglyness to manage doing a single prediction
        discounted_future_reward = discount_factor * np.max(q_model.predict(np.array([next_state,]))[0])

        # Update the action values with our new information
        action_values[action] = (
            prev_q_value + (learning_rate * (reward + discounted_future_reward - prev_q_value))

        )
        
        q_model.fit(np.array([state,]), np.array([action_values,]), epochs=1, verbose=False)
        
        
    # Every time we finish an episode, log the final reward:
    avg_reward += reward
    if current_episode_num % 500 == 0:
        print("Finished episode: ", current_episode_num)
        print("  Avg. Reward=", avg_reward / 500, "\n")
        avg_reward = 0
    
print("finished!")

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.
Finished episode:  0
  Avg. Reward= -0.2 

Finished episode:  500
  Avg. Reward= -100.0 

Finished episode:  1000
  Avg. Reward= -100.0 

Finished episode:  1500
  Avg. Reward= -100.0 

Finished episode:  2000
  Avg. Reward= -100.0 

Finished episode:  2500
  Avg. Reward= -100.0 

Finished episode:  3000
  Avg. Reward= -100.0 

Finished episode:  3500
  Avg. Reward= -99.6 

Finished episode:  4000
  Avg. Reward= -100.0 

Finished episode:  4500
  Avg. Reward= -100.0 

Finished episode:  5000
  Avg. Reward= -100.0 

Finished episode:  5500
  Avg. Reward= -100.0 

Finished episode:  6000
  Avg. Reward= -100.0 

Finished episode:  6500
  Avg. Reward= -99.8001184547515 

Finished episode:  7000
  Avg. Reward= -100.0 

Finished episode:  7500
  Avg. Reward= -100.0 

Finished episode:  8000
  Avg. Reward= -99.80056288938287 

Finished episode:  8500
  Avg. Reward= -100.0 



In [7]:
# As you can see, this naive version of deep q learning is computationally intense
# and its performance is honestly not stellar... In the next lab we'll look at some
# extensions to DQNs that significantly enhance their performance. 
for _ in range(5):
    orig_environment = gym.make('LunarLander-v2')
    environment = wrappers.Monitor(orig_environment, "gym-videos/", force=True)

    # Lets visualize a single playthrough.
    done = False
    state = environment.reset()
    
    while not done:
        action_values = q_model.predict(np.array([state,]))[0]
        action = np.argmax(action_values)
        state, _, done, _ = environment.step(action)
        environment.render()

    environment.close()
    orig_environment.close()

    display.display(imbed_round_video(environment))

In [None]:
# There are, as you can see, a lot of problems with this naive version of DQN.
# It needs an enormous amount of time to successfully train... but don't worry
# there are better ways to improve the results. See the next notebook!