# Deep Q-learning with Doom
In this tutorial we will implement deep Q-learning to teach an agent to play Doom.

We will use Keras for the deep learning part, and vizdoom to run doom in python.

Here is a gif of the final result:

<img src="figures/defend_the_center_dqn.gif"/>

## Prerequisites
- [Python 3.6](https://www.python.org/downloads/release/python-367/)
- pip install numpy gym tensorflow keras skimage
- vizdoom:
  - Download vizdoom from [here](https://github.com/mwydmuch/ViZDoom/releases/download/1.1.5pre/ViZDoom-1.1.5pre-Win-Python36-x86_64.zip) (for python 3.6)
    - For other versions of python, see https://github.com/mwydmuch/ViZDoom/releases
  - Extract and copy the `vizdoom` folder into site-packages:
    - Python: python_root\Lib\site-packges
    - Anaconda: anaconda_root\lib\pythonX.X\site-packages
    - If you don't know where your python installation is, run `where python` to see where it's installed

In [1]:
import random
import vizdoom
from collections import deque
import numpy as np
import keras
from keras import backend as K
from keras.layers import Input, Conv2D, Dense, Flatten, MaxPooling2D, Lambda
from keras.models import Model, load_model
from keras.optimizers import SGD
from skimage import transform
from IPython.display import clear_output

Using TensorFlow backend.


# Initialize DoomGame
We will load the **defend_the_center** scenario.

Hese is the summary of this scenario from https://github.com/mwydmuch/ViZDoom/tree/master/scenarios:


> ## DEFEND THE CENTER
> The purpose of this scenario is to teach the agent that killing the 
monsters is GOOD and when monsters kill you is BAD. In addition,
wasting amunition is not very good either. Agent is rewarded only 
for killing monsters so he has to figure out the rest for himself.

> Map is a large circle. Player is spawned in the exact center.
5 melee-only, monsters are spawned along the wall. Monsters are 
killed after a single shot. After dying each monster is respawned 
after some time. Episode ends when the player dies (it's inevitable 
becuse of limitted ammo).

> __REWARDS:__
> +1 for killing a monster

> Further configuration:
* 3 available buttons: turn left, turn right, shoot (attack)
* death penalty = 1

In [2]:
game = vizdoom.DoomGame()
game.load_config("scenarios/defend_the_center.cfg")

# Visualize the game (set to False to train faster)
game.set_window_visible(True)

# Set screen format to greyscale. This improves training time
game.set_screen_format(vizdoom.ScreenFormat.GRAY8)

# Make the game end after 2100 ticks (set to 0 to disable)
game.set_episode_timeout(2100)

# Init game
game.init()

# Setup Keras Model
## Let's Define some Hyperparameter

In [3]:
num_episodes       = 500     # How many episodes to run
num_actions        = game.get_available_buttons_size()
replay_buffer_size = 10000   # How many experiences to store in our memory
learning_rate      = 0.0002  # How "fast" should we update the network (alpha)
discount_factor    = 0.95    # Future reward discount factor (gamma)
batch_size         = 64      # How many replays should we use for training
enable_training    = True    # Should we train the agent?

print("Number of actions:", num_actions)

Number of actions: 3


## Construct the network

Here we use Keras to construct the following network:

<img src="figures/Deep_Q_learning_model.png"/>

In [4]:
input_images = Input(shape=(84, 84, 4))
input_action = Input(shape=(num_actions,)) # one-hot vector
x = Conv2D(8, (3, 3), activation='elu', padding="valid", input_shape=(84, 84, 4))(input_images)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Conv2D(32, (3, 3), activation='elu', padding="valid")(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Flatten()(x)
x = Dense(512, activation='elu')(x)
x = Dense(128, activation='elu')(x)
Q_actions = Dense(game.get_available_buttons_size(), activation=None)(x)
Q_input_action = Lambda(lambda x: K.expand_dims(K.sum(x[0] * x[1], axis=1), axis=-1))([Q_actions, input_action]) # Get Q-predicted for input_action

# Create one model for training and one for prediction
train_model = Model(inputs=[input_images, input_action], outputs=[Q_input_action])
train_model.compile(loss="mse", optimizer=SGD(lr=learning_rate))
train_model.summary()
predict_model = Model(inputs=[input_images], outputs=[Q_actions])

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 84, 84, 4)    0                                            
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 82, 82, 8)    296         input_1[0][0]                    
__________________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D)  (None, 41, 41, 8)    0           conv2d_1[0][0]                   
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, 39, 39, 32)   2336        max_pooling2d_1[0][0]            
__________________________________________________________________________________________________
max_poolin

# Training the Model

Remember that we want to use our model to estimate a Q-value for every action in a state, like this:

<img src="figures/Deep_Q_figure_1.png" style="width: 700px;"/>

Where $Q(s_t, a_0)$ represents how good it is to take action $a_0$ in state $s_t$ according to our network.

So the idea here is to train the network to make correct prediction of how good certain actions are, by optimizing the parameters in our network through exploration of the environment.

This can be achieved with the Deep Q-learning algorithm:

<img src="figures/Deep_Q_algorithm.png" style="width: 700px;"/>

Or alternatively:

```
Initialize replay buffer D with size N
Set the batch size B to some constant
for every episode:
    Reset environment
    Get initial state s
    for every step:
        Predict Q_values for state s
        Set a to the action with the highest Q_value
        Perform action a get reward r
        if game not is over:
            Get new state s'
        else:
            Set s' to None
        Store experience in D as a tuple <s, a, r, s'>
        if N >= B:
            Set V to equal random sample of B elements from D
            Q_target = V.r + gamma * max_a' Q(V.s', a') if V.s' is not None else V.r
            Train network with V.s as inputs and Q_target as target
```

In [5]:
# This function preprocesses a frame from the game by:
# - Remove the ceiling and floor pixels
# - Normalize the pixel values to [0, 1] range
# - Resize the image to 84x84 pixels
def preprocess_frame(frame):
    cropped_frame = frame[30:-10, 30:-30]                             # Crop the screen
    normalized_frame = cropped_frame / 255.0                          # Normalize pixel values    
    preprocessed_frame = transform.resize(normalized_frame, [84, 84]) # Resize
    return preprocessed_frame

Here are some useful functions:

* `np.argmax(v)`
  * Return the index of the largest value in v
* `np.stack(list)`
  * Stack every element in list and convert the stacked list into a matrix
* `np.expand_dims(m, axis=n)`
  * Expand matrix m with an additional dimension along axis n
* `game.new_episode()`
  * Start a new episode
* `game.get_state().screen_buffer`
  * Screen buffer (image) of the current state of the game
* `game.make_action(actions)`
  * Take the actions denoted by True in the actions vector (e.g. [True, False, False] to perform action 0)
* `game.is_episode_finished()`
  * Returns True when the episode is done (terminal state or episode timeout)
* `random.sample(list, n)`
  * Sample n elements from list
* `predict_model.predict_on_batch(batch)`
  * Feed batch through the network and return the output values (Q-values for the state)
* `train_model.train_on_batch(batch, target)`
  * Feed batch through the network and optimize network to predict "target" next time

In [6]:
if enable_training:
    # Initialize replay buffer
    replay_buffer = deque(maxlen=replay_buffer_size)

    # Initialize frame stack
    frame_stack = deque(maxlen=4)

    # For every episode
    episode_loss = float("nan")
    for episode in range(num_episodes):
        clear_output(wait=True)
        print("-- Episode {}/{} --".format(episode+1, num_episodes))
        print("Episode loss:", episode_loss)

        # Start new episode
        game.new_episode()

        # Initialize frame stack with the first frame of the game
        initial_frame = preprocess_frame(game.get_state().screen_buffer)
        for _ in range(4):
            frame_stack.append(initial_frame)
        state = np.stack(frame_stack, axis=2) # Stack the frames to setup the inital state

        # Run the episode
        episode_loss = 0
        while not game.is_episode_finished():
            # Get action with highest Q-value for current state
            action = np.argmax(predict_model.predict_on_batch(np.expand_dims(state, axis=0)))
            action_one_hot = [False] * num_actions
            action_one_hot[action] = True

            # Take action and get a reward
            reward = game.make_action(action_one_hot)
            
            # If game is not finished
            if not game.is_episode_finished():
                # Get new state
                frame_stack.append(preprocess_frame(game.get_state().screen_buffer))
                new_state = np.stack(frame_stack, axis=2)
            else:
                new_state = None

            # Store the replay
            replay_buffer.append((state, action_one_hot, reward, new_state))
            state = new_state

            # Train network on a random sample of previous expreiences
            if len(replay_buffer) >= batch_size:
                # Get replay batch
                replay_batch      = random.sample(replay_buffer, batch_size)
                replay_state      = [r[0] for r in replay_batch]
                replay_action     = [r[1] for r in replay_batch]
                replay_reward     = [r[2] for r in replay_batch]
                replay_next_state = [r[3] for r in replay_batch]

                # Q_target = reward + gamma * max_a' Q(s')
                Q_target = []
                for i in range(batch_size):
                    if replay_next_state[i] is not None:
                        Q_next_state = predict_model.predict_on_batch(np.expand_dims(replay_next_state[i], axis=0))[0]
                        Q_next_max   = np.max(Q_next_state)
                        Q_target.append(replay_reward[i] + discount_factor * Q_next_max)
                    else:
                        Q_target.append(replay_reward[i])
                
                # Run training pass
                episode_loss += train_model.train_on_batch([replay_state, replay_action], Q_target)
    predict_model.save("defend_the_center_dqn.h5")
else:
    predict_model = load_model("defend_the_center_dqn.h5")

-- Episode 64/500 --
Episode loss: 1.7479450501414249


KeyboardInterrupt: 

# Let's Evaluate the Model

In [9]:
# Reinitialize game with set_window_visible = True
game = vizdoom.DoomGame()
game.load_config("scenarios/defend_the_center.cfg")

# Visualize the game
game.set_window_visible(True)

# Set screen format to greyscale. This improves training time
game.set_screen_format(vizdoom.ScreenFormat.GRAY8)

# Make the game end after 2100 ticks (set to 0 to disable)
game.set_episode_timeout(2100)

# Init game
game.init()

# Saves a GIF when true
record = False
if record:
    import imageio
    recording = []
    
# Initialize frame stack
frame_stack = deque(maxlen=4)

# For every episode
for episode in range(1):
    # Start new episode
    game.new_episode()
    
    # Initialize frame stack with the first frame of the game
    initial_frame = preprocess_frame(game.get_state().screen_buffer)
    for _ in range(4):
        frame_stack.append(initial_frame)
    state = np.stack(frame_stack, axis=2) # Stack the frames to setup the inital state
    
    # Run the episode
    while not game.is_episode_finished():
        if record:
            recording.append((transform.resize(game.get_state().screen_buffer, [240, 320]) * 255.5).astype(np.uint8))
        
        # Get action with highest Q-value for current state
        action = np.argmax(predict_model.predict_on_batch(np.expand_dims(state, axis=0)))
        action_one_hot = [False] * num_actions
        action_one_hot[action] = True
        
        # Take action and get a reward
        reward = game.make_action(action_one_hot)
        
        # Break if the episode is finished
        if game.is_episode_finished():
            break
        
        # If not, get the new state
        frame_stack.append(preprocess_frame(game.get_state().screen_buffer))
        state = np.stack(frame_stack, axis=2)
        
if record:
    imageio.mimwrite("defend_the_center_dqn.gif", recording, subrectangles=True)