# Deep Q-learning with Doom
In this tutorial we will implement deep Q-learning to teach an agent to play Doom.

We will use Keras for the deep learning part, and vizdoom to run doom in python.

TODO: Insert GIF of final result

## Prerequisites
- python3.7
- pip install numpy pyplot gym tensorflow keras skimage
- vizdoom

In [1]:
import random
import vizdoom
from collections import deque
import numpy as np
import keras
from keras.layers import Conv2D, Dense, Flatten, MaxPooling2D
from keras.models import Sequential
from keras.optimizers import SGD
from skimage import transform
from IPython.display import clear_output

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


# Initialize DoomGame
We will load the **defend_the_center** scenario.

Hese is the summary of this scenario from https://github.com/mwydmuch/ViZDoom/tree/master/scenarios:

> The purpose of this scenario is to teach the agent that killing the monsters is GOOD and when monsters kill you is BAD. In addition, wasting amunition is not very good either. Agent is rewarded only for killing monsters so he has to figure out the rest for himself.

> Map is a large circle. Player is spawned in the exact center. 5 melee-only, monsters are spawned along the wall. Monsters are killed after a single shot. After dying each monster is respawned after some time. Episode ends when the player dies (it's inevitable becuse of limitted ammo).

> REWARDS: +1 for killing a monster

> Further configuration:

> 3 available buttons: turn left, turn right, shoot (attack)

> death penalty = 1

In [2]:
game = vizdoom.DoomGame()
game.load_config("scenarios/defend_the_center.cfg")

# Visualize the game (set to False to train faster)
game.set_window_visible(False)

# Set screen format to greyscale. This improves training time
game.set_screen_format(vizdoom.ScreenFormat.GRAY8)

# Make the game end after 2100 ticks (set to 0 to disable)
game.set_episode_timeout(2100)

# Init game
game.init()

# Setup Keras Model
## Let's Define some Hyperparameter

In [3]:
num_episodes       = 500     # How many episodes to run
num_actions        = game.get_available_buttons_size()
replay_buffer_size = 4000    # How many experiences to store in our memory
learning_rate      = 0.01    # How "fast" should we update the network (alpha)
discount_factor    = 0.95    # Future reward discount factor (gamma)
batch_size         = 64      # How many replays should we use for training

print("Number of actions:", num_actions)

Number of actions: 3


## Construct the network

Here we use Keras to construct the following network:

<img src="figures/Deep_Q_learning_model.png"/>

In [4]:
model = Sequential()
model.add(Conv2D(8, (3, 3), activation='elu', padding="valid", input_shape=(84, 84, 4)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(32, (3, 3), activation='elu', padding="valid"))
model.add(Flatten())
model.add(Dense(128, activation='elu'))
model.add(Dense(64, activation='elu'))
model.add(Dense(game.get_available_buttons_size(), activation=None))
model.summary()
model.compile(loss="mse", optimizer=SGD(lr=learning_rate))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 82, 82, 8)         296       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 41, 41, 8)         0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 39, 39, 32)        2336      
_________________________________________________________________
flatten_1 (Flatten)          (None, 48672)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               6230144   
_________________________________________________________________
dense_2 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 195       
Total para

# Training the Model

Remember that we want to use our model to estimate a Q-value for every action in a state, like this:

<img src="figures/Deep_Q_figure_1.png" style="width: 700px;"/>

Where $Q(s_t, a_0)$ represents how good it is to take action $a_0$ in state $s_t$ according to our network.

So the idea here is to train the network to make correct prediction of how good certain actions are, by optimizing the parameters in our network through exploration of the environment.

This can be achieved with the Deep Q-learning algorithm:

<img src="figures/Deep_Q_algorithm.png" style="width: 700px;"/>

In [None]:
# This function preprocesses a frame from the game by:
# - Remove the ceiling and floor pixels
# - Normalize the pixel values to [0, 1] range
# - Resize the image to 84x84 pixels
def preprocess_frame(frame):
    cropped_frame = frame[30:-10, 30:-30]                             # Crop the screen
    normalized_frame = cropped_frame / 255.0                          # Normalize pixel values    
    preprocessed_frame = transform.resize(normalized_frame, [84, 84]) # Resize
    return preprocessed_frame

In [None]:
# Initialize replay buffer
replay_buffer = deque(maxlen=replay_buffer_size)

# Initialize frame stack
frame_stack = deque(maxlen=4)

# For every episode
episode_loss = float("nan")
for episode in range(num_episodes):
    clear_output(wait=True)
    print("-- Episode {}/{} --".format(episode+1, num_episodes))
    print("Episode loss:", episode_loss)
    
    # Start new episode
    game.new_episode()
    
    # Initialize frame stack with the first frame of the game
    initial_frame = preprocess_frame(game.get_state().screen_buffer)
    for _ in range(4):
        frame_stack.append(initial_frame)
    state = np.stack(frame_stack, axis=2) # Stack the frames to setup the inital state
    
    # Run the episode
    episode_loss = 0
    while not game.is_episode_finished():    
        # Get action with highest Q-value for current state
        action = np.argmax(model.predict_on_batch(np.expand_dims(state, axis=0)))
        action_one_hot = [False] * num_actions
        action_one_hot[action] = True
        
        # Take action and get a reward
        reward = game.make_action(action_one_hot)
        
        # Break if the episode is finished
        if game.is_episode_finished():
            break
        
        # If not, get the new state
        frame_stack.append(preprocess_frame(game.get_state().screen_buffer))
        new_state = np.stack(frame_stack, axis=2)

        # Store the replay
        replay_buffer.append((state, action, reward, new_state))
        state = new_state

        # Train network on a random sample of previous expreiences
        if len(replay_buffer) >= batch_size:
            # Get replay batch
            replay_batch      = random.sample(replay_buffer, batch_size)
            replay_state      = np.array([r[0] for r in replay_batch])
            replay_reward     = np.array([r[2] for r in replay_batch])
            replay_next_state = np.array([r[3] for r in replay_batch])

            # Q_target = reward + gamma * max_a' Q(s', a')
            Q_target = np.expand_dims(replay_reward, axis=1) + discount_factor * model.predict_on_batch(replay_next_state)

            # Run training pass
            episode_loss += model.train_on_batch(replay_state, Q_target)

-- Episode 7/500 --
Episode loss: 0.8093927421723492


In [None]:
model.save("defend_the_center_dqn.h5")

# Let's Evaluate the Model

In [None]:
# Reinitialize game with set_window_visible = True
game = vizdoom.DoomGame()
game.load_config("scenarios/defend_the_center.cfg")

# Visualize the game
game.set_window_visible(True)

# Set screen format to greyscale. This improves training time
game.set_screen_format(vizdoom.ScreenFormat.GRAY8)

# Make the game end after 2100 ticks (set to 0 to disable)
game.set_episode_timeout(2100)

# Init game
game.init()

# For every episode
for episode in range(10):
    # Start new episode
    game.new_episode()
    
    # Initialize frame stack with the first frame of the game
    initial_frame = preprocess_frame(game.get_state().screen_buffer)
    for _ in range(4):
        frame_stack.append(initial_frame)
    state = np.stack(frame_stack, axis=2) # Stack the frames to setup the inital state
    
    # Run the episode
    while not game.is_episode_finished():    
        # Get action with highest Q-value for current state
        action = np.argmax(model.predict_on_batch(np.expand_dims(state, axis=0)))
        action_one_hot = [False] * num_actions
        action_one_hot[action] = True
        
        # Take action and get a reward
        reward = game.make_action(action_one_hot)
        
        # Break if the episode is finished
        if game.is_episode_finished():
            break
        
        # If not, get the new state
        frame_stack.append(preprocess_frame(game.get_state().screen_buffer))
        state = np.stack(frame_stack, axis=2)