# Deep Q-learning with Doom
In this tutorial we will implement deep Q-learning to teach an agent to play Doom.

We will use Keras for the deep learning part, and vizdoom to run doom in python.

Here is a gif of the final results:

TODO:
* DQN Psudeocode
* Install instructions
* Worksheet version of this notebook


## Prerequisites
- Python 3.7
- pip install numpy pyplot gym tensorflow keras skimage
- vizdoom

In [1]:
import random
import vizdoom
from collections import deque
import numpy as np
import keras
from keras.layers import Conv2D, Dense, Flatten, MaxPooling2D
from keras.models import Sequential, load_model
from keras.optimizers import SGD
from skimage import transform
from IPython.display import clear_output

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


# Initialize DoomGame
We will load the **basic** scenario.

Hese is the summary of this scenario from https://github.com/mwydmuch/ViZDoom/tree/master/scenarios:


> ## BASIC
> The purpose of the scenario is just to check if using this
framework to train some AI i 3D environment is feasible.

> Map is a rectangle with gray walls, ceiling and floor.
Player is spawned along the longer wall, in the center.
A red, circular monster is spawned randomly somewhere along
the opposite wall. Player can only (config) go left/right 
and shoot. 1 hit is enough to kill the monster. Episode 
finishes when monster is killed or on timeout.

> __REWARDS:__

> +101 for killing the monster
 -5 for missing
Episode ends after killing the monster or on timeout.

> Further configuration:
* living reward = -1,
* 3 available buttons: move left, move right, shoot (attack)
* timeout = 300



In [2]:
game = vizdoom.DoomGame()
game.load_config("scenarios/basic.cfg")

# Visualize the game (set to False to train faster)
game.set_window_visible(True)

# Set screen format to greyscale. This improves training time
game.set_screen_format(vizdoom.ScreenFormat.GRAY8)

# Make the game end after 100 ticks (set to 0 to disable)
game.set_episode_timeout(100)

# Init game
game.init()

# Setup Keras Model
## Let's Define some Hyperparameter

In [3]:
num_episodes       = 500     # How many episodes to run
num_actions        = game.get_available_buttons_size()
replay_buffer_size = 10000   # How many experiences to store in our memory
learning_rate      = 0.00001 # How "fast" should we update the network (alpha)
discount_factor    = 0.95    # Future reward discount factor (gamma)
batch_size         = 64      # How many replays should we use for training
enable_training    = True    # Should we train the agent?

print("Number of actions:", num_actions)

Number of actions: 3


## Construct the network

Here we use Keras to construct the following network:

<img src="figures/Deep_Q_learning_model.png"/>

In [4]:
model = Sequential()
model.add(Conv2D(8, (3, 3), activation='elu', padding="valid", input_shape=(84, 84, 4)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(32, (3, 3), activation='elu', padding="valid"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(128, activation='elu'))
model.add(Dense(64, activation='elu'))
model.add(Dense(game.get_available_buttons_size(), activation=None))
model.summary()
model.compile(loss="mse", optimizer=SGD(lr=learning_rate))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 82, 82, 8)         296       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 41, 41, 8)         0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 39, 39, 32)        2336      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 19, 19, 32)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 11552)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               1478784   
_________________________________________________________________
dense_2 (Dense)              (None, 64)                8256      
__________

# Training the Model

Remember that we want to use our model to estimate a Q-value for every action in a state, like this:

<img src="figures/Deep_Q_figure_1.png" style="width: 700px;"/>

Where $Q(s_t, a_0)$ represents how good it is to take action $a_0$ in state $s_t$ according to our network.

So the idea here is to train the network to make correct prediction of how good certain actions are, by optimizing the parameters in our network through exploration of the environment.

This can be achieved with the Deep Q-learning algorithm:

<img src="figures/Deep_Q_algorithm.png" style="width: 700px;"/>

Or alternatively:

```
Initialize replay buffer D with size N
Set the batch size B to some constant
for every episode:
    Reset environment
    Get initial state s
    for every step:
        Predict Q_values for state s
        Set a to the action with the highest Q_value
        Perform action a get reward r
        if game is over:
            break
        Get new state s'
        Store experience in D as a tuple <s, a, r, s'>
        if N >= B:
            Set V to equal random sample of B elements from D
            Calculate Q_target = V.r + gamma * max_a' Q(V.s', a')
            Train network with V.s as inputs and Q_target as target

```

In [None]:
# This function preprocesses a frame from the game by:
# - Remove the ceiling and floor pixels
# - Normalize the pixel values to [0, 1] range
# - Resize the image to 84x84 pixels
def preprocess_frame(frame):
    cropped_frame = frame[30:-10, 30:-30]                             # Crop the screen
    normalized_frame = cropped_frame / 255.0                          # Normalize pixel values    
    preprocessed_frame = transform.resize(normalized_frame, [84, 84]) # Resize
    return preprocessed_frame

Here are some useful functions:

* `np.argmax(v)`
  * Return the index of the largest value in v
* `np.stack(list)`
  * Stack every element in list and convert the stacked list into a matrix
* `np.expand_dims(m, axis=n)`
  * Expand matrix m with an additional dimension along axis n
* `game.new_episode()`
  * Start a new episode
* `game.get_state().screen_buffer`
  * Screen buffer (image) of the current state of the game
* `game.make_action(actions)`
  * Take the actions denoted by True in the actions vector (e.g. [True, False, False] to perform action 0)
* `game.is_episode_finished()`
  * Returns `True` when the episode is done (terminal state or episode timeout)
* `random.sample(list, n)`
  * Sample n elements from list
* `model.predict_on_batch(batch)`
  * Feed batch through the network and return the output values (Q-values for the state)
* `model.train_on_batch(batch, target)`
  * Feed batch through the network and optimize network to predict "target" next time

In [None]:
if enable_training:
    # Initialize replay buffer
    replay_buffer = deque(maxlen=replay_buffer_size)

    # Initialize frame stack
    frame_stack = deque(maxlen=4)

    # For every episode
    episode_loss = float("nan")
    for episode in range(num_episodes):
        clear_output(wait=True)
        print("-- Episode {}/{} --".format(episode+1, num_episodes))
        print("Episode loss:", episode_loss)

        # Start new episode
        game.new_episode()

        # Initialize frame stack with the first frame of the game
        initial_frame = preprocess_frame(game.get_state().screen_buffer)
        for _ in range(4):
            frame_stack.append(initial_frame)
        state = np.stack(frame_stack, axis=2) # Stack the frames to setup the inital state

        # Run the episode
        episode_loss = 0
        while not game.is_episode_finished():    
            # Get action with highest Q-value for current state
            action = np.argmax(model.predict_on_batch(np.expand_dims(state, axis=0)))
            action_one_hot = [False] * num_actions
            action_one_hot[action] = True

            # Take action and get a reward
            reward = game.make_action(action_one_hot)

            # Break if the episode is finished
            if game.is_episode_finished():
                break

            # If not, get the new state
            frame_stack.append(preprocess_frame(game.get_state().screen_buffer))
            new_state = np.stack(frame_stack, axis=2)

            # Store the replay
            replay_buffer.append((state, action, reward, new_state))
            state = new_state

            # Train network on a random sample of previous expreiences
            if len(replay_buffer) >= batch_size:
                # Get replay batch
                replay_batch      = random.sample(replay_buffer, batch_size)
                replay_state      = np.array([r[0] for r in replay_batch])
                replay_reward     = np.array([r[2] for r in replay_batch])
                replay_next_state = np.array([r[3] for r in replay_batch])

                # Q_target = reward + gamma * max_a' Q(s', a')
                Q_target = np.expand_dims(replay_reward, axis=1) + discount_factor * model.predict_on_batch(replay_next_state)

                # Run training pass
                episode_loss += model.train_on_batch(replay_state, Q_target)
    model.save("basic_dqn.h5")
else:
    model = load_model("basic_dqn.h5")

-- Episode 76/500 --
Episode loss: 2560.7036975016817


# Let's Evaluate the Model

In [None]:
# Reinitialize game with set_window_visible = True
game = vizdoom.DoomGame()
game.load_config("scenarios/basic.cfg")

# Visualize the game
game.set_window_visible(True)

# Set screen format to greyscale. This improves training time
game.set_screen_format(vizdoom.ScreenFormat.GRAY8)

# Make the game end after 2100 ticks (set to 0 to disable)
game.set_episode_timeout(100)

# Init game
game.init()

record = True
if record:
    import imageio
    recording = []

# For every episode
for episode in range(100):
    # Start new episode
    game.new_episode()
    
    # Initialize frame stack with the first frame of the game
    initial_frame = preprocess_frame(game.get_state().screen_buffer)
    for _ in range(4):
        frame_stack.append(initial_frame)
    state = np.stack(frame_stack, axis=2) # Stack the frames to setup the inital state
    
    # Run the episode
    while not game.is_episode_finished():
        if record:
            recording.append((transform.resize(game.get_state().screen_buffer, [240, 320]) * 255.5).astype(np.uint8))
        
        # Get action with highest Q-value for current state
        action = np.argmax(model.predict_on_batch(np.expand_dims(state, axis=0)))
        action_one_hot = [False] * num_actions
        action_one_hot[action] = True
        
        # Take action and get a reward
        reward = game.make_action(action_one_hot)
        
        # Break if the episode is finished
        if game.is_episode_finished():
            break
        
        # If not, get the new state
        frame_stack.append(preprocess_frame(game.get_state().screen_buffer))
        state = np.stack(frame_stack, axis=2)
        
if record:
    imageio.mimwrite("basic_dqn.gif", recording, subrectangles=True)#, palettesize=16)