# Deep Q learning with Doom 🕹️
In this notebook we'll implement an agent <b>that plays Doom by using a Deep Q learning architecture.</b> <br>
Our agent playing Doom:

<img src="https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/docs/assets/img/doom_reduced.gif" style="max-width: 600px;" alt="Deep Q learning with Doom"/>


## This notebook is part of the Free Deep Reinforcement Course 📝
<img src="https://simoninithomas.github.io/Deep_reinforcement_learning_Course/assets/img/preview.jpg" alt="Deep Reinforcement Course" style="width: 500px;"/>

<p> Deep Reinforcement Learning Course is a free series of blog posts about Deep Reinforcement Learning, where we'll learn the main algorithms, <b>and how to implement them in Tensorflow.</b></p>

<p>The goal of these articles is to <b>explain step by step from the big picture</b> and the mathematical details behind it, to the implementation with Tensorflow </p>


<a href="https://simoninithomas.github.io/Deep_reinforcement_learning_Course/">Syllabus</a><br>
<a href="https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-4339519de419">Part 0: Introduction to Reinforcement Learning </a><br>
<a href="https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe"> Part 1: Q-learning with FrozenLake</a><br>
<a href=""> Part 2: Deep Q-learning with Doom</a><br>
<a href=""> Part 3: Policy Gradients with Doom </a><br>

## Checklist 📝
- To launch tensorboard : `tensorboard --logdir=/tensorboard/dqn/1`
- ⚠️⚠️⚠️ You need to download vizdoom and place the folder in the repos.
- If don't want to train, you must change **training to False** (in hyperparameters step). 


## Any questions 👨‍💻
<p> If you have any questions, feel free to ask me: </p>
<p> 📧: <a href="mailto:hello@simoninithomas.com">hello@simoninithomas.com</a>  </p>
<p> Github: https://github.com/simoninithomas/Deep_reinforcement_learning_Course </p>
<p> 🌐 : https://simoninithomas.github.io/Deep_reinforcement_learning_Course/ </p>
<p> Twitter: <a href="https://twitter.com/ThomasSimonini">@ThomasSimonini</a> </p>
<p> Don't forget to <b> follow me on <a href="https://twitter.com/ThomasSimonini">twitter</a>, <a href="https://github.com/simoninithomas/Deep_reinforcement_learning_Course">github</a> and <a href="https://medium.com/@thomassimonini">Medium</a> to be alerted of the new articles that I publish </b></p>
    

## How to help  🙌
3 ways:
- **Clap our articles a lot**:Clapping in Medium means that you really like our articles. And the more claps we have, the more our article is shared
- **Share and speak about our articles**: By sharing our articles you help us to spread the word.
- **Improve our notebooks**: if you found a bug or **a better implementation** you can send a pull request.
<br>

## Important note 🤔
Some problems with jupyter notebook and GPU service forced me to run this notebook on my computer, **you can too but it's easier with GPUs**. 

<b> You can run it on your computer but it's better to run it on GPU based services </b>, personally I use Microsoft Azure and their Deep Learning Virtual Machine (they offer 170$)
https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.dsvm-deep-learning
<br>
⚠️ I don't have any business relations with them. I just loved their excellent customer service.

If you have some troubles to use Microsoft Azure follow the explainations of this excellent article here (without last the part fast.ai): https://medium.com/@manikantayadunanda/setting-up-deeplearning-machine-and-fast-ai-on-azure-a22eb6bd6429

## Step 1: Import the libraries 📚

In [1]:
import tensorflow as tf      # Deep Learning library
import numpy as np           # Handle matrices
from vizdoom import *        # Doom Environment
import random                # Handling random number generation
import time                  # Handling time calculation
from skimage import transform# Help us to preprocess the frames

from collections import deque# Ordered collection with ends
import matplotlib.pyplot as plt # Display graphs

ModuleNotFoundError: No module named 'vizdoom'

## Step 2: Create our environment 🎮
- Now that we imported the libraries/dependencies, we will create our environment.
- Doom environment takes:
    - A `configuration file` that **handle all the options** (size of the frame, possible actions...)
    - A `scenario file`: that **generates the correct scenario** (in our case basic **but you're invited to try other scenarios**).
- Note: We have 3 possible actions `[[0,0,1], [1,0,0], [0,1,0]]` so we don't need to do one hot encoding (thanks to < a href="https://stackoverflow.com/users/2237916/silgon">silgon</a> for figuring out. 

### Our environment
<img src="assets/doom.png" style="max-width:500px;" alt="Doom"/>
                                    
- A monster is spawned **randomly somewhere along the opposite wall**. 
- Player can only go **left/right and shoot**. 
- 1 hit is enough **to kill the monster**. 
- Episode finishes when **monster is killed or on timeout (300)**.
<br><br>
REWARDS:

- +101 for killing the monster 
- -5 for missing 
- Episode ends after killing the monster or on timeout.
- living reward = -1

In [2]:
"""
Here we create our environment
"""
def create_environment():
    game = DoomGame()
    
    # Load the correct configuration
    game.load_config("basic.cfg")
    
    # Load the correct scenario (in our case basic scenario)
    game.set_doom_scenario_path("basic.wad")
    
    game.init()
    
    # Here our possible actions
    left = [1, 0, 0]
    right = [0, 1, 0]
    shoot = [0, 0, 1]
    possible_actions = [left, right, shoot]
    
    return game, possible_actions
       
"""
Here we performing random action to test the environment
"""
def test_environment():
    game = DoomGame()
    game.load_config("basic.cfg")
    game.set_doom_scenario_path("basic.wad")
    game.init()
    shoot = [0, 0, 1]
    left = [1, 0, 0]
    right = [0, 1, 0]
    actions = [shoot, left, right]

    episodes = 10
    for i in range(episodes):
        game.new_episode()
        while not game.is_episode_finished():
            state = game.get_state()
            img = state.screen_buffer
            misc = state.game_variables
            action = random.choice(actions)
            print(action)
            reward = game.make_action(action)
            print ("\treward:", reward)
            time.sleep(0.02)
        print ("Result:", game.get_total_reward())
        time.sleep(2)
    game.close()

In [3]:
game, possible_actions = create_environment()

## Step 3: Set up our hyperparameters ⚗️
In this part we'll set up our different hyperparameters. But when you implement a Neural Network by yourself you will **not implement hyperparamaters at once but progressively**.

- First, you begin by defining the neural networks hyperparameters when you implement the model.
- Then, you'll add the training hyperparameters when you implement the training algorithm.

In [4]:
### MODEL HYPERPARAMETERS
state_size = [84,84,4]      # Our input is a stack of 4 frames hence 84x84x4 (Width, height, channels) 
action_size = game.get_available_buttons_size()              # 3 possible actions: left, right, shoot
learning_rate =  0.0002      # Alpha (aka learning rate)

### TRAINING HYPERPARAMETERS
total_episodes = 5000        # Total episodes for training
max_steps = 100              # Max possible steps in an episode
batch_size = 64             

# Exploration parameters for epsilon greedy strategy
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Q learning hyperparameters
gamma = 0.99

### MEMORY HYPERPARAMETERS
pretrain_length = batch_size
memory_size = 50000

### PREPROCESSING HYPERPARAMETERS
stack_size = 4

### MODIFY THIS TO FALSE IF YOU JUST WANT TO SEE THE TRAINED AGENT
training = True

## Step 4 : Define the preprocessing functions ⚙️
### preprocess_frame
Preprocessing is an important step, <b>because we want to reduce the complexity of our states to reduce the computation time needed for training.</b>
<br><br>
Our steps:
- Grayscale each of our frames (because <b> color does not add important information </b>). But this is already done by the config file.
- Crop the screen (in our case we remove the roof because it contains no information)
- We normalize pixel values
- Finally we resize the preprocessed frame

In [5]:
"""
    preprocess_frame:
    Take a frame.
    Resize it.
        __________________
        |                 |
        |                 |
        |                 |
        |                 |
        |_________________|
        
        to
        _____________
        |            |
        |            |
        |            |
        |____________|
    Normalize it.
    
    return preprocessed_frame
    
    """
def preprocess_frame(frame):
    # Greyscale frame already done in our vizdoom config
    # x = np.mean(frame,-1)
    
    # Crop the screen (remove the roof because it contains no information)
    cropped_frame = frame[30:-10,30:-30]
    
    # Normalize Pixel Values
    normalized_frame = cropped_frame/255.0
    
    # Resize
    preprocessed_frame = transform.resize(normalized_frame, [84,84])
    
    return preprocessed_frame

### stack_frames
👏 This part was made possible thanks to help of <a href="https://github.com/Miffyli">Anssi</a><br>
Stacking frames is really important because it helps us to **give have a sense of motion to our Neural Network.**
- First we preprocess frame
- Then we append the frame to the deque that automatically **removes the oldest frame**
- Finally we **build the stacked state**

This is how work stack:
- For the first frame, we feed the other 3 with blank frames
- At each timestep, **we add the new frame to deque and then we stack them to form a new stacked frame**
- And so on
<img src="assets\stack.png" alt="stack">

In [6]:
# Initialize deque with zero-images one array for each image
stacked_frames  =  deque([np.zeros((84,84), dtype=np.int) for i in range(stack_size)], maxlen=4) 

def stack_frames(stacked_frames, state):
    # Preprocess frame
    frame = preprocess_frame(state)
        
    # Append frame to deque, automatically removes the oldest frame
    stacked_frames.append(frame)
       
    # Build the stacked state (first dimension specifies different frames)
    stacked_state = np.stack(stacked_frames, axis=2)
    
    return stacked_state

## Step 5: Create our Deep Q-learning Neural Network model 🧠
<img src="assets/model.png" alt="Model" />
This is our Deep Q-learning model:
- We take a stack of 4 frames as input
- It passes through 3 convnets
- Then it is flatened
- Finally it passes through 2 FC layers
- It outputs a Q value for each actions

In [7]:
class DQNetwork:
    def __init__(self, state_size, action_size, learning_rate, name='DQNetwork'):
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate
        
        with tf.variable_scope(name):
            # We create the placeholders
            # *state_size means that we take each elements of state_size in tuple hence is like if we wrote
            # [None, 84, 84, 4]
            self.inputs_ = tf.placeholder(tf.float32, [None, *state_size], name="inputs")
            self.actions_ = tf.placeholder(tf.float32, [None, 3], name="actions_")
            
            # Remember that target_Q is the R(s,a) + ymax Qhat(s', a')
            self.target_Q = tf.placeholder(tf.float32, [None], name="target")
            
            """
            First convnet:
            CNN
            BatchNormalization
            ELU
            """
            # Input is 84x84x4
            self.conv1 = tf.layers.conv2d(inputs = self.inputs_,
                                         filters = 32,
                                         kernel_size = [8,8],
                                         strides = [4,4],
                                         padding = "VALID",
                                          kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),
                                         name = "conv1")
            
            self.conv1_batchnorm = tf.layers.batch_normalization(self.conv1,
                                                   training = True,
                                                   epsilon = 1e-5,
                                                     name = 'batch_norm1')
            
            self.conv1_out = tf.nn.elu(self.conv1_batchnorm, name="conv1_out")
            ## --> [20, 20, 32]
            
            
            """
            Second convnet:
            CNN
            BatchNormalization
            ELU
            """
            self.conv2 = tf.layers.conv2d(inputs = self.conv1_out,
                                 filters = 64,
                                 kernel_size = [4,4],
                                 strides = [2,2],
                                 padding = "VALID",
                                kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),
                                 name = "conv2")
        
            self.conv2_batchnorm = tf.layers.batch_normalization(self.conv2,
                                                   training = True,
                                                   epsilon = 1e-5,
                                                     name = 'batch_norm2')

            self.conv2_out = tf.nn.elu(self.conv2_batchnorm, name="conv2_out")
            ## --> [9, 9, 64]
            
            
            """
            Third convnet:
            CNN
            BatchNormalization
            ELU
            """
            self.conv3 = tf.layers.conv2d(inputs = self.conv2_out,
                                 filters = 128,
                                 kernel_size = [4,4],
                                 strides = [2,2],
                                 padding = "VALID",
                                kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),
                                 name = "conv3")
        
            self.conv3_batchnorm = tf.layers.batch_normalization(self.conv3,
                                                   training = True,
                                                   epsilon = 1e-5,
                                                     name = 'batch_norm3')

            self.conv3_out = tf.nn.elu(self.conv3_batchnorm, name="conv3_out")
            ## --> [3, 3, 128]
            
            
            self.flatten = tf.layers.flatten(self.conv3_out)
            ## --> [1152]
            
            
            self.fc = tf.layers.dense(inputs = self.flatten,
                                  units = 512,
                                  activation = tf.nn.elu,
                                       kernel_initializer=tf.contrib.layers.xavier_initializer(),
                                name="fc1")
            
            
            self.output = tf.layers.dense(inputs = self.fc, 
                                           kernel_initializer=tf.contrib.layers.xavier_initializer(),
                                          units = 3, 
                                        activation=None)

  
            # Q is our predicted Q value.
            self.Q = tf.reduce_sum(tf.multiply(self.output, self.actions_), axis=1)
            
            
            # The loss is the difference between our predicted Q_values and the Q_target
            # Sum(Qtarget - Q)^2
            self.loss = tf.reduce_mean(tf.square(self.target_Q - self.Q))
            
            self.optimizer = tf.train.RMSPropOptimizer(self.learning_rate).minimize(self.loss)

In [8]:
# Reset the graph
tf.reset_default_graph()

# Instantiate the DQNetwork
DQNetwork = DQNetwork(state_size, action_size, learning_rate)

## Step 6: Experience Replay 🔁
Now that we create our Neural Network, **we need to implement the Experience Replay method.** <br><br>
Here we'll create the Memory object that creates a deque.A deque (double ended queue) is a data type that **removes the oldest element each time that you add a new element.**

This part was taken from Udacity : <a href="https://github.com/udacity/deep-learning/blob/master/reinforcement/Q-learning-cart.ipynb" Cartpole DQN</a>

In [9]:
class Memory():
    def __init__(self, max_size):
        self.buffer = deque(maxlen = max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
    
    def sample(self, batch_size):
        buffer_size = len(self.buffer)
        index = np.random.choice(np.arange(buffer_size),
                                size = batch_size,
                                replace = False)
        
        return [self.buffer[i] for i in index]

Here we'll **deal with the empty memory problem**: we pre-populate our memory by taking random actions and storing the experience (state, action, reward, new_state).

In [10]:
# Render the environment
game.new_episode()

# Instantiate memory
memory = Memory(max_size = memory_size)

for i in range(pretrain_length):
    if i == 0:
        # First we need a state
        state = game.get_state().screen_buffer
        state = stack_frames(stacked_frames, state)
    
    # Random action
    action = random.choice(possible_actions)
    
    # Get the rewards
    reward = game.make_action(action)
    
    # Look if the episode is finished
    done = game.is_episode_finished()
    
    
    if done:
        # We finished the episode
        next_state = np.zeros(state.shape)
        
        # Add experience to memory
        memory.add((state, action, reward, next_state, done))
        
        # Start a new episode
        game.new_episode()
    else:
        # Get the next state
        next_state = game.get_state().screen_buffer
        next_state = stack_frames(stacked_frames, next_state)
        
        # Add experience to memory
        memory.add((state, action, reward, next_state, done))
        
        # Our state is now the next_state
        state = next_state

  warn("The default mode, 'constant', will be changed to 'reflect' in "


## Step 7: Set up Tensorboard 📊
For more information about tensorboard, please watch this <a href="https://www.youtube.com/embed/eBbEDRsCmv4">excellent 30min tutorial</a> <br><br>
To launch tensorboard : `tensorboard --logdir=/tensorboard/dqn/1`

In [12]:
# Setup TensorBoard Writer
writer = tf.summary.FileWriter("/tensorboard/dqn/1")

## Losses
tf.summary.scalar("Loss", DQNetwork.loss)

write_op = tf.summary.merge_all()

## Step 8: Train our Agent 🏃‍♂️

Our algorithm:
<br>
* Initialize the weights
* Init the environment
* Initialize the decay rate (that will use to reduce epsilon) 
<br><br>
* **For** episode to max_episode **do** 
    * Make new episode
    * Set step to 0
    * Observe the first state $s_0$
    <br><br>
    * **While** step < max_steps **do**:
        * Increase decay_rate
        * With $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s_t,a)$
        * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
        * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
        * Sample random mini-batch from $D$: $<s, a, r, s'>$
        * Set $\hat{Q} = r$ if the episode ends at $+1$, otherwise set $\hat{Q} = r + \gamma \max_{a'}{Q(s', a')}$
        * Make a gradient descent step with loss $(\hat{Q} - Q(s, a))^2$
    * **endfor**
    <br><br>
* **endfor**

    

In [13]:
# Saver will help us to save our model
saver = tf.train.Saver()

if training == True:
    rewards_list = []

    with tf.Session() as sess:
        # Initialize the variables
        sess.run(tf.global_variables_initializer())

        # Init the game
        game.init()

        decay_step = 0

        for episode in range(total_episodes):
            # Make new episode
            game.new_episode()
            step = 0

            # Observe the first state
            frame = game.get_state().screen_buffer
            state = stack_frames(stacked_frames, frame)

            while step < max_steps:
                step += 1
                # Increase decay_step
                decay_step +=1

                ## EPSILON GREEDY STRATEGY
                # Choose action a from state s using epsilon greedy.
                ## First we randomize a number
                exp_exp_tradeoff = np.random.rand()

                # Here we'll use an improved version of our epsilon greedy strategy used in Q-learning notebook
                explore_probability = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * decay_step)


                if (explore_probability > exp_exp_tradeoff):
                    # Make a random action
                    action = random.choice(possible_actions)

                else:
                    # Get action from Q-network
                    # Estimate the Qs values state
                    Qs = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: state.reshape((1, *state.shape))})

                    # Take the biggest Q value (= the best action)
                    action = np.argmax(Qs)

                    action = possible_actions[int(action)]

                # Do the action
                reward = game.make_action(action)

                # Look if the episode is finished
                done = game.is_episode_finished()

                # If the game is finished
                if done:
                    # the episode ends so no next state
                    next_state = np.zeros((84,84), dtype=np.int)
                    next_state = stack_frames(stacked_frames, next_state)

                    # Set step = max_steps to end the episode
                    step = max_steps

                    total_reward = game.get_total_reward()

                    print('Episode: {}'.format(episode),
                              'Total reward: {}'.format(total_reward),
                              'Training loss: {:.4f}'.format(loss),
                              'Explore P: {:.4f}'.format(explore_probability))

                    rewards_list.append((episode, total_reward))

                    memory.add((state, action, reward, next_state, done))

                else:
                    # Get the next state
                    next_state = game.get_state().screen_buffer
                    next_state = stack_frames(stacked_frames, next_state)

                    # Add experience to memory
                    memory.add((state, action, reward, next_state, done))
                    state = next_state




                ### LEARNING PART            
                # Obtain random mini-batch from memory
                batch = memory.sample(batch_size)
                states = np.array([each[0] for each in batch], ndmin=3)
                actions = np.array([each[1] for each in batch])
                rewards = np.array([each[2] for each in batch]) 
                next_states = np.array([each[3] for each in batch])
                dones = np.array([each[4] for each in batch])

                target_Qs_batch = []

                # Get Q values for next_state 
                target_Qs = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: next_states})

                # Set Qhat = r if the episode ends at +1, otherwise set Qhat = r + gamma*maxQ(s', a')
                for i in range(0, len(batch)):
                    terminal = dones[i]

                    # If we are in a terminal state, only equals reward
                    if terminal:
                        target_Qs_batch.append(rewards[i])
                    else:
                        target = rewards[i] + gamma * np.max(target_Qs[i])
                        target_Qs_batch.append(target)


                targets = np.array([each for each in target_Qs_batch])

                loss, _ = sess.run([DQNetwork.loss, DQNetwork.optimizer],
                                    feed_dict={DQNetwork.inputs_: states,
                                               DQNetwork.target_Q: targets,
                                               DQNetwork.actions_: actions})

                # Write TF Summaries
                summary = sess.run(write_op, feed_dict={DQNetwork.inputs_: states,
                                                   DQNetwork.target_Q: targets,
                                                   DQNetwork.actions_: actions})
                writer.add_summary(summary, episode)
                writer.flush()

            # Save model every 5 episodes
            if episode % 5 == 0:
                save_path = saver.save(sess, "./models/model.ckpt")
                print("Model Saved")

  warn("The default mode, 'constant', will be changed to 'reflect' in "


Episode: 0 Total reward: 95.0 Training loss: 159.8951 Explore P: 0.9994
Model Saved
Episode: 1 Total reward: 91.0 Training loss: 314.2269 Explore P: 0.9984
Episode: 2 Total reward: 25.0 Training loss: 249.3162 Explore P: 0.9924
Episode: 3 Total reward: 95.0 Training loss: 91.9304 Explore P: 0.9918
Episode: 5 Total reward: 93.0 Training loss: 20.8968 Explore P: 0.9813
Model Saved
Episode: 6 Total reward: 94.0 Training loss: 30.8907 Explore P: 0.9806
Episode: 7 Total reward: 95.0 Training loss: 241.0443 Explore P: 0.9800
Episode: 10 Total reward: 95.0 Training loss: 33.5293 Explore P: 0.9602
Model Saved
Episode: 13 Total reward: 95.0 Training loss: 9.7165 Explore P: 0.9409
Episode: 15 Total reward: 95.0 Training loss: 12.4297 Explore P: 0.9310
Model Saved
Episode: 16 Total reward: 94.0 Training loss: 11.5098 Explore P: 0.9304
Episode: 18 Total reward: 15.0 Training loss: 37.0864 Explore P: 0.9148
Episode: 20 Total reward: 69.0 Training loss: 16.4923 Explore P: 0.9034
Model Saved
Episode:

Episode: 153 Total reward: 95.0 Training loss: 9.9128 Explore P: 0.5214
Episode: 154 Total reward: 88.0 Training loss: 5.9731 Explore P: 0.5207
Episode: 155 Total reward: 93.0 Training loss: 8.5361 Explore P: 0.5203
Model Saved
Episode: 156 Total reward: 95.0 Training loss: 9.6579 Explore P: 0.5200
Episode: 158 Total reward: 57.0 Training loss: 12.6537 Explore P: 0.5132
Episode: 159 Total reward: 88.0 Training loss: 7.6906 Explore P: 0.5126
Episode: 160 Total reward: 88.0 Training loss: 11.4646 Explore P: 0.5119
Model Saved
Episode: 162 Total reward: 93.0 Training loss: 8.9778 Explore P: 0.5065
Episode: 163 Total reward: 94.0 Training loss: 13.8677 Explore P: 0.5062
Episode: 164 Total reward: 93.0 Training loss: 8.4455 Explore P: 0.5058
Episode: 165 Total reward: 93.0 Training loss: 8.1312 Explore P: 0.5054
Model Saved
Episode: 166 Total reward: 95.0 Training loss: 13.6543 Explore P: 0.5051
Episode: 167 Total reward: 94.0 Training loss: 6.8575 Explore P: 0.5047
Episode: 168 Total rewar

Episode: 274 Total reward: 93.0 Training loss: 9.7159 Explore P: 0.3828
Episode: 275 Total reward: 95.0 Training loss: 13.4010 Explore P: 0.3826
Model Saved
Episode: 276 Total reward: 91.0 Training loss: 11.7094 Explore P: 0.3822
Episode: 278 Total reward: 51.0 Training loss: 7.3124 Explore P: 0.3770
Episode: 279 Total reward: 94.0 Training loss: 8.8203 Explore P: 0.3768
Episode: 280 Total reward: 17.0 Training loss: 16.6367 Explore P: 0.3742
Model Saved
Episode: 281 Total reward: 95.0 Training loss: 10.2848 Explore P: 0.3740
Episode: 282 Total reward: 93.0 Training loss: 9.4529 Explore P: 0.3737
Episode: 283 Total reward: 25.0 Training loss: 6.3788 Explore P: 0.3715
Episode: 284 Total reward: 66.0 Training loss: 9.5051 Explore P: 0.3704
Episode: 285 Total reward: 68.0 Training loss: 13.6119 Explore P: 0.3694
Model Saved
Episode: 286 Total reward: 94.0 Training loss: 17.8033 Explore P: 0.3692
Episode: 287 Total reward: 82.0 Training loss: 7.6668 Explore P: 0.3685
Episode: 288 Total rew

Episode: 390 Total reward: 95.0 Training loss: 13.1551 Explore P: 0.2881
Model Saved
Episode: 391 Total reward: 95.0 Training loss: 9.0346 Explore P: 0.2880
Episode: 392 Total reward: 95.0 Training loss: 9.2041 Explore P: 0.2878
Episode: 393 Total reward: 53.0 Training loss: 10.1196 Explore P: 0.2866
Episode: 394 Total reward: 95.0 Training loss: 14.3436 Explore P: 0.2864
Episode: 395 Total reward: 41.0 Training loss: 6.9891 Explore P: 0.2851
Model Saved
Episode: 396 Total reward: 93.0 Training loss: 37.0162 Explore P: 0.2848
Episode: 397 Total reward: 95.0 Training loss: 8.7482 Explore P: 0.2847
Episode: 398 Total reward: 95.0 Training loss: 12.4865 Explore P: 0.2845
Episode: 399 Total reward: 55.0 Training loss: 10.0174 Explore P: 0.2834
Episode: 400 Total reward: 50.0 Training loss: 9.2013 Explore P: 0.2823
Model Saved
Episode: 401 Total reward: 89.0 Training loss: 11.5834 Explore P: 0.2819
Episode: 402 Total reward: 50.0 Training loss: 9.2483 Explore P: 0.2808
Episode: 403 Total re

Episode: 500 Total reward: 94.0 Training loss: 7.7699 Explore P: 0.2390
Model Saved
Episode: 501 Total reward: 93.0 Training loss: 7.5197 Explore P: 0.2388
Episode: 502 Total reward: 93.0 Training loss: 9.1131 Explore P: 0.2386
Episode: 503 Total reward: 95.0 Training loss: 6.2553 Explore P: 0.2385
Episode: 504 Total reward: 95.0 Training loss: 8.3049 Explore P: 0.2383
Episode: 505 Total reward: 37.0 Training loss: 6.7135 Explore P: 0.2371
Model Saved
Episode: 506 Total reward: 27.0 Training loss: 9.1066 Explore P: 0.2358
Episode: 507 Total reward: 93.0 Training loss: 7.5311 Explore P: 0.2356
Episode: 508 Total reward: 49.0 Training loss: 7.5265 Explore P: 0.2346
Episode: 509 Total reward: 26.0 Training loss: 7.2603 Explore P: 0.2332
Episode: 510 Total reward: 95.0 Training loss: 15.0211 Explore P: 0.2331
Model Saved
Episode: 511 Total reward: 94.0 Training loss: 6.2787 Explore P: 0.2329
Episode: 512 Total reward: 62.0 Training loss: 8.8246 Explore P: 0.2321
Episode: 513 Total reward: 

KeyboardInterrupt: 

## Step 9: Watch our Agent play 👀
Now that we trained our agent, we can test it

In [None]:
with tf.Session() as sess:
    
    game = DoomGame()
    
    totalScore = 0
    
    # Load the correct configuration (test configuration)
    game.load_config("basic_test.cfg")
    
    # Load the correct scenario (in our case basic scenario)
    game.set_doom_scenario_path("basic.wad")
    
    # Load the model
    saver.restore(sess, "./models/model.ckpt")
    game.init()
    for i in range(10):
        
        game.new_episode()
        while not game.is_episode_finished():
            frame = game.get_state().screen_buffer
            state = stack_frames(stacked_frames, frame)
            # Take the biggest Q value (= the best action)
            Qs = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: state.reshape((1, *state.shape))})
            action = np.argmax(Qs)
            action = possible_actions[int(action)]
            game.make_action(action)        
            score = game.get_total_reward()
        print("Score: ", score)
        totalScore += score
    print("TOTAL_SCORE", totalScore/100.0)
    game.close()