
# Deep Q learning with Doom 🕹️
In this notebook we'll implement an agent <b>that plays Doom by using a Deep Q learning architecture.</b> <br>
Our agent playing Doom:

<img src="assets/doom.gif" style="max-width: 600px;" alt="Deep Q learning with Doom"/>


# This is a notebook from [Deep Reinforcement Learning Course with Tensorflow](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)
<img src="https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/docs/assets/img/DRLC%20Environments.png" alt="Deep Reinforcement Course"/>
<br>
<p>  Deep Reinforcement Learning Course is a free series of articles and videos tutorials 🆕 about Deep Reinforcement Learning, where **we'll learn the main algorithms (Q-learning, Deep Q Nets, Dueling Deep Q Nets, Policy Gradients, A2C, Proximal Policy Gradients…), and how to implement them with Tensorflow.**
<br><br>
    
📜The articles explain the architectures from the big picture to the mathematical details behind them.
<br>
📹 The videos explain how to build the agents with Tensorflow </b></p>
<br>
This course will give you a **solid foundation for understanding and implementing the future state of the art algorithms**. And, you'll build a strong professional portfolio by creating **agents that learn to play awesome environments**: Doom© 👹, Space invaders 👾, Outrun, Sonic the Hedgehog©, Michael Jackson’s Moonwalker, agents that will be able to navigate in 3D environments with DeepMindLab (Quake) and able to walk with Mujoco. 
<br><br>
</p> 

## 📚 The complete [Syllabus HERE](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)


## Any questions 👨‍💻
<p> If you have any questions, feel free to ask me: </p>
<p> 📧: <a href="mailto:hello@simoninithomas.com">hello@simoninithomas.com</a>  </p>
<p> Github: https://github.com/simoninithomas/Deep_reinforcement_learning_Course </p>
<p> 🌐 : https://simoninithomas.github.io/Deep_reinforcement_learning_Course/ </p>
<p> Twitter: <a href="https://twitter.com/ThomasSimonini">@ThomasSimonini</a> </p>
<p> Don't forget to <b> follow me on <a href="https://twitter.com/ThomasSimonini">twitter</a>, <a href="https://github.com/simoninithomas/Deep_reinforcement_learning_Course">github</a> and <a href="https://medium.com/@thomassimonini">Medium</a> to be alerted of the new articles that I publish </b></p>
    
## How to help  🙌
3 ways:
- **Clap our articles and like our videos a lot**:Clapping in Medium means that you really like our articles. And the more claps we have, the more our article is shared Liking our videos help them to be much more visible to the deep learning community.
- **Share and speak about our articles and videos**: By sharing our articles and videos you help us to spread the word. 
- **Improve our notebooks**: if you found a bug or **a better implementation** you can send a pull request.
<br>

## Important note 🤔
<b> You can run it on your computer but it's better to run it on GPU based services</b>, personally I use Microsoft Azure and their Deep Learning Virtual Machine (they offer 170$)
https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.dsvm-deep-learning
<br>
⚠️ I don't have any business relations with them. I just loved their excellent customer service.

If you have some troubles to use Microsoft Azure follow the explainations of this excellent article here (without last the part fast.ai): https://medium.com/@manikantayadunanda/setting-up-deeplearning-machine-and-fast-ai-on-azure-a22eb6bd6429

## Prerequisites 🏗️
Before diving on the notebook **you need to understand**:
- The foundations of Reinforcement learning (MC, TD, Rewards hypothesis...) [Article](https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-4339519de419)
- Q-learning [Article](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe)
- Deep Q-Learning [Article](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8)
- In the [video version](https://www.youtube.com/watch?v=gCJyVX98KJ4)  we implemented a Deep Q-learning agent with Tensorflow that learns to play Atari Space Invaders 🕹️👾.

In [1]:
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/gCJyVX98KJ4?showinfo=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')



## Step 1: Import the libraries 📚

In [2]:
import tensorflow as tf      # Deep Learning library
import numpy as np           # Handle matrices
from vizdoom import *        # Doom Environment

import random                # Handling random number generation
import time                  # Handling time calculation
from skimage import transform# Help us to preprocess the frames

from collections import deque# Ordered collection with ends
import matplotlib.pyplot as plt # Display graphs

import warnings # This ignore all the warning messages that are normally printed during the training because of skiimage
warnings.filterwarnings('ignore') 

## Step 2: Create our environment 🎮
- Now that we imported the libraries/dependencies, we will create our environment.
- Doom environment takes:
    - A `configuration file` that **handle all the options** (size of the frame, possible actions...)
    - A `scenario file`: that **generates the correct scenario** (in our case basic **but you're invited to try other scenarios**).
- Note: We have 3 possible actions `[[0,0,1], [1,0,0], [0,1,0]]` so we don't need to do one hot encoding (thanks to < a href="https://stackoverflow.com/users/2237916/silgon">silgon</a> for figuring out. 

### Our environment
<img src="assets/doom.png" style="max-width:500px;" alt="Doom"/>
                                    
- A monster is spawned **randomly somewhere along the opposite wall**. 
- Player can only go **left/right and shoot**. 
- 1 hit is enough **to kill the monster**. 
- Episode finishes when **monster is killed or on timeout (300)**.
<br><br>
REWARDS:

- +101 for killing the monster 
- -5 for missing 
- Episode ends after killing the monster or on timeout.
- living reward = -1

In [3]:
"""
Here we create our environment
"""
def create_environment():
    game = DoomGame()
    
    # Load the correct configuration
    game.load_config("take_cover.cfg")
    
    # Load the correct scenario (in our case basic scenario)
    game.set_doom_scenario_path("take_cover.wad")
    
    game.init()
    
    # Here our possible actions
    left = [1, 0]
    right = [0, 1]
#     shoot = [0, 0, 1]
    possible_actions = [left, right]
    
    return game, possible_actions
       
"""
Here we performing random action to test the environment
"""
def test_environment():
    game = DoomGame()
    game.load_config("basic.cfg")
    game.set_doom_scenario_path("basic.wad")
    game.init()
#     shoot = [0, 0, 1]
    left = [1, 0]
    right = [0, 1]
    actions = [left, right]

    episodes = 10
    for i in range(episodes):
        game.new_episode()
        while not game.is_episode_finished():
            state = game.get_state()
            img = state.screen_buffer
            misc = state.game_variables
            action = random.choice(actions)
            print(action)
            reward = game.make_action(action)
            print ("\treward:", reward)
            time.sleep(0.02)
        print ("Result:", game.get_total_reward())
        time.sleep(2)
    game.close()

In [4]:
game, possible_actions = create_environment()

## Step 3: Define the preprocessing functions ⚙️
### preprocess_frame
Preprocessing is an important step, <b>because we want to reduce the complexity of our states to reduce the computation time needed for training.</b>
<br><br>
Our steps:
- Grayscale each of our frames (because <b> color does not add important information </b>). But this is already done by the config file.
- Crop the screen (in our case we remove the roof because it contains no information)
- We normalize pixel values
- Finally we resize the preprocessed frame

In [5]:
"""
    preprocess_frame:
    Take a frame.
    Resize it.
        __________________
        |                 |
        |                 |
        |                 |
        |                 |
        |_________________|
        
        to
        _____________
        |            |
        |            |
        |            |
        |____________|
    Normalize it.
    
    return preprocessed_frame
    
    """
def preprocess_frame(frame):
    # Greyscale frame already done in our vizdoom config
    # x = np.mean(frame,-1)
    
    # Crop the screen (remove the roof because it contains no information)

#     plt.imshow(frame)
    cropped_frame = frame[30:-10,30:-30]
    
    # Normalize Pixel Values
#     plt.imshow(cropped_frame)
    normalized_frame = cropped_frame/255.0
#     plt.imshow(normalized_frame)
    
    # Resize
    preprocessed_frame = transform.resize(normalized_frame, [84,84])
#     plt.imshow(preprocessed_frame)

    
    return preprocessed_frame

### stack_frames
👏 This part was made possible thanks to help of <a href="https://github.com/Miffyli">Anssi</a><br>

As explained in this really <a href="https://danieltakeshi.github.io/2016/11/25/frame-skipping-and-preprocessing-for-deep-q-networks-on-atari-2600-games/">  good article </a> we stack frames.

Stacking frames is really important because it helps us to **give have a sense of motion to our Neural Network.**

- First we preprocess frame
- Then we append the frame to the deque that automatically **removes the oldest frame**
- Finally we **build the stacked state**

This is how work stack:
- For the first frame, we feed 4 frames
- At each timestep, **we add the new frame to deque and then we stack them to form a new stacked frame**
- And so on
<img src="https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/DQN/Space%20Invaders/assets/stack_frames.png" alt="stack">
- If we're done, **we create a new stack with 4 new frames (because we are in a new episode)**.

In [6]:
stack_size = 4 # We stack 4 frames

# Initialize deque with zero-images one array for each image
stacked_frames  =  deque([np.zeros((84,84), dtype=np.int) for i in range(stack_size)], maxlen=4) 

def stack_frames(stacked_frames, state, is_new_episode):
    # Preprocess frame
    frame = preprocess_frame(state)
    
    if is_new_episode:
        # Clear our stacked_frames
        stacked_frames = deque([np.zeros((84,84), dtype=np.int) for i in range(stack_size)], maxlen=4)
        
        # Because we're in a new episode, copy the same frame 4x
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        
        # Stack the frames
        stacked_state = np.stack(stacked_frames, axis=2)
        
    else:
        # Append frame to deque, automatically removes the oldest frame
        stacked_frames.append(frame)

        # Build the stacked state (first dimension specifies different frames)
        stacked_state = np.stack(stacked_frames, axis=2) 
    
    return stacked_state, stacked_frames

## Step 4: Set up our hyperparameters ⚗️
In this part we'll set up our different hyperparameters. But when you implement a Neural Network by yourself you will **not implement hyperparamaters at once but progressively**.

- First, you begin by defining the neural networks hyperparameters when you implement the model.
- Then, you'll add the training hyperparameters when you implement the training algorithm.

In [7]:
### MODEL HYPERPARAMETERS
state_size = [84,84,4]      # Our input is a stack of 4 frames hence 84x84x4 (Width, height, channels) 
action_size = game.get_available_buttons_size()              # 3 possible actions: left, right, shoot
learning_rate =  0.0002      # Alpha (aka learning rate)

### TRAINING HYPERPARAMETERS
total_episodes = 500        # Total episodes for training
max_steps = 1000              # Max possible steps in an episode
batch_size = 64             

# Exploration parameters for epsilon greedy strategy
explore_start = 1            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Q learning hyperparameters
gamma = 0.95               # Discounting rate

### MEMORY HYPERPARAMETERS
pretrain_length = batch_size   # Number of experiences stored in the Memory when initialized for the first time
memory_size = 1000000          # Number of experiences the Memory can keep

### MODIFY THIS TO FALSE IF YOU JUST WANT TO SEE THE TRAINED AGENT
training = True

## TURN THIS TO TRUE IF YOU WANT TO RENDER THE ENVIRONMENT
episode_render = False

## Step 5: Create our Deep Q-learning Neural Network model 🧠
<img src="assets/model.png" alt="Model" />
This is our Deep Q-learning model:
- We take a stack of 4 frames as input
- It passes through 3 convnets
- Then it is flatened
- Finally it passes through 2 FC layers
- It outputs a Q value for each actions

In [8]:
class DQNetwork:
    def __init__(self, state_size, action_size, learning_rate, name='DQNetwork'):
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate
        
        with tf.variable_scope(name):
            # We create the placeholders
            # *state_size means that we take each elements of state_size in tuple hence is like if we wrote
            # [None, 84, 84, 4]
            self.inputs_ = tf.placeholder(tf.float32, [None, *state_size], name="inputs")
            self.actions_ = tf.placeholder(tf.float32, [None, 2], name="actions_")
            
            # Remember that target_Q is the R(s,a) + ymax Qhat(s', a')
            self.target_Q = tf.placeholder(tf.float32, [None], name="target")
            
            """
            First convnet:
            CNN
            BatchNormalization
            ELU
            """
            # Input is 84x84x4
            self.conv1 = tf.layers.conv2d(inputs = self.inputs_,
                                         filters = 32,
                                         kernel_size = [8,8],
                                         strides = [4,4],
                                         padding = "VALID",
                                          kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),
                                         name = "conv1")
            
            self.conv1_batchnorm = tf.layers.batch_normalization(self.conv1,
                                                   training = True,
                                                   epsilon = 1e-5,
                                                     name = 'batch_norm1')
            
            self.conv1_out = tf.nn.elu(self.conv1_batchnorm, name="conv1_out")
            ## --> [20, 20, 32]
            
            
            """
            Second convnet:
            CNN
            BatchNormalization
            ELU
            """
            self.conv2 = tf.layers.conv2d(inputs = self.conv1_out,
                                 filters = 64,
                                 kernel_size = [4,4],
                                 strides = [2,2],
                                 padding = "VALID",
                                kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),
                                 name = "conv2")
        
            self.conv2_batchnorm = tf.layers.batch_normalization(self.conv2,
                                                   training = True,
                                                   epsilon = 1e-5,
                                                     name = 'batch_norm2')

            self.conv2_out = tf.nn.elu(self.conv2_batchnorm, name="conv2_out")
            ## --> [9, 9, 64]
            
            
            """
            Third convnet:
            CNN
            BatchNormalization
            ELU
            """
            self.conv3 = tf.layers.conv2d(inputs = self.conv2_out,
                                 filters = 128,
                                 kernel_size = [4,4],
                                 strides = [2,2],
                                 padding = "VALID",
                                kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),
                                 name = "conv3")
        
            self.conv3_batchnorm = tf.layers.batch_normalization(self.conv3,
                                                   training = True,
                                                   epsilon = 1e-5,
                                                     name = 'batch_norm3')

            self.conv3_out = tf.nn.elu(self.conv3_batchnorm, name="conv3_out")
            ## --> [3, 3, 128]
            
            
            self.flatten = tf.layers.flatten(self.conv3_out)
            ## --> [1152]
            
            
            self.fc = tf.layers.dense(inputs = self.flatten,
                                  units = 512,
                                  activation = tf.nn.elu,
                                       kernel_initializer=tf.contrib.layers.xavier_initializer(),
                                name="fc1")
            
            
            self.output = tf.layers.dense(inputs = self.fc, 
                                           kernel_initializer=tf.contrib.layers.xavier_initializer(),
                                          units = 2, 
                                        activation=None)

  
            # Q is our predicted Q value.
            self.Q = tf.reduce_sum(tf.multiply(self.output, self.actions_), axis=1)
            
            
            # The loss is the difference between our predicted Q_values and the Q_target
            # Sum(Qtarget - Q)^2
            self.loss = tf.reduce_mean(tf.square(self.target_Q - self.Q))
            
            self.optimizer = tf.train.RMSPropOptimizer(self.learning_rate).minimize(self.loss)

In [9]:
# Reset the graph
tf.reset_default_graph()

# Instantiate the DQNetwork
DQNetwork = DQNetwork(state_size, action_size, learning_rate)

## Step 6: Experience Replay 🔁
Now that we create our Neural Network, **we need to implement the Experience Replay method.** <br><br>
Here we'll create the Memory object that creates a deque.A deque (double ended queue) is a data type that **removes the oldest element each time that you add a new element.**

This part was taken from Udacity : <a href="https://github.com/udacity/deep-learning/blob/master/reinforcement/Q-learning-cart.ipynb" Cartpole DQN</a>

In [10]:
class Memory():
    def __init__(self, max_size):
        self.buffer = deque(maxlen = max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
    
    def sample(self, batch_size):
        buffer_size = len(self.buffer)
        index = np.random.choice(np.arange(buffer_size),
                                size = batch_size,
                                replace = False)
        
        return [self.buffer[i] for i in index]

Here we'll **deal with the empty memory problem**: we pre-populate our memory by taking random actions and storing the experience (state, action, reward, new_state).

In [11]:
# Instantiate memory
memory = Memory(max_size = memory_size)

# Render the environment
game.new_episode()

for i in range(pretrain_length):
    # If it's the first step
    if i == 0:
        # First we need a state
        state = game.get_state().screen_buffer
        state, stacked_frames = stack_frames(stacked_frames, state, True)
    
    # Random action
    action = random.choice(possible_actions)
    
    # Get the rewards
    reward = game.make_action(action)
    
    # Look if the episode is finished
    done = game.is_episode_finished()
    
    # If we're dead
    if done:
        # We finished the episode
        next_state = np.zeros(state.shape)
        
        # Add experience to memory
        memory.add((state, action, reward, next_state, done))
        
        # Start a new episode
        game.new_episode()
        
        # First we need a state
        state = game.get_state().screen_buffer
        
        # Stack the frames
        state, stacked_frames = stack_frames(stacked_frames, state, True)
        
    else:
        # Get the next state
        next_state = game.get_state().screen_buffer
        next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)
        
        # Add experience to memory
        memory.add((state, action, reward, next_state, done))
        
        # Our state is now the next_state
        state = next_state

## Step 7: Set up Tensorboard 📊
For more information about tensorboard, please watch this <a href="https://www.youtube.com/embed/eBbEDRsCmv4">excellent 30min tutorial</a> <br><br>
To launch tensorboard : `tensorboard --logdir=/tensorboard/dqn/1`

In [12]:
# Setup TensorBoard Writer
writer = tf.summary.FileWriter("/tensorboard/dqn/1")

# Losses
tf.summary.scalar("Loss", DQNetwork.loss)

write_op = tf.summary.merge_all()

## Step 8: Train our Agent 🏃‍♂️

Our algorithm:
<br>
* Initialize the weights
* Init the environment
* Initialize the decay rate (that will use to reduce epsilon) 
<br><br>
* **For** episode to max_episode **do** 
    * Make new episode
    * Set step to 0
    * Observe the first state $s_0$
    <br><br>
    * **While** step < max_steps **do**:
        * Increase decay_rate
        * With $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s_t,a)$
        * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
        * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
        * Sample random mini-batch from $D$: $<s, a, r, s'>$
        * Set $\hat{Q} = r$ if the episode ends at $+1$, otherwise set $\hat{Q} = r + \gamma \max_{a'}{Q(s', a')}$
        * Make a gradient descent step with loss $(\hat{Q} - Q(s, a))^2$
    * **endfor**
    <br><br>
* **endfor**

    

In [13]:
"""
This function will do the part
With ϵ select a random action atat, otherwise select at=argmaxaQ(st,a)
"""
def predict_action(explore_start, explore_stop, decay_rate, decay_step, state, actions):
    ## EPSILON GREEDY STRATEGY
    # Choose action a from state s using epsilon greedy.
    ## First we randomize a number
    exp_exp_tradeoff = np.random.rand()

    # Here we'll use an improved version of our epsilon greedy strategy used in Q-learning notebook
    explore_probability = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * decay_step)
    
    if (explore_probability > exp_exp_tradeoff):
        # Make a random action (exploration)
        action = random.choice(possible_actions)
        
    else:
        # Get action from Q-network (exploitation)
        # Estimate the Qs values state
        Qs = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: state.reshape((1, *state.shape))})
#         print('Estimated value of the current state', Qs)
        
        # Take the biggest Q value (= the best action)
        choice = np.argmax(Qs)
        action = possible_actions[int(choice)]
                
    return action, explore_probability

In [14]:
from tqdm import tqdm
import logging
import sys
import time

logger = logging.getLogger('logger')
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler(sys.stdout)
formatter = logging.Formatter('%(levelname)s: %(asctime)s %(message)s', datefmt='%I:%M:%S.%f')
handler.setFormatter(formatter)
logger.addHandler(handler)

In [15]:
game, possible_actions = create_environment()

In [None]:
# Saver will help us to save our model
saver = tf.train.Saver()

if training == True:
    with tf.Session() as sess:
        # Initialize the variables
        sess.run(tf.global_variables_initializer())

        
        # Initialize the decay rate (that will use to reduce epsilon) 
        decay_step = 0

        # Init the game
        game.init()

        for episode in tqdm(range(total_episodes)):
            
            # Set step to 0
            step = 0
            
            # Initialize the rewards of the episode
            episode_rewards = []
            
            # Make a new episode and observe the first state
            game.new_episode()
            state = game.get_state().screen_buffer
            
            # Remember that stack frame function also call our preprocess function.
            state, stacked_frames = stack_frames(stacked_frames, state, True)
            
            while step < max_steps:
#             for i in tqdm(range(max_steps)):    
                step += 1
                
                # Increase decay_step
                decay_step +=1
                
                # Predict the action to take and take it
                action, explore_probability = predict_action(explore_start, explore_stop, decay_rate, decay_step, state, possible_actions)

                # Do the action
                reward = game.make_action(action)

                # Look if the episode is finished
                done = game.is_player_dead()
#                 print(done, game.get_game_variable(GameVariable.HEALTH))
                
                # Add the reward to total reward
                episode_rewards.append(reward)

                # If the game is finished
                if done:
                    # the episode ends so no next state
                    next_state = np.zeros((84,84), dtype=np.int)
                    next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)

                    # Set step = max_steps to end the episode
                    step = max_steps

                    # Get the total reward of the episode
                    total_reward = np.sum(episode_rewards)

                    print('Episode: {}'.format(episode),
                              'Total reward: {}'.format(total_reward),
                              'Training loss: {:.4f}'.format(loss),
                              'Explore P: {:.4f}'.format(explore_probability))

                    memory.add((state, action, reward, next_state, done))

                else:
                    # Get the next state
                    next_state = game.get_state().screen_buffer

                    # Stack the frame of the next_state
                    next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)

                    # Add experience to memory
                    memory.add((state, action, reward, next_state, done))
                    
                    # st+1 is now our current state
                    state = next_state


                ### LEARNING PART            
                # Obtain random mini-batch from memory
                                
                batch = memory.sample(batch_size)
                states_mb = np.array([each[0] for each in batch], ndmin=3)
                actions_mb = np.array([each[1] for each in batch])
                rewards_mb = np.array([each[2] for each in batch]) 
                next_states_mb = np.array([each[3] for each in batch], ndmin=3)
                dones_mb = np.array([each[4] for each in batch])
#                 print('retrieved {} memories from a batch'.format(len(states_mb)))
#                 print('actions taken', actions_mb)
#                 print('received rewards', rewards_mb)
#                 print('resulting {} states'.format(len(next_states_mb)))
#                 print('goal reached?', dones_mb)
                
#                 print('next states', next_states_mb)
                target_Qs_batch = []
                
                 # Get Q values for next_state 
#                 with tf.GradientTape() as tape:
#                     Qs_next_state = DQNetwork(next_states_mb, training=True)
                Qs_next_state = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: next_states_mb})
#                 print('Qs next state for each memory', Qs_next_state)
                
                # Set Q_target = r if the episode ends at s+1, otherwise set Q_target = r + gamma*maxQ(s', a')
                for i in range(0, len(batch)):
                    
                    terminal = dones_mb[i]

                    # If we are in a terminal state, only equals reward
                    if terminal:
                        target_Qs_batch.append(rewards_mb[i])
                        
                    else:
                        target = rewards_mb[i] + gamma * np.max(Qs_next_state[i])
                        target_Qs_batch.append(target)
                        

                targets_mb = np.array([each for each in target_Qs_batch])

                loss, _ = sess.run([DQNetwork.loss, DQNetwork.optimizer],
                                    feed_dict={DQNetwork.inputs_: states_mb,
                                               DQNetwork.target_Q: targets_mb,
                                               DQNetwork.actions_: actions_mb})
        
#                 print('loss', loss)

                # Write TF Summaries
                summary = sess.run(write_op, feed_dict={DQNetwork.inputs_: states_mb,
                                                   DQNetwork.target_Q: targets_mb,
                                                   DQNetwork.actions_: actions_mb})
                writer.add_summary(summary, episode)
                writer.flush()

            # Save model every 5 episodes
            if episode % 5 == 0:
#                 save_path = saver.save(sess, "./models/model.ckpt")
                print("Model Saved")

  0%|          | 1/500 [00:17<2:22:22, 17.12s/it]

Episode: 0 Total reward: 48.0 Training loss: 0.8946 Explore P: 0.9953
Model Saved


  0%|          | 2/500 [00:19<1:45:58, 12.77s/it]

Episode: 1 Total reward: 76.0 Training loss: 2.3946 Explore P: 0.9878


  1%|          | 3/500 [00:22<1:20:55,  9.77s/it]

Episode: 2 Total reward: 85.0 Training loss: 0.4913 Explore P: 0.9795


  1%|          | 4/500 [00:25<1:04:28,  7.80s/it]

Episode: 3 Total reward: 97.0 Training loss: 0.3406 Explore P: 0.9702


  1%|          | 5/500 [00:27<49:53,  6.05s/it]  

Episode: 4 Total reward: 55.0 Training loss: 0.6451 Explore P: 0.9649


  1%|          | 6/500 [00:32<45:33,  5.53s/it]

Episode: 5 Total reward: 131.0 Training loss: 0.6812 Explore P: 0.9525
Model Saved


  1%|▏         | 7/500 [00:35<40:10,  4.89s/it]

Episode: 6 Total reward: 101.0 Training loss: 0.2469 Explore P: 0.9430


  2%|▏         | 8/500 [00:36<31:58,  3.90s/it]

Episode: 7 Total reward: 48.0 Training loss: 0.2255 Explore P: 0.9385


  2%|▏         | 9/500 [00:39<28:44,  3.51s/it]

Episode: 8 Total reward: 78.0 Training loss: 0.2303 Explore P: 0.9313


  2%|▏         | 10/500 [00:42<26:11,  3.21s/it]

Episode: 9 Total reward: 75.0 Training loss: 0.2957 Explore P: 0.9244


  2%|▏         | 11/500 [00:43<22:26,  2.75s/it]

Episode: 10 Total reward: 51.0 Training loss: 0.2352 Explore P: 0.9198
Model Saved


  2%|▏         | 12/500 [00:45<19:34,  2.41s/it]

Episode: 11 Total reward: 48.0 Training loss: 0.2514 Explore P: 0.9154


  3%|▎         | 13/500 [00:47<18:08,  2.23s/it]

Episode: 12 Total reward: 55.0 Training loss: 0.1963 Explore P: 0.9105


  3%|▎         | 14/500 [00:52<25:23,  3.13s/it]

Episode: 13 Total reward: 157.0 Training loss: 0.1588 Explore P: 0.8964


  3%|▎         | 15/500 [00:54<21:51,  2.70s/it]

Episode: 14 Total reward: 51.0 Training loss: 0.1444 Explore P: 0.8919


  3%|▎         | 16/500 [00:56<20:50,  2.58s/it]

Episode: 15 Total reward: 69.0 Training loss: 0.1581 Explore P: 0.8859
Model Saved


  3%|▎         | 17/500 [00:58<18:34,  2.31s/it]

Episode: 16 Total reward: 50.0 Training loss: 0.1830 Explore P: 0.8815


  4%|▎         | 18/500 [00:59<16:59,  2.12s/it]

Episode: 17 Total reward: 50.0 Training loss: 0.1458 Explore P: 0.8771


  4%|▍         | 19/500 [01:04<22:54,  2.86s/it]

Episode: 18 Total reward: 138.0 Training loss: 0.9061 Explore P: 0.8653


  4%|▍         | 20/500 [01:06<20:57,  2.62s/it]

Episode: 19 Total reward: 62.0 Training loss: 0.1735 Explore P: 0.8600


  4%|▍         | 21/500 [01:10<24:51,  3.11s/it]

Episode: 20 Total reward: 128.0 Training loss: 0.4197 Explore P: 0.8492
Model Saved


  4%|▍         | 22/500 [01:16<30:05,  3.78s/it]

Episode: 21 Total reward: 160.0 Training loss: 0.3079 Explore P: 0.8358


  5%|▍         | 23/500 [01:17<25:24,  3.20s/it]

Episode: 22 Total reward: 55.0 Training loss: 0.3784 Explore P: 0.8313


  5%|▍         | 24/500 [01:19<22:40,  2.86s/it]

Episode: 23 Total reward: 62.0 Training loss: 0.2368 Explore P: 0.8262


  5%|▌         | 25/500 [01:21<20:04,  2.53s/it]

Episode: 24 Total reward: 53.0 Training loss: 0.1657 Explore P: 0.8219


  5%|▌         | 26/500 [01:24<19:41,  2.49s/it]

Episode: 25 Total reward: 72.0 Training loss: 0.2529 Explore P: 0.8161
Model Saved


  5%|▌         | 27/500 [01:26<19:27,  2.47s/it]

Episode: 26 Total reward: 72.0 Training loss: 0.1855 Explore P: 0.8103


  6%|▌         | 28/500 [01:28<18:20,  2.33s/it]

Episode: 27 Total reward: 60.0 Training loss: 0.1342 Explore P: 0.8055


  6%|▌         | 29/500 [01:31<19:33,  2.49s/it]

Episode: 28 Total reward: 85.0 Training loss: 0.4110 Explore P: 0.7988


  6%|▌         | 30/500 [01:36<26:33,  3.39s/it]

Episode: 29 Total reward: 165.0 Training loss: 0.1950 Explore P: 0.7859


  6%|▌         | 31/500 [01:38<22:21,  2.86s/it]

Episode: 30 Total reward: 48.0 Training loss: 0.5531 Explore P: 0.7822
Model Saved


  6%|▋         | 32/500 [01:41<21:47,  2.79s/it]

Episode: 31 Total reward: 79.0 Training loss: 0.2046 Explore P: 0.7761


  7%|▋         | 33/500 [01:44<23:11,  2.98s/it]

Episode: 32 Total reward: 102.0 Training loss: 0.2492 Explore P: 0.7683


  7%|▋         | 34/500 [01:49<26:43,  3.44s/it]

Episode: 33 Total reward: 135.0 Training loss: 0.1542 Explore P: 0.7582


  7%|▋         | 35/500 [01:51<24:57,  3.22s/it]

Episode: 34 Total reward: 81.0 Training loss: 0.1558 Explore P: 0.7521


  7%|▋         | 36/500 [01:53<22:02,  2.85s/it]

Episode: 35 Total reward: 59.0 Training loss: 0.2926 Explore P: 0.7477
Model Saved


  7%|▋         | 37/500 [01:56<21:41,  2.81s/it]

Episode: 36 Total reward: 81.0 Training loss: 0.1516 Explore P: 0.7418


  8%|▊         | 38/500 [01:59<22:05,  2.87s/it]

Episode: 37 Total reward: 90.0 Training loss: 0.3363 Explore P: 0.7352


  8%|▊         | 39/500 [02:05<29:07,  3.79s/it]

Episode: 38 Total reward: 178.0 Training loss: 0.4186 Explore P: 0.7224


  8%|▊         | 40/500 [02:09<29:43,  3.88s/it]

Episode: 39 Total reward: 121.0 Training loss: 0.5228 Explore P: 0.7139


  8%|▊         | 41/500 [02:12<27:25,  3.58s/it]

Episode: 40 Total reward: 87.0 Training loss: 0.5566 Explore P: 0.7078
Model Saved


  8%|▊         | 42/500 [02:14<23:06,  3.03s/it]

Episode: 41 Total reward: 51.0 Training loss: 0.3667 Explore P: 0.7042


  9%|▊         | 43/500 [02:17<24:00,  3.15s/it]

Episode: 42 Total reward: 99.0 Training loss: 0.2023 Explore P: 0.6974


  9%|▉         | 44/500 [02:23<29:53,  3.93s/it]

Episode: 43 Total reward: 174.0 Training loss: 0.2578 Explore P: 0.6855


  9%|▉         | 45/500 [02:26<27:19,  3.60s/it]

Episode: 44 Total reward: 85.0 Training loss: 0.1377 Explore P: 0.6798


  9%|▉         | 46/500 [02:28<25:08,  3.32s/it]

Episode: 45 Total reward: 80.0 Training loss: 0.2995 Explore P: 0.6745
Model Saved


  9%|▉         | 47/500 [02:31<24:17,  3.22s/it]

Episode: 46 Total reward: 89.0 Training loss: 0.1736 Explore P: 0.6686


 10%|▉         | 48/500 [02:36<26:50,  3.56s/it]

Episode: 47 Total reward: 128.0 Training loss: 0.3004 Explore P: 0.6602


 10%|▉         | 49/500 [02:39<25:55,  3.45s/it]

Episode: 48 Total reward: 94.0 Training loss: 0.4979 Explore P: 0.6541


 10%|█         | 50/500 [02:41<23:30,  3.13s/it]

Episode: 49 Total reward: 72.0 Training loss: 0.4630 Explore P: 0.6495


 10%|█         | 51/500 [02:46<26:02,  3.48s/it]

Episode: 50 Total reward: 129.0 Training loss: 0.2017 Explore P: 0.6413
Model Saved


 10%|█         | 52/500 [02:47<21:51,  2.93s/it]

Episode: 51 Total reward: 49.0 Training loss: 0.2611 Explore P: 0.6382


 11%|█         | 53/500 [02:50<20:54,  2.81s/it]

Episode: 52 Total reward: 76.0 Training loss: 0.1301 Explore P: 0.6335


 11%|█         | 54/500 [02:54<23:37,  3.18s/it]

Episode: 53 Total reward: 122.0 Training loss: 0.1326 Explore P: 0.6259


 11%|█         | 55/500 [02:58<26:23,  3.56s/it]

Episode: 54 Total reward: 133.0 Training loss: 0.9700 Explore P: 0.6178


 11%|█         | 56/500 [03:00<23:01,  3.11s/it]

Episode: 55 Total reward: 62.0 Training loss: 0.2103 Explore P: 0.6140
Model Saved


 11%|█▏        | 57/500 [03:02<20:27,  2.77s/it]

Episode: 56 Total reward: 59.0 Training loss: 0.1272 Explore P: 0.6105


 12%|█▏        | 58/500 [03:12<36:13,  4.92s/it]

Episode: 57 Total reward: 298.0 Training loss: 0.1964 Explore P: 0.5928


 12%|█▏        | 59/500 [03:14<28:59,  3.94s/it]

Episode: 58 Total reward: 50.0 Training loss: 0.2094 Explore P: 0.5899


 12%|█▏        | 60/500 [03:18<29:10,  3.98s/it]

Episode: 59 Total reward: 121.0 Training loss: 0.5194 Explore P: 0.5830


 12%|█▏        | 61/500 [03:20<24:33,  3.36s/it]

Episode: 60 Total reward: 57.0 Training loss: 0.2982 Explore P: 0.5797
Model Saved


 12%|█▏        | 62/500 [03:22<21:58,  3.01s/it]

Episode: 61 Total reward: 66.0 Training loss: 0.1551 Explore P: 0.5759


 13%|█▎        | 63/500 [03:26<23:50,  3.27s/it]

Episode: 62 Total reward: 117.0 Training loss: 0.1440 Explore P: 0.5694


 13%|█▎        | 64/500 [03:30<24:50,  3.42s/it]

Episode: 63 Total reward: 112.0 Training loss: 0.1930 Explore P: 0.5631


 13%|█▎        | 65/500 [03:35<29:35,  4.08s/it]

Episode: 64 Total reward: 168.0 Training loss: 0.1700 Explore P: 0.5539


 13%|█▎        | 66/500 [03:42<35:22,  4.89s/it]

Episode: 65 Total reward: 204.0 Training loss: 0.1539 Explore P: 0.5429
Model Saved


 13%|█▎        | 67/500 [03:50<40:58,  5.68s/it]

Episode: 66 Total reward: 232.0 Training loss: 0.1454 Explore P: 0.5307


 14%|█▎        | 68/500 [03:54<38:50,  5.40s/it]

Episode: 67 Total reward: 143.0 Training loss: 0.1879 Explore P: 0.5233


 14%|█▍        | 69/500 [03:56<31:45,  4.42s/it]

Episode: 68 Total reward: 65.0 Training loss: 0.1951 Explore P: 0.5200


 14%|█▍        | 70/500 [04:01<32:31,  4.54s/it]

Episode: 69 Total reward: 145.0 Training loss: 0.1178 Explore P: 0.5127


 14%|█▍        | 71/500 [04:03<27:13,  3.81s/it]

Episode: 70 Total reward: 63.0 Training loss: 0.1332 Explore P: 0.5095
Model Saved


 14%|█▍        | 72/500 [04:08<29:24,  4.12s/it]

Episode: 71 Total reward: 147.0 Training loss: 0.2533 Explore P: 0.5022


 15%|█▍        | 73/500 [04:12<28:16,  3.97s/it]

Episode: 72 Total reward: 110.0 Training loss: 0.1121 Explore P: 0.4968


 15%|█▍        | 74/500 [04:14<24:24,  3.44s/it]

Episode: 73 Total reward: 66.0 Training loss: 0.1010 Explore P: 0.4936


 15%|█▌        | 75/500 [04:17<23:02,  3.25s/it]

Episode: 74 Total reward: 85.0 Training loss: 0.1078 Explore P: 0.4895


 15%|█▌        | 76/500 [04:22<26:01,  3.68s/it]

Episode: 75 Total reward: 141.0 Training loss: 0.0692 Explore P: 0.4828
Model Saved


 15%|█▌        | 77/500 [04:24<23:55,  3.39s/it]

Episode: 76 Total reward: 82.0 Training loss: 0.1211 Explore P: 0.4790


 16%|█▌        | 78/500 [04:30<28:50,  4.10s/it]

Episode: 77 Total reward: 171.0 Training loss: 0.1555 Explore P: 0.4710


 16%|█▌        | 79/500 [04:37<34:16,  4.89s/it]

Episode: 78 Total reward: 201.0 Training loss: 0.3584 Explore P: 0.4618


 16%|█▌        | 80/500 [04:43<36:24,  5.20s/it]

Episode: 79 Total reward: 177.0 Training loss: 0.1805 Explore P: 0.4539


 16%|█▌        | 81/500 [04:46<31:20,  4.49s/it]

Episode: 80 Total reward: 85.0 Training loss: 0.1574 Explore P: 0.4501
Model Saved


 16%|█▋        | 82/500 [04:47<25:36,  3.68s/it]

Episode: 81 Total reward: 53.0 Training loss: 0.2944 Explore P: 0.4478


 17%|█▋        | 83/500 [04:50<22:31,  3.24s/it]

Episode: 82 Total reward: 64.0 Training loss: 0.2211 Explore P: 0.4450


 17%|█▋        | 84/500 [04:52<20:51,  3.01s/it]

Episode: 83 Total reward: 72.0 Training loss: 0.2094 Explore P: 0.4419


 17%|█▋        | 85/500 [04:55<21:12,  3.07s/it]

Episode: 84 Total reward: 96.0 Training loss: 0.1292 Explore P: 0.4378


 17%|█▋        | 86/500 [05:02<28:51,  4.18s/it]

Episode: 85 Total reward: 203.0 Training loss: 0.2438 Explore P: 0.4292
Model Saved


 17%|█▋        | 87/500 [05:06<28:35,  4.15s/it]

Episode: 86 Total reward: 122.0 Training loss: 0.1564 Explore P: 0.4241


 18%|█▊        | 88/500 [05:11<30:31,  4.45s/it]

Episode: 87 Total reward: 153.0 Training loss: 0.1925 Explore P: 0.4178


 18%|█▊        | 89/500 [05:16<31:46,  4.64s/it]

Episode: 88 Total reward: 152.0 Training loss: 0.0901 Explore P: 0.4117


 18%|█▊        | 90/500 [05:23<35:47,  5.24s/it]

Episode: 89 Total reward: 198.0 Training loss: 0.1191 Explore P: 0.4038


 18%|█▊        | 91/500 [05:26<30:48,  4.52s/it]

Episode: 90 Total reward: 84.0 Training loss: 0.1630 Explore P: 0.4005
Model Saved


 18%|█▊        | 92/500 [05:29<27:58,  4.11s/it]

Episode: 91 Total reward: 94.0 Training loss: 0.1652 Explore P: 0.3968


 19%|█▊        | 93/500 [05:31<23:32,  3.47s/it]

Episode: 92 Total reward: 58.0 Training loss: 0.5338 Explore P: 0.3946


 19%|█▉        | 94/500 [05:34<22:48,  3.37s/it]

Episode: 93 Total reward: 93.0 Training loss: 0.1716 Explore P: 0.3910


 19%|█▉        | 95/500 [05:40<27:44,  4.11s/it]

Episode: 94 Total reward: 173.0 Training loss: 0.2088 Explore P: 0.3845


 19%|█▉        | 96/500 [05:44<26:49,  3.98s/it]

Episode: 95 Total reward: 110.0 Training loss: 0.2740 Explore P: 0.3804
Model Saved


 19%|█▉        | 97/500 [05:46<23:50,  3.55s/it]

Episode: 96 Total reward: 75.0 Training loss: 0.2326 Explore P: 0.3776


 20%|█▉        | 98/500 [05:51<27:17,  4.07s/it]

Episode: 97 Total reward: 157.0 Training loss: 0.0898 Explore P: 0.3719


 20%|█▉        | 99/500 [05:55<25:45,  3.85s/it]

Episode: 98 Total reward: 99.0 Training loss: 0.0995 Explore P: 0.3683


 20%|██        | 100/500 [05:59<27:06,  4.07s/it]

Episode: 99 Total reward: 135.0 Training loss: 0.1537 Explore P: 0.3635


 20%|██        | 101/500 [06:02<25:10,  3.79s/it]

Episode: 100 Total reward: 93.0 Training loss: 0.1476 Explore P: 0.3603
Model Saved


 20%|██        | 102/500 [06:06<24:14,  3.65s/it]

Episode: 101 Total reward: 99.0 Training loss: 4.9547 Explore P: 0.3568


 21%|██        | 103/500 [06:11<26:56,  4.07s/it]

Episode: 102 Total reward: 150.0 Training loss: 0.3197 Explore P: 0.3517


 21%|██        | 104/500 [06:16<29:52,  4.53s/it]

Episode: 103 Total reward: 167.0 Training loss: 0.0986 Explore P: 0.3460


 21%|██        | 105/500 [06:23<33:13,  5.05s/it]

Episode: 104 Total reward: 186.0 Training loss: 0.1351 Explore P: 0.3398


 21%|██        | 106/500 [06:25<28:39,  4.37s/it]

Episode: 105 Total reward: 83.0 Training loss: 0.3793 Explore P: 0.3371
Model Saved


 21%|██▏       | 107/500 [06:33<34:45,  5.31s/it]

Episode: 106 Total reward: 223.0 Training loss: 0.8093 Explore P: 0.3299


 22%|██▏       | 108/500 [06:38<33:39,  5.15s/it]

Episode: 107 Total reward: 143.0 Training loss: 0.2544 Explore P: 0.3253


 22%|██▏       | 109/500 [06:41<29:16,  4.49s/it]

Episode: 108 Total reward: 87.0 Training loss: 0.1022 Explore P: 0.3226


 22%|██▏       | 110/500 [06:45<29:26,  4.53s/it]

Episode: 109 Total reward: 136.0 Training loss: 0.1833 Explore P: 0.3184


 22%|██▏       | 111/500 [06:51<31:10,  4.81s/it]

Episode: 110 Total reward: 162.0 Training loss: 0.2149 Explore P: 0.3134
Model Saved


 22%|██▏       | 112/500 [06:56<31:25,  4.86s/it]

Episode: 111 Total reward: 147.0 Training loss: 0.1374 Explore P: 0.3090


 23%|██▎       | 113/500 [06:57<25:19,  3.93s/it]

Episode: 112 Total reward: 52.0 Training loss: 0.1808 Explore P: 0.3074


 23%|██▎       | 114/500 [06:59<21:32,  3.35s/it]

Episode: 113 Total reward: 59.0 Training loss: 0.1552 Explore P: 0.3057


 23%|██▎       | 115/500 [07:02<20:41,  3.22s/it]

Episode: 114 Total reward: 82.0 Training loss: 0.1146 Explore P: 0.3033


 23%|██▎       | 116/500 [07:07<23:35,  3.69s/it]

Episode: 115 Total reward: 142.0 Training loss: 0.3034 Explore P: 0.2991
Model Saved


 23%|██▎       | 117/500 [07:09<20:38,  3.23s/it]

Episode: 116 Total reward: 64.0 Training loss: 0.0920 Explore P: 0.2973


 24%|██▎       | 118/500 [07:16<27:49,  4.37s/it]

Episode: 117 Total reward: 207.0 Training loss: 0.1933 Explore P: 0.2914


 24%|██▍       | 119/500 [07:22<29:33,  4.65s/it]

Episode: 118 Total reward: 153.0 Training loss: 0.6441 Explore P: 0.2871


 24%|██▍       | 120/500 [07:28<33:16,  5.25s/it]

Episode: 119 Total reward: 194.0 Training loss: 0.1380 Explore P: 0.2818


 24%|██▍       | 121/500 [07:31<27:22,  4.33s/it]

Episode: 120 Total reward: 64.0 Training loss: 2.1939 Explore P: 0.2801
Model Saved


 24%|██▍       | 122/500 [07:37<30:57,  4.91s/it]

Episode: 121 Total reward: 185.0 Training loss: 0.1967 Explore P: 0.2751


 25%|██▍       | 123/500 [07:42<30:50,  4.91s/it]

Episode: 122 Total reward: 143.0 Training loss: 0.4854 Explore P: 0.2714


 25%|██▍       | 124/500 [07:48<32:59,  5.27s/it]

Episode: 123 Total reward: 177.0 Training loss: 0.0852 Explore P: 0.2668


 25%|██▌       | 125/500 [07:53<32:29,  5.20s/it]

Episode: 124 Total reward: 148.0 Training loss: 0.1486 Explore P: 0.2630


 25%|██▌       | 126/500 [07:55<26:45,  4.29s/it]

Episode: 125 Total reward: 64.0 Training loss: 0.1196 Explore P: 0.2614
Model Saved


 25%|██▌       | 127/500 [08:01<29:56,  4.82s/it]

Episode: 126 Total reward: 178.0 Training loss: 0.0711 Explore P: 0.2570


 26%|██▌       | 128/500 [08:05<28:24,  4.58s/it]

Episode: 127 Total reward: 119.0 Training loss: 0.1729 Explore P: 0.2540


 26%|██▌       | 129/500 [08:10<28:54,  4.68s/it]

Episode: 128 Total reward: 144.0 Training loss: 0.2862 Explore P: 0.2505


 26%|██▌       | 130/500 [08:17<32:53,  5.33s/it]

Episode: 129 Total reward: 202.0 Training loss: 0.1506 Explore P: 0.2457


 26%|██▌       | 131/500 [08:21<30:06,  4.89s/it]

Episode: 130 Total reward: 114.0 Training loss: 0.1092 Explore P: 0.2431
Model Saved


 26%|██▋       | 132/500 [08:23<25:08,  4.10s/it]

Episode: 131 Total reward: 66.0 Training loss: 0.1263 Explore P: 0.2415


 27%|██▋       | 133/500 [08:26<22:52,  3.74s/it]

Episode: 132 Total reward: 85.0 Training loss: 0.3821 Explore P: 0.2396


 27%|██▋       | 134/500 [08:30<23:15,  3.81s/it]

Episode: 133 Total reward: 117.0 Training loss: 0.2413 Explore P: 0.2369


 27%|██▋       | 135/500 [08:32<20:58,  3.45s/it]

Episode: 134 Total reward: 76.0 Training loss: 0.1864 Explore P: 0.2352


 27%|██▋       | 136/500 [08:39<26:47,  4.42s/it]

Episode: 135 Total reward: 196.0 Training loss: 0.5921 Explore P: 0.2308
Model Saved


 27%|██▋       | 137/500 [08:45<30:00,  4.96s/it]

Episode: 136 Total reward: 187.0 Training loss: 0.5490 Explore P: 0.2267


 28%|██▊       | 138/500 [08:50<29:47,  4.94s/it]

Episode: 137 Total reward: 148.0 Training loss: 0.3131 Explore P: 0.2235


 28%|██▊       | 139/500 [08:54<28:13,  4.69s/it]

Episode: 138 Total reward: 121.0 Training loss: 0.2395 Explore P: 0.2210


 28%|██▊       | 140/500 [09:00<29:44,  4.96s/it]

Episode: 139 Total reward: 166.0 Training loss: 0.3795 Explore P: 0.2175


 28%|██▊       | 141/500 [09:03<25:53,  4.33s/it]

Episode: 140 Total reward: 86.0 Training loss: 0.1914 Explore P: 0.2157
Model Saved


 28%|██▊       | 142/500 [09:08<26:40,  4.47s/it]

Episode: 141 Total reward: 141.0 Training loss: 0.0794 Explore P: 0.2128


 29%|██▊       | 143/500 [09:10<22:12,  3.73s/it]

Episode: 142 Total reward: 59.0 Training loss: 0.2925 Explore P: 0.2116


 29%|██▉       | 144/500 [09:12<20:08,  3.39s/it]

Episode: 143 Total reward: 77.0 Training loss: 0.4419 Explore P: 0.2101


 29%|██▉       | 145/500 [09:14<16:52,  2.85s/it]

Episode: 144 Total reward: 47.0 Training loss: 0.1853 Explore P: 0.2092


 29%|██▉       | 146/500 [09:20<22:49,  3.87s/it]

Episode: 145 Total reward: 186.0 Training loss: 0.1529 Explore P: 0.2055
Model Saved


 29%|██▉       | 147/500 [09:25<25:05,  4.27s/it]

Episode: 146 Total reward: 157.0 Training loss: 0.8745 Explore P: 0.2024


 30%|██▉       | 148/500 [09:28<22:56,  3.91s/it]

Episode: 147 Total reward: 90.0 Training loss: 0.1650 Explore P: 0.2007


 30%|██▉       | 149/500 [09:31<21:13,  3.63s/it]

Episode: 148 Total reward: 86.0 Training loss: 0.1457 Explore P: 0.1991


 30%|███       | 150/500 [09:36<22:51,  3.92s/it]

Episode: 149 Total reward: 136.0 Training loss: 0.3259 Explore P: 0.1965


 30%|███       | 151/500 [09:38<20:19,  3.49s/it]

Episode: 150 Total reward: 74.0 Training loss: 0.4412 Explore P: 0.1952
Model Saved


 30%|███       | 152/500 [09:43<22:46,  3.93s/it]

Episode: 151 Total reward: 145.0 Training loss: 0.1399 Explore P: 0.1925


 31%|███       | 153/500 [09:47<21:34,  3.73s/it]

Episode: 152 Total reward: 96.0 Training loss: 0.0751 Explore P: 0.1907


 31%|███       | 154/500 [09:50<20:22,  3.53s/it]

Episode: 153 Total reward: 89.0 Training loss: 0.3017 Explore P: 0.1891


 31%|███       | 155/500 [09:56<25:42,  4.47s/it]

Episode: 154 Total reward: 195.0 Training loss: 0.7983 Explore P: 0.1857


 31%|███       | 156/500 [09:58<21:37,  3.77s/it]

Episode: 155 Total reward: 62.0 Training loss: 0.2753 Explore P: 0.1846
Model Saved


 31%|███▏      | 157/500 [10:06<27:10,  4.75s/it]

Episode: 156 Total reward: 209.0 Training loss: 0.2836 Explore P: 0.1810


 32%|███▏      | 158/500 [10:10<25:56,  4.55s/it]

Episode: 157 Total reward: 120.0 Training loss: 0.2902 Explore P: 0.1790


 32%|███▏      | 159/500 [10:16<28:41,  5.05s/it]

Episode: 158 Total reward: 184.0 Training loss: 2.0011 Explore P: 0.1759


 32%|███▏      | 160/500 [10:20<27:21,  4.83s/it]

Episode: 159 Total reward: 129.0 Training loss: 0.2832 Explore P: 0.1737


 32%|███▏      | 161/500 [10:27<30:59,  5.48s/it]

Episode: 160 Total reward: 206.0 Training loss: 0.1578 Explore P: 0.1704
Model Saved


 32%|███▏      | 162/500 [10:31<27:50,  4.94s/it]

Episode: 161 Total reward: 107.0 Training loss: 0.1405 Explore P: 0.1687


 33%|███▎      | 163/500 [10:36<28:25,  5.06s/it]

Episode: 162 Total reward: 158.0 Training loss: 0.1781 Explore P: 0.1662


 33%|███▎      | 164/500 [10:44<32:22,  5.78s/it]

Episode: 163 Total reward: 224.0 Training loss: 0.1608 Explore P: 0.1628


 33%|███▎      | 165/500 [10:51<34:15,  6.14s/it]

Episode: 164 Total reward: 202.0 Training loss: 0.2084 Explore P: 0.1597


 33%|███▎      | 166/500 [10:56<33:37,  6.04s/it]

Episode: 165 Total reward: 170.0 Training loss: 0.1781 Explore P: 0.1572
Model Saved


 33%|███▎      | 167/500 [11:00<28:45,  5.18s/it]

Episode: 166 Total reward: 93.0 Training loss: 0.1615 Explore P: 0.1558


 34%|███▎      | 168/500 [11:04<27:27,  4.96s/it]

Episode: 167 Total reward: 130.0 Training loss: 0.0972 Explore P: 0.1539


 34%|███▍      | 169/500 [11:07<24:27,  4.43s/it]

Episode: 168 Total reward: 94.0 Training loss: 0.1956 Explore P: 0.1526


 34%|███▍      | 170/500 [11:13<26:27,  4.81s/it]

Episode: 169 Total reward: 160.0 Training loss: 0.1708 Explore P: 0.1503


 34%|███▍      | 171/500 [11:20<29:25,  5.37s/it]

Episode: 170 Total reward: 188.0 Training loss: 0.1410 Explore P: 0.1477
Model Saved


 34%|███▍      | 172/500 [11:21<23:35,  4.32s/it]

Episode: 171 Total reward: 52.0 Training loss: 0.5594 Explore P: 0.1470


 35%|███▍      | 173/500 [11:29<28:17,  5.19s/it]

Episode: 172 Total reward: 211.0 Training loss: 0.0817 Explore P: 0.1441


 35%|███▍      | 174/500 [11:30<22:26,  4.13s/it]

Episode: 173 Total reward: 48.0 Training loss: 0.0983 Explore P: 0.1435


 35%|███▌      | 175/500 [11:37<27:00,  4.98s/it]

Episode: 174 Total reward: 202.0 Training loss: 0.5943 Explore P: 0.1408


 35%|███▌      | 176/500 [11:39<22:19,  4.13s/it]

Episode: 175 Total reward: 62.0 Training loss: 0.4293 Explore P: 0.1400
Model Saved


 35%|███▌      | 177/500 [11:41<18:38,  3.46s/it]

Episode: 176 Total reward: 55.0 Training loss: 0.1571 Explore P: 0.1393


 36%|███▌      | 178/500 [11:43<16:03,  2.99s/it]

Episode: 177 Total reward: 55.0 Training loss: 0.0447 Explore P: 0.1386


 36%|███▌      | 179/500 [11:48<18:28,  3.45s/it]

Episode: 178 Total reward: 132.0 Training loss: 0.7158 Explore P: 0.1369


 36%|███▌      | 180/500 [11:57<26:56,  5.05s/it]

Episode: 179 Total reward: 251.0 Training loss: 0.1043 Explore P: 0.1338


 36%|███▌      | 181/500 [12:02<28:10,  5.30s/it]

Episode: 180 Total reward: 171.0 Training loss: 0.1747 Explore P: 0.1317
Model Saved


 36%|███▋      | 182/500 [12:07<27:18,  5.15s/it]

Episode: 181 Total reward: 141.0 Training loss: 0.2726 Explore P: 0.1300


 37%|███▋      | 183/500 [12:12<27:04,  5.12s/it]

Episode: 182 Total reward: 148.0 Training loss: 0.1590 Explore P: 0.1282


 37%|███▋      | 184/500 [12:18<28:40,  5.44s/it]

Episode: 183 Total reward: 181.0 Training loss: 0.1831 Explore P: 0.1261


 37%|███▋      | 185/500 [12:22<25:57,  4.94s/it]

Episode: 184 Total reward: 110.0 Training loss: 0.1392 Explore P: 0.1248


 37%|███▋      | 186/500 [12:27<24:57,  4.77s/it]

Episode: 185 Total reward: 127.0 Training loss: 0.2263 Explore P: 0.1234
Model Saved


 37%|███▋      | 187/500 [12:34<28:23,  5.44s/it]

Episode: 186 Total reward: 199.0 Training loss: 0.0977 Explore P: 0.1211


 38%|███▊      | 188/500 [12:37<24:16,  4.67s/it]

Episode: 187 Total reward: 81.0 Training loss: 0.2649 Explore P: 0.1202


 38%|███▊      | 189/500 [12:41<23:11,  4.47s/it]

Episode: 188 Total reward: 117.0 Training loss: 0.2170 Explore P: 0.1189


 38%|███▊      | 190/500 [12:43<20:37,  3.99s/it]

Episode: 189 Total reward: 82.0 Training loss: 0.5385 Explore P: 0.1181


 38%|███▊      | 191/500 [12:48<21:12,  4.12s/it]

Episode: 190 Total reward: 127.0 Training loss: 0.1870 Explore P: 0.1167
Model Saved


 38%|███▊      | 192/500 [12:55<25:07,  4.89s/it]

Episode: 191 Total reward: 194.0 Training loss: 0.1435 Explore P: 0.1146


 39%|███▊      | 193/500 [13:00<25:32,  4.99s/it]

Episode: 192 Total reward: 151.0 Training loss: 0.2174 Explore P: 0.1131


 39%|███▉      | 194/500 [13:03<22:09,  4.35s/it]

Episode: 193 Total reward: 82.0 Training loss: 0.1062 Explore P: 0.1122


 39%|███▉      | 195/500 [13:06<21:08,  4.16s/it]

Episode: 194 Total reward: 108.0 Training loss: 0.1964 Explore P: 0.1111


 39%|███▉      | 196/500 [13:13<24:46,  4.89s/it]

Episode: 195 Total reward: 192.0 Training loss: 0.2016 Explore P: 0.1092
Model Saved


 39%|███▉      | 197/500 [13:19<26:08,  5.18s/it]

Episode: 196 Total reward: 169.0 Training loss: 0.2509 Explore P: 0.1075


 40%|███▉      | 198/500 [13:23<25:08,  4.99s/it]

Episode: 197 Total reward: 132.0 Training loss: 0.2437 Explore P: 0.1063


 40%|███▉      | 199/500 [13:28<24:58,  4.98s/it]

Episode: 198 Total reward: 143.0 Training loss: 0.2944 Explore P: 0.1049


 40%|████      | 200/500 [13:33<25:05,  5.02s/it]

Episode: 199 Total reward: 148.0 Training loss: 0.2291 Explore P: 0.1035


 40%|████      | 201/500 [13:38<25:11,  5.05s/it]

Episode: 200 Total reward: 148.0 Training loss: 0.1560 Explore P: 0.1021
Model Saved


 40%|████      | 202/500 [13:44<25:48,  5.20s/it]

Episode: 201 Total reward: 160.0 Training loss: 0.1496 Explore P: 0.1007


 41%|████      | 203/500 [13:48<24:16,  4.90s/it]

Episode: 202 Total reward: 122.0 Training loss: 0.1924 Explore P: 0.0996


 41%|████      | 204/500 [13:54<25:59,  5.27s/it]

Episode: 203 Total reward: 174.0 Training loss: 0.6961 Explore P: 0.0980


 41%|████      | 205/500 [13:58<23:50,  4.85s/it]

Episode: 204 Total reward: 112.0 Training loss: 0.2064 Explore P: 0.0970


 41%|████      | 206/500 [14:03<23:57,  4.89s/it]

Episode: 205 Total reward: 144.0 Training loss: 0.2202 Explore P: 0.0958
Model Saved


 41%|████▏     | 207/500 [14:07<22:05,  4.52s/it]

Episode: 206 Total reward: 106.0 Training loss: 0.2444 Explore P: 0.0949


 42%|████▏     | 208/500 [14:13<24:46,  5.09s/it]

Episode: 207 Total reward: 185.0 Training loss: 0.3134 Explore P: 0.0933


 42%|████▏     | 209/500 [14:20<26:25,  5.45s/it]

Episode: 208 Total reward: 181.0 Training loss: 0.2188 Explore P: 0.0918


 42%|████▏     | 210/500 [14:22<21:35,  4.47s/it]

Episode: 209 Total reward: 63.0 Training loss: 0.2738 Explore P: 0.0913


 42%|████▏     | 211/500 [14:28<23:27,  4.87s/it]

Episode: 210 Total reward: 168.0 Training loss: 0.2042 Explore P: 0.0900
Model Saved


 42%|████▏     | 212/500 [14:32<23:00,  4.79s/it]

Episode: 211 Total reward: 133.0 Training loss: 0.2750 Explore P: 0.0889


 43%|████▎     | 213/500 [14:39<25:17,  5.29s/it]

Episode: 212 Total reward: 186.0 Training loss: 0.1915 Explore P: 0.0875


 43%|████▎     | 214/500 [14:44<25:23,  5.33s/it]

Episode: 213 Total reward: 156.0 Training loss: 0.1748 Explore P: 0.0863


 43%|████▎     | 215/500 [14:51<27:04,  5.70s/it]

Episode: 214 Total reward: 189.0 Training loss: 0.0986 Explore P: 0.0848


 43%|████▎     | 216/500 [14:53<21:37,  4.57s/it]

Episode: 215 Total reward: 55.0 Training loss: 0.1333 Explore P: 0.0844
Model Saved


 43%|████▎     | 217/500 [14:58<23:20,  4.95s/it]

Episode: 216 Total reward: 168.0 Training loss: 0.1109 Explore P: 0.0832


 44%|████▎     | 218/500 [15:04<24:15,  5.16s/it]

Episode: 217 Total reward: 163.0 Training loss: 0.1193 Explore P: 0.0820


 44%|████▍     | 219/500 [15:09<23:47,  5.08s/it]

Episode: 218 Total reward: 141.0 Training loss: 0.5900 Explore P: 0.0810


 44%|████▍     | 220/500 [15:15<25:45,  5.52s/it]

Episode: 219 Total reward: 189.0 Training loss: 0.1427 Explore P: 0.0797


 44%|████▍     | 221/500 [15:21<25:32,  5.49s/it]

Episode: 220 Total reward: 156.0 Training loss: 0.4672 Explore P: 0.0786
Model Saved


 44%|████▍     | 222/500 [15:26<25:03,  5.41s/it]

Episode: 221 Total reward: 149.0 Training loss: 0.1032 Explore P: 0.0776


 45%|████▍     | 223/500 [15:31<24:28,  5.30s/it]

Episode: 222 Total reward: 146.0 Training loss: 0.0811 Explore P: 0.0766


 45%|████▍     | 224/500 [15:37<24:44,  5.38s/it]

Episode: 223 Total reward: 160.0 Training loss: 0.1412 Explore P: 0.0755


 45%|████▌     | 225/500 [15:43<25:35,  5.58s/it]

Episode: 224 Total reward: 175.0 Training loss: 0.1291 Explore P: 0.0744


 45%|████▌     | 226/500 [15:48<25:35,  5.60s/it]

Episode: 225 Total reward: 162.0 Training loss: 2.5601 Explore P: 0.0734
Model Saved


 45%|████▌     | 227/500 [15:52<22:23,  4.92s/it]

Episode: 226 Total reward: 95.0 Training loss: 0.3402 Explore P: 0.0728


 46%|████▌     | 228/500 [15:56<21:13,  4.68s/it]

Episode: 227 Total reward: 118.0 Training loss: 0.1070 Explore P: 0.0720


 46%|████▌     | 229/500 [16:00<19:55,  4.41s/it]

Episode: 228 Total reward: 106.0 Training loss: 0.1142 Explore P: 0.0714


 46%|████▌     | 230/500 [16:05<21:44,  4.83s/it]

Episode: 229 Total reward: 168.0 Training loss: 0.0778 Explore P: 0.0704


 46%|████▌     | 231/500 [16:11<22:44,  5.07s/it]

Episode: 230 Total reward: 163.0 Training loss: 0.1905 Explore P: 0.0694
Model Saved


 46%|████▋     | 232/500 [16:16<21:47,  4.88s/it]

Episode: 231 Total reward: 128.0 Training loss: 1.6079 Explore P: 0.0686


 47%|████▋     | 233/500 [16:22<24:12,  5.44s/it]

Episode: 232 Total reward: 195.0 Training loss: 0.0836 Explore P: 0.0675


 47%|████▋     | 234/500 [16:28<24:17,  5.48s/it]

Episode: 233 Total reward: 161.0 Training loss: 0.1109 Explore P: 0.0666


 47%|████▋     | 235/500 [16:31<21:38,  4.90s/it]

Episode: 234 Total reward: 102.0 Training loss: 0.3037 Explore P: 0.0660


 47%|████▋     | 236/500 [16:38<23:38,  5.37s/it]

Episode: 235 Total reward: 188.0 Training loss: 0.2394 Explore P: 0.0650
Model Saved


 47%|████▋     | 237/500 [16:42<21:52,  4.99s/it]

Episode: 236 Total reward: 120.0 Training loss: 0.5895 Explore P: 0.0643


 48%|████▊     | 238/500 [16:47<22:06,  5.06s/it]

Episode: 237 Total reward: 150.0 Training loss: 0.3097 Explore P: 0.0635


 48%|████▊     | 239/500 [16:53<22:28,  5.17s/it]

Episode: 238 Total reward: 156.0 Training loss: 0.1024 Explore P: 0.0627


 48%|████▊     | 240/500 [16:55<18:13,  4.21s/it]

Episode: 239 Total reward: 56.0 Training loss: 0.4113 Explore P: 0.0624


 48%|████▊     | 241/500 [17:01<20:31,  4.76s/it]

Episode: 240 Total reward: 168.0 Training loss: 0.0687 Explore P: 0.0615
Model Saved


 48%|████▊     | 242/500 [17:06<20:40,  4.81s/it]

Episode: 241 Total reward: 137.0 Training loss: 0.0929 Explore P: 0.0608


 49%|████▊     | 243/500 [17:11<21:35,  5.04s/it]

Episode: 242 Total reward: 161.0 Training loss: 0.0707 Explore P: 0.0600


 49%|████▉     | 244/500 [17:17<22:33,  5.29s/it]

Episode: 243 Total reward: 169.0 Training loss: 0.4217 Explore P: 0.0591


 49%|████▉     | 245/500 [17:21<21:21,  5.02s/it]

Episode: 244 Total reward: 127.0 Training loss: 0.1737 Explore P: 0.0585


 49%|████▉     | 246/500 [17:25<19:54,  4.70s/it]

Episode: 245 Total reward: 114.0 Training loss: 0.1321 Explore P: 0.0580
Model Saved


 49%|████▉     | 247/500 [17:32<21:55,  5.20s/it]

Episode: 246 Total reward: 183.0 Training loss: 0.3776 Explore P: 0.0571


 50%|████▉     | 248/500 [17:34<17:40,  4.21s/it]

Episode: 247 Total reward: 54.0 Training loss: 0.2839 Explore P: 0.0569


 50%|████▉     | 249/500 [17:36<14:51,  3.55s/it]

Episode: 248 Total reward: 58.0 Training loss: 0.0494 Explore P: 0.0566


 50%|█████     | 250/500 [17:41<17:24,  4.18s/it]

Episode: 249 Total reward: 159.0 Training loss: 0.1415 Explore P: 0.0558


 50%|█████     | 251/500 [17:47<18:56,  4.56s/it]

Episode: 250 Total reward: 154.0 Training loss: 0.1189 Explore P: 0.0551
Model Saved


 50%|█████     | 252/500 [17:55<22:52,  5.53s/it]

Episode: 251 Total reward: 223.0 Training loss: 0.5438 Explore P: 0.0542


 51%|█████     | 253/500 [18:01<23:30,  5.71s/it]

Episode: 252 Total reward: 174.0 Training loss: 0.1826 Explore P: 0.0534


 51%|█████     | 254/500 [18:07<24:33,  5.99s/it]

Episode: 253 Total reward: 192.0 Training loss: 0.1650 Explore P: 0.0526


 51%|█████     | 255/500 [18:10<20:58,  5.14s/it]

Episode: 254 Total reward: 91.0 Training loss: 0.0925 Explore P: 0.0522


 51%|█████     | 256/500 [18:16<22:00,  5.41s/it]

Episode: 255 Total reward: 175.0 Training loss: 0.3293 Explore P: 0.0514
Model Saved


 51%|█████▏    | 257/500 [18:24<24:43,  6.10s/it]

Episode: 256 Total reward: 223.0 Training loss: 0.2100 Explore P: 0.0505


 52%|█████▏    | 258/500 [18:31<25:34,  6.34s/it]

Episode: 257 Total reward: 199.0 Training loss: 0.3232 Explore P: 0.0497


 52%|█████▏    | 259/500 [18:37<25:22,  6.32s/it]

Episode: 258 Total reward: 180.0 Training loss: 0.1221 Explore P: 0.0490


 52%|█████▏    | 260/500 [18:42<23:13,  5.81s/it]

Episode: 259 Total reward: 132.0 Training loss: 0.1523 Explore P: 0.0485


 52%|█████▏    | 261/500 [18:44<18:31,  4.65s/it]

Episode: 260 Total reward: 56.0 Training loss: 0.1313 Explore P: 0.0483
Model Saved


 52%|█████▏    | 262/500 [18:50<20:24,  5.14s/it]

Episode: 261 Total reward: 181.0 Training loss: 0.2103 Explore P: 0.0476


 53%|█████▎    | 263/500 [18:56<21:16,  5.39s/it]

Episode: 262 Total reward: 171.0 Training loss: 0.2475 Explore P: 0.0470


 53%|█████▎    | 264/500 [19:03<22:58,  5.84s/it]

Episode: 263 Total reward: 198.0 Training loss: 0.6124 Explore P: 0.0462


 53%|█████▎    | 265/500 [19:09<22:54,  5.85s/it]

Episode: 264 Total reward: 169.0 Training loss: 0.1651 Explore P: 0.0456


 53%|█████▎    | 266/500 [19:16<24:12,  6.21s/it]

Episode: 265 Total reward: 202.0 Training loss: 0.1176 Explore P: 0.0449
Model Saved


 53%|█████▎    | 267/500 [19:21<23:12,  5.97s/it]

Episode: 266 Total reward: 156.0 Training loss: 0.3324 Explore P: 0.0444


 54%|█████▎    | 268/500 [19:28<23:35,  6.10s/it]

Episode: 267 Total reward: 183.0 Training loss: 0.1671 Explore P: 0.0438


 54%|█████▍    | 269/500 [19:35<24:46,  6.43s/it]

Episode: 268 Total reward: 206.0 Training loss: 0.2555 Explore P: 0.0431


 54%|█████▍    | 270/500 [19:42<25:03,  6.54s/it]

Episode: 269 Total reward: 194.0 Training loss: 0.0746 Explore P: 0.0424


 54%|█████▍    | 271/500 [19:48<24:25,  6.40s/it]

Episode: 270 Total reward: 174.0 Training loss: 0.1234 Explore P: 0.0419
Model Saved


 54%|█████▍    | 272/500 [19:53<22:24,  5.90s/it]

Episode: 271 Total reward: 135.0 Training loss: 2.1174 Explore P: 0.0415


 55%|█████▍    | 273/500 [20:00<24:11,  6.39s/it]

Episode: 272 Total reward: 215.0 Training loss: 0.1040 Explore P: 0.0408


 55%|█████▍    | 274/500 [20:05<22:44,  6.04s/it]

Episode: 273 Total reward: 145.0 Training loss: 0.1265 Explore P: 0.0403


 55%|█████▌    | 275/500 [20:11<22:20,  5.96s/it]

Episode: 274 Total reward: 165.0 Training loss: 0.0777 Explore P: 0.0398


 55%|█████▌    | 276/500 [20:17<22:39,  6.07s/it]

Episode: 275 Total reward: 181.0 Training loss: 0.2049 Explore P: 0.0393
Model Saved


 55%|█████▌    | 277/500 [20:23<21:36,  5.82s/it]

Episode: 276 Total reward: 149.0 Training loss: 0.1259 Explore P: 0.0389


 56%|█████▌    | 278/500 [20:28<20:32,  5.55s/it]

Episode: 277 Total reward: 141.0 Training loss: 0.1318 Explore P: 0.0385


 56%|█████▌    | 279/500 [20:34<21:38,  5.88s/it]

Episode: 278 Total reward: 189.0 Training loss: 0.3705 Explore P: 0.0379


 56%|█████▌    | 280/500 [20:37<17:38,  4.81s/it]

Episode: 279 Total reward: 66.0 Training loss: 0.1631 Explore P: 0.0378


 56%|█████▌    | 281/500 [20:42<18:14,  5.00s/it]

Episode: 280 Total reward: 155.0 Training loss: 0.1578 Explore P: 0.0373
Model Saved


 56%|█████▋    | 282/500 [20:47<18:37,  5.13s/it]

Episode: 281 Total reward: 155.0 Training loss: 0.0882 Explore P: 0.0369


 57%|█████▋    | 283/500 [20:53<18:31,  5.12s/it]

Episode: 282 Total reward: 146.0 Training loss: 0.6500 Explore P: 0.0365


 57%|█████▋    | 284/500 [20:59<19:26,  5.40s/it]

Episode: 283 Total reward: 173.0 Training loss: 0.1539 Explore P: 0.0361


 57%|█████▋    | 285/500 [21:06<21:24,  5.98s/it]

Episode: 284 Total reward: 209.0 Training loss: 0.1015 Explore P: 0.0355


 57%|█████▋    | 286/500 [21:12<21:40,  6.08s/it]

Episode: 285 Total reward: 181.0 Training loss: 0.1446 Explore P: 0.0351
Model Saved


 57%|█████▋    | 287/500 [21:16<19:34,  5.51s/it]

Episode: 286 Total reward: 118.0 Training loss: 0.4907 Explore P: 0.0348


 58%|█████▊    | 288/500 [21:18<15:27,  4.37s/it]

Episode: 287 Total reward: 48.0 Training loss: 0.1230 Explore P: 0.0347


 58%|█████▊    | 289/500 [21:24<17:21,  4.94s/it]

Episode: 288 Total reward: 179.0 Training loss: 0.0955 Explore P: 0.0342


 58%|█████▊    | 290/500 [21:27<14:27,  4.13s/it]

Episode: 289 Total reward: 64.0 Training loss: 0.3292 Explore P: 0.0341


 58%|█████▊    | 291/500 [21:34<17:54,  5.14s/it]

Episode: 290 Total reward: 215.0 Training loss: 0.0894 Explore P: 0.0335
Model Saved


 58%|█████▊    | 292/500 [21:39<17:17,  4.99s/it]

Episode: 291 Total reward: 132.0 Training loss: 0.0939 Explore P: 0.0332


 59%|█████▊    | 293/500 [21:44<17:47,  5.16s/it]

Episode: 292 Total reward: 159.0 Training loss: 0.1974 Explore P: 0.0329


 59%|█████▉    | 294/500 [21:50<17:49,  5.19s/it]

Episode: 293 Total reward: 151.0 Training loss: 0.2894 Explore P: 0.0325


 59%|█████▉    | 295/500 [21:56<18:28,  5.41s/it]

Episode: 294 Total reward: 169.0 Training loss: 0.1502 Explore P: 0.0322


 59%|█████▉    | 296/500 [22:00<17:36,  5.18s/it]

Episode: 295 Total reward: 133.0 Training loss: 0.4168 Explore P: 0.0319
Model Saved


## Step 9: Watch our Agent play 👀
Now that we trained our agent, we can test it

In [None]:
saver = tf.train.Saver()

with tf.Session() as sess:
    devices = sess.list_devices()
    print(devices)
    
    game, possible_actions = create_environment()
    
    totalScore = 0
    
    # Load the model
    saver.restore(sess, "./models/model.ckpt")
    game.init()
    for i in range(1):
        
        done = False
        
        game.new_episode()
        
        state = game.get_state().screen_buffer
        state, stacked_frames = stack_frames(stacked_frames, state, True)
            
        while not game.is_episode_finished():
            # Take the biggest Q value (= the best action)
            Qs = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: state.reshape((1, *state.shape))})
            
            # Take the biggest Q value (= the best action)
            choice = np.argmax(Qs)
            action = possible_actions[int(choice)]
            
            game.make_action(action)
            done = game.is_episode_finished()
            score = game.get_total_reward()
            
            if done:
                break  
                
            else:
                next_state = game.get_state().screen_buffer
                next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)
                state = next_state
                
        score = game.get_total_reward()
        print("Score: ", score)
    game.close()