## Keras Implementation of a DQN

Trained on the game catcher. Dependencies:
* Keras
* TensorFlow
* Pygame Learning Environment: https://github.com/ntasfi/PyGame-Learning-Environment
* numpy

Code used from:
* Using Keras to solve FlappyBird: https://github.com/yanpanlau/Keras-FlappyBird
* Denny Britz's DQN implementation: https://github.com/dennybritz/reinforcement-learning/tree/master/DQN
* Udacity's RL Implementation: https://github.com/udacity/deep-learning/tree/master/reinforcement 

In [1]:
import numpy as np
from ple import PLE
from ple.games.catcher import Catcher

import numpy as np
from collections import deque
import random

from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.optimizers import SGD , Adam
from keras import initializers
import keras.backend as K


LEARNING_RATE = 1e-4
ACTIONS = 2 #right, left
action_list = [97,100]
GAMMA = 0.99 # decay rate of past observations
EXPLORE = 4000000. #3000000. # frames over which to anneal epsilon, 
FINAL_EPSILON = 0.01 # final value of epsilon
INITIAL_EPSILON = 0.95 # starting value of epsilon
REPLAY_MEMORY = 50000 # number of previous transitions to remember
BATCH = 32 #32 # size of minibatch
#FRAME_PER_ACTION = 1 #set in PLE environement
LEARNING_RATE = 1e-4

img_rows , img_cols = 64, 64 #default size of catcher
#Convert image into Black and white
img_channels = 4 #stacking 4 frames

#DQN (https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf 2015, Minh) has different architecture
#using slight modifications from denny britz and flappy bird code
#modifying it with action length for this game
def build_model():
    model = Sequential()
    model.add(Conv2D(32, kernel_size=(8,8),strides=(4, 4), padding='same',
                     activation='relu', input_shape=(img_rows,img_cols,img_channels), 
                     kernel_initializer=initializers.glorot_normal(seed=31))) 
    model.add(Conv2D(64,kernel_size=(4,4),strides=(2,2),padding='same',activation='relu', 
                     kernel_initializer=initializers.glorot_normal(seed=31)))
    model.add(Conv2D(64,kernel_size=(3,3),strides=(1,1),padding='same',activation='relu', 
                     kernel_initializer=initializers.glorot_normal(seed=31)))
    model.add(Flatten())
    model.add(Dense(512, activation='relu', 
                     kernel_initializer=initializers.glorot_normal(seed=31)))
    model.add(Dense(ACTIONS, 
                     kernel_initializer=initializers.glorot_normal(seed=31)))
    
    adam = Adam(lr=LEARNING_RATE,clipvalue=1.0)
    model.compile(loss='mse',optimizer=adam)
    return model


couldn't import doomish
Couldn't import doom


Using TensorFlow backend.


In [2]:
def train_network(model,target_model):
    
    REPORT_EVERY = 2000 #report results this often
    UPDATE_EVERY = 10000 #update target_model after this many training steps
    
    # store the previous observations in replay memory
    exp_replay = deque()
    
    #stop training after hitting max loops
    train_loops = 0
    train_loops_max = 10
    
    loss = 0
    Q_sa = 0
    action_index = 0 #action index stored in experience replay
    action_command = action_list[action_index] #actual command given to the environment
    r_t = 0 #reward
    epsilon = INITIAL_EPSILON
    s_t = None #state
    s_t1 = None #resulting state
    skip_append = True

    ti_tuple = tuple([i for i in range(BATCH)]) #used for indexing a np array down below, probably a better way to do this
    
    EPISODE_DONE = 0 #0 if done, 1 if continuing makes math easier when making targets
    EPISODE_NOT_DONE = 1 #makes math easier
    EPISODE_START = -1
    episode_length = 0.
    episode_reward = 0.
    stats_episode_avg_reward = 0.
    stats_episode_avg_length = 0.
    stats_episode = 0
    episode = 1
    turn = 0
    
    game = Catcher() # create our game

    fps = 30  # fps we want to run at
    frame_skip = 2
    num_steps = 2
    force_fps = True #True # false for slower speed
    display_screen = False

    # make a PLE instance.
    ple_env = PLE(game, fps=fps, frame_skip=frame_skip, num_steps=num_steps,
            force_fps=force_fps, display_screen=display_screen)

    # init agent and game.
    ple_env.init()
    done_check = ple_env.lives() #resetting episode on loss of life, not on game over
    ple_env.act(action_list[np.random.randint(0,2)]) #need to take an action to initialize, screen starts out as black
    
    #start of an episode is a stack of four of the same observation
    #subsequent time steps removes the first frame adds current frame to the end
    x_t = ple_env.getScreenGrayscale()/255. 
    s_t = np.stack((x_t, x_t, x_t, x_t), axis=2)

    while train_loops < EXPLORE:
        train_loops += 1
        
        #get action
        #action_command used to actually input the command
        #a_t is used in the experience replay/training
        if random.random() <= epsilon:
            action_index = np.random.randint(0,ACTIONS)
            action_command = action_list[action_index]
        else:
            q = model.predict(s_t.reshape(1, s_t.shape[0], s_t.shape[1], s_t.shape[2])) 
            max_Q = np.argmax(q)
            action_index = max_Q
            action_command = action_list[action_index]
            
        #reduce exploration rate epsilon
        if epsilon > FINAL_EPSILON:
            epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / EXPLORE
        
        #do action
        r_t = ple_env.act(action_command)
        episode_reward += r_t
        episode_length += 1
        
        #get next state
        x_t1 = ple_env.getScreenGrayscale()/255.
        x_t1 = x_t1.reshape(x_t1.shape[0], x_t1.shape[1], 1) #1x64x64x1
        s_t1 = np.append(x_t1, s_t[ :, :, :3], axis=2)

        if ple_env.lives() != done_check:
            done = EPISODE_DONE
        else:
            done = EPISODE_NOT_DONE
        
        #adding to experience replay
        if len(exp_replay) == REPLAY_MEMORY:
            exp_replay.popleft()
        exp_replay.append((s_t, action_index, r_t, s_t1, done))
        
        #check if life is lost, if so episode is done and we reset
        if done == EPISODE_DONE:
            episode += 1
            stats_episode += 1
            turn = 0
            stats_episode_avg_reward = stats_episode_avg_reward + (episode_reward - stats_episode_avg_reward)/stats_episode
            stats_episode_avg_length = stats_episode_avg_length + (episode_length - stats_episode_avg_length)/stats_episode
            episode_reward = 0.
            episode_length = 0.

            #reset the environment, get the current state
            ple_env.reset_game()
            ple_env.act(action_list[np.random.randint(0,2)]) #take action to initialize, screen starts out as black
            x_t = ple_env.getScreenGrayscale()/255.
            s_t = np.stack((x_t, x_t, x_t, x_t), axis=2)
        else:
            s_t = s_t1 #resulting state becomes the current state
        
        #train the network using the experience replay
        #modified heavily from FlappyBird implementation
            #better way from dennybritz design
                #used a fixed Q target network
                #only predicts on the next values, flappy bird predicts on current values which I don't think the paper does
                #removes a for loop
                
        if len(exp_replay) > BATCH:
            #print("training")
            #minibatch = exp_replay.sample(BATCHD)
            minibatch = random.sample(exp_replay,BATCH)
            
            #modified heavily from flapply bird way which I believe is incorrect
            #better way from dennybritz design
                #uses a fixed Q target network
                #only predicts on the next states values
                    #flappy bird predicts on current values which I don't think the DQN paper does
                #removes a for loop

            states_batch, action_batch, reward_batch, next_states_batch, done_batch = map(np.array, zip(*minibatch))        
            
            q_values_next = target_model.predict(next_states_batch,batch_size=BATCH)
            targets = np.zeros((BATCH,ACTIONS)) #BATCHxACTIONS
            targets[ti_tuple,action_batch] = reward_batch + done_batch * GAMMA * np.amax(q_values_next,axis=1)
            loss += model.train_on_batch(states_batch, targets)
        
        #save model weights and update target model
        if train_loops % UPDATE_EVERY == 0:
            print("Saving model weights")
            zString = "catcher_training_weights/model_{}.h5".format(train_loops)
            model.save_weights(zString, overwrite=True)
            #updating fixed Q network weights
            target_model.load_weights(zString)

        #print info
        if train_loops % REPORT_EVERY == 0:
            print("Loop {} / Episode {} / Epsilon {:6.4f} / Avg. Reward {:6.3f} / Avg. Length {:6.3f} / Loss {:6.4f} "
                              .format(train_loops,episode,epsilon,stats_episode_avg_reward,stats_episode_avg_length,loss) )
            stats_episode_avg_reward = 0.
            stats_episode_avg_length = 0.
            stats_episode = 0.
            loss = 0.
        

In [3]:
model = build_model()
target_model = build_model()

train_network(model,target_model)

Loop 2000 / Episode 156 / Epsilon 0.9495 / Avg. Reward -0.548 / Avg. Length 12.884 / Loss 21.7476 
Loop 4000 / Episode 304 / Epsilon 0.9491 / Avg. Reward -0.480 / Avg. Length 13.486 / Loss 19.8332 
Loop 6000 / Episode 459 / Epsilon 0.9486 / Avg. Reward -0.542 / Avg. Length 12.897 / Loss 21.2444 
Loop 8000 / Episode 625 / Epsilon 0.9481 / Avg. Reward -0.651 / Avg. Length 12.054 / Loss 21.3462 
Saving model weights
Loop 10000 / Episode 790 / Epsilon 0.9476 / Avg. Reward -0.624 / Avg. Length 12.121 / Loss 20.8022 
Loop 12000 / Episode 940 / Epsilon 0.9472 / Avg. Reward -0.480 / Avg. Length 13.340 / Loss 35.8715 
Loop 14000 / Episode 1091 / Epsilon 0.9467 / Avg. Reward -0.497 / Avg. Length 13.245 / Loss 29.2353 
Loop 16000 / Episode 1230 / Epsilon 0.9462 / Avg. Reward -0.367 / Avg. Length 14.410 / Loss 29.5682 
Loop 18000 / Episode 1387 / Epsilon 0.9458 / Avg. Reward -0.567 / Avg. Length 12.732 / Loss 27.6158 
Saving model weights
Loop 20000 / Episode 1553 / Epsilon 0.9453 / Avg. Reward -0

Loop 158000 / Episode 12224 / Epsilon 0.9129 / Avg. Reward -0.452 / Avg. Length 13.733 / Loss 47.2238 
Saving model weights
Loop 160000 / Episode 12392 / Epsilon 0.9124 / Avg. Reward -0.649 / Avg. Length 11.988 / Loss 45.9708 
Loop 162000 / Episode 12548 / Epsilon 0.9119 / Avg. Reward -0.551 / Avg. Length 12.846 / Loss 51.9891 
Loop 164000 / Episode 12699 / Epsilon 0.9115 / Avg. Reward -0.503 / Avg. Length 13.252 / Loss 50.9671 
Loop 166000 / Episode 12845 / Epsilon 0.9110 / Avg. Reward -0.445 / Avg. Length 13.678 / Loss 47.3807 
Loop 168000 / Episode 12989 / Epsilon 0.9105 / Avg. Reward -0.410 / Avg. Length 13.896 / Loss 46.8538 
Saving model weights
Loop 170000 / Episode 13139 / Epsilon 0.9100 / Avg. Reward -0.513 / Avg. Length 13.327 / Loss 46.5756 
Loop 172000 / Episode 13296 / Epsilon 0.9096 / Avg. Reward -0.567 / Avg. Length 12.752 / Loss 50.5967 
Loop 174000 / Episode 13454 / Epsilon 0.9091 / Avg. Reward -0.563 / Avg. Length 12.620 / Loss 48.1845 
Loop 176000 / Episode 13604 / E

Loop 312000 / Episode 23931 / Epsilon 0.8767 / Avg. Reward -0.545 / Avg. Length 12.763 / Loss 56.0301 
Loop 314000 / Episode 24080 / Epsilon 0.8762 / Avg. Reward -0.477 / Avg. Length 13.530 / Loss 54.0700 
Loop 316000 / Episode 24234 / Epsilon 0.8757 / Avg. Reward -0.526 / Avg. Length 12.961 / Loss 52.0396 
Loop 318000 / Episode 24379 / Epsilon 0.8753 / Avg. Reward -0.441 / Avg. Length 13.779 / Loss 51.5743 
Saving model weights
Loop 320000 / Episode 24523 / Epsilon 0.8748 / Avg. Reward -0.431 / Avg. Length 13.826 / Loss 50.6085 
Loop 322000 / Episode 24680 / Epsilon 0.8743 / Avg. Reward -0.548 / Avg. Length 12.726 / Loss 54.9213 
Loop 324000 / Episode 24828 / Epsilon 0.8739 / Avg. Reward -0.459 / Avg. Length 13.595 / Loss 52.5702 
Loop 326000 / Episode 24980 / Epsilon 0.8734 / Avg. Reward -0.539 / Avg. Length 13.020 / Loss 51.4520 
Loop 328000 / Episode 25140 / Epsilon 0.8729 / Avg. Reward -0.594 / Avg. Length 12.500 / Loss 49.6369 
Saving model weights
Loop 330000 / Episode 25292 / E

Loop 466000 / Episode 35343 / Epsilon 0.8405 / Avg. Reward -0.455 / Avg. Length 13.717 / Loss 51.1757 
Loop 468000 / Episode 35487 / Epsilon 0.8400 / Avg. Reward -0.417 / Avg. Length 13.951 / Loss 49.5005 
Saving model weights
Loop 470000 / Episode 35631 / Epsilon 0.8395 / Avg. Reward -0.431 / Avg. Length 13.903 / Loss 49.2212 
Loop 472000 / Episode 35780 / Epsilon 0.8391 / Avg. Reward -0.490 / Avg. Length 13.416 / Loss 54.9944 
Loop 474000 / Episode 35922 / Epsilon 0.8386 / Avg. Reward -0.401 / Avg. Length 14.077 / Loss 52.1767 
Loop 476000 / Episode 36067 / Epsilon 0.8381 / Avg. Reward -0.421 / Avg. Length 13.828 / Loss 51.0660 
Loop 478000 / Episode 36224 / Epsilon 0.8377 / Avg. Reward -0.567 / Avg. Length 12.720 / Loss 51.3465 
Saving model weights
Loop 480000 / Episode 36372 / Epsilon 0.8372 / Avg. Reward -0.480 / Avg. Length 13.385 / Loss 49.3943 
Loop 482000 / Episode 36526 / Epsilon 0.8367 / Avg. Reward -0.519 / Avg. Length 13.104 / Loss 52.3315 
Loop 484000 / Episode 36684 / E

Saving model weights
Loop 620000 / Episode 46324 / Epsilon 0.8043 / Avg. Reward -0.386 / Avg. Length 14.271 / Loss 52.7638 
Loop 622000 / Episode 46461 / Epsilon 0.8038 / Avg. Reward -0.358 / Avg. Length 14.555 / Loss 55.4667 
Loop 624000 / Episode 46599 / Epsilon 0.8034 / Avg. Reward -0.348 / Avg. Length 14.594 / Loss 53.5830 
Loop 626000 / Episode 46730 / Epsilon 0.8029 / Avg. Reward -0.290 / Avg. Length 15.160 / Loss 52.4945 
Loop 628000 / Episode 46881 / Epsilon 0.8024 / Avg. Reward -0.497 / Avg. Length 13.278 / Loss 51.7810 
Saving model weights
Loop 630000 / Episode 47029 / Epsilon 0.8019 / Avg. Reward -0.473 / Avg. Length 13.399 / Loss 49.7541 
Loop 632000 / Episode 47149 / Epsilon 0.8015 / Avg. Reward -0.092 / Avg. Length 16.817 / Loss 55.5396 
Loop 634000 / Episode 47296 / Epsilon 0.8010 / Avg. Reward -0.449 / Avg. Length 13.646 / Loss 52.2627 
Loop 636000 / Episode 47437 / Epsilon 0.8005 / Avg. Reward -0.411 / Avg. Length 14.078 / Loss 50.9482 
Loop 638000 / Episode 47575 / E

Loop 774000 / Episode 57090 / Epsilon 0.7681 / Avg. Reward -0.386 / Avg. Length 14.371 / Loss 59.8628 
Loop 776000 / Episode 57235 / Epsilon 0.7676 / Avg. Reward -0.434 / Avg. Length 13.786 / Loss 57.8909 
Loop 778000 / Episode 57376 / Epsilon 0.7672 / Avg. Reward -0.404 / Avg. Length 14.170 / Loss 58.6362 
Saving model weights
Loop 780000 / Episode 57514 / Epsilon 0.7667 / Avg. Reward -0.370 / Avg. Length 14.478 / Loss 57.8208 
Loop 782000 / Episode 57643 / Epsilon 0.7662 / Avg. Reward -0.233 / Avg. Length 15.558 / Loss 60.0244 
Loop 784000 / Episode 57767 / Epsilon 0.7658 / Avg. Reward -0.169 / Avg. Length 16.145 / Loss 57.7816 
Loop 786000 / Episode 57898 / Epsilon 0.7653 / Avg. Reward -0.267 / Avg. Length 15.229 / Loss 56.2257 
Loop 788000 / Episode 58039 / Epsilon 0.7648 / Avg. Reward -0.383 / Avg. Length 14.191 / Loss 55.7077 
Saving model weights
Loop 790000 / Episode 58172 / Epsilon 0.7643 / Avg. Reward -0.301 / Avg. Length 15.008 / Loss 55.2623 
Loop 792000 / Episode 58316 / E

Loop 928000 / Episode 67323 / Epsilon 0.7319 / Avg. Reward -0.319 / Avg. Length 14.785 / Loss 58.0457 
Saving model weights
Loop 930000 / Episode 67463 / Epsilon 0.7314 / Avg. Reward -0.379 / Avg. Length 14.271 / Loss 57.7180 
Loop 932000 / Episode 67590 / Epsilon 0.7310 / Avg. Reward -0.213 / Avg. Length 15.740 / Loss 62.7530 
Loop 934000 / Episode 67712 / Epsilon 0.7305 / Avg. Reward -0.164 / Avg. Length 16.090 / Loss 60.8047 
Loop 936000 / Episode 67850 / Epsilon 0.7300 / Avg. Reward -0.326 / Avg. Length 14.746 / Loss 59.3687 
Loop 938000 / Episode 67981 / Epsilon 0.7296 / Avg. Reward -0.267 / Avg. Length 15.221 / Loss 57.6335 
Saving model weights
Loop 940000 / Episode 68116 / Epsilon 0.7291 / Avg. Reward -0.296 / Avg. Length 14.896 / Loss 58.3459 
Loop 942000 / Episode 68249 / Epsilon 0.7286 / Avg. Reward -0.308 / Avg. Length 15.038 / Loss 62.5859 
Loop 944000 / Episode 68377 / Epsilon 0.7282 / Avg. Reward -0.234 / Avg. Length 15.617 / Loss 58.7998 
Loop 946000 / Episode 68508 / E

Loop 1082000 / Episode 77205 / Epsilon 0.6957 / Avg. Reward -0.286 / Avg. Length 15.060 / Loss 65.6586 
Loop 1084000 / Episode 77324 / Epsilon 0.6953 / Avg. Reward -0.092 / Avg. Length 16.748 / Loss 63.7141 
Loop 1086000 / Episode 77455 / Epsilon 0.6948 / Avg. Reward -0.275 / Avg. Length 15.160 / Loss 62.3870 
Loop 1088000 / Episode 77578 / Epsilon 0.6943 / Avg. Reward -0.138 / Avg. Length 16.382 / Loss 61.8461 
Saving model weights
Loop 1090000 / Episode 77694 / Epsilon 0.6938 / Avg. Reward -0.052 / Avg. Length 17.259 / Loss 61.1625 
Loop 1092000 / Episode 77815 / Epsilon 0.6934 / Avg. Reward -0.116 / Avg. Length 16.545 / Loss 63.8643 
Loop 1094000 / Episode 77929 / Epsilon 0.6929 / Avg. Reward -0.035 / Avg. Length 17.368 / Loss 62.4564 
Loop 1096000 / Episode 78036 / Epsilon 0.6924 / Avg. Reward  0.150 / Avg. Length 18.785 / Loss 62.8491 
Loop 1098000 / Episode 78177 / Epsilon 0.6920 / Avg. Reward -0.397 / Avg. Length 14.227 / Loss 61.7083 
Saving model weights
Loop 1100000 / Episode

Loop 1234000 / Episode 86568 / Epsilon 0.6600 / Avg. Reward  0.168 / Avg. Length 19.028 / Loss 64.0316 
Loop 1236000 / Episode 86694 / Epsilon 0.6595 / Avg. Reward -0.190 / Avg. Length 15.937 / Loss 62.5463 
Loop 1238000 / Episode 86809 / Epsilon 0.6591 / Avg. Reward -0.035 / Avg. Length 17.252 / Loss 62.4057 
Saving model weights
Loop 1240000 / Episode 86934 / Epsilon 0.6586 / Avg. Reward -0.160 / Avg. Length 16.048 / Loss 62.1079 
Loop 1242000 / Episode 87047 / Epsilon 0.6581 / Avg. Reward  0.018 / Avg. Length 17.708 / Loss 67.3709 
Loop 1244000 / Episode 87173 / Epsilon 0.6577 / Avg. Reward -0.183 / Avg. Length 15.937 / Loss 64.2201 
Loop 1246000 / Episode 87301 / Epsilon 0.6572 / Avg. Reward -0.234 / Avg. Length 15.570 / Loss 65.0142 
Loop 1248000 / Episode 87435 / Epsilon 0.6567 / Avg. Reward -0.284 / Avg. Length 14.985 / Loss 63.3139 
Saving model weights
Loop 1250000 / Episode 87553 / Epsilon 0.6562 / Avg. Reward -0.093 / Avg. Length 16.788 / Loss 63.3176 
Loop 1252000 / Episode

Loop 1386000 / Episode 95576 / Epsilon 0.6243 / Avg. Reward  0.009 / Avg. Length 17.646 / Loss 68.3593 
Loop 1388000 / Episode 95697 / Epsilon 0.6238 / Avg. Reward -0.132 / Avg. Length 16.331 / Loss 67.5226 
Saving model weights
Loop 1390000 / Episode 95807 / Epsilon 0.6233 / Avg. Reward  0.082 / Avg. Length 18.291 / Loss 67.7743 
Loop 1392000 / Episode 95922 / Epsilon 0.6229 / Avg. Reward -0.017 / Avg. Length 17.417 / Loss 69.2568 
Loop 1394000 / Episode 96038 / Epsilon 0.6224 / Avg. Reward -0.026 / Avg. Length 17.293 / Loss 69.2779 
Loop 1396000 / Episode 96152 / Epsilon 0.6219 / Avg. Reward -0.018 / Avg. Length 17.404 / Loss 67.0249 
Loop 1398000 / Episode 96273 / Epsilon 0.6215 / Avg. Reward -0.083 / Avg. Length 16.694 / Loss 68.4219 
Saving model weights
Loop 1400000 / Episode 96382 / Epsilon 0.6210 / Avg. Reward  0.083 / Avg. Length 18.083 / Loss 68.4426 
Loop 1402000 / Episode 96487 / Epsilon 0.6205 / Avg. Reward  0.229 / Avg. Length 19.343 / Loss 70.9182 
Loop 1404000 / Episode

Loop 1538000 / Episode 104113 / Epsilon 0.5886 / Avg. Reward  0.163 / Avg. Length 18.942 / Loss 73.8803 
Saving model weights
Loop 1540000 / Episode 104221 / Epsilon 0.5881 / Avg. Reward  0.157 / Avg. Length 18.750 / Loss 71.3208 
Loop 1542000 / Episode 104330 / Epsilon 0.5876 / Avg. Reward  0.083 / Avg. Length 18.440 / Loss 75.1257 
Loop 1544000 / Episode 104440 / Epsilon 0.5872 / Avg. Reward  0.073 / Avg. Length 18.182 / Loss 73.8275 
Loop 1546000 / Episode 104549 / Epsilon 0.5867 / Avg. Reward  0.101 / Avg. Length 18.358 / Loss 72.3757 
Loop 1548000 / Episode 104657 / Epsilon 0.5862 / Avg. Reward  0.093 / Avg. Length 18.417 / Loss 71.2127 
Saving model weights
Loop 1550000 / Episode 104765 / Epsilon 0.5857 / Avg. Reward  0.130 / Avg. Length 18.583 / Loss 70.6454 
Loop 1552000 / Episode 104881 / Epsilon 0.5853 / Avg. Reward -0.052 / Avg. Length 17.009 / Loss 72.5856 
Loop 1554000 / Episode 104989 / Epsilon 0.5848 / Avg. Reward  0.111 / Avg. Length 18.528 / Loss 70.9414 
Loop 1556000 

Saving model weights
Loop 1690000 / Episode 112208 / Epsilon 0.5528 / Avg. Reward  0.140 / Avg. Length 18.645 / Loss 74.0847 
Loop 1692000 / Episode 112321 / Epsilon 0.5524 / Avg. Reward  0.009 / Avg. Length 17.593 / Loss 76.8858 
Loop 1694000 / Episode 112438 / Epsilon 0.5519 / Avg. Reward -0.051 / Avg. Length 17.077 / Loss 75.1267 
Loop 1696000 / Episode 112531 / Epsilon 0.5514 / Avg. Reward  0.462 / Avg. Length 21.570 / Loss 74.5727 
Loop 1698000 / Episode 112648 / Epsilon 0.5510 / Avg. Reward -0.034 / Avg. Length 17.043 / Loss 74.2369 
Saving model weights
Loop 1700000 / Episode 112748 / Epsilon 0.5505 / Avg. Reward  0.300 / Avg. Length 20.030 / Loss 75.0513 
Loop 1702000 / Episode 112841 / Epsilon 0.5500 / Avg. Reward  0.484 / Avg. Length 21.634 / Loss 78.9520 
Loop 1704000 / Episode 112925 / Epsilon 0.5496 / Avg. Reward  0.702 / Avg. Length 23.464 / Loss 76.3641 
Loop 1706000 / Episode 113024 / Epsilon 0.5491 / Avg. Reward  0.354 / Avg. Length 20.505 / Loss 75.5052 
Loop 1708000 

Loop 1842000 / Episode 119749 / Epsilon 0.5171 / Avg. Reward  0.257 / Avg. Length 19.644 / Loss 77.4337 
Loop 1844000 / Episode 119849 / Epsilon 0.5167 / Avg. Reward  0.300 / Avg. Length 20.190 / Loss 76.7164 
Loop 1846000 / Episode 119943 / Epsilon 0.5162 / Avg. Reward  0.383 / Avg. Length 20.638 / Loss 78.4907 
Loop 1848000 / Episode 120033 / Epsilon 0.5157 / Avg. Reward  0.633 / Avg. Length 22.800 / Loss 77.5333 
Saving model weights
Loop 1850000 / Episode 120138 / Epsilon 0.5152 / Avg. Reward  0.190 / Avg. Length 19.124 / Loss 76.4747 
Loop 1852000 / Episode 120231 / Epsilon 0.5148 / Avg. Reward  0.484 / Avg. Length 21.484 / Loss 78.2119 
Loop 1854000 / Episode 120318 / Epsilon 0.5143 / Avg. Reward  0.621 / Avg. Length 22.713 / Loss 78.2364 
Loop 1856000 / Episode 120410 / Epsilon 0.5138 / Avg. Reward  0.478 / Avg. Length 21.674 / Loss 75.8865 
Loop 1858000 / Episode 120505 / Epsilon 0.5134 / Avg. Reward  0.463 / Avg. Length 21.316 / Loss 75.8028 
Saving model weights
Loop 1860000 

Loop 1994000 / Episode 126909 / Epsilon 0.4814 / Avg. Reward  0.629 / Avg. Length 22.730 / Loss 79.4097 
Loop 1996000 / Episode 127001 / Epsilon 0.4809 / Avg. Reward  0.467 / Avg. Length 21.533 / Loss 78.7594 
Loop 1998000 / Episode 127089 / Epsilon 0.4805 / Avg. Reward  0.591 / Avg. Length 22.489 / Loss 77.5930 
Saving model weights
Loop 2000000 / Episode 127184 / Epsilon 0.4800 / Avg. Reward  0.453 / Avg. Length 21.526 / Loss 77.4685 
Loop 2002000 / Episode 127278 / Epsilon 0.4795 / Avg. Reward  0.394 / Avg. Length 20.819 / Loss 82.8447 
Loop 2004000 / Episode 127365 / Epsilon 0.4791 / Avg. Reward  0.690 / Avg. Length 23.460 / Loss 81.5909 
Loop 2006000 / Episode 127461 / Epsilon 0.4786 / Avg. Reward  0.385 / Avg. Length 20.813 / Loss 81.7056 
Loop 2008000 / Episode 127554 / Epsilon 0.4781 / Avg. Reward  0.430 / Avg. Length 21.269 / Loss 79.7224 
Saving model weights
Loop 2010000 / Episode 127647 / Epsilon 0.4776 / Avg. Reward  0.473 / Avg. Length 21.624 / Loss 79.4183 
Loop 2012000 

Loop 2146000 / Episode 133592 / Epsilon 0.4457 / Avg. Reward  0.753 / Avg. Length 23.882 / Loss 90.2018 
Loop 2148000 / Episode 133673 / Epsilon 0.4452 / Avg. Reward  0.778 / Avg. Length 24.222 / Loss 89.5145 
Saving model weights
Loop 2150000 / Episode 133752 / Epsilon 0.4447 / Avg. Reward  0.975 / Avg. Length 25.759 / Loss 88.2284 
Loop 2152000 / Episode 133826 / Epsilon 0.4443 / Avg. Reward  1.135 / Avg. Length 27.135 / Loss 92.3212 
Loop 2154000 / Episode 133912 / Epsilon 0.4438 / Avg. Reward  0.674 / Avg. Length 23.209 / Loss 91.5895 
Loop 2156000 / Episode 134005 / Epsilon 0.4433 / Avg. Reward  0.430 / Avg. Length 21.118 / Loss 90.2992 
Loop 2158000 / Episode 134079 / Epsilon 0.4429 / Avg. Reward  1.189 / Avg. Length 27.649 / Loss 91.0801 
Saving model weights
Loop 2160000 / Episode 134158 / Epsilon 0.4424 / Avg. Reward  0.823 / Avg. Length 24.823 / Loss 89.3661 
Loop 2162000 / Episode 134238 / Epsilon 0.4419 / Avg. Reward  0.900 / Avg. Length 25.488 / Loss 91.1022 
Loop 2164000 

Loop 2298000 / Episode 139678 / Epsilon 0.4100 / Avg. Reward  0.714 / Avg. Length 23.857 / Loss 95.0526 
Saving model weights
Loop 2300000 / Episode 139754 / Epsilon 0.4095 / Avg. Reward  1.013 / Avg. Length 26.079 / Loss 94.9400 
Loop 2302000 / Episode 139836 / Epsilon 0.4090 / Avg. Reward  0.805 / Avg. Length 24.488 / Loss 98.3880 
Loop 2304000 / Episode 139911 / Epsilon 0.4086 / Avg. Reward  1.027 / Avg. Length 26.453 / Loss 97.8183 
Loop 2306000 / Episode 139982 / Epsilon 0.4081 / Avg. Reward  1.225 / Avg. Length 27.958 / Loss 97.1698 
Loop 2308000 / Episode 140068 / Epsilon 0.4076 / Avg. Reward  0.709 / Avg. Length 23.605 / Loss 96.1473 
Saving model weights
Loop 2310000 / Episode 140145 / Epsilon 0.4071 / Avg. Reward  1.026 / Avg. Length 25.987 / Loss 96.7855 
Loop 2312000 / Episode 140213 / Epsilon 0.4067 / Avg. Reward  1.382 / Avg. Length 29.382 / Loss 102.3831 
Loop 2314000 / Episode 140305 / Epsilon 0.4062 / Avg. Reward  0.500 / Avg. Length 21.804 / Loss 101.4622 
Loop 231600

Loop 2448000 / Episode 145250 / Epsilon 0.3747 / Avg. Reward  1.239 / Avg. Length 28.070 / Loss 102.8372 
Saving model weights
Loop 2450000 / Episode 145318 / Epsilon 0.3742 / Avg. Reward  1.338 / Avg. Length 28.912 / Loss 102.0886 
Loop 2452000 / Episode 145385 / Epsilon 0.3738 / Avg. Reward  1.507 / Avg. Length 29.955 / Loss 106.1524 
Loop 2454000 / Episode 145450 / Epsilon 0.3733 / Avg. Reward  1.615 / Avg. Length 31.123 / Loss 103.2921 
Loop 2456000 / Episode 145523 / Epsilon 0.3728 / Avg. Reward  1.123 / Avg. Length 27.096 / Loss 101.6868 
Loop 2458000 / Episode 145599 / Epsilon 0.3724 / Avg. Reward  1.079 / Avg. Length 26.592 / Loss 102.6459 
Saving model weights
Loop 2460000 / Episode 145670 / Epsilon 0.3719 / Avg. Reward  1.282 / Avg. Length 28.366 / Loss 101.5363 
Loop 2462000 / Episode 145738 / Epsilon 0.3714 / Avg. Reward  1.382 / Avg. Length 29.382 / Loss 107.1750 
Loop 2464000 / Episode 145793 / Epsilon 0.3710 / Avg. Reward  1.982 / Avg. Length 34.236 / Loss 103.0949 
Loop

Loop 2598000 / Episode 150323 / Epsilon 0.3395 / Avg. Reward  0.589 / Avg. Length 22.622 / Loss 114.2256 
Saving model weights
Loop 2600000 / Episode 150377 / Epsilon 0.3390 / Avg. Reward  2.185 / Avg. Length 36.000 / Loss 112.5202 
Loop 2602000 / Episode 150442 / Epsilon 0.3385 / Avg. Reward  1.646 / Avg. Length 31.354 / Loss 111.9143 
Loop 2604000 / Episode 150513 / Epsilon 0.3381 / Avg. Reward  1.169 / Avg. Length 27.465 / Loss 111.2936 
Loop 2606000 / Episode 150581 / Epsilon 0.3376 / Avg. Reward  1.529 / Avg. Length 30.412 / Loss 109.8955 
Loop 2608000 / Episode 150639 / Epsilon 0.3371 / Avg. Reward  1.879 / Avg. Length 33.621 / Loss 107.6491 
Saving model weights
Loop 2610000 / Episode 150704 / Epsilon 0.3366 / Avg. Reward  1.600 / Avg. Length 30.908 / Loss 108.7956 
Loop 2612000 / Episode 150765 / Epsilon 0.3362 / Avg. Reward  1.803 / Avg. Length 32.967 / Loss 112.2253 
Loop 2614000 / Episode 150833 / Epsilon 0.3357 / Avg. Reward  1.426 / Avg. Length 29.456 / Loss 112.5847 
Loop

Loop 2748000 / Episode 154934 / Epsilon 0.3042 / Avg. Reward  3.800 / Avg. Length 49.600 / Loss 126.4180 
Saving model weights
Loop 2750000 / Episode 154996 / Epsilon 0.3037 / Avg. Reward  1.645 / Avg. Length 31.597 / Loss 121.8151 
Loop 2752000 / Episode 155062 / Epsilon 0.3033 / Avg. Reward  1.591 / Avg. Length 31.212 / Loss 129.1319 
Loop 2754000 / Episode 155118 / Epsilon 0.3028 / Avg. Reward  2.143 / Avg. Length 35.589 / Loss 127.6013 
Loop 2756000 / Episode 155181 / Epsilon 0.3023 / Avg. Reward  1.667 / Avg. Length 31.444 / Loss 128.2187 
Loop 2758000 / Episode 155240 / Epsilon 0.3019 / Avg. Reward  2.000 / Avg. Length 34.271 / Loss 125.3445 
Saving model weights
Loop 2760000 / Episode 155297 / Epsilon 0.3014 / Avg. Reward  2.018 / Avg. Length 34.930 / Loss 126.3669 
Loop 2762000 / Episode 155357 / Epsilon 0.3009 / Avg. Reward  1.900 / Avg. Length 33.167 / Loss 125.6026 
Loop 2764000 / Episode 155418 / Epsilon 0.3005 / Avg. Reward  1.803 / Avg. Length 33.131 / Loss 124.5914 
Loop

Loop 2898000 / Episode 159062 / Epsilon 0.2690 / Avg. Reward  2.179 / Avg. Length 35.679 / Loss 129.9994 
Saving model weights
Loop 2900000 / Episode 159107 / Epsilon 0.2685 / Avg. Reward  2.911 / Avg. Length 42.400 / Loss 129.8286 
Loop 2902000 / Episode 159171 / Epsilon 0.2680 / Avg. Reward  1.766 / Avg. Length 32.531 / Loss 133.1023 
Loop 2904000 / Episode 159217 / Epsilon 0.2676 / Avg. Reward  3.000 / Avg. Length 43.043 / Loss 131.5063 
Loop 2906000 / Episode 159263 / Epsilon 0.2671 / Avg. Reward  3.196 / Avg. Length 44.217 / Loss 129.6503 
Loop 2908000 / Episode 159309 / Epsilon 0.2666 / Avg. Reward  3.022 / Avg. Length 43.565 / Loss 131.5609 
Saving model weights
Loop 2910000 / Episode 159349 / Epsilon 0.2661 / Avg. Reward  3.800 / Avg. Length 49.775 / Loss 131.6977 
Loop 2912000 / Episode 159403 / Epsilon 0.2657 / Avg. Reward  2.333 / Avg. Length 37.222 / Loss 135.1070 
Loop 2914000 / Episode 159460 / Epsilon 0.2652 / Avg. Reward  1.930 / Avg. Length 33.684 / Loss 131.1566 
Loop

Loop 3048000 / Episode 162756 / Epsilon 0.2337 / Avg. Reward  3.064 / Avg. Length 43.149 / Loss 140.5120 
Saving model weights
Loop 3050000 / Episode 162802 / Epsilon 0.2332 / Avg. Reward  2.978 / Avg. Length 43.022 / Loss 137.6381 
Loop 3052000 / Episode 162853 / Epsilon 0.2328 / Avg. Reward  2.569 / Avg. Length 39.118 / Loss 143.1264 
Loop 3054000 / Episode 162888 / Epsilon 0.2323 / Avg. Reward  4.314 / Avg. Length 54.171 / Loss 138.2970 
Loop 3056000 / Episode 162922 / Epsilon 0.2318 / Avg. Reward  4.647 / Avg. Length 56.735 / Loss 139.6455 
Loop 3058000 / Episode 162967 / Epsilon 0.2314 / Avg. Reward  3.733 / Avg. Length 49.044 / Loss 139.2930 
Saving model weights
Loop 3060000 / Episode 163007 / Epsilon 0.2309 / Avg. Reward  3.825 / Avg. Length 50.000 / Loss 136.9502 
Loop 3062000 / Episode 163046 / Epsilon 0.2304 / Avg. Reward  4.000 / Avg. Length 51.359 / Loss 142.6460 
Loop 3064000 / Episode 163094 / Epsilon 0.2300 / Avg. Reward  2.750 / Avg. Length 40.812 / Loss 141.6770 
Loop

Loop 3198000 / Episode 165897 / Epsilon 0.1985 / Avg. Reward  5.312 / Avg. Length 62.156 / Loss 147.0803 
Saving model weights
Loop 3200000 / Episode 165942 / Epsilon 0.1980 / Avg. Reward  3.111 / Avg. Length 43.844 / Loss 147.4139 
Loop 3202000 / Episode 165980 / Epsilon 0.1975 / Avg. Reward  4.158 / Avg. Length 52.658 / Loss 150.3824 
Loop 3204000 / Episode 166027 / Epsilon 0.1971 / Avg. Reward  2.872 / Avg. Length 42.085 / Loss 151.1828 
Loop 3206000 / Episode 166068 / Epsilon 0.1966 / Avg. Reward  3.854 / Avg. Length 50.073 / Loss 149.6246 
Loop 3208000 / Episode 166105 / Epsilon 0.1961 / Avg. Reward  4.270 / Avg. Length 54.027 / Loss 150.2820 
Saving model weights
Loop 3210000 / Episode 166149 / Epsilon 0.1956 / Avg. Reward  3.295 / Avg. Length 45.841 / Loss 153.1946 
Loop 3212000 / Episode 166179 / Epsilon 0.1952 / Avg. Reward  5.433 / Avg. Length 63.733 / Loss 159.9088 
Loop 3214000 / Episode 166224 / Epsilon 0.1947 / Avg. Reward  3.222 / Avg. Length 45.444 / Loss 157.2853 
Loop

Loop 3348000 / Episode 168595 / Epsilon 0.1632 / Avg. Reward  4.970 / Avg. Length 59.576 / Loss 159.3275 
Saving model weights
Loop 3350000 / Episode 168633 / Epsilon 0.1627 / Avg. Reward  4.184 / Avg. Length 53.263 / Loss 162.1569 
Loop 3352000 / Episode 168663 / Epsilon 0.1623 / Avg. Reward  5.900 / Avg. Length 67.100 / Loss 167.1681 
Loop 3354000 / Episode 168698 / Epsilon 0.1618 / Avg. Reward  4.743 / Avg. Length 57.486 / Loss 163.6308 
Loop 3356000 / Episode 168723 / Epsilon 0.1613 / Avg. Reward  7.080 / Avg. Length 77.640 / Loss 164.4156 
Loop 3358000 / Episode 168763 / Epsilon 0.1609 / Avg. Reward  4.050 / Avg. Length 51.600 / Loss 161.7446 
Saving model weights
Loop 3360000 / Episode 168805 / Epsilon 0.1604 / Avg. Reward  3.548 / Avg. Length 47.357 / Loss 163.3915 
Loop 3362000 / Episode 168837 / Epsilon 0.1599 / Avg. Reward  4.906 / Avg. Length 58.656 / Loss 163.4999 
Loop 3364000 / Episode 168872 / Epsilon 0.1595 / Avg. Reward  5.086 / Avg. Length 61.143 / Loss 160.6196 
Loop

Loop 3498000 / Episode 170760 / Epsilon 0.1280 / Avg. Reward  5.774 / Avg. Length 66.806 / Loss 180.7564 
Saving model weights
Loop 3500000 / Episode 170783 / Epsilon 0.1275 / Avg. Reward  7.217 / Avg. Length 78.870 / Loss 181.5934 
Loop 3502000 / Episode 170811 / Epsilon 0.1270 / Avg. Reward  7.000 / Avg. Length 77.321 / Loss 186.2990 
Loop 3504000 / Episode 170839 / Epsilon 0.1266 / Avg. Reward  5.929 / Avg. Length 67.429 / Loss 180.2319 
Loop 3506000 / Episode 170868 / Epsilon 0.1261 / Avg. Reward  6.276 / Avg. Length 71.103 / Loss 184.2708 
Loop 3508000 / Episode 170900 / Epsilon 0.1256 / Avg. Reward  5.156 / Avg. Length 61.094 / Loss 177.4814 
Saving model weights
Loop 3510000 / Episode 170929 / Epsilon 0.1251 / Avg. Reward  6.448 / Avg. Length 71.793 / Loss 184.9417 
Loop 3512000 / Episode 170949 / Epsilon 0.1247 / Avg. Reward  9.350 / Avg. Length 96.750 / Loss 189.8010 
Loop 3514000 / Episode 170971 / Epsilon 0.1242 / Avg. Reward  8.773 / Avg. Length 92.182 / Loss 188.3912 
Loop

Loop 3648000 / Episode 172421 / Epsilon 0.0927 / Avg. Reward 11.765 / Avg. Length 117.941 / Loss 192.8348 
Saving model weights
Loop 3650000 / Episode 172444 / Epsilon 0.0922 / Avg. Reward  7.391 / Avg. Length 80.217 / Loss 194.4802 
Loop 3652000 / Episode 172461 / Epsilon 0.0918 / Avg. Reward 12.824 / Avg. Length 125.824 / Loss 200.5198 
Loop 3654000 / Episode 172480 / Epsilon 0.0913 / Avg. Reward  9.737 / Avg. Length 99.368 / Loss 192.1170 
Loop 3656000 / Episode 172501 / Epsilon 0.0908 / Avg. Reward  9.952 / Avg. Length 102.524 / Loss 194.9035 
Loop 3658000 / Episode 172518 / Epsilon 0.0904 / Avg. Reward 11.647 / Avg. Length 116.882 / Loss 193.5189 
Saving model weights
Loop 3660000 / Episode 172539 / Epsilon 0.0899 / Avg. Reward  8.810 / Avg. Length 92.905 / Loss 191.7188 
Loop 3662000 / Episode 172554 / Epsilon 0.0894 / Avg. Reward 14.067 / Avg. Length 137.533 / Loss 202.9002 
Loop 3664000 / Episode 172576 / Epsilon 0.0890 / Avg. Reward  8.364 / Avg. Length 88.091 / Loss 201.2295 

Loop 3796000 / Episode 173557 / Epsilon 0.0579 / Avg. Reward 36.000 / Avg. Length 323.000 / Loss 219.1211 
Loop 3798000 / Episode 173571 / Epsilon 0.0575 / Avg. Reward 16.500 / Avg. Length 157.857 / Loss 222.4497 
Saving model weights
Loop 3800000 / Episode 173587 / Epsilon 0.0570 / Avg. Reward 12.625 / Avg. Length 124.125 / Loss 219.3792 
Loop 3802000 / Episode 173600 / Epsilon 0.0565 / Avg. Reward 14.692 / Avg. Length 142.769 / Loss 234.6546 
Loop 3804000 / Episode 173614 / Epsilon 0.0561 / Avg. Reward 15.071 / Avg. Length 146.214 / Loss 224.5973 
Loop 3806000 / Episode 173630 / Epsilon 0.0556 / Avg. Reward 13.375 / Avg. Length 131.562 / Loss 222.3192 
Loop 3808000 / Episode 173639 / Epsilon 0.0551 / Avg. Reward 22.000 / Avg. Length 203.444 / Loss 219.3355 
Saving model weights
Loop 3810000 / Episode 173650 / Epsilon 0.0546 / Avg. Reward 21.091 / Avg. Length 196.000 / Loss 215.5241 
Loop 3812000 / Episode 173660 / Epsilon 0.0542 / Avg. Reward 20.400 / Avg. Length 190.900 / Loss 221.2

Loop 3944000 / Episode 174219 / Epsilon 0.0232 / Avg. Reward 65.750 / Avg. Length 572.500 / Loss 177.3754 
Loop 3946000 / Episode 174223 / Epsilon 0.0227 / Avg. Reward 58.500 / Avg. Length 513.000 / Loss 175.7493 
Loop 3948000 / Episode 174229 / Epsilon 0.0222 / Avg. Reward 38.167 / Avg. Length 339.667 / Loss 177.9578 
Saving model weights
Loop 3950000 / Episode 174234 / Epsilon 0.0217 / Avg. Reward 42.200 / Avg. Length 374.800 / Loss 168.8184 
Loop 3952000 / Episode 174240 / Epsilon 0.0213 / Avg. Reward 36.000 / Avg. Length 324.833 / Loss 171.9087 
Loop 3954000 / Episode 174245 / Epsilon 0.0208 / Avg. Reward 31.400 / Avg. Length 287.400 / Loss 175.8589 
Loop 3956000 / Episode 174252 / Epsilon 0.0203 / Avg. Reward 32.429 / Avg. Length 293.143 / Loss 178.6128 
Loop 3958000 / Episode 174257 / Epsilon 0.0199 / Avg. Reward 58.800 / Avg. Length 513.200 / Loss 179.7225 
Saving model weights
Loop 3960000 / Episode 174262 / Epsilon 0.0194 / Avg. Reward 43.200 / Avg. Length 385.600 / Loss 163.4

In [62]:
#running the model

inference_model = build_model()
zString = "catcher_training_weights/model_3000000.h5"
#zString = "catcher_training_weights/model_2950000.h5"
inference_model.load_weights(zString)

game = Catcher() # create our game

fps = 30  # fps we want to run at
frame_skip = 2
num_steps = 2
force_fps = False
display_screen = True

# make a PLE instance.
ple_env = PLE(game, fps=fps, frame_skip=frame_skip, num_steps=num_steps,
        force_fps=force_fps, display_screen=display_screen)

# init agent and game.
ple_env.init()
ple_env.act(action_list[np.random.randint(0,2)])

done_check = ple_env.lives() 

x_t = ple_env.getScreenGrayscale()/255. 
s_t = np.stack((x_t, x_t, x_t, x_t), axis=2)

total_reward = 0
total_steps = 0

for f in range(1000000):
    q = inference_model.predict(s_t.reshape(1, s_t.shape[0], s_t.shape[1], s_t.shape[2])) 
    max_Q = np.argmax(q)
    action_index = max_Q
    action_command = action_list[action_index]
            
    total_reward += ple_env.act(action_command)
    total_steps += 1
    
    #get next state
    x_t1 = ple_env.getScreenGrayscale()/255.
    x_t1 = x_t1.reshape(x_t1.shape[0], x_t1.shape[1], 1) #1x64x64x1
    s_t1 = np.append(x_t1, s_t[ :, :, :3], axis=2) #add new observation to end, drop first observation
    
    # if the game is over
    if ple_env.lives() != done_check:
        print("Life lost: reward {}, steps {}".format(total_reward,total_steps))
        ple_env.reset_game()
        ple_env.act(action_list[np.random.randint(0,2)])
        x_t = ple_env.getScreenGrayscale()/255. 
        s_t = np.stack((x_t, x_t, x_t, x_t), axis=2)
        total_reward = 0
        total_steps = 0
    else:
        s_t = s_t1
    

Life lost: reward 18.0, steps 169
Life lost: reward 84.0, steps 733
Life lost: reward 20.0, steps 185
Life lost: reward 60.0, steps 526
Life lost: reward 69.0, steps 600
Life lost: reward 23.0, steps 214
Life lost: reward 44.0, steps 392
Life lost: reward 32.0, steps 282
Life lost: reward 11.0, steps 109
Life lost: reward 98.0, steps 842
Life lost: reward 119.0, steps 1033
Life lost: reward 93.0, steps 809
Life lost: reward 94.0, steps 826
Life lost: reward 97.0, steps 849
Life lost: reward 71.0, steps 622
Life lost: reward 194.0, steps 1657
Life lost: reward 90.0, steps 792
Life lost: reward 65.0, steps 576
Life lost: reward 89.0, steps 768
Life lost: reward 12.0, steps 119
Life lost: reward 13.0, steps 124
Life lost: reward 320.0, steps 2726
Life lost: reward 58.0, steps 504
Life lost: reward 59.0, steps 511
Life lost: reward 174.0, steps 1497
Life lost: reward 71.0, steps 623
Life lost: reward 43.0, steps 378
Life lost: reward 106.0, steps 917
Life lost: reward 95.0, steps 827
Life 

KeyboardInterrupt: 

### Notes:
* Still kind of not great considering catcher is such a simple game
* Could use some hyperparameter adjustments
* Does weird thing where it prefers all the way to the right then shifts left even when not needed. This could be because DQN is deterministic and a stochastic solution (like a Policy Gradients method) would lead to a better solution