<h1> Deep Q Network on Image Sequences to play game </h1>

In [1]:
from PIL import Image  # To transform the image in the Processor
import numpy as np
import gym

# Convolutional Backbone Network
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten, Convolution2D, Permute
from tensorflow.keras.optimizers import Adam

# Keras-RL
from rl.agents.dqn import DQNAgent
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy
from rl.memory import SequentialMemory
from rl.core import Processor
from rl.callbacks import FileLogger, ModelIntervalCheckpoint

In [2]:
env = gym.make("BreakoutDeterministic-v4")
nb_actions = env.action_space.n #action space

In [3]:
IMG_SHAPE = (84, 84) #input image shape
WINDOW_LENGTH = 4 #window length sequence in buffer

Now we create the image processor. It is the same processor as in the preprocessing notebook, with the addition that it standardizes the data into the [0, 1] interval which often decreases the necessary training time. <br />

In [5]:
class ImageProcessor(Processor):
    def process_observation(self, observation):
        # First convert the numpy array to a PIL Image
        img = Image.fromarray(observation)
        # Then resize the image
        img = img.resize(IMG_SHAPE)
        # And convert it to grayscale  (The L stands for luminance)
        img = img.convert("L")
        # Convert the image back to a numpy array and finally return the image
        img = np.array(img)
        return img.astype('uint8')  # saves storage in experience memory
    
    def process_state_batch(self, batch):

        # We divide the observations by 255 to compress it into the intervall [0, 1].
        # This supports the training of the network
        # We perform this operation here to save memory.
        processed_batch = batch.astype('float32') / 255.
        return processed_batch

    def process_reward(self, reward):
        return np.clip(reward, -1., 1.)


As our input consists of 4 consecutive frames, each having the shape $(84 \times 84)$, the input to the network has the shape $(84 \times 84 \times 4)$.
But as the Convolutional Layers expect our input to be of shape $(4 \times 84 \times 84)$ , a permute layer is added at the beginning to swap the channels


In [7]:
input_shape = (WINDOW_LENGTH, IMG_SHAPE[0], IMG_SHAPE[1])
input_shape

(4, 84, 84)

Now it is time to define the network!
We use the He Normal weight initialization technique

In [8]:
model = Sequential()
model.add(Permute((2, 3, 1), input_shape=input_shape))

model.add(Convolution2D(32, (8, 8), strides=(4, 4),kernel_initializer='he_normal'))
model.add(Activation('relu'))
model.add(Convolution2D(64, (4, 4), strides=(2, 2), kernel_initializer='he_normal'))
model.add(Activation('relu'))
model.add(Convolution2D(64, (3, 3), strides=(1, 1), kernel_initializer='he_normal'))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
permute (Permute)            (None, 84, 84, 4)         0         
_________________________________________________________________
conv2d (Conv2D)              (None, 20, 20, 32)        8224      
_________________________________________________________________
activation (Activation)      (None, 20, 20, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 9, 9, 64)          32832     
_________________________________________________________________
activation_1 (Activation)    (None, 9, 9, 64)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 7, 7, 64)          36928     
_________________________________________________________________
activation_2 (Activation)    (None, 7, 7, 64)          0

Defining the sequentual memory

In [10]:
memory = SequentialMemory(limit=1000000, window_length=WINDOW_LENGTH)

Then we define the processor

In [11]:
processor = ImageProcessor()

I have used the LinearAnnealedPolicy to implement the epsilon greedy action selection with decaying epsilon.

As this network need to train for at least a million steps, I have set the number of steps to 1,000,000

In [13]:
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=1., value_min=.1, value_test=.05,
                              nb_steps=1000000)

Defining and Compiling the Model

In [14]:
dqn = DQNAgent(model=model, nb_actions=nb_actions, policy=policy, memory=memory,
               processor=processor, nb_steps_warmup=50000, gamma=.99, target_model_update=10000,
              train_interval=4, delta_clip=1)

In [15]:
dqn.compile(Adam(learning_rate=.00025), metrics=['mae'])

In [15]:
weights_filename = 'dqn_breakout_weights.h5f'
checkpoint_weights_filename = 'dqn_' + "BreakoutDeterministic-v4" + '_weights_{step}.h5f'
checkpoint_callback = ModelIntervalCheckpoint(checkpoint_weights_filename, interval=100000)


If you do not want to waste time on initial trianing, **load_weights()** function provided by tensorflow. <br />

Note that you would need to reduce to set a reduced epsilon if you are loading my pre-trained weights and start training from there.

If you want to see the results and the performance of the DQN after 1.2 million episodes of training, go and run cell 

<h1> Run Below Cells if you want to train </h1> 

In [16]:
# Load the weights
model.load_weights("weights/dqn_BreakoutDeterministic-v4_weights_900000.h5f")

# Update the policy to start with a smaller epsilon
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=0.3, value_min=.1, value_test=.05,
                              nb_steps=100000)


# Initialize the DQNAgent with the new model and updated policy and compile it
dqn = DQNAgent(model=model, nb_actions=nb_actions, policy=policy, memory=memory,
               processor=processor, nb_steps_warmup=50000, gamma=.99, target_model_update=10000)
dqn.compile(Adam(learning_rate=.00025), metrics=['mae'])

# And train the model
dqn.fit(env, nb_steps=500000, callbacks=[checkpoint_callback], log_interval=10000, visualize=False)


Training for 500000 steps ...
Interval 1 (0 steps performed)
   10/10000 [..............................] - ETA: 1:00 - reward: 0.0000e+00 



  459/10000 [>.............................] - ETA: 55s - reward: 0.0109done, took 2.784 seconds


<keras.callbacks.History at 0x20a472bd1f0>

<b> testing the trained model </b>

In [None]:
dqn.test(env, nb_episodes=5, visualize=True)

Testing for 5 episodes ...




<b> testing the trained model </b>

In [None]:
dqn.fit(env, nb_steps=1500000, callbacks=[checkpoint_callback], log_interval=10000, visualize=False)

# After training is done, we save the final weights one more time.
dqn.save_weights(weights_filename, overwrite=True)

In [None]:
dqn.test(env, nb_episodes=5, visualize=True)

<h1> Run these cells if you only want to evaluate performance on my weights </h1>

In [None]:
# Load the weights
model.load_weights("weights/dqn_BreakoutDeterministic-v4_weights_100000.h5f")

#You can chose an arbitrary policy for evaluation, it is fixed here.
policy = EpsGreedyQPolicy(0.1)


# Initialize the DQNAgent with the new model and updated policy and compile it
dqn = DQNAgent(model=model, nb_actions=nb_actions, policy=policy, memory=memory,
               processor=processor)
dqn.compile(Adam(learning_rate=.00025), metrics=['mae'])

In [None]:
dqn.test(env, nb_episodes=5, visualize=True)

In [None]:
# Load the weights
model.load_weights("weights/dqn_BreakoutDeterministic-v4_weights_600000.h5f")

#You can chose an arbitrary policy for evaluation, it is fixed here.
policy = EpsGreedyQPolicy(0.1)


# Initialize the DQNAgent with the new model and updated policy and compile it
dqn = DQNAgent(model=model, nb_actions=nb_actions, policy=policy, memory=memory,
               processor=processor)
dqn.compile(Adam(learning_rate=.00025), metrics=['mae'])


In [None]:
dqn.test(env, nb_episodes=5, visualize=True)

In [18]:
# Load the weights
model.load_weights("weights/dqn_BreakoutDeterministic-v4_weights_1200000.h5f")

#You can chose an arbitrary policy for evaluation, it is fixed here.
policy = EpsGreedyQPolicy(0.1)


# Initialize the DQNAgent with the new model and updated policy and compile it
dqn = DQNAgent(model=model, nb_actions=nb_actions, policy=policy, memory=memory,
               processor=processor)
dqn.compile(Adam(learning_rate=.00025), metrics=['mae'])


In [19]:
dqn.test(env, nb_episodes=5, visualize=True)

Testing for 5 episodes ...
Episode 1: reward: 40.000, steps: 1517
Episode 2: reward: 40.000, steps: 1513
Episode 3: reward: 40.000, steps: 1513
Episode 4: reward: 40.000, steps: 1513
Episode 5: reward: 40.000, steps: 1513


<keras.callbacks.History at 0x20a4ec6aaf0>