# CNNs and DQNs: Playing From Pixels

Instead of relying on some kind of state transformer—that is some way to get relevant state information from the game itself—we can use CNN's and use a short sequence of screen captures of the pixels on the screen as the state. Doing so models the way humans play games (we don't know the internal state values) and is a much more general approach (every video game involves frames of pixels). 

![](images/CNN-DQN.jpg)

> Image Source: https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf

In [1]:
import numpy as np

import io
import base64
from IPython import display

import gym
from gym import wrappers

# Same as before, just allowing us to display the video from OpenAI Gym 
def imbed_round_video(video_env):
    video = io.open('./gym-videos/openaigym.video.%s.video000001.mp4' % video_env.file_infix, 'r+b').read()
    encoded = base64.b64encode(video)
    return display.HTML(data='''
        <video width="360" height="auto" alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''
    .format(encoded.decode('ascii')))

In [2]:
# Much of the code here is borrowed directly from the 
# keras-rl package examples, which you can find here: 
#    https://github.com/keras-rl/keras-rl/tree/master/examples

from PIL import Image
import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten, Convolution2D, Permute
from keras.optimizers import Adam
import keras.backend as K

from rl.agents.dqn import DQNAgent
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy
from rl.memory import SequentialMemory
from rl.core import Processor
from rl.callbacks import FileLogger, ModelIntervalCheckpoint

INPUT_SHAPE = (84, 84)

# The window length enables us to capture more information in a "state"
# than the current pixels on the screen. Consider this: if each state was
# a single snapshot, how would our AI ever know the velocity of the ball?
# A window length of 4 means a "state" is actually the combination of 4
# frames of the game. 
WINDOW_LENGTH = 4

# keras-rl provides this processor which allows us to use the pixel's from 
# the OpenAI Gym Atari selection and process those pixels into an image that
# play's well with Keras' expectations for CNN inputs. 

# For efficency reasons they suggest transforming the images to grayscale 
# before sending them to a CNN. 
class AtariProcessor(Processor):
    def process_observation(self, observation):
        assert observation.ndim == 3  # (height, width, channel)

        # NOTE THAT: each frame from the atari game is squashed to INPUT_SHAPE
        # and then converted to grayscale. 
        img = Image.fromarray(observation)
        img = img.resize(INPUT_SHAPE).convert('L')  # resize and convert to grayscale
        processed_observation = np.array(img)
        assert processed_observation.shape == INPUT_SHAPE
        return processed_observation.astype('uint8')  # saves storage in experience memory

    def process_state_batch(self, batch):
        # We could perform this processing step in `process_observation`. In this case, however,
        # we would need to store a `float32` array instead, which is 4x more memory intensive than
        # an `uint8` array. This matters if we store 1M observations.
        processed_batch = batch.astype('float32') / 255.
        return processed_batch

#     # TODO: See if we need this... seems kind of wrong headed to me actually. 
#     def process_reward(self, reward):
#         return np.clip(reward, -1., 1.)

Using TensorFlow backend.


In [3]:
# This could would work for just about any of the pixel 
# returning OpenAI environments. Feel free to try many games. 
env = gym.make("Breakout-v0")
nb_actions = env.action_space.n

# We'll use a relatively simple model that starts with fairly big
# strides to compress the input size quickly. This idea is described
# in Minh et al. (2015): https://www.nature.com/articles/nature14236

# Remember window length is the number of frames, and INPUT_SHAPE is
# the shape of each grayscale frame. 
input_shape = (WINDOW_LENGTH,) + INPUT_SHAPE
model = Sequential()

# Some backeneds have the color channel first, some have it last.
# This code accounts for that. 
if K.image_dim_ordering() == 'tf':
    # (width, height, channels)
    model.add(Permute((2, 3, 1), input_shape=input_shape))
elif K.image_dim_ordering() == 'th':
    # (channels, width, height)
    model.add(Permute((1, 2, 3), input_shape=input_shape))

# Note the significant striding.
model.add(Convolution2D(32, (8, 8), strides=(4, 4), activation='relu'))
model.add(Convolution2D(64, (4, 4), strides=(2, 2), activation='relu'))
model.add(Convolution2D(64, (3, 3), strides=(1, 1), activation='relu'))
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dense(nb_actions, activation='linear'))

memory = SequentialMemory(limit=1000000, window_length=WINDOW_LENGTH)
processor = AtariProcessor()

# Direct from the keras-rl docs/example: 

# Select a policy. We use eps-greedy action selection, which means that a random action is selected
# with probability eps. We anneal eps from 1.0 to 0.1 over the course of 1M steps. This is done so that
# the agent initially explores the environment (high eps) and then gradually sticks to what it knows
# (low eps). We also set a dedicated eps value that is used during testing. Note that we set it to 0.05
# so that the agent still performs some random actions. This ensures that the agent cannot get stuck.

policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), 
              attr='eps', value_max=.99, value_min=.05, value_test=.05, nb_steps=1250000
         )
# policy = EpsGreedyQPolicy() 

# Note the "train interval" matches our WINDOW_LENGTH, not by accident. 
dqn = DQNAgent(model=model, nb_actions=nb_actions, policy=policy, memory=memory,
               processor=processor, nb_steps_warmup=10000, gamma=.99, target_model_update=0.01,
               train_interval=4, delta_clip=1.)

# Note the low learning rate. 
dqn.compile(Adam(lr=.00025), metrics=['mae'])

Instructions for updating:
Colocations handled automatically by placer.


In [4]:
# Notice that we just went for a huge number of steps right off the bat. 
# Feel free to divide these steps into smaller steps to visualize performance
# over time. 
dqn.fit(env, nb_steps=1750000, nb_max_start_steps=30, verbose=True, action_repetition=2)

# Note the warm up period the network seems really fast! ~20s per 10k steps. 
# then the actual RL kicks in at interval 6 and we see closer to 3-5 minutes per 10k steps. 

Training for 1750000 steps ...
Interval 1 (0 steps performed)
78 episodes - episode_reward: 1.487 [0.000, 8.000] - ale.lives: 5.698

Interval 2 (10000 steps performed)
Instructions for updating:
Use tf.cast instead.
83 episodes - episode_reward: 1.217 [0.000, 4.000] - loss: 0.006 - mean_absolute_error: 0.098 - mean_q: 0.133 - mean_eps: 0.979 - ale.lives: 5.695

Interval 3 (20000 steps performed)
78 episodes - episode_reward: 1.577 [0.000, 6.000] - loss: 0.010 - mean_absolute_error: 0.346 - mean_q: 0.464 - mean_eps: 0.971 - ale.lives: 5.754

Interval 4 (30000 steps performed)
81 episodes - episode_reward: 1.519 [0.000, 5.000] - loss: 0.017 - mean_absolute_error: 0.887 - mean_q: 1.200 - mean_eps: 0.964 - ale.lives: 5.778

Interval 5 (40000 steps performed)
77 episodes - episode_reward: 1.597 [0.000, 4.000] - loss: 0.014 - mean_absolute_error: 1.539 - mean_q: 2.081 - mean_eps: 0.956 - ale.lives: 5.563

Interval 6 (50000 steps performed)
75 episodes - episode_reward: 1.773 [0.000, 5.000] -

65 episodes - episode_reward: 2.538 [0.000, 8.000] - loss: 0.023 - mean_absolute_error: 3.012 - mean_q: 4.043 - mean_eps: 0.761 - ale.lives: 5.561

Interval 32 (310000 steps performed)
67 episodes - episode_reward: 2.388 [0.000, 10.000] - loss: 0.023 - mean_absolute_error: 2.941 - mean_q: 3.945 - mean_eps: 0.753 - ale.lives: 5.672

Interval 33 (320000 steps performed)
63 episodes - episode_reward: 2.651 [0.000, 7.000] - loss: 0.021 - mean_absolute_error: 2.885 - mean_q: 3.872 - mean_eps: 0.746 - ale.lives: 5.555

Interval 34 (330000 steps performed)
62 episodes - episode_reward: 2.629 [0.000, 8.000] - loss: 0.023 - mean_absolute_error: 2.902 - mean_q: 3.897 - mean_eps: 0.738 - ale.lives: 5.503

Interval 35 (340000 steps performed)
69 episodes - episode_reward: 2.101 [0.000, 8.000] - loss: 0.024 - mean_absolute_error: 2.893 - mean_q: 3.884 - mean_eps: 0.731 - ale.lives: 5.389

Interval 36 (350000 steps performed)
73 episodes - episode_reward: 1.753 [0.000, 7.000] - loss: 0.023 - mean_ab

46 episodes - episode_reward: 5.217 [0.000, 13.000] - loss: 0.013 - mean_absolute_error: 2.499 - mean_q: 3.358 - mean_eps: 0.528 - ale.lives: 5.585

Interval 63 (620000 steps performed)
44 episodes - episode_reward: 5.659 [2.000, 13.000] - loss: 0.014 - mean_absolute_error: 2.519 - mean_q: 3.386 - mean_eps: 0.520 - ale.lives: 5.656

Interval 64 (630000 steps performed)
43 episodes - episode_reward: 5.837 [2.000, 13.000] - loss: 0.014 - mean_absolute_error: 2.520 - mean_q: 3.385 - mean_eps: 0.512 - ale.lives: 5.729

Interval 65 (640000 steps performed)
43 episodes - episode_reward: 6.140 [1.000, 15.000] - loss: 0.016 - mean_absolute_error: 2.527 - mean_q: 3.393 - mean_eps: 0.505 - ale.lives: 5.787

Interval 66 (650000 steps performed)
42 episodes - episode_reward: 6.071 [1.000, 12.000] - loss: 0.015 - mean_absolute_error: 2.541 - mean_q: 3.413 - mean_eps: 0.497 - ale.lives: 5.608

Interval 67 (660000 steps performed)
45 episodes - episode_reward: 5.378 [0.000, 13.000] - loss: 0.014 - me

32 episodes - episode_reward: 9.750 [1.000, 21.000] - loss: 0.015 - mean_absolute_error: 2.632 - mean_q: 3.536 - mean_eps: 0.294 - ale.lives: 5.815

Interval 94 (930000 steps performed)
32 episodes - episode_reward: 9.781 [3.000, 26.000] - loss: 0.014 - mean_absolute_error: 2.630 - mean_q: 3.534 - mean_eps: 0.287 - ale.lives: 5.822

Interval 95 (940000 steps performed)
37 episodes - episode_reward: 7.649 [1.000, 15.000] - loss: 0.014 - mean_absolute_error: 2.630 - mean_q: 3.533 - mean_eps: 0.279 - ale.lives: 5.932

Interval 96 (950000 steps performed)
34 episodes - episode_reward: 8.294 [1.000, 23.000] - loss: 0.014 - mean_absolute_error: 2.630 - mean_q: 3.534 - mean_eps: 0.272 - ale.lives: 5.795

Interval 97 (960000 steps performed)
36 episodes - episode_reward: 8.472 [1.000, 22.000] - loss: 0.014 - mean_absolute_error: 2.655 - mean_q: 3.567 - mean_eps: 0.264 - ale.lives: 5.567

Interval 98 (970000 steps performed)
32 episodes - episode_reward: 10.156 [3.000, 22.000] - loss: 0.013 - m

22 episodes - episode_reward: 16.000 [6.000, 34.000] - loss: 0.014 - mean_absolute_error: 2.851 - mean_q: 3.832 - mean_eps: 0.061 - ale.lives: 5.741

Interval 125 (1240000 steps performed)
23 episodes - episode_reward: 16.043 [6.000, 36.000] - loss: 0.015 - mean_absolute_error: 2.861 - mean_q: 3.847 - mean_eps: 0.054 - ale.lives: 5.596

Interval 126 (1250000 steps performed)
20 episodes - episode_reward: 17.500 [9.000, 36.000] - loss: 0.015 - mean_absolute_error: 2.871 - mean_q: 3.858 - mean_eps: 0.050 - ale.lives: 5.920

Interval 127 (1260000 steps performed)
21 episodes - episode_reward: 18.429 [9.000, 28.000] - loss: 0.015 - mean_absolute_error: 2.871 - mean_q: 3.858 - mean_eps: 0.050 - ale.lives: 5.583

Interval 128 (1270000 steps performed)
22 episodes - episode_reward: 14.500 [6.000, 26.000] - loss: 0.015 - mean_absolute_error: 2.870 - mean_q: 3.856 - mean_eps: 0.050 - ale.lives: 5.511

Interval 129 (1280000 steps performed)
21 episodes - episode_reward: 16.952 [4.000, 55.000] - 

23 episodes - episode_reward: 15.913 [7.000, 29.000] - loss: 0.022 - mean_absolute_error: 3.290 - mean_q: 4.415 - mean_eps: 0.050 - ale.lives: 5.533

Interval 156 (1550000 steps performed)
20 episodes - episode_reward: 16.300 [7.000, 29.000] - loss: 0.021 - mean_absolute_error: 3.288 - mean_q: 4.414 - mean_eps: 0.050 - ale.lives: 5.650

Interval 157 (1560000 steps performed)
24 episodes - episode_reward: 15.500 [7.000, 38.000] - loss: 0.021 - mean_absolute_error: 3.295 - mean_q: 4.424 - mean_eps: 0.050 - ale.lives: 5.821

Interval 158 (1570000 steps performed)
22 episodes - episode_reward: 15.409 [5.000, 27.000] - loss: 0.022 - mean_absolute_error: 3.316 - mean_q: 4.451 - mean_eps: 0.050 - ale.lives: 5.713

Interval 159 (1580000 steps performed)
22 episodes - episode_reward: 16.000 [5.000, 29.000] - loss: 0.022 - mean_absolute_error: 3.333 - mean_q: 4.473 - mean_eps: 0.050 - ale.lives: 5.923

Interval 160 (1590000 steps performed)
25 episodes - episode_reward: 13.160 [4.000, 20.000] - 

<keras.callbacks.History at 0x111123a20>

In [5]:
for _ in range(10):
    orig_environment = gym.make('Breakout-v0')
    environment = wrappers.Monitor(orig_environment, "gym-videos/", force=True)

    # Lets visualize a single playthrough.
    state = environment.reset()
    dqn.test(environment, nb_episodes=1, visualize=True, nb_max_start_steps=30, action_repetition=2)

    environment.close()
    orig_environment.close()

    display.display(imbed_round_video(environment))

Testing for 1 episodes ...
Episode 1: reward: 14.000, steps: 414


Testing for 1 episodes ...
Episode 1: reward: 20.000, steps: 4997


Testing for 1 episodes ...
Episode 1: reward: 5.000, steps: 5000


Testing for 1 episodes ...
Episode 1: reward: 10.000, steps: 4994


Testing for 1 episodes ...
Episode 1: reward: 36.000, steps: 693


Testing for 1 episodes ...
Episode 1: reward: 15.000, steps: 448


Testing for 1 episodes ...
Episode 1: reward: 24.000, steps: 365


Testing for 1 episodes ...
Episode 1: reward: 12.000, steps: 390


Testing for 1 episodes ...
Episode 1: reward: 21.000, steps: 507


Testing for 1 episodes ...
Episode 1: reward: 14.000, steps: 307


In [None]:
# You'll notice some very silly behavior, sometimes if the ball falls
# off screen and then the agent just sits there — this is because it
# has not learned that it has to take the "Fire" action to get another
# ball, for example. 

# You'll also notice that for having played millions of steps and thousands
# of rounds of Breakout.... even the best episodes are only okay. This 
# implementation has several state-of-the-art best practices, but learning
# from raw pixel data is still hard. 

# Tactics like reward augmentation can still help, and explicitly extracting
# useful state information can help further. 