# Deep Q-Learning with Space Invaders

During the last two exercises, I experimented with basic Q-Learning methods, and a baseline for a DQN with Linear functions. The environment observation space was very limited in both cases (Taxi - 500 states, Moonlander - 8 states). 

These environments were easy to solve with these methods. But what about Space Invaders? How does this environment stack up against the others?

In [1]:
import gym
import cv2
import numpy as np
import random
from collections import deque
import pygame
from pygame.locals import *
from IPython.display import clear_output

import tensorflow as tf
from tensorflow import keras
from keras.models import load_model, Sequential
from keras.optimizers import Adam
from keras.layers import Activation, Convolution2D, Flatten, Dense

In [2]:
from ale_py import ALEInterface
from ale_py.roms import SpaceInvaders
ale = ALEInterface()
ale.loadROM(SpaceInvaders)

In [6]:
env = gym.make('ALE/SpaceInvaders-v5',
               obs_type='rgb',                   # ram | rgb | grayscale
               frameskip=4,                      # frame skip
               mode=None,                        # game mode, see Machado et al. 2018
               difficulty=None,                  # game difficulty, see Machado et al. 2018
               repeat_action_probability=0.25,   # Sticky action probability
               full_action_space=False,          # Use all actions
               render_mode='rgb_array'           # None | human | rgb_array
)

print(f'Observation space: {env.observation_space.shape}')
print(f'Action space: {env.action_space.n}')

Observation space: (210, 160, 3)
Action space: 6


It is already evident that the `observation_space` contains many more states than the previous two environments. The observation space for Atari environments is based on rgb pixel values of the game. This means it is not possible to have a simple function for observation. But a CNN can be applied.

The action space contains:

0. NOOP

1. FIRE

2. RIGHT

3. LEFT

4. RIGHTFIRE

5. LEFTFIRE

## Creating a Q-Network

Firstly, in order to observe the environment a CNN is created. The input shape is reduced to 84x84 pixels. As 210x160 rgb values would require too much computing power.

The output layer of this network contains the number of actions in the `action_space = 6`

In [7]:
def construct_q_network(self):
        self.model = Sequential()
        self.model.add(Convolution2D(32, 8, 8, subsample=(4, 4), input_shape=(84, 84, NUM_FRAMES)))
        self.model.add(Activation('relu'))
        self.model.add(Convolution2D(64, 4, 4, subsample=(2, 2)))
        self.model.add(Activation('relu'))
        self.model.add(Convolution2D(64, 3, 3))
        self.model.add(Activation('relu'))
        self.model.add(Flatten())
        self.model.add(Dense(512))
        self.model.add(Activation('relu'))
        self.model.add(Dense(NUM_ACTIONS))

Another interesting note is that this CNN does not implement a Maxpooling layer. These layers are primarily used to implement translational invariance, which is not relevant in this use case as the game screen translation remains static.

Three color channels might still be too hard to compute, so they are converted to a single black-white channel.

`self.process_buffer` contains the last three full sized rgb pictures.

In [8]:
 def convert_process_buffer(self):
        """Converts the list of NUM_FRAMES images in the process buffer
        into one training sample"""
        black_buffer = map(lambda x: cv2.resize(cv2.cvtColor(x, cv2.COLOR_RGB2GRAY), (84, 90)), self.process_buffer)
        black_buffer = map(lambda x: x[1:85, :, np.newaxis], black_buffer)
        return np.concatenate(black_buffer, axis=2)

## Replay Buffer

A problem when training the network on incoming frames is overfitting. As the last few frames are more relevant, compared to future frames.

A solution for this problem is keeping a buffer of the last 20.000 experiences, and randomly sample a batch of 64 images to learn on at each step. This is also defined as **experience replay**

In [12]:
class ReplayBuffer:
    """Constructs a buffer object that stores the past moves
    and samples a set of subsamples"""

    def __init__(self, buffer_size):
        self.buffer_size = buffer_size
        self.count = 0
        self.buffer = deque()

    def add(self, s, a, r, d, s2):
        """Add an experience to the buffer"""
        # S represents current state, a is action,
        # r is reward, d is whether it is the end, 
        # and s2 is next state
        experience = (s, a, r, d, s2)
        if self.count < self.buffer_size:
            self.buffer.append(experience)
            self.count += 1
        else:
            self.buffer.popleft()
            self.buffer.append(experience)

    def size(self):
        return self.count

    def sample(self, batch_size):
        """Samples a total of elements equal to batch_size from buffer
        if buffer contains enough elements. Otherwise return all elements"""

        batch = []

        if self.count < batch_size:
            batch = random.sample(self.buffer, self.count)
        else:
            batch = random.sample(self.buffer, batch_size)

        # Maps each experience in batch in batches of states, actions, rewards
        # and new states
        s_batch, a_batch, r_batch, d_batch, s2_batch = map(np.array, zip(*batch))

        return s_batch, a_batch, r_batch, d_batch, s2_batch

## Exploration

In order for the network to learn, it can not always perform the action it thinks is best. Otherwise it might get stuck in a loop. Exploration overcomes this issue, but performing random moves with probability epsilon. If this random action is not selected, then the actor performs an action that maximizes the Q value in the current situation.

In [14]:
def predict_movement(self, data, epsilon):
        """Predict movement of game controler where is epsilon
        probability randomly move."""
        q_actions = self.model.predict(data.reshape(1, 84, 84, NUM_FRAMES), batch_size = 1)
        opt_policy = np.argmax(q_actions)
        rand_val = np.random.random()
        if rand_val < epsilon:
            opt_policy = np.random.randint(0, NUM_ACTIONS)
        return opt_policy, q_actions[0, opt_policy]

## Loss Function & Target Networks

In [15]:
def train(self, s_batch, a_batch, r_batch, d_batch, s2_batch, observation_num):
        """Trains network to fit given parameters"""
        batch_size = s_batch.shape[0]
        targets = np.zeros((batch_size, NUM_ACTIONS))

        for i in xrange(batch_size):
            targets[i] = self.model.predict(s_batch[i].reshape(1, 84, 84, NUM_FRAMES), batch_size = 1)
            fut_action = self.target_model.predict(s2_batch[i].reshape(1, 84, 84, NUM_FRAMES), batch_size = 1)
            targets[i, a_batch[i]] = r_batch[i]
            if d_batch[i] == False:
                targets[i, a_batch[i]] += DECAY_RATE * np.max(fut_action)

        loss = self.model.train_on_batch(s_batch, targets)

In [16]:
def target_train(self):
        model_weights = self.model.get_weights()
        self.target_model.set_weights(model_weights)

# SPACE INVADERS

In [3]:
# List of hyper-parameters and constants
BUFFER_SIZE = 100000
MINIBATCH_SIZE = 32
TOT_FRAME = 1000000
EPSILON_DECAY = 300000
MIN_OBSERVATION = 5000
FINAL_EPSILON = 0.1
INITIAL_EPSILON = 1.0
# Number of frames to throw into network
NUM_FRAMES = 3

class SpaceInvader(object):

    def __init__(self, mode, render_mode):
        self.env = gym.make('ALE/SpaceInvaders-v5',
                            obs_type='rgb',                   # ram | rgb | grayscale
                            frameskip=4,                      # frame skip
                            mode=None,                        # game mode, see Machado et al. 2018
                            difficulty=None,                  # game difficulty, see Machado et al. 2018
                            repeat_action_probability=0.25,   # Sticky action probability
                            full_action_space=False,          # Use all actions
                            render_mode=render_mode).env
        self.env = gym.wrappers.RecordVideo(self.env, 'video')
        self.env.reset()
        self.replay_buffer = ReplayBuffer(BUFFER_SIZE)

        # Construct appropriate network based on flags
        if mode == "DDQN":
            self.deep_q = DuelQ()
        elif mode == "DQN":
            self.deep_q = DeepQ()

        # A buffer that keeps the last 3 images
        self.process_buffer = []
        # Initialize buffer with the first frame
        s1, r1, _, _, _ = self.env.step(0)
        s2, r2, _, _, _ = self.env.step(0)
        s3, r3, _, _, _ = self.env.step(0)
        self.process_buffer = [s1, s2, s3]

    def load_network(self, path):
        self.deep_q.load_network(path)

    def convert_process_buffer(self):
        """Converts the list of NUM_FRAMES images in the process buffer
        into one training sample"""
        black_buffer = [cv2.resize(cv2.cvtColor(x, cv2.COLOR_RGB2GRAY), (84, 90)) for x in self.process_buffer]
        black_buffer = [x[1:85, :, np.newaxis] for x in black_buffer]
        return np.concatenate(black_buffer, axis=2)

    def train(self, num_frames):
        observation_num = 0
        curr_state = self.convert_process_buffer()
        epsilon = INITIAL_EPSILON
        alive_frame = 0
        total_reward = 0

        while observation_num < num_frames:
            clear_output(wait=True)
            if observation_num % 1000 == 999: 
                print(("Executing loop %d" %observation_num))

            # Slowly decay the learning rate
            if epsilon > FINAL_EPSILON:
                epsilon -= (INITIAL_EPSILON-FINAL_EPSILON)/EPSILON_DECAY

            initial_state = self.convert_process_buffer()
            self.process_buffer = []

            predict_movement, predict_q_value = self.deep_q.predict_movement(curr_state, epsilon)

            reward, terminated, truncated = 0, False, False
            for i in range(NUM_FRAMES):
                temp_observation, temp_reward, temp_terminated, temp_truncated, _ = self.env.step(predict_movement)
                reward += temp_reward
                self.process_buffer.append(temp_observation)
                terminated = terminated | temp_terminated

            if observation_num % 10 == 0:
                clear_output(wait=True)
                print("We predicted a q value of ", predict_q_value)

            if terminated or truncated:
                clear_output(wait=True)
                print("Lived with maximum time ", alive_frame)
                print("Earned a total of reward equal to ", total_reward)
                self.env.reset()
                alive_frame = 0
                total_reward = 0

            new_state = self.convert_process_buffer()
            self.replay_buffer.add(initial_state, predict_movement, reward, terminated, truncated, new_state)
            total_reward += reward

            if self.replay_buffer.size() > MIN_OBSERVATION:
                s_batch, a_batch, r_batch, ter_batch, tru_batch, s2_batch = self.replay_buffer.sample(MINIBATCH_SIZE)
                self.deep_q.train(s_batch, a_batch, r_batch, ter_batch, tru_batch, s2_batch, observation_num)
                self.deep_q.target_train()

            # Save the network every 100 iterations
            if observation_num % 100 == 99:
                print("Saving Network")
                self.deep_q.save_network("SpaceInvadersLONG.h5")

            alive_frame += 1
            observation_num += 1

    def simulate(self, path = "", save = False):
        """Simulates game"""
        self.env.reset()
        terminated = False
        truncated = False
        tot_award = 0
        while not terminated or truncated:
            state = self.convert_process_buffer()
            predict_movement = self.deep_q.predict_movement(state, 0)[0]
            observation, reward, terminated, truncated, _ = self.env.step(predict_movement)
            tot_award += reward
            self.process_buffer.append(observation)
            self.process_buffer = self.process_buffer[1:]
#             for event in pygame.event.get():
#                 if event.type == QUIT: ## defined in pygame.locals
#                     pygame.quit()
#                     #pygame._exit = 1
#                     #sys.exit()
        self.env.close()

    def calculate_mean(self, num_samples = 100):
        reward_list = []
        clear_output(wait=True)
        print("Printing scores of each trial")
        for i in range(num_samples):
            terminated = False
            truncated = False
            tot_award = 0
            self.env.reset()
            while not terminated or truncated:
                state = self.convert_process_buffer()
                predict_movement = self.deep_q.predict_movement(state, 0.0)[0]
                observation, reward, terminated, truncated, _ = self.env.step(predict_movement)
                tot_award += reward
                self.process_buffer.append(observation)
                self.process_buffer = self.process_buffer[1:]
            print(tot_award)
            reward_list.append(tot_award)
        return np.mean(reward_list), np.std(reward_list)

In [6]:
def show_video(env_name):
    mp4list = glob.glob('video/*.mp4')
    if len(mp4list) > 0:
        mp4 = 'video/{}.mp4'.format(env_name)
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        display.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
    else:
        print("Could not find video")
        
def show_video_of_model(self, env_name):
    env = gym.make(env_name, render_mode='rgb_array').env
  
    env = gym.wrappers.RecordVideo(env, 'video')

    state, state_info = env.reset()
    terminated = False
    truncated = False
    while not terminated or truncated:
        state = self.convert_process_buffer()
        predict_movement = self.deep_q.predict_movement(state, 0)[0]
        observation, reward, terminated, truncated, _ = self.env.step(predict_movement)
        tot_award += reward
        self.process_buffer.append(observation)
        self.process_buffer = self.process_buffer[1:]        
    env.close()

# DQN

In [4]:
DECAY_RATE = 0.99
BUFFER_SIZE = 40000
MINIBATCH_SIZE = 64
TOT_FRAME = 3000000
EPSILON_DECAY = 1000000
MIN_OBSERVATION = 5000
FINAL_EPSILON = 0.05
INITIAL_EPSILON = 0.1
NUM_ACTIONS = 6
TAU = 0.01
# Number of frames to throw into network
NUM_FRAMES = 3

class DeepQ(object):
    """Constructs the desired deep q learning network"""
    def __init__(self):
        self.construct_q_network()

    def construct_q_network(self):
        # Uses the network architecture found in DeepMind paper
        self.model = Sequential()
        self.model.add(Convolution2D(32, (8, 8), strides=(4, 4), input_shape=(84, 84, NUM_FRAMES)))
        self.model.add(Activation('relu'))
        self.model.add(Convolution2D(64, (4, 4), strides=(2, 2)))
        self.model.add(Activation('relu'))
        self.model.add(Convolution2D(64, 3, 3))
        self.model.add(Activation('relu'))
        self.model.add(Flatten())
        self.model.add(Dense(512))
        self.model.add(Activation('relu'))
        self.model.add(Dense(NUM_ACTIONS))
        self.model.compile(loss='mse', optimizer=Adam(lr=0.00001))

        # Creates a target network as described in DeepMind paper
        self.target_model = Sequential()
        self.target_model.add(Convolution2D(32, (8, 8), strides=(4, 4), input_shape=(84, 84, NUM_FRAMES)))
        self.target_model.add(Activation('relu'))
        self.target_model.add(Convolution2D(64, (4, 4), strides=(2, 2)))
        self.target_model.add(Activation('relu'))
        self.target_model.add(Convolution2D(64, 3, 3))
        self.target_model.add(Activation('relu'))
        self.target_model.add(Flatten())
        self.target_model.add(Dense(512))
        self.model.add(Activation('relu'))
        self.target_model.add(Dense(NUM_ACTIONS))
        self.target_model.compile(loss='mse', optimizer=Adam(lr=0.00001))
        self.target_model.set_weights(self.model.get_weights())

        print("Successfully constructed networks.")

    def predict_movement(self, data, epsilon):
        """Predict movement of game controler where is epsilon
        probability randomly move."""
        q_actions = self.model.predict(data.reshape(1, 84, 84, NUM_FRAMES), batch_size = 1)
        opt_policy = np.argmax(q_actions)
        rand_val = np.random.random()
        if rand_val < epsilon:
            opt_policy = np.random.randint(0, NUM_ACTIONS)
        return opt_policy, q_actions[0, opt_policy]

    def train(self, s_batch, a_batch, r_batch, ter_batch, tru_batch, s2_batch, observation_num):
        """Trains network to fit given parameters"""
        batch_size = s_batch.shape[0]
        targets = np.zeros((batch_size, NUM_ACTIONS))

        for i in range(batch_size):
            targets[i] = self.model.predict(s_batch[i].reshape(1, 84, 84, NUM_FRAMES), batch_size = 1)
            fut_action = self.target_model.predict(s2_batch[i].reshape(1, 84, 84, NUM_FRAMES), batch_size = 1)
            targets[i, a_batch[i]] = r_batch[i]
            if ter_batch[i] == False:
                targets[i, a_batch[i]] += DECAY_RATE * np.max(fut_action)

        loss = self.model.train_on_batch(s_batch, targets)

        # Print the loss every 10 iterations.
        if observation_num % 10 == 0:
            clear_output(wait=True)
            print("We had a loss equal to ", loss)

    def save_network(self, path):
        # Saves model at specified path as h5 file
        self.model.save(path)
        print("Successfully saved network.")

    def load_network(self, path):
        self.model = load_model(path)
        print("Succesfully loaded network.")

    def target_train(self):
        model_weights = self.model.get_weights()
        target_model_weights = self.target_model.get_weights()
        for i in range(len(model_weights)):
            target_model_weights[i] = TAU * model_weights[i] + (1 - TAU) * target_model_weights[i]
        self.target_model.set_weights(target_model_weights)

# REPLAY BUFFER

In [5]:
class ReplayBuffer:
    """Constructs a buffer object that stores the past moves
    and samples a set of subsamples"""

    def __init__(self, buffer_size):
        self.buffer_size = buffer_size
        self.count = 0
        self.buffer = deque()

    def add(self, s, a, r, ter, trunc, s2):
        """Add an experience to the buffer"""
        # S represents current state, a is action,
        # r is reward, d is whether it is the end, 
        # and s2 is next state
        experience = (s, a, r, ter, trunc, s2)
        if self.count < self.buffer_size:
            self.buffer.append(experience)
            self.count += 1
        else:
            self.buffer.popleft()
            self.buffer.append(experience)

    def size(self):
        return self.count

    def sample(self, batch_size):
        """Samples a total of elements equal to batch_size from buffer
        if buffer contains enough elements. Otherwise return all elements"""

        batch = []

        if self.count < batch_size:
            batch = random.sample(self.buffer, self.count)
        else:
            batch = random.sample(self.buffer, batch_size)

        # Maps each experience in batch in batches of states, actions, rewards
        # and new states
        s_batch, a_batch, r_batch, ter_batch, trunc_batch, s2_batch = list(map(np.array, list(zip(*batch))))

        return s_batch, a_batch, r_batch, ter_batch, trunc_batch, s2_batch

    def clear(self):
        self.buffer.clear()
        self.count = 0

In [5]:
NUM_FRAME = 100000
game_instance = SpaceInvader('DQN', 'rgb_array')
game_instance.train(NUM_FRAME)

We predicted a q value of  59.23079


KeyboardInterrupt: 

In [6]:
spaceinvaders = SpaceInvader('DQN', 'human')
spaceinvaders.load_network('SpaceInvadersLONG.h5')
spaceinvaders.simulate()

  logger.warn(
  logger.warn(
  logger.warn(
  super().__init__(name, **kwargs)


Successfully constructed networks.
Succesfully loaded network.




## Conclusion

Training a model like this takes a very long time. I gathered better "playing" results from a short training time of 30 min. This could be because the reward system was not set up properly. It was challenging to understand all the steps within the network. But by carefully creating it step by step the main points of interest became clear. Now while also working on my data challenge of teachable reinforcement learning i can try and improve these models.