# Play Breakout using Deep Q-Learning

As an agent takes actions and moves through an environment, it learns to map the observed state of the environment to an action. An agent will choose an action in a given state based on a **Q-value**, which is a weighted reward based on the expected highest long-term reward. 

A **Q-Learning Agent** learns to perform its task such that the recommended action **maximizes the potential future rewards**. This method is considered an *Off-Policy* method, meaning its Q values are updated assuming that the best action was chosen, even if the best action was not chosen.

## Import Dependencies

In [1]:
import gym

import numpy as np
import tensorflow as tf
import tensorflow.keras as ks

## Define Buffer class for Experience Replay

In this class we only have to define the buffer dimension that will limit the number of samples contained by the buffer. This limits the amount of memory required by the program and avoi problems.

In [2]:
class Buffer:
    
    def __init__(self, buffer_dim):
        
        self.buffer_dim = buffer_dim
        self.state_hist = []
        self.action_hist = []
        self.rewards_hist = []
        self.next_state_hist = []
        self.done_hist = []
    
    # Save a sample in the Buffer
    def save(self, state, action, reward, next_state, done):
        self.state_hist.append(state)
        self.action_hist.append(action)
        self.rewards_hist.append(reward)
        self.next_state_hist.append(next_state)
        self.done_hist.append(done)
        
        # Deleting the oldest sample
        if len(self.done_hist) > self.buffer_dim:
            del self.state_hist[0]
            del self.action_hist[0]
            del self.rewards_hist[0]
            del self.next_state_hist[0]
            del self.done_hist[0]
    
    # Get a batch of samples from the Buffer
    def sample(self, batch_size):
        indices = np.random.choice(range(len(self.done_hist)), size=batch_size)
        
        state_sample = np.array([self.state_hist[i] for i in indices])
        action_sample = [self.action_hist[i] for i in indices]
        reward_sample = [self.rewards_hist[i] for i in indices]
        next_state_sample = np.array([self.next_state_hist[i] for i in indices])
        done_sample = tf.convert_to_tensor([float(self.done_hist[i]) for i in indices])
        
        return state_sample, action_sample, reward_sample, next_state_sample, done_sample
    
    # Get the number of samples contained in the Buffer
    def n_samples(self):
        return len(self.done_hist)

## Define Convolutional Neural Network

This network **learns an approximation of the Q-table**, which is a mapping between the states and actions that an agent will take. For every state we'll have four actions, that can be taken. The environment provides the state, and the action is chosen by selecting the larger of the four Q-values predicted in the output layer.

In [3]:
def get_model(input_shape, num_actions):
    
    inputs = ks.layers.Input(input_shape)
    
    x = ks.layers.Conv2D(32, kernel_size=8, strides=4, activation='relu')(inputs)
    x = ks.layers.Conv2D(64, kernel_size=4, strides=2, activation='relu')(x)
    x = ks.layers.Conv2D(64, kernel_size=3, strides=1, activation='relu')(x)
    
    x = ks.layers.Flatten()(x)
    x = ks.layers.Dense(512, activation='relu')(x)
    outputs = ks.layers.Dense(num_actions, activation='linear')(x)

    return ks.Model(inputs=inputs, outputs=outputs)

## Create the Environment

In this environment, a board moves along the bottom of the screen returning a ball that will destroy blocks at the top of the screen. The aim of the game is to remove all blocks and breakout of the level. The agent must learn to control the board by moving left and right, returning the ball and removing all the blocks without the ball passing the board.

We will **stack 4 frames** together in a way that the Convolutional Neural Network has the opportunity to 'perceive' the movement of the ball. We are using the **Grayscale images** instead of the RGB ones.

In [4]:
env = gym.make('ALE/Breakout-v5')

# Transforming the environment
env = gym.wrappers.AtariPreprocessing(env, frame_skip=1)

# Stacking frames together
env = gym.wrappers.FrameStack(env, 4)
env.seed(14)

(14, 1335387034)

In [5]:
# PARAMETERS
episodes = 1000

# Discount factor
gamma = 0.99

# Exploration trade-off
initial_exploration_frames = 100000
epsilon = 1.0
epsilon_min = 0.05
epsilon_max = 1.0
decay_interval = 1000000

batch_size = 32
max_steps_per_episode = 10000

# Train the model after 4 actions
update_after_actions = 4

# Update target network after 10000 actions
update_target_network = 10000

In [6]:
# Initialize the models
model = get_model((84, 84, 4), env.action_space.n)
model_target = get_model((84, 84, 4), env.action_space.n)

# Choice of optimizer and loss function
optimizer = ks.optimizers.Adam(learning_rate=0.00025, clipnorm=1.0)
loss_function = ks.losses.Huber()

In [7]:
env.observation_space.shape

(4, 84, 84)

## Training

The DQN algorithm can be describes as follows:

1. **Initialize replay buffer**,

2. Pre-process and the environment and **feed state S to DQN**, which will return the Q values of all possible actions in the state.

3. **Select an action** using the epsilon-greedy policy: with the probability epsilon, we select a random action A and with probability 1-epsilon. Select an action that has a maximum Q value, such as A = argmax(Q(S, A, θ)).

4. After selecting the action A, the Agent **performs chosen action** in a state S and move to a new state S’ and receive a reward R.

5. **Store transition** in replay buffer as <S,A,R,S’>.

6. Next, **sample some random batches of transitions** from the replay buffer and calculate the loss using the formula:

7. **Perform gradient descent** with respect to actual network parameters in order to minimize this loss.

8. After every k steps, **copy our actual network weights to the target network weights**.

9. Repeat these steps for M number of episodes.

In [None]:
buffer = Buffer(100000)

episode_reward_hist = []
running_reward = 0
frames = 0
for episode in range(1, episodes+1):
    
    # Reset environment
    state = env.reset()
    state = np.array(state)
    state = np.transpose(state, [1, 2, 0])
    done = False
    episode_reward = 0
    steps = 0
    
    # Starting the episode
    while not done and steps < max_steps_per_episode:
        
        env.render()
        
        frames += 1
        steps += 1
        
        # Epsilon-greedy strategy
        if frames < initial_exploration_frames or np.random.rand() < epsilon:
            # Random action
            action = np.random.choice(env.action_space.n)
        else:
            # Get probabilities from the model
            state_tensor = tf.convert_to_tensor(state)
            state_tensor = tf.expand_dims(state_tensor, 0)
            action_probs = model(state_tensor, training=False)
            
            # Get best action
            action = tf.argmax(action_probs[0]).numpy()
            
        # Compute action
        next_state, reward, done, _ = env.step(action)
        next_state = np.array(next_state)
        next_state = np.transpose(next_state, [1, 2, 0])
        episode_reward += reward
        
        # Save in Replay Buffer
        buffer.save(state, action, reward, next_state, done)
        state = next_state
        
        # Update epsilon
        epsilon -= (epsilon_max - epsilon_min) / decay_interval
        epsilon = max(epsilon, epsilon_min)
        
        # Update conditions
        if buffer.n_samples() > batch_size and frames % update_after_actions == 0:
            
            # Load samples
            state_sample, action_sample, reward_sample, next_state_sample, done_sample = buffer.sample(batch_size)
            
            # Estimate future rewards using the target network
            estimated_rewards = model_target.predict(next_state_sample)
            
            # Q-value estimate = current reward + estimated future reward
            estimated_q = reward_sample + gamma * tf.reduce_max(estimated_rewards, axis=1) 
            
            # If final frame set the last value to -1
            estimated_q = estimated_q * (1 - done_sample) - done_sample
            
            # Create a mask so we only calculate loss on the updated Q-values
            masks = tf.one_hot(action_sample, env.action_space.n)
            
            with tf.GradientTape() as tape:
                
                # Current Q-values
                q_values = model(state_sample)
                q_action = tf.reduce_sum(tf.multiply(q_values, masks), axis=1)
                
                # Evaluate Loss
                loss = loss_function(estimated_q, q_action)
                
            # Backpropagation
            grads = tape.gradient(loss, model.trainable_variables)
            optimizer.apply_gradients(zip(grads, model.trainable_variables))
            
        if frames % update_target_network == 0:
            
            # Update Terget network
            model_target.set_weights(model.get_weights())
            
            print(f'Running reward {running_reward:.2f} at frame {frames} (episode {episode})')
            
    episode_reward_hist.append(episode_reward)
    if len(episode_reward_hist) > 100:
        del episode_reward_hist[0]
        
    running_reward = np.mean(episode_reward_hist)
    
    if running_reward > 40:
        print(f'Solved at episode {episode}')
        break
        
env.close()

## Testing

In [None]:
test_episodes = 10
reward_hist = []

for test_episode in range(1, test_episodes+1):
    
    state = env.reset()
    done = False
    episode_reward = 0
    
    while not done:
        env.render()
        
        state = np.array(state)
        state = np.transpose(state, [1, 2, 0])
        state_tensor = tf.convert_to_tensor(state)
        state_tensor = tf.expand_dims(state_tensor, 0)
        
        action_probs = model(state_tensor)
        action = tf.argmax(action_probs[0]).numpy()
        
        state, reward, done, _ = env.step(action)
        episode_reward += reward
        
    reward_hist.append(episode_reward)
    
env.close()

print(f'Average episode reward: {np.mean(reward_hist)}')