# University of Aberdeen

## Applied AI (CS5079)

### Assessment 1 Task 1 - Reinforcement Learning from the Screen Frames

---


## Imports

In [None]:
# openai has preprocessing modules in their baselines repository such as FrameStack, NoopResetEnv, episode_life, etc.
!pip install git+https://github.com/openai/baselines.git


In [None]:
# Seed value used for achieving reproducibility
SEED_VALUE = 1337

# 1. Set the `PYTHONHASHSEED` environment variable at a fixed value
# For handling files
import os
os.environ['TF_DETERMINISTIC_OPS'] = '1'
os.environ['TF_CUDNN_DETERMINISM'] = '1'
os.environ['PYTHONHASHSEED'] = str(SEED_VALUE)

# 2. Set the `python` built-in pseudo-random generator at a fixed value
import random
random.seed(SEED_VALUE)

# 3. Set the `numpy` pseudo-random generator at a fixed value
import numpy as np
np.random.seed(SEED_VALUE)

# 4. Set the `tensorflow` pseudo-random generator at a fixed value
import tensorflow as tf
tf.random.set_seed(SEED_VALUE)

# Import the Keras backend used for freeing the global state
# to avoid clutter
from tensorflow.keras import backend as K

# https://github.com/openai/baselines
from baselines.common.atari_wrappers import make_atari, wrap_deepmind, NoopResetEnv, FrameStack
from tensorflow import keras
from tensorflow.keras import layers

# For plotting graphs
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# OpenAI Gym
import gym
FRAME_STACK_SIZE = 5
# NOOP_FRAMES = 10
env = gym.make("Asterix-v0")
env = wrap_deepmind(env, frame_stack=False, scale=True, clip_rewards=False, episode_life=False)
# env = NoopResetEnv(env, noop_max=NOOP_FRAMES)
env = FrameStack(env, FRAME_STACK_SIZE)

env.seed(SEED_VALUE)

## Task 1.1

In [None]:
class RandomAgent():
    def __init__(self, env):
        """
        action_size of the agent is taken from the environment's action space
        """
        self.action_size = env.action_space.n
        
    def get_action(self, observation):
        return random.choice(range(self.action_size))
    
total_reward=0
agent = RandomAgent(env)
numberOfEpisodes = 1
 
for steps in range(numberOfEpisodes):
    current_obs = env.reset()
    done = False
    while not done:
        action = agent.get_action(current_obs)
        next_obs, reward, done, info = env.step(action)
        total_reward += reward
        env.render()
print("Average reward: {}".format(total_reward/numberOfEpisodes))

In [None]:
print("Observation space:", env.observation_space)
print("Action space:", env.action_space)
print("Possible actions:", env.unwrapped.get_action_meanings())
print('Info dictionary:', env.step(action)[3])
print('Reward:', env.step(action)[1])

To display this, a simple agent that performs random actions for one episode is implemented.

Observations: `env.observation_space` shows that the obseravtion space is an **RGB** image of which is an array of shape **(210, 160, 3)**, indicating a height of 210 pixels, a width of 160 pixels and 3 channels.  <br />  <br /> 
Action Space: `env.action_space` shows us that the action space is **discrete** with **9 possible actions** that are printed out for better understanding using `env.unwrapped.get_action_meanings()`. Each action is performed repeatedly for a duration of ***k*** frames, where ***k*** is sampled from the set {2,3,4} uniformly. The discrete space allows for a fixed range of non-negative numbers, where low (`env.observation_space.low`) will be a grid filled with zeros (**0x0x0**) and high (`env.observation_space.high`) will be a grid filled with **250x250x250**. <br /> <br /> 
Reward: **0** but is accumulated as an episode is played and the player collects cauldrons and items giving him points. <br /> <br /> 
Environment's info dictionary: The info dictionary contains `ale.lives()` which refers to the number of lives left. In our case, the player has **only 3 lives** that decrease as he collides with the lyres <br /> <br /> 
Episode: An episode is concluded when the player loses a life. When a game is played and **done** is equal to **True**, this indicates that the game is finished and the player has lost all lives. <br /> <br /> 




## Task 1.2

Please refer to the report for this task.

## Task 1.3

We are using a special wrapper for the purpose of environment preprocessing. 

This reshapes the image to .... and the agent will receive a grayscale observation as we have converted the observarions to grayscale. We also scale all observations to [0,1].

Another common preprocessing step is the introductioon of frame skipping (Naddaf, 2010 at https://era.library.ualberta.ca/items/a661eb66-f2e0-4ed3-b501-b6cbcd1fdd9d), which is what restricts the agent's decision points by repeating some selected action for ***k*** consequitve frames, making the RL problem simpler and speeding up execution. 

This notebook follows an approach where the agent is designed with a richer observation space by combining past frames with most recent ones as known as frame stacking (Mnih et al., 2015 https://www.nature.com/articles/nature14236). We use ***5*** frames, but the algorithm might also be robust with different values such as 3 or 4. This research experimented with all 3 options and concluded that the use of 5 images in a stack is the most beneficial approach with regards to total reward. Due to this reduction in partial observability, the agent can detect the direction of motion of in-game objects.




In [None]:
obs = env.reset()
from skimage.color import rgb2gray
print(env.observation_space.shape)
plt.imshow(obs[:,:,0], cmap='gray')
plt.show()

Please refer to the report for a more indepth explanation.

## Task 1.4

read this to check architecture and fig

https://www.nature.com/articles/nature14236

ARCHITECTURE
TODO:

The first hidden layer convolves 32 filters of 8 × 8 with stride 4 with the input image and applies a rectifier nonlinearity 31,32. The second hidden layer convolves 64 filters of 4 × 4 with stride 2, again followed by a rectifier nonlinearity. This is followed by a third convolutional layer that convolves 64 filters of 3 × 3 with stride 1 followed by a rectifier. The final two hidden layers are fully-connected and consist of 512 and 256 rectifier units respectively. The output layer is a fully-connected linear layer with a single output for each valid action. The number of valid actions varied between 4 and 18 on the games we considered.

In [None]:
# Import the garbage collector package
import gc

def reset_random_seeds():
    """Reset the random number generator seed to achieve full reproducibility 
    even when running the script on GPU
    """
    
    # Set the environment determinism to guarantee
    # reproducibility of the results
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
    os.environ['TF_CUDNN_DETERMINISM'] = '1'
    os.environ['PYTHONHASHSEED']=str(SEED_VALUE)
    tf.random.set_seed(SEED_VALUE)
    np.random.seed(SEED_VALUE)
    random.seed(SEED_VALUE)
    
    # Perform garbage collection
    gc.collect()
    print("Random number generator seed reset!")  # optional

### Defining global variable and importing packages for training the models

In [None]:
# Import the Keras layer used for building our models
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D,\
MaxPool2D, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.initializers import HeUniform

INPUT_SHAPE = env.observation_space.shape
NUM_ACTIONS = env.action_space.n
BATCH_SIZE = 32
MEMORY_SIZE = 25000
EXPLORATION_STEPS = 20000
LEARNING_RATE = 0.00025

In [None]:
class SumTree:
    """Code is adapted from an open-source MIT-licensed GitHub repository.
    It implements an optimisation experience replay memory technique that can
    be used for all Atari games DQNs. 
    
    Source: https://github.com/rlcode/per
    Args:
        capacity (int): size of the sum tree
    """
    # The data pointer class variable
    write = 0

    def __init__(self, capacity):
        """Initialize the tree with all nodes and data with values set to 0.
        """
        # Number of leaf nodes (final nodes) that contains experiences
        self.capacity = capacity
        
        # Generate the tree with all nodes values = 0
        # To understand this calculation (2 * capacity - 1) look at the schema above
        # Remember we are in a binary node (each node has max 2 children) so 2x size of leaf (capacity) - 1 (root node)
        # Parent nodes = capacity - 1
        # Leaf nodes = capacity
        self.tree = np.zeros(2 * capacity - 1)
        
        # Store the experiences (so the size of data is capacity)
        self.data = np.zeros(capacity, dtype=object)

    def _propagate(self, idx, change):
        """Recursively propagate a given child node change through the sum tree.
        """
        
        # Get the parent node index
        parent = (idx - 1) // 2
        
        # Apply the child node change to the parent node
        self.tree[parent] += change
        
        # Do this until the root node is reached
        if parent != 0:
            self._propagate(parent, change)

    def _retrieve(self, idx, s):
        """Return a leaf node (priority) index for the current experience (observation)
        """
        left = 2 * idx + 1
        right = left + 1

        if left >= len(self.tree):
            return idx

        if s <= self.tree[left]:
            return self._retrieve(left, s)
        else:
            return self._retrieve(right, s - self.tree[left])

    def total(self):
        """Return the root node (total sum of the experience priority values).
        """
        return self.tree[0]

    def add(self, p, data):
        """Add the experience and its priority score (as a leaf) to the sum tree.
        """
        # Get the tree index for the experience (observation)
        idx = self.write + self.capacity - 1
        
        # Update the sum tree data
        self.data[self.write] = data
        
        # Update the sum tree probability values
        self.update(idx, p)
        
        # Increment the data pointer
        self.write += 1
        
        # If over the capacity, go back to first index (we overwrite)
        if self.write >= self.capacity:
            self.write = 0

    def update(self, idx, p):
        """Update the leaf priority score and propagate the change through tree.
        """
        # Change = new priority score - former priority score
        change = p - self.tree[idx]
        self.tree[idx] = p
        
        # Propagate the change through tree
        self._propagate(idx, change)

    def get(self, s):
        """Return the leaf index, priority value and its associated experience (observation).
        """
        # Get the leaf index
        idx = self._retrieve(0, s)
        
        # Get the experience (observation) index
        dataIdx = idx - self.capacity + 1

        return (idx, self.tree[idx], self.data[dataIdx])
    

#-------------------- MEMORY --------------------------
class Memory:
    """Code is adapted from an open-source MIT-licensed GitHub repository.
    It implements an optimisation experience replay memory technique that can
    be used for all Atari games DQNs. 
    
    Source: https://github.com/rlcode/per
    
    Args:
        capacity (int): size of the sum tree
    """
    # Hyperparameter to avoid assigning 0 probability to experiences
    e = 0.01
    
    # Hyperparameter to make a trade-off between random sampling and taking a
    # a high priority experience (observation)
    a = 0.6

    def __init__(self, capacity):
        """Initialise the sum tree with the given capacity.
        """
        self.tree = SumTree(capacity)

    def _getPriority(self, error):
        return (error + self.e) ** self.a

    def add(self, error, sample):
        """Store a new experience in the tree along with its corresponding priority value.
        """
        # Get the priority from the experience (observation) error
        p = self._getPriority(error)
        
        # Store the priority and the experience (observation)
        self.tree.add(p, sample) 

    def sample(self, n):
        """Sample a n-sized batch of priority index and observation pair.
        """
        # Create a list to hold the batch pairs
        batch = []
        
        # Calculate and store the priority segment
        segment = self.tree.total() / n
        
        # Populate the batch list
        for i in range(n):
            a = segment * i
            b = segment * (i + 1)

            s = random.uniform(a, b)
            (idx, p, data) = self.tree.get(s)
            batch.append( (idx, data) )

        return batch

    def update(self, idx, error):
        """Update the sum tree leaves (priorities).
        """
        # Get the priority from the experience (observation) error
        p = self._getPriority(error)
        
        # Update the sum tree probability values
        self.tree.update(idx, p)

In [None]:
from collections import deque
import math

MAX_EPSILON = 1
MIN_EPSILON = 0.1
LAMBDA = - math.log(0.01) / EXPLORATION_STEPS  # speed of decay

class DDQNAgent():
    """Code is adapted from an open-source MIT-licensed GitHub repository.
    It implements an DDQ Network and Agent experience replay memory technique that can
    be used for all Atari games DQNs. 
    
    Source: https://github.com/jaromiru/AI-blog
    
    Args:
        env (AtariEnv): the environment that RLNetwork and Agent are optimized on
        lr (float): learning rate of the network
    """
    steps = 0
    
    def __init__(self, env, lr=LEARNING_RATE):
        """Initialise the agent and network for the environment.
        """
        self.lr = lr
        
        self.memory = Memory(MEMORY_SIZE)
        
        self.loss_val = np.inf
        self.action_size = NUM_ACTIONS
        
        # Disable eager execution which boosts runtime
        # Eager execution is generally used for debugging purposes
        tf.compat.v1.disable_eager_execution()
        self.discount_rate = 0.99

        # Create the two networks for predicting the actions
        
        # The first model makes the predictions for Q-values 
        # which are used to make a action.
        self.online = self.q_network()
        
        # Build a target model for the prediction of future
        # rewards. The weights of a target model get updated 
        # every 10,000 steps thus when the loss between the 
        # Q-values is calculated the target Q-value is stable.
        self.target = self.q_network()

        #The "target" DNN will take the values of the "online" DNN
        self.update_target()


    def q_network(self, filters_1=32, filters_2=64, filters_3=64):
        """Define and return the CNN model architecture.
        """
        # Ensure reproducibility of the results
        # by resetting the random seeds
        reset_random_seeds()

        # Build the model
        kernel_init = HeUniform()
        model = Sequential()
        model.add(Conv2D(filters_1, kernel_size=8, padding="same", strides=4, activation='relu', input_shape=INPUT_SHAPE, kernel_initializer=kernel_init))
        model.add(Conv2D(filters_2, kernel_size=4, padding="same", strides=2, activation='relu', kernel_initializer=kernel_init))
        model.add(Conv2D(filters_3, kernel_size=3, padding="same", strides=1, activation='relu', kernel_initializer=kernel_init))
        model.add(Flatten())
        model.add(Dense(512, activation='relu', kernel_initializer=kernel_init))
        model.add(Dense(256, activation='relu', kernel_initializer=kernel_init))
        model.add(Dense(NUM_ACTIONS, activation="linear", kernel_initializer=kernel_init))
        
        # In the Deepmind paper they use RMSProp however then Adam optimizer
        # improves training time
        model.compile(loss="huber_loss", optimizer=Adam(learning_rate=self.lr))
        return model
            
    def update_target(self):
        """Update the target network with the online network weights.
        """
        # Get the online DQN weights
        online_weights = self.online.get_weights()
        
        # Update the target DQN weights
        self.target.set_weights(online_weights)
        
    #---- CHOSSING ACTION ----
    def get_action(self, state, step):
        """Based on the Epsilon agent chooses wheter to explore or exploatate
        """
        if step <= EXPLORATION_STEPS:
            return np.random.randint(self.action_size)
        else:     
            self.steps += 1
            epsilon = MIN_EPSILON + (MAX_EPSILON - MIN_EPSILON) * math.exp(-LAMBDA * self.steps)
            if np.random.rand() < epsilon:
                return np.random.randint(self.action_size)# random action
            else:
                q_values = self.online.predict(state.reshape(-1, *INPUT_SHAPE))
                return np.argmax(q_values) # optimal action
    
    def get_train_data(self, batch):
        """Decide on the input, desired output and error of the current state.
        """
        no_state = np.zeros(INPUT_SHAPE)
        
        prev_states = np.array([obs[1][0] for obs in batch])
        
        # No state if done = True
        next_states = np.array([(no_state if obs[1][4] is True else obs[1][3]) for obs in batch])
    
        prev_q_vals = self.online.predict(prev_states)
        
        next_q_vals_double = self.online.predict(next_states)
        next_q_vals = self.target.predict(next_states)
        
        X = np.zeros((len(batch), *INPUT_SHAPE))
        Y = np.zeros((len(batch), NUM_ACTIONS))
        errors = np.zeros(len(batch))
        
        for idx in range(len(batch)):
            
            # Unpack the current batch sample
            curr_state, action, reward, next_state, done = batch[idx][1]
            
            q_val = prev_q_vals[idx]
            prev_q_val = q_val[action]
            
            # Future q value
            future_q_val = q_val
            
            if done:
                future_q_val[action] = reward
            else:
                future_q_val[action] = reward + next_q_vals[idx][np.argmax(next_q_vals_double[idx])] * self.discount_rate
            
            X[idx] = curr_state
            Y[idx] = future_q_val
            errors[idx] = abs(prev_q_val - future_q_val[action])
        
        return (X, Y, errors)
    
    def save_to_memory(self, curr_state, action, reward, next_state, done, step):
        """Update the sum tree priorities and observations (samples).
        """
        sample = (curr_state, action, reward, next_state, done)
        if step <= EXPLORATION_STEPS:
            error = abs(sample[2])  # Reward
            self.memory.add(error, sample)
        else:
            X, Y, errors = self.get_train_data([(0, sample)])
            self.memory.add(errors[0], sample)
        
    def train(self, step):
        """Train the online model and update the loss value.
        """
        
        batch = self.memory.sample(BATCH_SIZE)
        X, Y, errors = self.get_train_data(batch)
        
        # Update errors
        for i in range(len(batch)):
            idx = batch[i][0]
            self.memory.update(idx, errors[i])
        
        hist = self.online.fit(X, Y, batch_size=BATCH_SIZE, epochs=1, verbose=0, shuffle=True)
        self.loss_val = hist.history['loss'][0]
    


Please refer to the report for further information regarding the deployment and parameter adjustments of the agent.

## Task 1.5

In [None]:
import time
agent = DDQNAgent(env)  
ep_rewards = []
ep_steps = []
total_reward = 0
N_STEPS = 1000000  # total number of training steps
save_steps = 50


# Train the model after 4 actions
TRAIN_ONLINE_STEPS = 4

# Update teh target every 10,000 steps (1 epoch)
# Considered as a hyperparameter
UPDATE_TARGET_STEPS = 10000

done=True
for step in range(N_STEPS):
    total_perc = step * 100 / N_STEPS
    print(f"\r\tAction step: {step}/{N_STEPS} ({total_perc:.2f}%)\tLoss: {agent.loss_val:5f}", end="")
    if done: # game over, start again
        avg_reward = int(total_reward/3)
        if step:
            ep_rewards.append(total_reward)
            ep_steps.append(step)
        print(f"\tAVG reward: {avg_reward}\tTotal mean: {np.mean(ep_rewards)}")

        obs = env.reset()
        state = np.array(obs)
        total_reward = 0


    # Get a exploration/exploitation action
    action = agent.get_action(state, step)

    # Take a step in the game environment
    next_state, reward, done, info = env.step(action)
    
    # Convert to NumPy array
    next_state = np.array(next_state)
    
    # Update the sum tree priorities and observations (samples)
    agent.save_to_memory(state, action, reward, next_state, done, step)
    
    # Skip training the agent if still exploring
    if step > EXPLORATION_STEPS:
        
        # Train the online DDQN every 4th frame
        if step % TRAIN_ONLINE_STEPS == 0:
            agent.train(step)

        # Regularly copy the online DDQN to the target DDQN
        if step % UPDATE_TARGET_STEPS == 0:
            agent.update_target()
    
    env.render()
    total_reward += reward
    state = next_state
env.close()

## Plotting

In [None]:
import pandas as pd
# train_df = pd.read_csv('sumtree_ddqn_data.csv')
# train_df.iloc[:, 1:]
# Create data frame for the obtained training data
train_df = pd.DataFrame(data={'Step': ep_steps, 'Reward': ep_rewards})

# Calculate cumulative mean of the rewards
train_df['Total Mean'] = train_df['Reward'].expanding().mean()

# Calculate cumulative sum of the rewards
train_df['Total Reward'] = train_df['Reward'].cumsum()
train_df.head(n=5)

In [None]:
# Function for plotting the DDQN training data
def plot_df(df, cols, x_label, y_label, title, title_fontsize=20, label_fontsize=16):
    plt.rcParams['figure.figsize']= (20, 6)
    for col in cols:
        df[col].plot(fontsize=12)
    plt.legend(loc=2, prop={'size': 14})
    plt.xlabel(x_label, fontsize=label_fontsize)
    plt.ylabel(y_label, fontsize=label_fontsize)
    plt.title(title, fontsize=title_fontsize)
    plt.show()

In [None]:
def plot_boxplot(df, cols, models, x_label, y_label, title, title_fontsize=20, label_fontsize=16):
    plt.rcParams['figure.figsize']= (20, 6)
    plt.xlabel(x_label, fontsize=label_fontsize)
    plt.ylabel(y_label, fontsize=label_fontsize)
    plt.title(title, fontsize=title_fontsize)
    plt.boxplot(df[cols])
    
    initial_labels = []
    final_labels = []
    for idx, model in enumerate(models):
        initial_labels.append(idx+1)
        final_labels.append(model)
    plt.xticks(initial_labels, final_labels)
    plt.show()

In [None]:
# Plot the training reward cumulative sum throughout the episodes (1 episode = 3 player lives)
plot_df(train_df, cols=['Total Reward'], x_label='Episode', y_label='Reward', title='DDQN Agent Training Total Reward')

In [None]:
# Plot the training reward values throughout the episodes (1 episode = 3 player lives)
plot_df(train_df, cols=['Reward', 'Total Mean'], x_label='Episode', y_label='Reward', title='DDQN Agent Training Rewards and Trendline')

In [None]:
plot_boxplot(train_df, ['Reward'], ['DDQN-Image'], x_label='Current Models', y_label='Reward', title='DDQN Agent Training Rewards BoxPlot')

In [None]:
# Save the obtaine DDQN training data to a CSV file
train_df.to_csv('sumtree_ddqn_data.csv')

In [None]:
# Create a HDF5 file with the trained online model
# with all the details necessary to reconstitute it. 
online_model = agent.online
online_model.save('trained_ddqn_online_model.h5') 

# Create a HDF5 file with the trained target model
# with all the details necessary to reconstitute it. 
target_model = agent.target
target_model.save('trained_ddqn_target_model.h5')  