In [1]:
import numpy as np
import pandas as pd
import keras
import gym
import os
import h5py

from keras.models import Sequential, load_model
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l2
import tensorflow as tf
import matplotlib.pyplot as plt
from gym import wrappers
%matplotlib inline
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
from plotly import tools
from tensorflow.keras.losses import MeanSquaredError
import time


import keras.backend as K

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

# Background Research
The Lunar Lander Environment is a environment from the OpenAI Gym. The objective in this environment is to land a spaceship safely inbetween two flags, the space inbetween both flags will always be at the (0,0) coordinates.

The environment has a state space with 8 attributes:
- X-position
- Y-position
- X-velocity
- Y-velocity
- Angle of the Lander
- Angular Velocity
- If the left leg has touched the ground
- If the right leg has touched the ground

It also has an action space of 4 attributes:
- Fire right engine
- Fire left engine
- Fire main engine
- Do nothing

The rewards for this environment are calculated by:
- Moving from the top of the screen to the landing pad and coming to rest is about +100-140 points
- If the lander moves away from the landing pad it loses reward
- If the lander crashes it -100 points, if it lands it +100 points.
- Each leg with ground contact is +10
- Firing the main engine -0.3 points per frame (Fuel is infinite)
- Firing the side engine -0.03 points per frame
- Solved is 200 points

Each episode will also automatically terminate if:
- The lander crashes on contact with the moon
- The lander gets outside of the viewport, where the x coordinate is greater than 1

----
# Problem
We were tasked to train a Reinforcement Learning algorithm to solve the LunarLanderV2 environment, by training it to land successfully on the landing pad.

Reinforcement Learning is an area of machine learning where the objective is to train intelligent agents to take a sequence of actions in an environment to maximize the reward given to it. Therefore, to make the machine do what we want it to do, when the agent executes an action we will either reward it if its good. Or penalise it if the action is bad. This reward policy is already set for the LunarLander environment as seen from the background research I conducted above. And the goal is to train a Reinforcement Learning algorithm to take actions that will yield the most amount of rewards from the game, as that would mean that the algorithm is successful in landing the lander.

----
# Solution
We will be exploring 2 Reinforcement Learning Algorithms, namely Deep Q Network for our baseline model and its improved counterpart Double Deep Q Network for our final model. We chose to explore these 2 models as we are able to compare the improvements of the DDQN model compared to the DQN model, and also talk about the difference that caused those improvements.

Q Learning is a value based learning algorithm. These algorithms updates the value function based on an equation(particularly Bellman equation). Whereas the other type, policy-based estimates the value function with a greedy policy obtained from the last policy improvement. Q learning is a off-policy learner, meaning that it learns the optimal policy independently of the agent's action. An on-policy learner learns the value of the policy from the agent including exporation steps where it will find an optimal policy taking into account the exploration inherent in the policy.

The Q table works by having a state action pair. As seen in this diagram:

<img src='qtable.jpg'>

From the diagram we can see that for each state there are multiple actions to be taken. For our problem, we would have 4 actions to be taken at any given state as described in our action space. Each state is consists of a different combination of observations values, which is taken from the environment to 'see' what state the lunar lander is in. From the background research, we know that there are 8 state space for the Lunar Lander environment. In a regular Q table, we would have to populate the table with Q values so as to figure out the optimal action to take at each given state. That would require enormous amount of computing power to figure out every possible state the lander is in and also which action to take in that state for optimal rewards.

To speed up the process, we will use Deep Q Network. Where we train a neural network to predict what action to take based on a specific state that we input into it. Makes it so that we do not need to populate the whole Q table and find out the best action to take at every step as we are able to train a model to predict which action to take based on the values of the state you input into the neural network.

We use Double Deep Q Network as our final model as its an improved architecture built to fix Deep Q Learning weaknesses, such as over estimation of the Q values as shown in [Deep Reinforcement Learning with Double Q-Learning](https://arxiv.org/pdf/1509.06461.pdf). The DDQN architechture not only reduces the over estimation seen in DQN, it is also reported to have better performance compared to DQN in several Atari games.

# Functions

In [6]:
def create_trace(x, y, ylabel, color):
    trace = go.Scatter(
        x=x, y=y, name=ylabel, marker=dict(color=color), mode="markers+lines", text=x
    )
    return trace

def plot_graph(data):
    avg = data["average"]
    reward = data["reward"]
    epochs = list(range(1, len(reward) + 1))

    trace_avg = create_trace(epochs, avg, "Avg Reward", "Green")
    trace_reward = create_trace(epochs, reward, "Reward", "Red")

    fig = tools.make_subplots(
        rows=1,
        cols=1,
        subplot_titles=(
            "Reward and Last 100 Average Reward",
        ),
    )
    fig.append_trace(trace_avg, 1, 1)
    fig.append_trace(trace_reward, 1, 1)
    fig["layout"]["xaxis"].update(title="Epoch")
    fig.update_layout(height=800, width=2000)

    iplot(fig, filename="accuracy-loss")

In [7]:
class FileLogger():
  def __init__(self, file_name='progress.log'):
    self.file_name = file_name
    self.clean_progress_file()

  def log(self, episode, reward, average_reward):
    f = open(self.file_name, 'a+')
    f.write(f"{episode};{reward};{average_reward}\n")
    f.close()

  def clean_progress_file(self):
    if os.path.exists(self.file_name):
      os.remove(self.file_name)
    f = open(self.file_name, 'a+')
    f.write("episode;reward;average\n")
    f.close()

# Deep Q Learning (DQN)

## Experience replay
In a normal q-learning algorithm, the past experiences of the agent is not saved and used. The past experience of a model being defined as past states, actions, rewards and resultant states sets.

Experience replay in DQN is when instead of throwing away these past experiences, we save them as history into memory so that we can revisit these sets of experiences for the model to be given a chance to update policy based on past state transitions.

There are 2 main advantages of experience replay:

1. Revisiting and learning on past experiences multiple times allows our model to learn more effieciently as policy updates are incremental (based on learning rate).

2. Better convergence when training.

In order to implement this, we create a Replay buffer class with attributes such as the memory size which specifies the max size our history.

The way we store the sets of data is by using a numpy array containing numpy arrays of states, actions, rewards and resultant states.

During every state in a episode, when we train our deep learning network we will use the sample buffer function to randomly select a batch of past experiences to to fed to our network.

In [None]:
class ReplayBuffer(object):
    def __init__(self, max_size, input_shape, n_actions, discrete=False):
        self.mem_size = max_size
        self.mem_cntr = 0
        self.discrete = discrete
        self.state_memory = np.zeros((self.mem_size, input_shape))
        self.new_state_memory = np.zeros((self.mem_size, input_shape))
        dtype = np.int8 if self.discrete else np.float32
        self.action_memory = np.zeros((self.mem_size, n_actions), dtype=dtype)
        self.reward_memory = np.zeros(self.mem_size)
        self.terminal_memory = np.zeros(self.mem_size, dtype=np.float32)

    def store_transition(self, state, action, reward, state_, done):
        index = self.mem_cntr % self.mem_size
        self.state_memory[index] = state
        self.new_state_memory[index] = state_
        # store one hot encoding of actions, if appropriate
        if self.discrete:
            actions = np.zeros(self.action_memory.shape[1])
            actions[action] = 1.0
            self.action_memory[index] = actions
        else:
            self.action_memory[index] = action
        self.reward_memory[index] = reward
        self.terminal_memory[index] = 1 - done
        self.mem_cntr += 1

    def sample_buffer(self, batch_size):
        max_mem = min(self.mem_cntr, self.mem_size)
        batch = np.random.choice(max_mem, batch_size)

        states = self.state_memory[batch]
        actions = self.action_memory[batch]
        rewards = self.reward_memory[batch]
        states_ = self.new_state_memory[batch]
        terminal = self.terminal_memory[batch]

        return states, actions, rewards, states_, terminal

## Huber loss

We will be using huber loss when compiling our model. This is because when using mean squared error, the loss values can reach very high and extreme values which can cause the model to over calibrate.

This will result in a higher failure of convergence. The huber loss is lesser prone to reach high and extreme values as it is only exponential up to a ceratin limit where by it will become linear. 

This reduces the chances of the model over calibrating which in turns leads to more stable training.

In [3]:
def masked_huber_loss(mask_value, clip_delta):
    def f(y_true, y_pred):
        error = y_true - y_pred
        cond  = K.abs(error) < clip_delta
        mask_true = K.cast(K.not_equal(y_true, mask_value), K.floatx())
        masked_squared_error = 0.5 * K.square(mask_true * (y_true - y_pred))
        linear_loss  = mask_true * (clip_delta * K.abs(error) - 0.5 * (clip_delta ** 2))
        huber_loss = tf.where(cond, masked_squared_error, linear_loss)
        return K.sum(huber_loss) / K.sum(mask_true)
    f.__name__ = 'masked_huber_loss'
    return f

## Deep learning network

We will be using a simple fully connected network which will contain 2 dense layers and 2 relu activation layers between them.

We will have an output dense layer with 4 nodes each representing an action the agent is able to take. 

We will be using the adam optimiser and the huber loss for this model.

In [None]:
def get_dqn(lr, n_actions, input_dims):
    model = Sequential([
                Dense(256, input_shape=(input_dims,)),
                Activation('relu'),
                Dense(256),
                Activation('relu'),
                Dense(n_actions)])

    model.compile(optimizer=Adam(lr=lr), loss=masked_huber_loss(0,1))

    return model

## Agent

We will now build our agent. Our agent comprises of all the hyperparameters needed to optimize the policy as well as the replay buffer and deep learning network.

The remember function in the agent is for us to store the states, actions, rewards and resultant states sets into the history of the ReplayBuffer class.

The learn function is where we call the ReplayBuffer class to randomly sample 64 states, actions, rewards and resultant states sets from the experience history to use it to calculate the the new q values for the states using the bellman's equation. The Bellman equation looks like this:

<image src = 'https://cdn.discordapp.com/attachments/345195124374110218/944283494501187724/image-1.png'>

In Q learning an agent uses the Bellman equation to update the Q values of state action pairs. We used this formula to get and fit the new q values of the 64 states that we received from the replaybuffer class to our deep learning network model.

We also update the epsilon value over here which is the exploration vs exploitation rate in the learn function.

The last 2 functions are save_model to save the current deep learning model and load_model is for loading a deeping learning model.

In [None]:
class Agent(object):
    def __init__(self, alpha, gamma, n_actions, epsilon, batch_size,
                input_dims, epsilon_dec, epsilon_min,
                mem_size):
        self.action_space = [i for i in range(n_actions)]
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_dec = epsilon_dec
        self.epsilon_min = epsilon_min
        self.batch_size = batch_size
        self.memory = ReplayBuffer(mem_size, input_dims, n_actions, discrete=True)
        self.q_eval = get_dqn(alpha, n_actions, input_dims)

    def remember(self, state, action, reward, new_state, done):
        self.memory.store_transition(state, action, reward, new_state, done)

    def choose_action(self, state):
        state = state[np.newaxis, :]
        rand = np.random.random()
        if rand < self.epsilon:
            action = np.random.choice(self.action_space)
        else:
            actions = self.q_eval.predict(state)
            action = np.argmax(actions)

        return action

    def learn(self):
        if self.memory.mem_cntr > self.batch_size:
            state, action, reward, new_state, done = self.memory.sample_buffer(self.batch_size)

            action_values = np.array(self.action_space, dtype=np.int8)
            action_indices = np.dot(action, action_values)

            q_eval = self.q_eval.predict(state)

            q_next = self.q_eval.predict(new_state)

            q_target = q_eval.copy()

            batch_index = np.arange(self.batch_size, dtype=np.int32)

            q_target[batch_index, action_indices] = reward + self.gamma*np.max(q_next, axis=1)*done

            _ = self.q_eval.fit(state, q_target, verbose=0)

            self.epsilon = self.epsilon*self.epsilon_dec if self.epsilon > self.epsilon_min else self.epsilon_min

    def save_model(self, model_file):
        self.q_eval.save(model_file)

    def load_model(self):
        self.q_eval = load_model(self.model_file)

## Main program

In the main program we will be specifying the hyper parameters and also creating the main loop where we choose the number of episodes/games to run.

The chosen values for all the hyperparameters are based on experimention

The hyper parameters are:

1. Learning rate: The learning rate is the leaning rate used for the adam optimizer for the deep learning network. The lower the learning rate the smaller the changes to the weights of the models are modified and vice versa. We will be using a learning rate of 0.0005.
<br><br>

2. Gamma: Gamma is the discount factor used in the bellman equation. The higher the gamma, the more the agent will value long term rewards. We will be using a gamma of 0.99.
<br><br>


3. Epsilon: The epsilon is the probability at which we either use our model to choose the action (explotation) or use randomly choose a action (exploration). An epsilon of 1 indicates only using random actions while an epsilon of 0 indicates only using the model to predict. Our staring epsilon will be 1 as we want to explore the environment as much possible at the start.
<br><br>


4. Epsilon decay: The epsilon is the rate at which the epsilon is reduced as the number of episodes increases. This is so that we start exploiting our policy more as we train the model longer. We wil be using a decay rate of 0.996.
<br><br>


5. input dimensions: The input dimension represents the number of observations in each state. The lunar lander environment has 8 observations so we will be using 8 as our input dimension.
<br><br>


6. Number of actions: The number of actions represents the number of actions our agent can take. The lunar lander environment has 4 actions so we will be using 4 as our input dimension.
<br><br>


7. memory size: The memory size represents the max number of entrys for the history of states, actions, rewards and resultant states sets. When the number of sets exceed this number the first item of the dataset will be removed to make space for he new entry. We will be using a memory size of 1000000
<br><br>


8. batch size: The batch size represents the number of states, actions, rewards and resultant states sets sampled from the history when the agent is learning. We will be using 64
<br><br>


Inside the main loop, we will have a while loop to continuously take actions and train the model until the terminal state is reached. We then repeat n_games times which in this case is 500 times. After every games, we print out the score and average score. 



In [None]:
env = gym.make('LunarLander-v2')
agent = Agent(alpha=0.0005, gamma=0.99, n_actions=4, epsilon=1, batch_size=64, input_dims=8, epsilon_dec=0.995, epsilon_min=0.01, mem_size=1000000)

In [None]:
epochs = 500
reward_history = []
eps_history = []
logger = FileLogger('DQN_History.log')


for i in range(epochs):
    done = False
    score = 0
    observation = env.reset()
    while not done:
        action = agent.choose_action(observation)
        observation_, reward, done, info = env.step(action)
        score += reward
        agent.remember(observation, action, reward, observation_, int(done))
        observation = observation_
        agent.learn()

    eps_history.append(agent.epsilon)
    reward_history.append(score)

    avg_score = np.mean(reward_history[max(0, i-100):(i+1)])
    print('episode: ', i,'score: %.2f' % score,
            ' average score %.2f' % avg_score)

    if i % 5 == 0 and i > 0:
        agent.save_model(f'Models/DQN/episode{i}.h5')
    logger.log(i, score, avg_score)

  super(Adam, self).__init__(name, **kwargs)


episode:  0 score: -263.75  average score -263.75
episode:  1 score: -462.09  average score -362.92
episode:  2 score: -107.21  average score -277.68
episode:  3 score: -399.87  average score -308.23
episode:  4 score: -393.01  average score -325.19
episode:  5 score: -220.47  average score -307.73
episode:  6 score: -262.93  average score -301.33
episode:  7 score: -46.14  average score -269.43
episode:  8 score: -173.41  average score -258.77
episode:  9 score: -128.90  average score -245.78
episode:  10 score: 97.87  average score -214.54
episode:  11 score: -229.66  average score -215.80
episode:  12 score: -97.91  average score -206.73
episode:  13 score: -377.40  average score -218.92
episode:  14 score: -164.32  average score -215.28
episode:  15 score: -185.68  average score -213.43
episode:  16 score: -94.54  average score -206.44
episode:  17 score: -195.71  average score -205.84
episode:  18 score: -77.49  average score -199.09
episode:  19 score: -97.40  average score -194.

# Evaluating DQN

From the graph below, we can see the moving average of 100 for rewards in green. And also the individual rewards per episode in red. We can see that the moving average increases drastically from episodes 0 to 200. Then it starts to slow down its increments and peaks at episode 271 before it dips till episode 310 where the moving average starts to increase again until episode 368 where it peaks and dips all the way till episode 471 where it starts to increase drastically all the way to episode 500.

I chose to use the model saved at episode 475 as there was a steep increase in moving average around episode 475 indicating that the rewards around those episodes are really high as they started pulling up the moving average. Therefore, the model saved around here should be good. I also experimented with other models saved at different episodes but they could not get the same results as the model at episode 475 thus, I will be using that for evaluation.

The model at episode 475 achieved a constant reward around 220 which is great as according to the documentation for the Lunar Lander environment, above a 200 reward is considered solved. Therefore, the baseline model is able to solve the lunar lander environment and even get +20 reward above its 'solved' state.

In [12]:
data = pd.read_csv('DQN_History.log', sep=';')
plot_graph(data)

In [37]:
env.close()

In [41]:
env = gym.make("LunarLander-v2")
filename = "Models/DQN/episode475.h5"
# 475, 
trained_model = load_model(filename, custom_objects={"masked_huber_loss": MeanSquaredError()})

evaluation_max_episodes = 10
evaluation_max_steps = 450

def episode_trigger(x):
    return x % 1 == 0
env = wrappers.RecordVideo(env, f'Videos/DQN/', episode_trigger)

def get_q_values(model, state):
    input = state[np.newaxis, ...]
    return model.predict(input)[0]

def select_best_action(q_values):
    return np.argmax(q_values)

rewards = []
for episode in range(1, evaluation_max_episodes + 1):
    state = env.reset()

    episode_reward = 0

    step = 1
    for step in range(1, evaluation_max_steps + 1):
        env.render()
        q_values = get_q_values(trained_model, state)
        action = select_best_action(q_values)
        new_state, reward, done, info = env.step(action)

        episode_reward += reward

        if step == evaluation_max_steps:
            print(f"Episode reached the maximum number of steps. {evaluation_max_steps}")
            done = True

        state = new_state

        if done:
            break
    print(f"episode {episode} finished in {step} steps with reward {episode_reward}.")
    rewards.append(episode_reward)

print(f"Average reward: {np.average(rewards)}")
env.close()

Episode reached the maximum number of steps. 450
episode 1 finished in 450 steps with reward 147.77049373202067.
episode 2 finished in 235 steps with reward 257.28173297949843.
episode 3 finished in 330 steps with reward 287.9188121502981.
Episode reached the maximum number of steps. 450
episode 4 finished in 450 steps with reward 146.48165328935798.
episode 5 finished in 334 steps with reward 256.3675295714852.
episode 6 finished in 321 steps with reward 275.9922407034659.
episode 7 finished in 212 steps with reward 275.05596113321457.
Episode reached the maximum number of steps. 450
episode 8 finished in 450 steps with reward 156.4268290289234.
episode 9 finished in 170 steps with reward 264.44861794219787.
Episode reached the maximum number of steps. 450
episode 10 finished in 450 steps with reward 114.03661042287595.
Average reward: 218.17804809533382


This is the 3rd episode as its the episode with the highest reward. Therefore I choose to display it

<video controls="true" allowfullscreen="true">
  <source src="Videos\DQN\rl-video-episode-3.mp4">
</video>

You can see from the video that it agent is able to recognise that the lander was tilting left and veering to the right at the start. The agent then fired burst of the left thruster to counter the tilt of the lander so as to center it. The agent then recognises that its no longer in the center and decides to fire its right thrusters to direct the lander back into the middle, where its inbetween the flags and it manages to land smoothly inbetween the flags. This shows that the baseline RL model is able to successfully land the lunar lander even if it needs to make drastic corrective adjustments at the start to stablise and then center the lander.

# Double Deep Q Learning (DDQN)

For our second improved model we implemented the double deep q learning network. The problem with the normal DQN is that it is prone to overestimation of q values for certain actions. This is due to the original target formula using the max function.

This overestimation of q values will lead to the model to think that a certain action is the best value when it is not.

To solve this problem we use two models. One for selecting actions and one for calculating the target values. 

The modified formula is shown below where thetha is the action selcection model and thetha prime is the target calculating model:

<img src = 'https://cdn.discordapp.com/attachments/345195124374110218/944293972367515718/unknown.png'>

The weights of the action selcection model are copied to the target calculating model every x steps.

Our implementation of the DDQN is very similar to DQN but with just a slight modification to the learn function in the agent class. I will not be explaining the rest of the parts as I have explained it above.


## Replay Buffer

In [9]:
class ReplayBuffer(object):
    def __init__(self, max_size, input_shape, n_actions, discrete=False):
        self.mem_size = max_size
        self.mem_count = 0
        self.discrete = discrete
        self.state_memory = np.zeros((self.mem_size, input_shape))
        self.new_state_memory = np.zeros((self.mem_size, input_shape))
        dtype = np.int8 if self.discrete else np.float32
        self.action_memory = np.zeros((self.mem_size, n_actions), dtype=dtype)
        self.reward_memory = np.zeros(self.mem_size)
        self.terminal_memory = np.zeros(self.mem_size, dtype=np.float32)

    def store_transition(self, state, action, reward, state_, done):
        index = self.mem_count % self.mem_size
        self.state_memory[index] = state
        self.new_state_memory[index] = state_
        # store one hot encoding of actions, if appropriate
        if self.discrete:
            actions = np.zeros(self.action_memory.shape[1])
            actions[action] = 1.0
            self.action_memory[index] = actions
        else:
            self.action_memory[index] = action
        self.reward_memory[index] = reward
        self.terminal_memory[index] = 1 - done
        self.mem_count += 1

    def sample_buffer(self, batch_size):
        max_mem = min(self.mem_count, self.mem_size)
        batch = np.random.choice(max_mem, batch_size)

        states = self.state_memory[batch]
        actions = self.action_memory[batch]
        rewards = self.reward_memory[batch]
        states2 = self.new_state_memory[batch]
        terminal = self.terminal_memory[batch]

        return states, actions, rewards, states2, terminal

## Huber Loss

In [10]:
def masked_huber_loss(mask_value, clip_delta):
    def f(y_true, y_pred):
        error = y_true - y_pred
        cond  = K.abs(error) < clip_delta
        mask_true = K.cast(K.not_equal(y_true, mask_value), K.floatx())
        masked_squared_error = 0.5 * K.square(mask_true * (y_true - y_pred))
        linear_loss  = mask_true * (clip_delta * K.abs(error) - 0.5 * (clip_delta ** 2))
        huber_loss = tf.where(cond, masked_squared_error, linear_loss)
        return K.sum(huber_loss) / K.sum(mask_true)
    f.__name__ = 'masked_huber_loss'
    return f

## Deep Learning Neural Network

In [11]:
def get_dqn(lr, n_actions, input_dims):
    model = Sequential([
                Dense(256, input_shape=(input_dims,)),
                Activation('relu'),
                Dense(256),
                Activation('relu'),
                Dense(n_actions)])

    model.compile(optimizer=Adam(lr=lr), loss=masked_huber_loss(0,1))

    return model

## Agent

We can see that in the initialization of the agent class, we have 2 networks of the same architecure. One is called q_eval which is our action selcection model and q_target whicn is target calculating model.

We also have a new parameter called replace_target. This is the x step that can be set to replace the weights of the target calculating model with the current weights of the action selcection model.

Inside of the learn function we use the modified target equation where we call the action selection model instead of the max operator to get the action in the q value of the target.

We have a new function called update_network_parameters where we update the weights of the target calculation model every x steps. We call this function every x steps in the learn function.

In [12]:
class DDQNAgent(object):
    def __init__(self, alpha, gamma, n_actions, epsilon, batch_size,
                 input_dims, epsilon_dec,  epsilon_min,
                 mem_size, replace_target):
        self.action_space = [i for i in range(n_actions)]
        self.n_actions = n_actions
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_dec = epsilon_dec
        self.epsilon_min = epsilon_min
        self.batch_size = batch_size
        self.replace_target = replace_target
        self.memory = ReplayBuffer(mem_size, input_dims, n_actions,
                                   discrete=True)
        self.q_eval = get_dqn(alpha, n_actions, input_dims)
        self.q_target = get_dqn(alpha, n_actions, input_dims)

    def remember(self, state, action, reward, new_state, done):
        self.memory.store_transition(state, action, reward, new_state, done)

    def choose_action(self, state):
        state = state[np.newaxis, :]
        rand = np.random.random()
        if rand < self.epsilon:
            action = np.random.choice(self.action_space)
        else:
            actions = self.q_eval.predict(state)
            action = np.argmax(actions)

        return action

    def learn(self):
        if self.memory.mem_cntr > self.batch_size:
            state, action, reward, new_state, done = self.memory.sample_buffer(self.batch_size)

            action_values = np.array(self.action_space, dtype=np.int8)
            action_indices = np.dot(action, action_values)

            q_next = self.q_target.predict(new_state)
            q_eval = self.q_eval.predict(new_state)
            q_pred = self.q_eval.predict(state)

            max_actions = np.argmax(q_eval, axis=1)

            q_target = q_pred

            batch_index = np.arange(self.batch_size, dtype=np.int32)

            q_target[batch_index, action_indices] = reward + self.gamma * q_next[batch_index, max_actions.astype(int)] * done

            _ = self.q_eval.fit(state, q_target, verbose=0)

            self.epsilon = self.epsilon*self.epsilon_dec if self.epsilon > self.epsilon_min else self.epsilon_min
            if self.memory.mem_cntr % self.replace_target == 0:
                self.update_network_parameters()

    def update_network_parameters(self):
        self.q_target.set_weights(self.q_eval.get_weights())

    def save_model(self, file):
        self.q_eval.save(file)

    def load_model(self, file):
        self.q_eval = load_model(file)
        # if we are in evaluation mode we want to use the best weights for
        # q_target
        if self.epsilon == 0.0:
            self.update_network_parameters()


## Main Program

In [None]:
env = gym.make('LunarLander-v2')
ddqn_agent = DDQNAgent(alpha=0.0005, gamma=0.99, n_actions=4, epsilon=1.0, batch_size=64, input_dims=8, 
                        epsilon_dec=0.995, epsilon_min=0.01, mem_size=1000000, replace_target=100)

In [14]:
epochs = 501
rewards_history = []
eps_history = []
logger = FileLogger('DDQN_History.log')

for i in range(epochs):
    done = False
    score = 0
    observation = env.reset()
    while not done:
        action = ddqn_agent.choose_action(observation)
        observation2, reward, done, info = env.step(action)
        score += reward
        ddqn_agent.remember(observation, action, reward, observation2, int(done))
        observation = observation2
        ddqn_agent.learn()
    eps_history.append(ddqn_agent.epsilon)
    rewards_history.append(score)

    avg_score = np.mean(rewards_history[max(0, i-100):(i+1)])
    print('episode: ', i,'score: %.2f' % score,
            ' average score %.2f' % avg_score)
    logger.log(i, score, avg_score)
    if i % 5 == 0 and i > 0:
        ddqn_agent.save_model(f'Models/DDQN/episode{i}.h5')

episode:  0 score: -137.52  average score -137.52
episode:  1 score: -395.91  average score -266.71
episode:  2 score: -308.45  average score -280.63
episode:  3 score: 11.93  average score -207.49
episode:  4 score: -2.96  average score -166.58
episode:  5 score: -99.42  average score -155.39
episode:  6 score: -211.18  average score -163.36
episode:  7 score: -199.06  average score -167.82
episode:  8 score: -280.16  average score -180.30
episode:  9 score: -143.75  average score -176.65
episode:  10 score: -179.82  average score -176.93
episode:  11 score: -174.84  average score -176.76
episode:  12 score: -213.78  average score -179.61
episode:  13 score: -206.45  average score -181.53
episode:  14 score: -124.71  average score -177.74
episode:  15 score: -67.08  average score -170.82
episode:  16 score: -89.73  average score -166.05
episode:  17 score: 6.12  average score -156.49
episode:  18 score: -46.24  average score -150.68
episode:  19 score: -94.53  average score -147.88
ep

# Evaluating DDQN
From the graph below we can see that the moving average of the reward increases from episode 0 to 350. The average rewards starts to plateau after that.

We choose the final models to test based on the peaks of the moving average line between 300 to 500 episodes. We tried visualizing lunar lander landing using the models from episode 350 to 360, 400 to 415 and 470 to 480.

We found that the model that achieved a constand score above 200 was the model at 475 episodes. Hence, we chose that model as our final one.

In [11]:
data = pd.read_csv('DDQN_History.log', sep=';')
plot_graph(data)


plotly.tools.make_subplots is deprecated, please use plotly.subplots.make_subplots instead



In [29]:
env = gym.make("LunarLander-v2")
filename = "Models/DDQN/episode475.h5"
# 470-480 ep optimal
trained_model = load_model(filename, custom_objects={"masked_huber_loss": MeanSquaredError()})

evaluation_max_episodes = 10
evaluation_max_steps = 450

def episode_trigger(x):
    return x % 1 == 0
env = wrappers.RecordVideo(env, f'Videos/DDQN/', episode_trigger)

def get_q_values(model, state):
    input = state[np.newaxis, ...]
    return model.predict(input)[0]

def select_best_action(q_values):
    return np.argmax(q_values)

rewards = []
for episode in range(1, evaluation_max_episodes + 1):
    state = env.reset()

    episode_reward = 0

    step = 1
    for step in range(1, evaluation_max_steps + 1):
        env.render()
        q_values = get_q_values(trained_model, state)
        action = select_best_action(q_values)
        new_state, reward, done, info = env.step(action)

        episode_reward += reward

        if step == evaluation_max_steps:
            print(f"Episode reached the maximum number of steps. {evaluation_max_steps}")
            done = True

        state = new_state

        if done:
            break
    print(f"episode {episode} finished in {step} steps with reward {episode_reward}.")
    rewards.append(episode_reward)

print(f"Average reward: {np.average(rewards)}")
env.close()


[WinError -2147417850] Cannot change thread mode after it is set



episode 1 finished in 314 steps with reward 259.8623645173483.
episode 2 finished in 288 steps with reward 257.05966515917913.
episode 3 finished in 249 steps with reward 268.9174468096048.
episode 4 finished in 204 steps with reward 237.14798515387955.
episode 5 finished in 254 steps with reward 277.00182440486793.
Episode reached the maximum number of steps. 450
episode 6 finished in 450 steps with reward 151.74912420164515.
episode 7 finished in 173 steps with reward 274.06344696161364.
episode 8 finished in 208 steps with reward 266.6771424551007.
episode 9 finished in 161 steps with reward 255.94001833874972.
episode 10 finished in 275 steps with reward 264.34180405736043.
Average reward: 251.27608220593493


Im displaying episode 7 as it is my best episode based off rewards from the previous 10

<video controls="true" allowfullscreen="true">
  <source src="Videos\DDQN\rl-video-episode-7.mp4">
</video>

You can see here that the Agent spent wasted very little resources when using the thrusters and it also landed the lander in very few steps. This shows that the agent has successfully learnt how to land the lander and also how to land it efficiently without wasting much time and also resources such as the thruster activations. The agent was able to fire the thrusters just enough so that the lander landed quickly but also not quickly enough to consider it as a crash. This is a great example.

In [33]:
env.close()

# Conclusion
In conclusion, we explored two Reinforcement Learning Algorithms, DQN and its improved counterpart DDQN. We used DQN as our baseline model and after quite a bunch of parameter tuning, we got a average reward of around 220, which means what we have solved the problem according to the Lunar Lander doucmentation where they mentioned 'above 200 = solved'. After training our baseline model to solve the Lunar Lander problem, we trained DDQN, the baseline improved counterpart and after tuning it we managed to improve upon that average reward of 220 and managed to bring it up to 250-260 average reward, which indicates that our final model is able to learn and play the environment better than our baseline which was expected as DDQN is an improved counterpart to DQN.

# References
- https://wingedsheep.com/lunar-lander-dqn/
- https://rubikscode.net/2021/07/20/introduction-to-double-q-learning/
- https://arxiv.org/pdf/1509.06461.pdf
- https://www.neuralnet.ai/
- https://drawar.github.io/blog/2019/05/12/lunar-lander-dqn.html
- https://youtu.be/p0rGjAgykOU