# Deep Q-Learning for Lunar Landing

## Part 0 - Installing the required packages and importing the libraries

### Installing Gymnasium

In [1]:
!pip install gymnasium
!pip install "gymnasium[atari, accept-rom-license]"
!apt-get install -y swig
!pip install gymnasium[box2d]

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
swig is already the newest version (4.0.2-1ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


### Importing the libraries

In [2]:
import os
import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.autograd as autograd
from torch.autograd import Variable
from collections import deque, namedtuple

# Gymnasium environment
import gymnasium as gym

## Part 1 - Building the AI

### Creating the architecture of the Neural Network

In [3]:
class Network(nn.Module):
    def __init__(self, state_size, action_size, seed = 42) -> None:
        super(Network, self).__init__()
        self. seed = torch.manual_seed(seed)
        # flc: full connected layer
        self.fcl1 = nn.Linear(in_features = state_size, # Number of input layers
                              out_features = 64) # This number is by experimentation, experts usually use 64
        self.fcl2 = nn.Linear(64, 64) # Input should have the same number of output in previuos layer
        self.fcl3 = nn.Linear(64, action_size)

    def forward(self, state):
        x = self.fcl1(state)
        x = F.relu(x)
        x = self.fcl2(x)
        x = F.relu(x)
        return self.fcl3(x)

  and should_run_async(code)


## Part 2 - Training the AI

### Setting up the environment

In [4]:
env = gym.make('LunarLander-v2')
state_shape = env.observation_space.shape
state_size = env.observation_space.shape[0]  # 8-dimensional vector: coordinates, velocities, angle, etc..
number_actions = env.action_space.n  # 4 actions: nothing, left, main engine, right
print("State shape:", state_shape)
print("State size:", state_size)
print("Number of actions:", number_actions)

State shape: (8,)
State size: 8
Number of actions: 4


### Initializing the hyperparameters

In [5]:
learning_rate = 5e-4  # Experimental value
minibatch_size = 100  # Experimental value
discount_factor = 0.99 # gamma value
replay_buffer_size = int(1e5) # how many experiencies in memory of the agent
interpolation_parameter = 1e-3 # tau. Experimental value

### Implementing Experience Replay

In [6]:
class MemoryReplay(object):
  def __init__(self, capacity):  # Capacity of the memory
    self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    self.capacity = capacity
    self.memory = []

  def push(self, event):
    self.memory.append(event)
    # Making sure the memory buffer doesn't exceed the capacity
    # by removing the oldest event
    if len(self.memory) > self.capacity:
      del self.memory[0]

  def sample(self, batch_size):
    experiences = random.sample(self.memory, k = batch_size)
    states = torch.from_numpy(np.vstack([e[0] for e in experiences if e is not None])).float().to(self.device)
    actions = torch.from_numpy(np.vstack([e[1] for e in experiences if e is not None])).long().to(self.device)
    rewards = torch.from_numpy(np.vstack([e[2] for e in experiences if e is not None])).float().to(self.device)
    next_states = torch.from_numpy(np.vstack([e[3] for e in experiences if e is not None])).float().to(self.device)
    dones = torch.from_numpy(np.vstack([e[4] for e in experiences if e is not None]).astype(np.uint8)).float().to(self.device)
    return states, next_states, actions, rewards, dones

### Implementing the DQN class

## Local Q-Network (or simply Q-Network):

**Purpose**: This network is responsible for learning the Q-value function. It is updated continuously during the training process.

**Operation**: At each step of training, this network takes the current state as input and outputs Q-values for all possible actions in that state.

**Updating**: The weights of this network are updated frequently, typically at every step or every few steps of training, using a technique like backpropagation. The updates are guided by a loss function that measures the difference between the predicted Q-values and the target Q-values (which come from the target Q-network).

## Target Network Prediction
The target network is a copy of the Q-network, and it is used to predict the Q values for the next states. This helps to stabilize the learning process because the target values are kept fixed for a while, reducing the oscillations and divergence issues that can occur if the same network is used for both action selection and value estimation.

**Purpose**: The target Q-network is used to generate stable target values for the updates of the Q-network. It helps in stabilizing the learning process.

**Operation**: Like the local Q-network, it also takes the state as input and outputs Q-values. However, its weights are not updated as frequently.

**Updating**: The weights of the target Q-network are periodically updated to match those of the local Q-network. This update happens less frequently, such as every few hundred or thousand training steps.

## Max Q Value Selection
For each possible next state, the target network predicts Q values for all possible actions. We then select the maximum Q value among these predictions. This value represents the best possible future reward we can expect from that state, according to the target network.

## Cost Function
The cost function is utilized to improve learning by minimizing the difference (error) between the predicted Q values from the Q-network and the target Q values. The target Q value is calculated using the maximum Q value from the target network as mentioned earlier. This error is typically the mean squared error (MSE) between the predicted Q value and the target Q value:

$$\text{Loss} = \left( r + \gamma \max_{a'} Q_{\text{target}}(s', a') - Q_{\text{current}}(s, a) \right)^2$$

Here:

$r$ is the reward received after taking action $a$ in state $s$.

$\gamma$ is the discount factor.

$\max_{a'} Q_{\text{target}}(s', a')$ is the maximum Q value for the next state $s'$ from the target network.

$Q_{\text{current}}(s, a)$ is the current Q value predicted by the Q-network for the state-action pair $(s, a)$.

By minimizing this loss, the Q-network learns to approximate the Q values more accurately, improving its policy for selecting actions over time.

## Clarifications

* For the actual action selection during training (e.g., choosing the action to take in the current state), the local Q-network is used. However, when calculating the target Q-values for training, the target Q-network is used. The target Q-network provides the Q-values for the next state, which are used to compute the target value for the current state. This helps to stabilize training.

* The local Q-network predicts the Q-values for the current state-action pairs.

  The target Q-network predicts the Q-values for the next state.

  The target value for each action is computed using the Bellman equation

* The reason for having these two separate networks (local and target) comes from the need to address a significant challenge in training deep Q-networks: the moving target problem. When a single network is used both to select actions and to evaluate them, it can lead to highly correlated Q-value estimates. This correlation can make the training unstable and inefficient, as the network is effectively chasing a constantly moving target (its own continuously updated estimates).

  By separating the networks, the target Q-network provides more stable and less frequently changing target values for the local Q-network to learn from. This separation reduces the correlations in the update process, leading to more stable and reliable learning. The idea is similar to using fixed datasets for training and validation in supervised learning to prevent overfitting and to ensure that the model generalizes well.

In [7]:
class Agent:
  def __init__(self, state_size, action_size) -> None:
    self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    self.state_size = state_size
    self.action_size = action_size
    self.local_qnetwork = Network(state_size, action_size).to(self.device)
    self.target_qnetwork = Network(state_size, action_size).to(self.device)
    self.optimizer = optim.Adam(self.local_qnetwork.parameters(), lr = learning_rate)
    self.memory = MemoryReplay(replay_buffer_size)
    self.t_step = 0

  def step(self, state, action, reward, next_stage, done):
    self.memory.push((state, action, reward, next_stage, done))
    self.t_step = (self.t_step + 1) % 4 # restart t_step every 4 steps
    if self.t_step == 0:
      if len(self.memory.memory) > minibatch_size:
        experiences = self.memory.sample(100)
        self.learn(experiences, discount_factor)

  def act(self, state, epsilon = 0.):
    """Epsilon-Greedy Action Selection policy.

    By generating a random value and comparing it with
    epsilon value that allows to leave some room for
    exploration because some actions will be randomly
    selected (instead of always being predicted by the agent)"""
    state = torch.from_numpy(state).float().unsqueeze(0).to(self.device)
    self.local_qnetwork.eval()

    # Checking we are not in training
    # i.e., we are predicting
    with torch.no_grad(): # make sure any gradient computation is disabled
      action_values = self.local_qnetwork(state)

    self.local_qnetwork.train()
    if random.random() > epsilon:
      return np.argmax(action_values.cpu().data.numpy())
    else:
      return random.choice(np.arange(self.action_size))

  def learn(self, experiences:tuple, discount_factor):
    states, next_states, actions, rewards, dones = experiences # experiences is a tuple
    # getting maximum predicted q values of the next stage
    next_q_target = self.target_qnetwork(next_states).detach().max(1)[0].unsqueeze(1)
    q_targets = rewards + (discount_factor * next_q_target * (1 - dones))

    # expected q values for the locat network
    q_expected = self.local_qnetwork(states).gather(1, actions)

    # Computing the loss
    loss = F.mse_loss(q_expected, q_targets)

    # Backpropagate the loss
    self.optimizer.zero_grad()
    loss.backward()

    # Updating the parameters of the model
    self.optimizer.step()

    # Update the target network parameters with those of the local network
    self.soft_update(self.local_qnetwork, self.target_qnetwork, interpolation_parameter)

  def soft_update(self, local_model, target_model, interpolation_parameter):
    # Update the target parameters using the weighted average of the local and target parameters
    for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
      target_param.data.copy_(interpolation_parameter * local_param.data + (1.0 - interpolation_parameter) * target_param.data)


### Initializing the DQN agent

In [8]:
agent = Agent(state_size, number_actions)

### Training the DQN agent

In [9]:
# Hyperparameters
number_episodes = 2000
maximum_number_timesteps_per_episode = 1000
epsilon_starting_value = 1.0
epsilon_ending_value = 0.01
epsilon_decay_value = 0.995
epsilon = epsilon_starting_value
scores_on_100_episodes = deque(maxlen = 100)

for episode in range(1, number_episodes + 1):
  # reset the environment to the initial state
  state, _ = env.reset()
  # Initialize the score
  score = 0

  for t in range(0, maximum_number_timesteps_per_episode):
    # Select the action
    action = agent.act(state, epsilon)
    # Move and end in a new state, also get the reward
    next_state, reward, done, _, _ = env.step(action)
    # Training!
    agent.step(state, action, reward, next_state, done)
    state = next_state
    score += reward
    if done:
      break

  # Updateing the scores after the episode
  scores_on_100_episodes.append(score)
  epsilon = max(epsilon_ending_value, epsilon_decay_value * epsilon)

  # Printing dynamically the scores (reward)
  print(f'\rEpisode {episode}\tAverage Score: {round(np.mean(scores_on_100_episodes), 2)}', end = "")
  if episode % 100 == 0:
    print(f'\rEpisode {episode}\tAverage Score: {round(np.mean(scores_on_100_episodes), 2)}')

  if np.mean(scores_on_100_episodes) >= 200.0:
    print(f'\n Environment solved in {episode - 100} episodes\tAverage Score: {round(np.mean(scores_on_100_episodes), 2)}')
    torch.save(agent.local_qnetwork.state_dict(), "checkpoint.pth")
    break

  and should_run_async(code)


Episode 100	Average Score: -160.92
Episode 200	Average Score: -112.28
Episode 300	Average Score: -40.0
Episode 400	Average Score: 40.34
Episode 500	Average Score: 104.43
Episode 600	Average Score: 158.26
Episode 700	Average Score: 183.2
Episode 800	Average Score: 189.38
Episode 857	Average Score: 200.22
 Environment solved in 757 episodes	Average Score: 200.22


## Part 3 - Visualizing the results

In [10]:
import glob
import io
import base64
import imageio
from IPython.display import HTML, display
from gym.wrappers.monitoring.video_recorder import VideoRecorder

def show_video_of_model(agent, env_name):
    env = gym.make(env_name, render_mode='rgb_array')
    state, _ = env.reset()
    done = False
    frames = []
    while not done:
        frame = env.render()
        frames.append(frame)
        action = agent.act(state)
        state, reward, done, _, _ = env.step(action.item())
    env.close()
    imageio.mimsave('video.mp4', frames, fps=30)

show_video_of_model(agent, 'LunarLander-v2')

def show_video():
    mp4list = glob.glob('*.mp4')
    if len(mp4list) > 0:
        mp4 = mp4list[0]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        display(HTML(data='''<video alt="test" autoplay
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
    else:
        print("Could not find video")

show_video()

