# RLSS2023 - DQN Tutorial: Deep Q-Network (DQN)

## Part II: DQN Update and Training Loop

Website: https://rlsummerschool.com/

Github repository: https://github.com/araffin/rlss23-dqn-tutorial

Gymnasium documentation: https://gymnasium.farama.org/

<div>
    <img src="https://araffin.github.io/slides/dqn-tutorial/images/dqn/dqn.png" width="800"/>
</div>

### Introduction

In this notebook, you will finish the implementation of the [Deep Q-Network (DQN)](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html) algorithm (started in part I) by implementing the training loop and the DQN gradient update.

In [1]:
# for autoformatting
# !pip install jupyter-black
# %load_ext jupyter_black

### Install Dependencies

In [2]:
!pip install git+https://github.com/araffin/rlss23-dqn-tutorial/ --upgrade

Collecting git+https://github.com/araffin/rlss23-dqn-tutorial/
  Cloning https://github.com/araffin/rlss23-dqn-tutorial/ to /private/var/folders/sd/ntm8vytx57g033nh4jm4j2700000gn/T/pip-req-build-3o0reu5y
  Running command git clone --filter=blob:none --quiet https://github.com/araffin/rlss23-dqn-tutorial/ /private/var/folders/sd/ntm8vytx57g033nh4jm4j2700000gn/T/pip-req-build-3o0reu5y
  Resolved https://github.com/araffin/rlss23-dqn-tutorial/ to commit 3070ad0d11f0ece870fb0f6978dc75911edeaca1
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


In [3]:
!apt-get install ffmpeg  # For visualization

zsh:1: command not found: apt-get


### Imports (from Part I)

In [4]:
from typing import Optional

import numpy as np
import torch as th
import gymnasium as gym
from gymnasium import spaces

# We implemented those components in part I
from dqn_tutorial.dqn import ReplayBuffer, epsilon_greedy_action_selection, collect_one_step, linear_schedule, QNetwork
from dqn_tutorial.notebook_utils import show_videos

In [5]:
from pathlib import Path

import dqn_tutorial

# Where to save the video
video_folder = Path.cwd()/"DQN_videos"

In [6]:
from pathlib import Path
from typing import Optional

import gymnasium as gym
import numpy as np
from gymnasium import spaces
from gymnasium.wrappers.monitoring.video_recorder import VideoRecorder

from dqn_tutorial.dqn.collect_data import epsilon_greedy_action_selection
from dqn_tutorial.dqn.q_network import QNetwork

# Evaluation, modified version
def evaluate_policy(
    eval_env: gym.Env,
    q_net: QNetwork,
    n_eval_episodes: int,
    eval_exploration_rate: float = 0.0,
    video_name: Optional[str] = None,
    video_folder: Optional[str] = None,
) -> None:
    """
    Evaluate the policy by computing the average episode reward
    over n_eval_episodes episodes.

    :param eval_env: The environment to evaluate the policy on
    :param q_net: The Q-network to evaluate
    :param n_eval_episodes: The number of episodes to evaluate the policy on
    :param eval_exploration_rate: The exploration rate to use during evaluation
    """
    assert isinstance(eval_env.action_space, spaces.Discrete)

    # Setup video recorder
    video_recorder = None
    if video_name is not None and eval_env.render_mode == "rgb_array":
        video_path = video_folder / video_name

        video_recorder = VideoRecorder(
            env=eval_env,
            base_path=str(video_path),
        )

    episode_returns = []
    for _ in range(n_eval_episodes):
        obs, _ = eval_env.reset()
        total_reward = 0.0
        done = False
        while not done:
            # Record video
            if video_recorder is not None:
                video_recorder.capture_frame()

            # Select the action according to the policy
            action = epsilon_greedy_action_selection(
                q_net,
                obs,
                exploration_rate=eval_exploration_rate,
                action_space=eval_env.action_space,
            )
            # Render
            if eval_env.render_mode is not None:  # pragma: no cover
                eval_env.render()
            # Do one step in the environment
            obs, reward, terminated, truncated, _ = eval_env.step(action)
            total_reward += float(reward)

            done = terminated or truncated
        # Store the episode reward
        episode_returns.append(total_reward)

    if video_recorder is not None:
        print(f"Saving video to {video_recorder.path}")
        video_recorder.close()

    # Print mean and std of the episode rewards
    print(f"Mean episode reward: {np.mean(episode_returns):.2f} +/- {np.std(episode_returns):.2f}")

## DQN Update rule (no target network)
<div>
    <img src="https://araffin.github.io/slides/dqn-tutorial/images/dqn/annotated_dqn.png" width="1000"/>
</div>


### Exercise (15 minutes): write DQN update

**HINT**: DQN update is heavily inspired by FQI update, if you block, you can take a look at what you did in the first notebook on FQI

**HINT**: The data sampled from the replay buffer uses the following structure:

```python
@dataclass
class ReplayBufferSamples:
    """
    A dataclass containing transitions from the replay buffer.
    """

    observations: np.ndarray  # same as states in the theory
    next_observations: np.ndarray
    actions: np.ndarray
    rewards: np.ndarray
    terminateds: np.ndarray
```

**HINT**: You can take a look at the section about Q-Network in the second notebook (DQN part I) to recall how to predict q-values using a q-network.

In [7]:
def dqn_update_no_target(
    q_net: QNetwork,
    optimizer: th.optim.Optimizer,
    replay_buffer: ReplayBuffer,
    batch_size: int,
    gamma: float,
) -> None:
    """
    Perform one gradient step on the Q-network
    using the data from the replay buffer.
    Note: this is the same as dqn_update in dqn.py, but without the target network.

    :param q_net: The Q-network to update
    :param optimizer: The optimizer to use
    :param replay_buffer: The replay buffer containing the transitions
    :param batch_size: The minibatch size, how many transitions to sample
    :param gamma: The discount factor
    """
    ### YOUR CODE HERE

    # Sample the replay buffer and convert them to PyTorch tensors
    # using `.to_torch()` method
    replay_data = replay_buffer.sample(batch_size).to_torch()

    # We should not compute gradient with respect to the target
    with th.no_grad():
        # Compute the Q-values for the next observations
        # (replay_data.next_observations)
        # (batch_size, n_actions)
        next_q_values = q_net(replay_data.next_observations)

        # Follow greedy policy: use the one with the highest value
        # shape: (batch_size,)
        # Note: tensor.max(dim=..) returns a tuple (max, indices) in PyTorch
        next_max_q_values, _ = next_q_values.max(dim = 1) # Take maximum across columns for each row
        
        # If the episode is terminated, set the target to the reward
        # (same as FQI, you can use `th.logical_not` to mask the next q values)
        should_bootstrap = th.logical_not(replay_data.terminateds)

        # 1-step TD target (TD(0) same as for FQI)
        td_target = replay_data.rewards + gamma * next_max_q_values * should_bootstrap

    # Get current Q-values estimates for the replay_data (batch_size, n_actions)
    q_values = q_net(replay_data.observations)

    # Select the Q-values corresponding to the actions that were selected
    # during data collection, since we only care about Q(S_t,A_t)
    # you should use `th.gather()`, gather q_value that matches with the taken action 
    # across columns for each row
    current_q_values = th.gather(q_values, dim = 1, index = replay_data.actions)

    # Reshape from (batch_size, 1) to (batch_size,) to avoid broadcast error
    # You can use `tensor.squeeze(dim=..)`
    current_q_values = current_q_values.squeeze(dim = 1) # This is 1-dimensional now

    # Check for any shape/broadcast error
    # Current q-values must have the same shape as the TD target
    assert current_q_values.shape == (batch_size,), f"{current_q_values.shape} != {(batch_size,)}"
    assert current_q_values.shape == td_target.shape, f"{current_q_values.shape} != {td_target.shape}"

    # Compute the Mean Squared Error (MSE) loss
    # Optionally, one can use a Huber loss instead of the MSE loss
    loss = th.mean((current_q_values - td_target) ** 2)

    ### END OF YOUR CODE

    # Reset gradients
    optimizer.zero_grad()
    
    # Compute the gradients, back propagation
    loss.backward()
    
    # Update the parameters of the q-network
    optimizer.step()

Let's test the implementation:

In [8]:
env = gym.make("CartPole-v1")

# DQN
q_net = QNetwork(env.observation_space, env.action_space)

# Adam optimizer for parameters
optimizer = th.optim.Adam(q_net.parameters(), lr=0.001)

# Set uo replay buffer
replay_buffer = ReplayBuffer(2000, env.observation_space, env.action_space)

obs, _ = env.reset()

# Let's collect some data following an epsilon-greedy policy, fill replay buffer
for _ in range(1000):
    obs = collect_one_step(env, q_net, replay_buffer, obs, exploration_rate=0.1)

# Try to do some gradient steps:
for _ in range(10):
    dqn_update_no_target(q_net, optimizer, replay_buffer, batch_size=32, gamma=0.99)

### Exercise (10 minutes): write the training loop

Let's put everything together and implement the training loop that alternates between data collection and updating the Q-Network.
At first we will not use any target network.

<div>
    <img src="https://araffin.github.io/slides/dqn-tutorial/images/dqn/dqn_loop.png" width="600"/>
</div>

In [9]:
def run_dqn_no_target(
    env_id: str = "CartPole-v1",
    replay_buffer_size: int = 50_000,
    # Exploration schedule
    # (for the epsilon-greedy data collection)
    exploration_initial_eps: float = 1.0,
    exploration_final_eps: float = 0.01,
    n_timesteps: int = 20_000,
    update_interval: int = 2,
    learning_rate: float = 3e-4,
    batch_size: int = 64,
    gamma: float = 0.99,
    n_eval_episodes: int = 10,
    evaluation_interval: int = 1000,
    eval_exploration_rate: float = 0.0,
    seed: int = 2023,
    # device: Union[th.device, str] = "cpu",
    eval_render_mode: Optional[str] = None,  # "human", "rgb_array", None
) -> QNetwork:
    """
    Run Deep Q-Learning (DQN) on a given environment.
    (without target network)

    :param env_id: Name of the environment
    :param replay_buffer_size: Max capacity of the replay buffer
    :param exploration_initial_eps: The initial exploration rate
    :param exploration_final_eps: The final exploration rate
    :param n_timesteps: Number of timesteps in total
    :param update_interval: How often to update the Q-network
        (every update_interval steps)
    :param learning_rate: The learning rate to use for the optimizer
    :param batch_size: The minibatch size
    :param gamma: The discount factor
    :param n_eval_episodes: The number of episodes to evaluate the policy on
    :param evaluation_interval: How often to evaluate the policy
    :param eval_exploration_rate: The exploration rate to use during evaluation
    :param seed: Random seed for the pseudo random generator
    :param eval_render_mode: The render mode to use for evaluation
    """
    # Set seed for reproducibility
    # Seed Numpy as PyTorch pseudo random generators
    # Seed Numpy RNG
    np.random.seed(seed)
    # seed the RNG for all devices (both CPU and CUDA)
    th.manual_seed(seed)

    # Create the environment
    env = gym.make(env_id)
    assert isinstance(env.observation_space, spaces.Box)
    assert isinstance(env.action_space, spaces.Discrete)
    env.action_space.seed(seed)

    # Create the evaluation environment
    eval_env = gym.make(env_id, render_mode=eval_render_mode)
    eval_env.reset(seed=seed)
    eval_env.action_space.seed(seed)

    ### YOUR CODE HERE
    # TODO:
    # 1. Instantiate the Q-Network and the optimizer
    # 2. Instantiate the replay buffer
    # 3. Compute the current exploration rate (epsilon)
    # 4. Collect new transition by stepping in the env following
    # an epsilon-greedy strategy
    # 5. Update the Q-Network using gradient descent

    # Create the q-network
    q_net = QNetwork(env.observation_space, env.action_space)

    # Create the optimizer (PyTorch `th.optim.Adam` will be helpful here)
    optimizer = th.optim.Adam(q_net.parameters(),lr = learning_rate)

    # Create the Replay buffer
    replay_buffer = ReplayBuffer(replay_buffer_size,env.observation_space,env.action_space)

    # Reset the env
    obs, _ = env.reset(seed = seed)

    for current_step in range(1, n_timesteps + 1):
        # Compute the current exploration rate
        # according to the exploration schedule (update the value of epsilon)
        # you should use `linear_schedule()`
        exploration_rate = linear_schedule(exploration_initial_eps,
                                exploration_final_eps,current_step,n_timesteps)

        # Do one step in the environment following an epsilon-greedy policy
        # and store the transition in the replay buffer
        # you can re-use `collect_one_step()`
        obs = collect_one_step(env, q_net, replay_buffer, obs, exploration_rate = exploration_rate)

        # Update the Q-Network every `update_interval` steps
        if (current_step % update_interval) == 0:
            # Do one gradient step (using `dqn_update_no_target()`)
            dqn_update_no_target(q_net, optimizer, replay_buffer, batch_size, gamma)

        ### END OF YOUR CODE
        
        # Evaluation step
        if (current_step % evaluation_interval) == 0:
            print()
            print(f"Evaluation at step {current_step}:")
            # Evaluate the current greedy policy (deterministic policy)
            evaluate_policy(eval_env, q_net, n_eval_episodes, eval_exploration_rate = eval_exploration_rate)
    return q_net

## Train a DQN agent on CartPole environment

In [10]:
import os
# create log folder
os.makedirs(video_folder, exist_ok=True)

In [11]:
env_id = "CartPole-v1"
q_net = run_dqn_no_target(env_id)


Evaluation at step 1000:
Mean episode reward: 9.20 +/- 0.75

Evaluation at step 2000:
Mean episode reward: 9.50 +/- 0.67

Evaluation at step 3000:
Mean episode reward: 50.40 +/- 10.47

Evaluation at step 4000:
Mean episode reward: 88.80 +/- 41.77

Evaluation at step 5000:
Mean episode reward: 109.20 +/- 31.57

Evaluation at step 6000:
Mean episode reward: 110.50 +/- 26.14

Evaluation at step 7000:
Mean episode reward: 221.90 +/- 145.50

Evaluation at step 8000:
Mean episode reward: 403.30 +/- 133.11

Evaluation at step 9000:
Mean episode reward: 316.00 +/- 82.25

Evaluation at step 10000:
Mean episode reward: 164.50 +/- 58.70

Evaluation at step 11000:
Mean episode reward: 208.40 +/- 59.28

Evaluation at step 12000:
Mean episode reward: 327.30 +/- 119.10

Evaluation at step 13000:
Mean episode reward: 304.40 +/- 125.29

Evaluation at step 14000:
Mean episode reward: 370.80 +/- 108.71

Evaluation at step 15000:
Mean episode reward: 303.90 +/- 112.74

Evaluation at step 16000:
Mean epis

### Record and show video of the trained agent

In [12]:
eval_env = gym.make(env_id, render_mode="rgb_array")
n_eval_episodes = 3
eval_exploration_rate = 0.0 # No exploration during evaluation
video_name = f"DQN_no_target_{env_id}"

evaluate_policy(
    eval_env,
    q_net,
    n_eval_episodes,
    eval_exploration_rate=eval_exploration_rate,
    video_name=video_name,
    video_folder=video_folder
)

show_videos(video_folder, prefix=video_name)

Saving video to /Users/a24395/Desktop/files/PhD_Coursework/Year2/RL_seminar/Code/Gym_FQI_DQN/DQN_videos/DQN_no_target_CartPole-v1.mp4
Moviepy - Building video /Users/a24395/Desktop/files/PhD_Coursework/Year2/RL_seminar/Code/Gym_FQI_DQN/DQN_videos/DQN_no_target_CartPole-v1.mp4.
Moviepy - Writing video /Users/a24395/Desktop/files/PhD_Coursework/Year2/RL_seminar/Code/Gym_FQI_DQN/DQN_videos/DQN_no_target_CartPole-v1.mp4



                                                                 

Moviepy - Done !
Moviepy - video ready /Users/a24395/Desktop/files/PhD_Coursework/Year2/RL_seminar/Code/Gym_FQI_DQN/DQN_videos/DQN_no_target_CartPole-v1.mp4
Mean episode reward: 487.33 +/- 17.91


## [Bonus] DQN Target Network


<div>
    <img src="https://araffin.github.io/slides/dqn-tutorial/images/dqn/target_q_network.png" width="1000"/>
</div>

The only things that is changing is when predicting the next q value.

In DQN without target, the online network with weights **$\theta$** is used:

$y = r_t + \gamma \cdot \max_{a \in A}(\hat{Q}_{\pi}(s_{t+1}, a; \theta))$


whereas with DQN with target network, the target q-network (a delayed copy of the q-network) with weights **$\theta^\prime$** is used instead:

$y = r_t + \gamma \cdot \max_{a \in A}(\hat{Q}_{\pi}(s_{t+1}, a; \theta^\prime))$


### Exercise (5 minutes): write the DQN update with target network

**HINT**: it is exactly the same as `dqn_update_no_target` except for computing the next q-values

In [13]:
def dqn_update(
    q_net: QNetwork,
    q_target_net: QNetwork,
    optimizer: th.optim.Optimizer,
    replay_buffer: ReplayBuffer,
    batch_size: int,
    gamma: float,
) -> None:
    """
    Perform one gradient step on the Q-network
    using the data from the replay buffer.

    :param q_net: The Q-network to update
    :param q_target_net: The target Q-network, to compute the td-target.
    :param optimizer: The optimizer to use
    :param replay_buffer: The replay buffer containing the transitions
    :param batch_size: The minibatch size, how many transitions to sample
    :param gamma: The discount factor
    """

    # Sample the replay buffer and convert them to PyTorch tensors
    replay_data = replay_buffer.sample(batch_size).to_torch()

    with th.no_grad():
        ### YOUR CODE HERE
        # TODO: use the target q-network instead of the online q-network
        # to compute the next values

        # Compute the Q-values for the next observations (batch_size, n_actions)
        # using the target network
        next_q_values = q_target_net(replay_data.next_observations)

        # Follow greedy policy: use the one with the highest value
        # (batch_size,)
        next_max_q_values, _ = next_q_values.max(dim = 1)

        # If the episode is terminated, set the target to the reward
        should_bootstrap = th.logical_not(replay_data.terminateds)

        # 1-step TD target
        td_target = replay_data.rewards + gamma * next_max_q_values * should_bootstrap

        ### END OF YOUR CODE

    # Get current Q-values estimates for the replay_data (batch_size, n_actions)
    q_values = q_net(replay_data.observations)
    # Select the Q-values corresponding to the actions that were selected
    # during data collection
    current_q_values = th.gather(q_values, dim=1, index=replay_data.actions)
    # Reshape from (batch_size, 1) to (batch_size,) to avoid broadcast error
    current_q_values = current_q_values.squeeze(dim=1)

    # Check for any shape/broadcast error
    # Current q-values must have the same shape as the TD target
    assert current_q_values.shape == (batch_size,), f"{current_q_values.shape} != {(batch_size,)}"
    assert current_q_values.shape == td_target.shape, f"{current_q_values.shape} != {td_target.shape}"

    # Compute the Mean Squared Error (MSE) loss
    # Optionally, one can use a Huber loss instead of the MSE loss
    loss = th.mean((current_q_values - td_target) ** 2)
    
    # Huber loss
    # loss = th.nn.functional.smooth_l1_loss(current_q_values, td_target)

    # Reset gradients
    optimizer.zero_grad()
    # Compute the gradients
    loss.backward()
    # Update the parameters of the q-network
    optimizer.step()

### Updated training loop

In [14]:
def run_dqn(
    env_id: str = "CartPole-v1",
    replay_buffer_size: int = 50_000,
    # How often do we copy the parameters from the Q-network to the target network
    target_network_update_interval: int = 1000,
    # Warmup phase
    learning_starts: int = 100,
    # Exploration schedule
    # (for the epsilon-greedy data collection)
    exploration_initial_eps: float = 1.0,
    exploration_final_eps: float = 0.01,
    exploration_fraction: float = 0.1,
    n_timesteps: int = 20_000,
    update_interval: int = 2,
    learning_rate: float = 3e-4,
    batch_size: int = 64,
    gamma: float = 0.99,
    n_hidden_units: int = 64,
    n_eval_episodes: int = 10,
    evaluation_interval: int = 1000,
    eval_exploration_rate: float = 0.0,
    seed: int = 2023,
    # device: Union[th.device, str] = "cpu",
    eval_render_mode: Optional[str] = None,  # "human", "rgb_array", None
) -> QNetwork:
    """
    Run Deep Q-Learning (DQN) on a given environment.
    (with a target network)

    :param env_id: Name of the environment
    :param replay_buffer_size: Max capacity of the replay buffer
    :param target_network_update_interval: How often do we copy the parameters
         to the target network
    :param learning_starts: Warmup phase to fill the replay buffer
        before starting the optimization.
    :param exploration_initial_eps: The initial exploration rate
    :param exploration_final_eps: The final exploration rate
    :param exploration_fraction: The fraction of the number of steps
        during which the exploration rate is annealed from
        initial_eps to final_eps.
        After this many steps, the exploration rate remains constant.
    :param n_timesteps: Number of timesteps in total
    :param update_interval: How often to update the Q-network
        (every update_interval steps)
    :param learning_rate: The learning rate to use for the optimizer
    :param batch_size: The minibatch size
    :param gamma: The discount factor
    :param n_hidden_units: Number of units for each hidden layer
        of the Q-Network.
    :param n_eval_episodes: The number of episodes to evaluate the policy on
    :param evaluation_interval: How often to evaluate the policy
    :param eval_exploration_rate: The exploration rate to use during evaluation
    :param seed: Random seed for the pseudo random generator
    :param eval_render_mode: The render mode to use for evaluation
    """
    # Set seed for reproducibility
    # Seed Numpy as PyTorch pseudo random generators
    # Seed Numpy RNG
    np.random.seed(seed)
    # seed the RNG for all devices (both CPU and CUDA)
    th.manual_seed(seed)

    # Create the environment
    env = gym.make(env_id)
    # For highway env
    env = gym.wrappers.FlattenObservation(env)
    env = gym.wrappers.RecordEpisodeStatistics(env)
    assert isinstance(env.observation_space, spaces.Box)
    assert isinstance(env.action_space, spaces.Discrete)
    env.action_space.seed(seed)

    # Create the evaluation environment
    eval_env = gym.make(env_id, render_mode=eval_render_mode)
    eval_env = gym.wrappers.FlattenObservation(eval_env)
    eval_env.reset(seed=seed)
    eval_env.action_space.seed(seed)

    # Create the q-network
    q_net = QNetwork(env.observation_space, env.action_space, n_hidden_units=n_hidden_units)
    
    # Create the target network
    q_target_net = QNetwork(env.observation_space, env.action_space, n_hidden_units=n_hidden_units)
    
    # Copy the parameters of the q-network to the target network
    q_target_net.load_state_dict(q_net.state_dict())

    # For flappy bird
    if env.observation_space.dtype == np.float64:
        q_net.double()
        q_target_net.double()

    # Create the optimizer, we only optimize the parameters of the q-network
    optimizer = th.optim.Adam(q_net.parameters(), lr=learning_rate)

    # Create the Replay buffer
    replay_buffer = ReplayBuffer(replay_buffer_size, env.observation_space, env.action_space)
    
    # Reset the env
    obs, _ = env.reset(seed=seed)
    for current_step in range(1, n_timesteps + 1):
        # Update the current exploration schedule (update the value of epsilon)
        exploration_rate = linear_schedule(
            exploration_initial_eps,
            exploration_final_eps,
            current_step,
            int(exploration_fraction * n_timesteps),
        )
        
        # Do one step in the environment following an epsilon-greedy policy
        # and store the transition in the replay buffer
        obs = collect_one_step(
            env,
            q_net,
            replay_buffer,
            obs,
            exploration_rate=exploration_rate,
            verbose=0,
        )

        # Update the target network
        # by copying the parameters from the Q-network every target_network_update_interval steps
        if (current_step % target_network_update_interval) == 0:
            q_target_net.load_state_dict(q_net.state_dict())

        # Update the Q-network every update_interval steps
        # after learning_starts steps have passed (warmup phase)
        if (current_step % update_interval) == 0 and current_step > learning_starts:
            # Do one gradient step
            dqn_update(q_net, q_target_net, optimizer, replay_buffer, batch_size, gamma=gamma)

        if (current_step % evaluation_interval) == 0:
            print()
            print(f"Evaluation at step {current_step}:")
            print(f"exploration_rate = {exploration_rate:.2f}")
            # Evaluate the current greedy policy (deterministic policy)
            evaluate_policy(eval_env, q_net, n_eval_episodes, eval_exploration_rate=eval_exploration_rate)
            
            # Save a checkpoint
            os.makedirs("./data_DQN", exist_ok=True)
            th.save(q_net.state_dict(), f"./data_DQN/q_net_checkpoint_{env_id}_{current_step}.pth")
    return q_net

## Train DQN agent with target network on CartPole env

In [15]:
# Tuned hyperparameters from the RL Zoo3 of the Stable Baselines3 library
# https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/hyperparams/dqn.yml

env_id = "CartPole-v1"

q_net = run_dqn(
    env_id=env_id,
    replay_buffer_size=100_000,
    # Note: you can remove the target network
    # by setting target_network_update_interval=1
    target_network_update_interval=10,
    learning_starts=1000,
    exploration_initial_eps=1.0,
    exploration_final_eps=0.04,
    exploration_fraction=0.1,
    n_timesteps=80_000,
    update_interval=2,
    learning_rate=1e-3,
    batch_size=64,
    gamma=0.99,
    n_eval_episodes=10,
    evaluation_interval=5000,
    # No exploration during evaluation
    # (deteministic policy)
    eval_exploration_rate=0.0,
    seed=2022,
)


Evaluation at step 5000:
exploration_rate = 0.40
Mean episode reward: 248.30 +/- 108.67

Evaluation at step 10000:
exploration_rate = 0.04
Mean episode reward: 265.70 +/- 53.99

Evaluation at step 15000:
exploration_rate = 0.04
Mean episode reward: 164.10 +/- 30.11

Evaluation at step 20000:
exploration_rate = 0.04
Mean episode reward: 186.20 +/- 65.99

Evaluation at step 25000:
exploration_rate = 0.04
Mean episode reward: 260.70 +/- 52.42

Evaluation at step 30000:
exploration_rate = 0.04
Mean episode reward: 500.00 +/- 0.00

Evaluation at step 35000:
exploration_rate = 0.04
Mean episode reward: 500.00 +/- 0.00

Evaluation at step 40000:
exploration_rate = 0.04
Mean episode reward: 500.00 +/- 0.00

Evaluation at step 45000:
exploration_rate = 0.04
Mean episode reward: 500.00 +/- 0.00

Evaluation at step 50000:
exploration_rate = 0.04
Mean episode reward: 500.00 +/- 0.00

Evaluation at step 55000:
exploration_rate = 0.04
Mean episode reward: 500.00 +/- 0.00

Evaluation at step 60000:


### Visualize the trained agent

In [16]:
eval_env = gym.make(env_id, render_mode="rgb_array")
n_eval_episodes = 3
eval_exploration_rate = 0.0
video_name = f"DQN_{env_id}"

# Optional: load checkpoint
# q_net = QNetwork(eval_env.observation_space, eval_env.action_space, n_hidden_units=64)
# q_net.load_state_dict(th.load("../logs/q_net_checkpoint_CartPole-v1_75000.pth"))

evaluate_policy(
    eval_env,
    q_net,
    n_eval_episodes,
    eval_exploration_rate = eval_exploration_rate,
    video_name = video_name,
    video_folder = video_folder
)

show_videos(video_folder, prefix = video_name)

Saving video to /Users/a24395/Desktop/files/PhD_Coursework/Year2/RL_seminar/Code/Gym_FQI_DQN/DQN_videos/DQN_CartPole-v1.mp4
Moviepy - Building video /Users/a24395/Desktop/files/PhD_Coursework/Year2/RL_seminar/Code/Gym_FQI_DQN/DQN_videos/DQN_CartPole-v1.mp4.
Moviepy - Writing video /Users/a24395/Desktop/files/PhD_Coursework/Year2/RL_seminar/Code/Gym_FQI_DQN/DQN_videos/DQN_CartPole-v1.mp4



                                                                 

Moviepy - Done !
Moviepy - video ready /Users/a24395/Desktop/files/PhD_Coursework/Year2/RL_seminar/Code/Gym_FQI_DQN/DQN_videos/DQN_CartPole-v1.mp4
Mean episode reward: 500.00 +/- 0.00




## Training DQN agent on flappy bird:

You can go in the [GitHub repo](https://github.com/araffin/flappy-bird-gymnasium/tree/patch-1) to learn more about this environment.

<div>
    <img src="https://raw.githubusercontent.com/markub3327/flappy-bird-gymnasium/main/imgs/dqn.gif" width="300"/>
</div>


In [17]:
!pip install "flappy-bird-gymnasium @ git+https://github.com/araffin/flappy-bird-gymnasium@patch-1"

Collecting flappy-bird-gymnasium@ git+https://github.com/araffin/flappy-bird-gymnasium@patch-1
  Cloning https://github.com/araffin/flappy-bird-gymnasium (to revision patch-1) to /private/var/folders/sd/ntm8vytx57g033nh4jm4j2700000gn/T/pip-install-s7l2uy2n/flappy-bird-gymnasium_42a1d998c66d458b91b26defc0c32ce8
  Running command git clone --filter=blob:none --quiet https://github.com/araffin/flappy-bird-gymnasium /private/var/folders/sd/ntm8vytx57g033nh4jm4j2700000gn/T/pip-install-s7l2uy2n/flappy-bird-gymnasium_42a1d998c66d458b91b26defc0c32ce8
  Running command git checkout -b patch-1 --track origin/patch-1
  Switched to a new branch 'patch-1'
  branch 'patch-1' set up to track 'origin/patch-1'.
  Resolved https://github.com/araffin/flappy-bird-gymnasium to commit 8828737242a38c0cd18c4260819fe7e695fb6bae
  Preparing metadata (setup.py) ... [?25ldone


In [18]:
import flappy_bird_gymnasium  # noqa: F401

In [19]:
env_id = "FlappyBird-v0"

q_net = run_dqn(
    env_id=env_id,
    replay_buffer_size=100_000,
    # Note: you can remove the target network
    # by setting target_network_update_interval=1
    target_network_update_interval=250,
    learning_starts=10_000,
    exploration_initial_eps=1.0,
    exploration_final_eps=0.03,
    exploration_fraction=0.1,
    n_timesteps=500_000,
    update_interval=4,
    learning_rate=1e-3,
    batch_size=128,
    gamma=0.98,
    n_eval_episodes=5,
    evaluation_interval=50000,
    n_hidden_units=256,
    # No exploration during evaluation
    # (deteministic policy)
    eval_exploration_rate=0.0,
    seed=2023,
    eval_render_mode=None,
)


Evaluation at step 50000:
exploration_rate = 0.03
Mean episode reward: 9.00 +/- 0.00

Evaluation at step 100000:
exploration_rate = 0.03
Mean episode reward: 10.80 +/- 2.20

Evaluation at step 150000:
exploration_rate = 0.03
Mean episode reward: 18.20 +/- 5.93

Evaluation at step 200000:
exploration_rate = 0.03
Mean episode reward: 24.60 +/- 5.63

Evaluation at step 250000:
exploration_rate = 0.03
Mean episode reward: 21.98 +/- 9.05

Evaluation at step 300000:
exploration_rate = 0.03
Mean episode reward: 32.66 +/- 32.52

Evaluation at step 350000:
exploration_rate = 0.03
Mean episode reward: 187.26 +/- 141.36

Evaluation at step 400000:
exploration_rate = 0.03
Mean episode reward: 128.20 +/- 58.85

Evaluation at step 450000:
exploration_rate = 0.03
Mean episode reward: 236.18 +/- 262.10

Evaluation at step 500000:
exploration_rate = 0.03
Mean episode reward: 207.12 +/- 123.61


### Record a video of the trained agent

In [21]:
eval_env = gym.make(env_id, render_mode="rgb_array")
n_eval_episodes = 3
eval_exploration_rate = 0.00
video_name = f"DQN_{env_id}"


# Optional: load checkpoint
q_net = QNetwork(eval_env.observation_space, eval_env.action_space, n_hidden_units=256)
# Convert weights from float32 to float64 to match flappy bird obs
q_net.double()
q_net.load_state_dict(th.load("./data_DQN/q_net_checkpoint_FlappyBird-v0_500000.pth"))

evaluate_policy(
    eval_env,
    q_net,
    n_eval_episodes,
    eval_exploration_rate = eval_exploration_rate,
    video_name = video_name,
    video_folder = video_folder
)

show_videos(video_folder, prefix=video_name)

Saving video to /Users/a24395/Desktop/files/PhD_Coursework/Year2/RL_seminar/Code/Gym_FQI_DQN/DQN_videos/DQN_FlappyBird-v0.mp4
Moviepy - Building video /Users/a24395/Desktop/files/PhD_Coursework/Year2/RL_seminar/Code/Gym_FQI_DQN/DQN_videos/DQN_FlappyBird-v0.mp4.
Moviepy - Writing video /Users/a24395/Desktop/files/PhD_Coursework/Year2/RL_seminar/Code/Gym_FQI_DQN/DQN_videos/DQN_FlappyBird-v0.mp4



                                                                 

Moviepy - Done !
Moviepy - video ready /Users/a24395/Desktop/files/PhD_Coursework/Year2/RL_seminar/Code/Gym_FQI_DQN/DQN_videos/DQN_FlappyBird-v0.mp4
Mean episode reward: 269.63 +/- 238.37




### Going further

- analyse the learned q-values
- explore different value for the target update, use soft update instead of hard-copy
- experiment with Huber loss (smooth l1 loss) instead of l2 loss (mean squared error)
- play with different environments
- implement a CNN to play flappybird/pong from pixels (need to stack frames)
- implement DQN extensions (double Q-learning, prioritized experience replay, ...)

## Conclusion

In this notebook, you have seen how to implement the DQN algorithm (update rule and training loop) using all the components from part I (replay buffer, epsilon-greedy exploration strategy, Q-Network, ...).