<a href="https://colab.research.google.com/github/Bryan-Az/RL-DQN-Gym/blob/main/DQN_Projects/RL_DQN_Gym_CliffWalking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training a Deep Q Learning Neural Network Agent in Gymnasium's Blackjack Environment

## Imports and Installs

In [1]:
!pip install gym gym[atari] gym[accept-rom-license] agilerl accelerate>=0.21.0

[0m

In [2]:
import os
import gymnasium as gym
import numpy as np

### These imports will be used to implement the NN Agent ##
import torch
from agilerl.algorithms.dqn import DQN
from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.training.train_off_policy import train_off_policy
from agilerl.utils.utils import create_population, make_vect_envs

from tqdm import tqdm
from __future__ import annotations
from collections import defaultdict

## Setting up the Reinforcement Learning Environment

In [3]:
env = make_vect_envs("CliffWalking-v0", num_envs=1)
n_episodes = 1_000
env = gym.wrappers.RecordEpisodeStatistics(env, deque_size=n_episodes)

In [4]:
try:
    state_dim = env.single_observation_space.n  # Discrete observation space
    one_hot = True  # Requires one-hot encoding
    is_discrete_obs = True
except Exception:
    state_dim = env.single_observation_space.shape  # Continuous observation space
    one_hot = False  # Does not require one-hot encoding
    is_discrete_obs = False
try:
    action_dim = env.single_action_space.n  # Discrete action space
    is_discrete_actions = True
except Exception:
    action_dim = env.single_action_space.shape[0]  # Continuous action space
    is_discrete_actions = False

  logger.warn(
  logger.warn(


In [5]:
print(f"Action dimension: {action_dim}")
print(f"Observation dimension: {state_dim}")
print(f"Is discrete action space: {is_discrete_actions}")
print(f"Is discrete observation space: {is_discrete_obs}")
print(f"Is one-hot:  {one_hot}")

Action dimension: 4
Observation dimension: 48
Is discrete action space: True
Is discrete observation space: True
Is one-hot:  True


In [6]:
device = "cuda" if torch.cuda.is_available() else "cpu"

The blackjack environment does not implement a .shape method on its' observation_space. Since the DQN agent expects a python tuple instead of a gym.tuple, we need to create our own state_dim. Although the output of the above shows it is not discete - the blackjack environment does actually have a discrete observation space.

## Setting up the Deep Q-Learning Agent

In this block we set the default hyperparameters

In [7]:
# Initial hyperparameters
INIT_HP = {
    "BATCH_SIZE": 64,  # Batch size
    "LR": 0.0001,  # Learning rate
    "GAMMA": 0.99,  # Discount factor
    "MEMORY_SIZE": 5_000,  # Max memory buffer size
    "LEARN_STEP": 5,  # Learning frequency
    "N_STEP": 3,  # Step number to calculate td error
    "PER": True,  # Use prioritized experience replay buffer
    "ALPHA": 0.6,  # Prioritized replay buffer parameter
    "BETA": 0.4,  # Importance sampling coefficient
    "TAU": 0.001,  # For soft update of target parameters
    "PRIOR_EPS": 0.000001,  # Minimum priority for sampling
    "NUM_ATOMS": 51,  # Unit number of support
    "V_MIN": -200.0,  # Minimum value of support
    "V_MAX": 200.0,  # Maximum value of support
    "NOISY": True,  # Add noise directly to the weights of the network
    # Swap image channels dimension from last to first [H, W, C] -> [C, H, W]
    "LEARNING_DELAY": 5000,  # Steps before starting learning
    "CHANNELS_LAST": False,  # Use with RGB states
    "TARGET_SCORE": 200.0,  # Target score that will beat the environment
    "MAX_STEPS": 10000,  # Maximum number of steps an agent takes in an environment
    "EVO_STEPS": 10000,  # Evolution frequency
    "EVAL_STEPS": None,  # Number of evaluation steps per episode
    "EVAL_LOOP": 1,  # Number of evaluation episodes
}

In this block we set the neural network config. Since we are using the more simple discrete observation and action environment, a typical multi-layer perceptron network is sufficient.

In [8]:
NET_CONFIG = {
      'arch': 'mlp',      # Network architecture
      'hidden_size': [32, 32],  # Network hidden size
}

Finally, the neural network config is passed to the DQN agent to alter the default network.

In [9]:
agent = DQN(
    net_config=NET_CONFIG,
    batch_size=int(state_dim),
    state_dim=[state_dim],
    action_dim=action_dim,
    one_hot=one_hot,
    lr=INIT_HP["LR"],
    learn_step=INIT_HP["LEARN_STEP"],
    gamma=INIT_HP["GAMMA"],
    tau=INIT_HP["TAU"],
    device=device)

## Training the DQN Agent in the Blackjack Environment

### What is the difference in training with a DQN agent vs a Q-Learning agent?

#### Memory and Replay buffer
A difference in training an agent with a memory or replay buffer is that the simpler agent.update(state, action, reward, next_state, done) function is decomposed into multiple functions:

1. memory.save_to_memory_vect_envs(state, action, reward, next_state, done)
2. experience = memory.sample(agent.batch_size)
3. agent.learn(experience)

This allows for a higher dimensional input used for training (for example with multiple channels or multiple observations).

#### Training steps

The training steps when using a memory or replay buffer is dependent on the 'batch_size' of the memory. This determines how many 'experiences' / memory samples (or steps in the environment) the memory should be filled with prior to training. Once the memory is filled (this could be taken as the exploration phase as no learning is taking place), the agent continues taking steps while also learning (the training phase).

In [10]:
# the environmental variables are inferred for use with the agent
# the 'Experience Replay Buffer' / agent memory is added to provide learning stability

field_names = ["state", "action", "reward", "next_state", "done"]

In [13]:
memory = ReplayBuffer(memory_size=5000, field_names=field_names, device=device)
trained_pop, pop_fitnesses = train_off_policy(
    env=env,
    env_name="CliffWalking-v0",
    algo="DQN",
    memory=memory,
    pop=[agent],
    target=INIT_HP["TARGET_SCORE"],
    max_steps=INIT_HP["MAX_STEPS"],
    learning_delay=INIT_HP["LEARNING_DELAY"],
    tournament=None,
    mutation=None,
    wb=False,  # Boolean flag to record run with Weights & Biases
    checkpoint=INIT_HP["MAX_STEPS"],
    checkpoint_path="DQN.pt"
)


Training...



  0%|          |    0/10000 [  00:00<      ?, ?step/s]


## Saving the Trained Weights and Evaluating the Agent

In [21]:
checkpoint_path = "DQN.pt"
agent.save_checkpoint(checkpoint_path)

In [None]:
# Uses the Gym Monitor wrapper to evalaute the agent and record video
# only one video will be saved

# video of the final episode with the episode trigger
#env = gym.wrappers.RecordVideo(
    env, "./gym_monitor_output", episode_trigger=lambda x: x == 0)

#test(agent, env, learned_policy)

#env.close()