<a href="https://colab.research.google.com/github/Bryan-Az/RL-DQN-Gym/blob/main/DQN_projects/RL_DQN_Gym_CliffWalking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training a Deep Q Learning Neural Network Agent in Gymnasium's Blackjack Environment

## Imports and Installs

In [1]:
!pip install gym gym[atari] gym[accept-rom-license] agilerl accelerate>=0.21.0

[0m

In [166]:
import os
import gymnasium as gym
import numpy as np

### These imports will be used to implement the NN Agent ##
import torch
from agilerl.algorithms.dqn import DQN
from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.training.train_off_policy import train_off_policy
from agilerl.utils.utils import create_population, make_vect_envs
# import trange
from tqdm.notebook import trange

from tqdm import tqdm
from __future__ import annotations
from collections import defaultdict

## Setting up the Reinforcement Learning Environment

In [167]:
#env = make_vect_envs("CliffWalking-v0", num_envs=1) # uncomment if want to run across envs
env = gym.vector.make("CliffWalking-v0")

  gym.logger.warn(


In [168]:
try:
    state_dim = env.single_observation_space.n  # Discrete observation space
    one_hot = True  # Requires one-hot encoding
    is_discrete_obs = True
except Exception:
    state_dim = env.single_observation_space.shape  # Continuous observation space
    one_hot = False  # Does not require one-hot encoding
    is_discrete_obs = False
try:
    action_dim = env.single_action_space.n  # Discrete action space
    is_discrete_actions = True
except Exception:
    action_dim = env.single_action_space.shape[0]  # Continuous action space
    is_discrete_actions = False

In [169]:
print(f"Action dimension: {action_dim}")
print(f"Observation dimension: {state_dim}")
print(f"Is discrete action space: {is_discrete_actions}")
print(f"Is discrete observation space: {is_discrete_obs}")
print(f"Is one-hot:  {one_hot}")

Action dimension: 4
Observation dimension: 48
Is discrete action space: True
Is discrete observation space: True
Is one-hot:  True


In [170]:
device = "cuda" if torch.cuda.is_available() else "cpu"

The blackjack environment does not implement a .shape method on its' observation_space. Since the DQN agent expects a python tuple instead of a gym.tuple, we need to create our own state_dim. Although the output of the above shows it is not discete - the blackjack environment does actually have a discrete observation space.

## Setting up the Deep Q-Learning Agent

In this block we set the default hyperparameters

In [171]:
# Initial hyperparameters
INIT_HP = {
    "BATCH_SIZE": 64,  # Batch size
    "LR": 0.0001,  # Learning rate
    "GAMMA": 0.99,  # Discount factor
    "MEMORY_SIZE": 5_000,  # Max memory buffer size
    "LEARN_STEP": 5,  # Learning frequency
    "N_STEP": 3,  # Step number to calculate td error
    "PER": True,  # Use prioritized experience replay buffer
    "ALPHA": 0.6,  # Prioritized replay buffer parameter
    "BETA": 0.4,  # Importance sampling coefficient
    "TAU": 0.001,  # For soft update of target parameters
    "PRIOR_EPS": 0.000001,  # Minimum priority for sampling
    "NUM_ATOMS": 51,  # Unit number of support
    "V_MIN": -200.0,  # Minimum value of support
    "V_MAX": 200.0,  # Maximum value of support
    "NOISY": True,  # Add noise directly to the weights of the network
    # Swap image channels dimension from last to first [H, W, C] -> [C, H, W]
    "LEARNING_DELAY": 200,  # Steps before starting learning
    "CHANNELS_LAST": False,  # Use with RGB states
    "TARGET_SCORE": 200.0,  # Target score that will beat the environment
    "MAX_STEPS": 10000,  # Maximum number of steps an agent takes in an environment
    "EVO_STEPS": 10000,  # Evolution frequency
    "EVAL_STEPS": 1000,  # Number of evaluation steps per episode
    "EVAL_LOOP": 1,  # Number of evaluation episodes
}

In this block we set the neural network config. Since we are using the more simple discrete observation and action environment, a typical multi-layer perceptron network is sufficient.

In [172]:
NET_CONFIG = {
      'arch': 'mlp',      # Network architecture
      'hidden_size': [32, 32],  # Network hidden size
}

Finally, the neural network config is passed to the DQN agent to alter the default network.

In [173]:
agent = DQN(
    net_config=NET_CONFIG,
    batch_size=int(state_dim),
    state_dim=[state_dim],
    action_dim=action_dim,
    one_hot=one_hot,
    lr=INIT_HP["LR"],
    learn_step=INIT_HP["LEARN_STEP"],
    gamma=INIT_HP["GAMMA"],
    tau=INIT_HP["TAU"],
    device=device)

## Training the DQN Agent in the Blackjack Environment

### What is the difference in training with a DQN agent vs a Q-Learning agent?

#### Memory and Replay buffer
A difference in training an agent with a memory or replay buffer is that the simpler agent.update(state, action, reward, next_state, done) function is decomposed into multiple functions:

1. memory.save_to_memory_vect_envs(state, action, reward, next_state, done)
2. experience = memory.sample(agent.batch_size)
3. agent.learn(experience)

This allows for a higher dimensional input used for training (for example with multiple channels or multiple observations).

#### Training steps

The training steps when using a memory or replay buffer is dependent on the 'batch_size' of the memory. This determines how many 'experiences' / memory samples (or steps in the environment) the memory should be filled with prior to training. Once the memory is filled (this could be taken as the exploration phase as no learning is taking place), the agent continues taking steps while also learning (the training phase).

In [174]:
# the 'Experience Replay Buffer' / agent memory is added to provide learning stability
field_names = ["state", "action", "reward", "next_state", "done"]

In [175]:
memory = ReplayBuffer(memory_size=5000, field_names=field_names, device=device)

In [176]:
 # Exploration params
eps_start = 1.0  # Max exploration
eps_end = 0.1  # Min exploration
eps_decay = 0.995  # Decay per episode
epsilon = eps_start

In [177]:
# TRAINING LOOP
total_steps=0
pop=[agent]
print("Training...")
pbar = trange(INIT_HP["MAX_STEPS"], unit="step")
while np.less([agent.steps[-1] for agent in pop], INIT_HP["MAX_STEPS"]).all():
    pop_episode_scores = []
    for agent in pop:  # Loop through population
        state, info = env.reset()  # Reset environment at start of episode
        scores = np.zeros(1)
        completed_episode_scores = []
        steps = 0
        epsilon = eps_start

        for idx_step in range(INIT_HP['MAX_STEPS'] // 1):
            if INIT_HP["CHANNELS_LAST"]:
                state = np.moveaxis(state, [-1], [-3])

            action = agent.get_action(state, epsilon)  # Get next action from agent
            epsilon = max(
                eps_end, epsilon * eps_decay
            )  # Decay epsilon for exploration

            # Act in environment
            next_state, reward, terminated, truncated, info = env.step(action)
            scores += np.array(reward)
            steps += 1
            total_steps += 1

            # Collect scores for completed episodes
            for idx, (d, t) in enumerate(zip(terminated, truncated)):
                if d or t:
                    completed_episode_scores.append(scores[idx])
                    agent.scores.append(scores[idx])
                    scores[idx] = 0

            # Save experience to replay buffer
            if INIT_HP["CHANNELS_LAST"]:
                memory.save_to_memory(
                    state,
                    action,
                    reward,
                    np.moveaxis(next_state, [-1], [-3]),
                    terminated,
                    is_vectorised=True,
                )
            else:
                memory.save_to_memory(
                    state,
                    action,
                    reward,
                    next_state,
                    terminated,
                    is_vectorised=True,
                )

            # Learn according to learning frequency
            if memory.counter > INIT_HP['LEARNING_DELAY'] and len(memory) >= agent.batch_size:
                for _ in range(1 // agent.learn_step):
                    experiences = memory.sample(
                        agent.batch_size
                    )  # Sample replay buffer
                    agent.learn(
                        experiences
                    )  # Learn according to agent's RL algorithm

            state = next_state

        pbar.update(INIT_HP['EVO_STEPS'] // len(pop))
        agent.steps[-1] += steps
        pop_episode_scores.append(completed_episode_scores)

    # Reset epsilon start to latest decayed value for next round of population training
    eps_start = epsilon

    # Evaluate population
    fitnesses = [
        agent.test(
            env,
            swap_channels=INIT_HP["CHANNELS_LAST"],
            max_steps=INIT_HP['EVAL_STEPS'],
            loop=INIT_HP['EVAL_LOOP'],
        )
        for agent in pop
    ]
    mean_scores = [
        (
            np.mean(episode_scores)
            if len(episode_scores) > 0
            else "0 completed episodes"
        )
        for episode_scores in pop_episode_scores
    ]

    print(f"--- Global steps {total_steps} ---")
    print(f"Steps {[agent.steps[-1] for agent in pop]}")
    print(f"Scores: {mean_scores}")
    print(f'Fitnesses: {["%.2f"%fitness for fitness in fitnesses]}')
    print(
        f'5 fitness avgs: {["%.2f"%np.mean(agent.fitness[-5:]) for agent in pop]}'
    )

    # Update step counter
    for agent in pop:
        agent.steps.append(agent.steps[-1])

pbar.close()
env.close()

Training...


  0%|          | 0/10000 [00:00<?, ?step/s]

--- Global steps 10000 ---
Steps [10000]
Scores: [-499.0]
Fitnesses: ['-1000.00']
5 fitness avgs: ['-1000.00']


In [178]:
agent.save_checkpoint('CliffWalking.pt')

## Saving the Trained Weights and Evaluating the Agent

In [179]:
test_env = gym.make("CliffWalking-v0", render_mode='rgb_array')

These wrappers are necessary if using the gym.make method instead of the gym.vector.make method to create the test environment. The gym.make method is required if using gym.wrapper.RecordVideo for testing as it sets the render_mode to be able to use rgb_array. The gym.make method forces the environments to use tuples instead of arrays for the data types for observations or actions so is incompatible with the DQN class of AgileRL which expects arrays and not tuples.

In [180]:
class ArrayObservationEnv(gym.ObservationWrapper):
    def __init__(self, env):
        super().__init__(env)
    # code to overwrite the step and reset functions to modify the state

    #super() of env.step
    def step(self, action):
        obs, reward, terminated, truncated, info = self.env.step(action)
        #return tuple([obs]), reward, terminated, truncated, info
        return np.array(obs), reward, terminated, truncated, info

    #super() of env.reset
    def reset(self, **kwargs):
        obs, info = self.env.reset(**kwargs)
        #return tuple([obs]), info
        return np.array(obs), info


#wrapper for agileRL DQN

class TupletoArrayDQN():
    def __init__(self, trained_agent):  # Pass keyword arguments for DQN initialization
        self.dqn_instance = trained_agent
        #super().__init__() # Added this to solve NameError, not sure if necessary but it works now

    #super() of DQN.test
    def test(self, env, swap_channels=False, max_steps=None, loop=1):
        # uses array env wrapper
        env = ArrayObservationEnv(env)
        """Returns mean test score of agent in environment with epsilon-greedy policy.

        :param env: The environment to be tested in
        :type env: Gym-style environment
        :param swap_channels: Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False
        :type swap_channels: bool, optional
        :param max_steps: Maximum number of testing steps, defaults to None
        :type max_steps: int, optional
        :param loop: Number of testing loops/episodes to complete. The returned score is the mean over these tests. Defaults to 1
        :type loop: int, optional
        """
        with torch.no_grad():
            rewards = []
            num_envs = env.num_envs if hasattr(env, "num_envs") else 1
            for i in range(loop):
                state, info = env.reset()
                scores = np.zeros(num_envs)
                completed_episode_scores = np.zeros(num_envs)
                finished = np.zeros(num_envs)
                step = 0
                while not np.all(finished):
                    if swap_channels:
                        state = np.moveaxis(state, [-1], [-3])
                    action_mask = info.get("action_mask", None)
                    action = self.dqn_instance.get_action(state, epsilon=0, action_mask=action_mask)
                    state, reward, done, trunc, info = env.step(action[0])
                    step += 1
                    scores += np.array(reward)
                    for idx, (d, t) in enumerate(zip([done], [trunc])):
                        if (
                            d or t or (max_steps is not None and step == max_steps)
                        ) and not finished[idx]:
                            completed_episode_scores[idx] = scores[idx]
                            finished[idx] = 1
                rewards.append(np.mean(completed_episode_scores))
        mean_fit = np.mean(rewards)
        self.dqn_instance.fitness.append(mean_fit)
        return mean_fit



In [181]:
wrapped_dqn = TupletoArrayDQN(trained_agent=agent)

In [183]:
# Uses the Gym Monitor wrapper to evalaute the agent and record video
# only one video will be saved
# video of the final episode with the episode trigger
test_env = gym.wrappers.RecordVideo(
    test_env, "./gym_monitor_output", episode_trigger=lambda x: x == 0)

wrapped_dqn.test(test_env, swap_channels=INIT_HP["CHANNELS_LAST"], max_steps=INIT_HP['EVAL_STEPS'])

test_env.close()

Moviepy - Building video /content/gym_monitor_output/rl-video-episode-0.mp4.
Moviepy - Writing video /content/gym_monitor_output/rl-video-episode-0.mp4





Moviepy - Done !
Moviepy - video ready /content/gym_monitor_output/rl-video-episode-0.mp4
