<a href="https://colab.research.google.com/github/RL-Starterpack/rl-starterpack/blob/main/exercises/DQN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RL Tutorial - **DQN Exercise**

## Setup

In [None]:
#@title Run this cell to clone the RL tutorial repository and install it
try:
  import rl_starterpack
  print('RL-Starterpack repo succesfully installed!')
except ImportError:
  print('Cloning RL-Starterpack package...')

  !git clone https://github.com/RL-Starterpack/rl-starterpack.git
  print('Installing RL-StarterPack package...')
  !pip install -e rl-starterpack[full] &> /dev/null
  print('\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
  print('Please restart the runtime to use the newly installed package!')
  print('Runtime > Restart Runtime')
  print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')

In [None]:
#@title Run this cell to install additional dependencies (will take ~30s)
!apt-get remove ffmpeg > /dev/null # Removing due to restrictive license
!apt-get install -y xvfb x11-utils > /dev/null

In [None]:
#@title Run this cell to import the required libraries
try:
    from rl_starterpack import OpenAIGym, DQN, experiment, vis_utils
except ImportError:
    print('Please run the first cell! If you already ran it, make sure to restart the runtime after the package is installed.')
    raise
import pandas as pd
import numpy as np
import torch
import gym
import torchviz
from itertools import chain
%matplotlib inline
from pyvirtualdisplay import Display
from IPython import display as ipythondisplay

# Setup display to show video renderings
if 'display' not in globals():
    display = Display(visible=0, size=(1400, 900))
    display.start()

## Exercise

### FrozenLake: DQN Style!

#### Neural Network
We define our FrozenLake environment as in TQL. Let's use the non-stochastic version to make sure that our new DQN approach using a neural network works.

In [None]:
env = OpenAIGym(level='FrozenLake', max_timesteps=100, is_slippery=False) # Non-stochastic

Now we define the heart of our DQN: the neural network. Don't worry if you're not familiar with PyTorch, the code does the following:
1. Build an input **Embedding Layer**. This functions as a lookup table which maps the integer representation of states (provided by the environment) to a learned vector per state. So the lookup table has shape `num_states * embedding_dimension`. 
2. Pass the embedding layer output through a **hyperbolic tangent non-linearity**.
3. Finally, a linear layer maps these to a final vector with one Q-value per action.

Feel free to modify the neural network by adding layers, increasing the hidden size, or changing the non-linearity.

In [None]:
num_states = env.state_space['num_values']
num_actions = env.action_space['num_values']
hidden_size = 16  # The "width" of the neural network

network_fn = (lambda: torch.nn.Sequential(
    torch.nn.Embedding(num_embeddings=num_states, 
                       embedding_dim=hidden_size),
    torch.nn.Tanh(),
    torch.nn.Linear(in_features=hidden_size, out_features=num_actions)
))

Using this neural network constructor, we now create the agent. Fill in the hyperparameters below and see if your agent succeeds in the next section!

In [None]:
# TODO: Fill in these hyperparameters
learning_rate = None  # Speed at which the agent learns. Between (0,1)
discount_rate = None  # How much future rewards are discounted at each step. Between (0,1)
exploration = None  # During training the agent will take a random action and "explore" with this probability. Between (0,1)

agent = DQN(
    state_space=env.state_space, action_space=env.action_space, network_fn=network_fn,
    learning_rate=learning_rate, discount=discount_rate, exploration=exploration
)

In [None]:
#@title _<sub><sup>SOLUTION: Expand this cell to see working parameters for the learning rate, discount_rate and exploration in the non-stochastic environment </sup></sub>_
learning_rate = 1e-3  # Speed at which the agent learns. Between (0,1)
discount_rate = 0.95  # How much future rewards are discounted at each step. Between (0,1)
exploration = 0.25  # During training the agent will take a random action and "explore" with this probability. Between (0,1)


agent = DQN(
    state_space=env.state_space, action_space=env.action_space, network_fn=network_fn,
    learning_rate=learning_rate, discount=discount_rate, exploration=exploration
)

We can call our DQN to get Q-values for a given state. Recall that by outputting all actions for a given state, we save ourselves from having to call the network for every state-action pair. Compare this to the TQL tabular approach. Note that the network isn't trained yet, so the output values aren't yet meaningful.

In [None]:
random_state = np.random.randint(0, num_states)
agent.network(torch.tensor(random_state))

We can inspect what the neural network looks like:

In [None]:
named_params = dict(agent.network.named_parameters())
torchviz.make_dot(agent.network(torch.tensor(random_state)), params=named_params)

####  Train and evaluate on non-stochastic environment

In [None]:
train_returns = experiment.train(agent, env, num_episodes=500)
eval_returns = experiment.evaluate(agent, env, num_episodes=500)
print('Mean eval return:', sum(eval_returns) / len(eval_returns))

In [None]:
vis_utils.draw_returns_chart(train_returns)

We can also inspect how our agent solved the environment!

In [None]:
experiment.evaluate_render(agent, env, ipythondisplay, sleep=0.5)

#### Train and evaluate on stochastic "slippery" environment

Now that we know that our implementation works on a non-stochastic environment, let us try it on the true "slippery" environment and see if it can solve it. Similarly to TQL, we may need to provide some additional reward shaping to help it solve the environment.

In [None]:
env = OpenAIGym(level='FrozenLake', max_timesteps=100, is_slippery=True) # stochastic

In [None]:
def reward_shaping_fn(reward, terminal, next_state):
    """
    Shapes the reward before passing it on to the agent.
    Args:
        reward (float): Reward returned by the environment for the action which was just performed.
        terminal (int): Boolean int representing whether the current episode has ended (if episode has ended =1, otherwise =0).
        next_state (object): Next state. In the case of FrozenLake this is a np.ndarray of a scalar. i.e. np.array(0)
    Returns:
        reward (float): The modified rewarad.
        terminal (int): The `terminal` input needs to be passed through.
    """
    # TODO: Fill in if your agent is having a hard time solving the environment!
    return reward, terminal

In [None]:
# TODO: Fill in these hyperparameters
learning_rate = None  # Speed at which the agent learns. Between (0,1)
discount_rate = None  # How much future rewards are discounted at each step. Between (0,1)
exploration = None  # During training the agent will take a random action and "explore" with this probability. Between (0,1)

agent = DQN(
    state_space=env.state_space, action_space=env.action_space,
    network_fn=network_fn,
    learning_rate=learning_rate, 
    discount=discount_rate, 
    exploration=exploration
)

experiment.train(agent, env, num_episodes=1000, 
                 reward_shaping_fn=reward_shaping_fn)
returns = experiment.evaluate(agent, env, num_episodes=1000)
print('Mean return:', returns.mean())

We can also inspect how our agent solved the environment. Note that since the environment is slippery, the agent may not solve it everytime.

In [None]:
experiment.evaluate_render(agent, env, ipythondisplay, sleep=0.5)

In [None]:
#@title _<sub><sup>SOLUTION: Expand this cell to see a working DQN implementation </sup></sub>_

# Environment
env = OpenAIGym(level='FrozenLake', max_timesteps=100, is_slippery=True) # stochastic

# Hyperparameters
learning_rate = 1e-3
discount_rate = 0.95
exploration = 0.1

# Define our agent
num_states = env.state_space['num_values']
num_actions = env.action_space['num_values']
hidden_size = 16  # The "width" of the neural network

network_fn = (lambda: torch.nn.Sequential(
    torch.nn.Embedding(num_embeddings=num_states, 
                       embedding_dim=hidden_size),
    torch.nn.Tanh(),
    torch.nn.Linear(in_features=hidden_size, out_features=num_actions)
))

def agent_fn():
    return DQN(
        state_space=env.state_space, action_space=env.action_space, network_fn=network_fn,
        learning_rate=learning_rate, discount=discount_rate, exploration=exploration
    )

# Provide some helpful reward shaping
def reward_shaping_fn(reward, terminal, next_state):
    del next_state # unused
    if terminal == 1 and reward == 0.0:
        # Penalize the agent for failing to reach the goal
        return -1.0, terminal
    else:
        return reward, terminal

# Train the agent. We should be able to achieve a reward of >0.7
experiment.train(agent, env, num_episodes=1000, 
                 reward_shaping_fn=reward_shaping_fn)
returns = experiment.evaluate(agent, env, num_episodes=1000)
print(f'Run Returns: {returns.mean():.3f} ± {returns.std():.3f}')

### CartPole

Next we'll be looking at an environment with a continuous state space: [CartPole](https://gym.openai.com/envs/CartPole-v1/)

> A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center

Recall that one of the advantages DQN provides over TQL is the ability to handle such continuous state space environments.

In [None]:
env = OpenAIGym('CartPole', max_timesteps=300)

In [None]:
print('State Information (angles are in rad)\n')
pd.DataFrame.from_dict({'Observation': ['Cart Position', 'Cart Velocity', 'Pole Angle', 'Pole Angular Velocity'],
                        'min_value': env.state_space['min_value'],
                        'max_value': env.state_space['max_value'], 
                        })

#### Neural Network
We now define our new DQN network to handle this new environment with a continuous state space.
Note that we no longer need to embed the input state space, since it's now a vector of 4 continuous observations (see table above)

In [None]:
state_obs_dim = env.state_space['shape'][0]  # Size of vector representing the state observations
num_actions = env.action_space['num_values']
hidden_size = 16 

network_fn = (lambda: torch.nn.Sequential(
    torch.nn.Linear(in_features=state_obs_dim, out_features=hidden_size),
    torch.nn.Tanh(),
    torch.nn.Linear(in_features=hidden_size, out_features=num_actions)
))
agent = DQN(
    env.state_space, env.action_space, network_fn=network_fn, 
    discount=0.9, exploration=0.05, learning_rate=1e-3
)

#### Train and Evaluate

In [None]:
train_returns = experiment.train(agent, env, num_episodes=1500)
eval_returns = experiment.evaluate(agent, env, num_episodes=500)

In [None]:
vis_utils.draw_returns_chart(train_returns)

Let's visualise our performance! If your agent managed to balance the pole for >300 timesteps, try increasing max_timesteps below to see if it balance indefinitely. Also note that since the cartpole has a random starting state, it might not solve it everytime, so try running the visualisation a few times to see different episodes.

_Note: to show videos in Colab, we need to render them as gifs._

In [None]:
# To show longer balancing run this before creating the gif: env = OpenAIGym('CartPole', max_timesteps=500)
vis_utils.show_episode_as_gif(ipythondisplay, agent, env)

### DQN Extensions
There are a few improvements to vanilla DQN which improve the stability of the learning process. If you've found that the learning of the agents above aren't stable, then these may help. We've made a couple of these available in our implementation for you to play with. Feel free to combine them.

#### Experience Replay

Recall from the slides that the idea here is to save "experiences" in a memory (also known as a replay buffer) and then randomly sample batches of experiences from this memory to update the network. This has the following advantages:


*   Batch updates are less noisy, more stable and much faster!
*   Items within a batch are less correlated since they are sampled across multiple episodes



In [None]:
env = OpenAIGym('CartPole', max_timesteps=300)

In [None]:
agent = DQN(
    env.state_space, env.action_space, network_fn=network_fn, 
    discount=0.9, exploration=0.05, learning_rate=1e-3,
    # Replay memory params
    memory=100, # Size of the replay memory. Must be >= batch size.
    batch_size=16,
    update_frequency=4, # The frequency at which the network is updated (i.e. how often a batch is sampled to update the network)
    update_start=None  # Number of timesteps to collect before first update. Must be >= batch size. If None then = batch size.
)

In [None]:
train_returns = experiment.train(agent, env, num_episodes=1500)
eval_returns = experiment.evaluate(agent, env, num_episodes=500)
vis_utils.draw_returns_chart(train_returns)

In [None]:
vis_utils.show_episode_as_gif(ipythondisplay, agent, env)

#### Target Network
Recall that the idea here is to use a separate Q-network to estimate the TD-target. The network is then infrequently synced with the main network. The advantage of this is that it reduces correlation between the Q-value and the TD-target. You can think of this as temporarily "fixing" our goal (the TD-target), so that we don't have a moving target to chase.

In [None]:
env = OpenAIGym('CartPole', max_timesteps=300)

In [None]:
# The q-network and target network are identical and are both created using the network_fn input
agent = DQN(
    env.state_space, env.action_space, network_fn=network_fn, 
    discount=0.9, exploration=0.05, learning_rate=1e-3,
    # Target network params
    target_network_update_frequency=15, # The frequency at which the target network is updated
)

In [None]:
train_returns = experiment.train(agent, env, num_episodes=1500)
eval_returns = experiment.evaluate(agent, env, num_episodes=500)
vis_utils.draw_returns_chart(train_returns)

In [None]:
vis_utils.show_episode_as_gif(ipythondisplay, agent, env)

In [None]:
#@title _<sub><sup>SOLUTION: Everything together! </sup></sub>_
env = OpenAIGym('CartPole', max_timesteps=300)

state_obs_dim = env.state_space['shape'][0]  # Size of vector representing the state observations
num_actions = env.action_space['num_values']
hidden_size = 16 

network_fn = (lambda: torch.nn.Sequential(
    torch.nn.Linear(in_features=state_obs_dim, out_features=hidden_size),
    torch.nn.Tanh(),
    torch.nn.Linear(in_features=hidden_size, out_features=num_actions)
))

agent = DQN(state_space=env.state_space, 
            action_space=env.action_space, network_fn=network_fn,
            discount=0.9, exploration=0.1, learning_rate=1e-3,
            target_network_update_frequency=10, 
            memory=500, 
            batch_size=16,
            update_start=100,
            update_frequency=4)

train_returns = experiment.train(agent, env, num_episodes=2000)
eval_returns = experiment.evaluate(agent, env, num_episodes=1000)
vis_utils.draw_returns_chart(train_returns)
vis_utils.show_episode_as_gif(ipythondisplay, agent, env)

## Leaderboard

Once you have completed the exercises above consider submitting your scores to our DQN leaderboard using [this form](https://forms.gle/oM3yJJmz7nQfwavJ9).

Note: to compute the "mean evaluation return" you can do `eval_returns.mean()`.