# Deep Q Networks
Previously, we trained policies using policy gradient algorithms, which directly estimated the gradient of the returns for the policy and performed stochastic gradient ascent. In this section, we will now implement deep Q-learning, which does not explicitly optimize a policy, but simply infers the policy from a learned Q function that is trained via dynamic programming.

We will assume discrete action spaces for this notebook as to enable us to easily select actions that maximize the Q-function at given states.

In [1]:
import os
from google.colab import drive
drive.mount('/content/drive')
DRIVE_PATH = '/content/drive/My\ Drive/282'

Mounted at /content/drive


In [3]:
# As usual, a bit of setup
import os
import shutil
import time
import torch
import numpy as np

import deeprl.infrastructure.pytorch_util as ptu

from deeprl.infrastructure.rl_trainer import RL_Trainer
from deeprl.infrastructure.trainers import PG_Trainer
from deeprl.infrastructure.trainers import DQN_Trainer

%load_ext autoreload
%autoreload 2

def rel_error(x, y):
    """ returns relative error """
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

def remove_folder(path):
    # check if folder exists
    if os.path.exists(path): 
        print("Clearing old results at {}".format(path))
        # remove if exists
        shutil.rmtree(path)
    else:
        print("Folder {} does not exist yet. No old results to delete".format(path))

In [4]:
dqn_base_args_dict = dict(
    env_name = 'LunarLander-v3', #@param 
    exp_name = 'test_dqn', #@param
    save_params = False, #@param {type: "boolean"}
    
    ## PDF will tell you how to set ep_len
    ## and discount for each environment
    ep_len = 200, #@param {type: "integer"}
    # discount = 0.95, #@param {type: "number"}

    # Training
    num_agent_train_steps_per_iter = 1, #@param {type: "integer"})
    num_critic_updates_per_agent_update = 1, #@param {type: "integer"}
  
    #@markdown Q-learning parameters
    double_q = False, #@param {type: "boolean"}

    # batches & buffers
    batch_size = 32, #@param {type: "integer"})
    batch_size_initial=1000,

    #@markdown logging
    video_log_freq = -1, #@param {type: "integer"}
    scalar_log_freq = 1000, #@param {type: "integer"}

    #@markdown gpu & run-time settings
    no_gpu = False, #@param {type: "boolean"}
    which_gpu = 0, #@param {type: "integer"}
    seed = 2, #@param {type: "integer"}
    logdir = 'test',
)

## DQN updates
Recall in Q-learning, we attempt to solve the optimal state-action values (which we refer to as Q-values $Q(s,a)$), by finding solutions to the Bellman equation given by
$$Q(s,a) = r(s,a) + \gamma \mathbb{E}_{s' \sim p(s'\vert s,a)}[\max_{a'}Q(s', a')].$$

Regular tabular Q-learning would take sample transitions $(s, a, r, s')$ and perform updates according to
$$Q(s,a) \leftarrow Q(s,a) + \alpha (r(s,a) + \gamma \max_{a'} Q(s', a') - Q(s,a)),$$
where $\alpha$ is a stepsize parameter.

This can be interpreted as updating $Q(s,a)$ by taking one gradient step on a squared Bellman error objective
$$(r(s, a) +\gamma \max_{a'} \tilde Q(s', a') - Q(s,a))^2,$$
where $\tilde Q$ is a copy of $Q$, but is not differentiated when taking the gradient step.

Adapting this update to the setting where we use a neural network with parameters $\theta$ to approximate $Q(s,a)$, we then train $\theta$ with the loss function 
$$\min_{\theta} \mathbb{E}_{s, a, s' \sim D} [L(Q_{\theta}(s,a), r(s,a) + \gamma \max_{a'} Q_{\tilde \theta}(s', a'))]$$
where $D$ is our replay buffer containing past transitions we've experienced, $L$ is some loss function capturing how far the predicted Q-values are from the target values, and $\tilde \theta$ are the target Q function parameters, which are usually a delayed copy of $\theta$ for stability reasons.

We note our previous policy gradient algorithms were _on-policy_ algoritms, which meant they updated the policy using only the data collected from the most recent policy, and discard all the data after using it just once. In contrast, DQN uses _off-policy_ updates by sampling data from all past interactions, allowing for data reuse over time.

Fill out the missing components for the basic Q-learning update in <code>critics/dqn_critic.py</code> (not including the double_q section).

In [43]:
#### Test DQN updates
dqn_args = dict(dqn_base_args_dict)

env_str = 'LunarLander'
dqn_args['env_name'] = '{}-v3'.format(env_str)
dqn_args['double_q'] = False
dqntrainer = DQN_Trainer(dqn_args)
dqnagent = dqntrainer.rl_trainer.agent
critic = dqnagent.critic

ob_dim = critic.ob_dim
ac_dim = 6
N = 5

np.random.seed(0)
obs = np.random.normal(size=(N, ob_dim))
acts = np.random.choice(ac_dim, size=(N,))
next_obs = np.random.normal(size=(N, ob_dim))
rewards = np.random.normal(size=N)
terminals = np.zeros(N)
terminals[0] = 1

first_weight_before = np.array(ptu.to_numpy(next(critic.q_net.parameters())))
print("Weight before update (first row)", first_weight_before[0])


loss = critic.update(obs, acts, next_obs, rewards, terminals)['Training Loss']
expected_loss = 0.9408444
loss_error = rel_error(loss, expected_loss)
print("Initial loss", loss)
print("Initial Loss Error", loss_error, "should be on the order of 1e-6 or lower")

for i in range(4):
    loss = critic.update(obs, acts, next_obs, rewards, terminals)['Training Loss']
    print(loss)

expected_loss = 0.7889254
loss_error = rel_error(loss, expected_loss)
print("Loss Error", loss_error, "should be on the order of 1e-6 or lower")


first_weight_after = np.array(ptu.to_numpy(next(critic.q_net.parameters())))
print("Weight after update (first row)", first_weight_after.shape)
# Test DQN gradient
print(first_weight_after[0])
weight_change_partial = first_weight_after[0] - first_weight_before[0]
expected_weight_change = np.array([-0.00491365, -0.00500049, -0.00499149, -0.00491229, -0.00490125,  0.00489534,
 -0.00282785, -0.00171614,  0.00485604])


updated_weight_error = rel_error(weight_change_partial, expected_weight_change)
print("Weight Update Error", updated_weight_error, "should be on the order of 1e-6 or lower")

########################
logging outputs to  test
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
LunarLander-v3
Weight before update (first row) [ 0.07646339 -0.07932477  0.09140956 -0.01702595  0.1423959   0.07935759
 -0.03831156 -0.2694876   0.0761048 ]
Initial loss 0.9408444
Initial Loss Error 8.831612855468697e-09 should be on the order of 1e-6 or lower
0.9007309
0.8621321
0.82467556
0.7889254
Loss Error 5.904878045285727e-09 should be on the order of 1e-6 or lower
Weight after update (first row) (64, 9)
[ 0.07154974 -0.08432526  0.08641808 -0.02193824  0.13749465  0.08425292
 -0.04113941 -0.27120373  0.08096085]
Weight Update Error 8.937585787696078e-07 should be on the order of 1e-6 or lower




Implement the missing components in the get_action method of <code>policies/argmax_policy.py</code> and the step_env method in <code>agents/dqn_agent.py</code> to allow our agent to interact with the environment.

In [61]:
### Test argmax policy
dqn_args = dict(dqn_base_args_dict)

env_str = 'LunarLander'
dqn_args['env_name'] = '{}-v3'.format(env_str)
dqn_args['double_q'] = False
dqntrainer = DQN_Trainer(dqn_args)
dqnagent = dqntrainer.rl_trainer.agent
actor = dqnagent.actor

ob_dim = critic.ob_dim
ac_dim = 6
N = 5

np.random.seed(0)
obs = np.random.normal(size=(N, ob_dim))

actions = actor.get_action(obs)
correct_actions = np.array([1, 0, 1, 0, 1])

assert np.all(correct_actions == actions)

########################
logging outputs to  test
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
LunarLander-v3




We can now test our DQN implementation on the LunarLander environment. These experiments can take a while to run (over 10 minutes per seed) on CPU, so start early.

In [None]:
dqn_args = dict(dqn_base_args_dict)

env_str = 'LunarLander'
dqn_args['env_name'] = '{}-v3'.format(env_str)
dqn_args['double_q'] = False

# Delete all previous logs
remove_folder('logs/dqn/{}/vanilla_dqn'.format(env_str))

for seed in range(3):
    print("Running DQN experiment with seed", seed)
    dqn_args['seed'] = seed
    dqn_args['logdir'] = 'logs/dqn/{}/vanilla_dqn/seed{}'.format(env_str, seed)
    dqntrainer = DQN_Trainer(dqn_args)
    dqntrainer.run_training_loop()

Clearing old results at logs/dqn/LunarLander/vanilla_dqn
Running DQN experiment with seed 0
########################
logging outputs to  logs/dqn/LunarLander/vanilla_dqn/seed0
########################
Using CPU for this assignment. There may be some bugs with using GPU that cause test cases to not match. You can uncomment the code below if you want to try using it.
LunarLander-v3


********** Iteration 0 ************

Training agent...

Beginning logging procedure...
Timestep 1
mean reward (100 episodes) nan
best mean reward -inf
running time 0.002473
Train_EnvstepsSoFar : 1
TimeSinceStart : 0.0024726390838623047
Done logging...








********** Iteration 1000 ************

Training agent...

Beginning logging procedure...
Timestep 1001
mean reward (100 episodes) -361.464801
best mean reward -inf
running time 0.692687
Train_EnvstepsSoFar : 1001
Train_AverageReturn : -361.46480063692485
TimeSinceStart : 0.6926872730255127
Done logging...




********** Iteration 2000 ************

Training agent...

Beginning logging procedure...
Timestep 2001
mean reward (100 episodes) -356.064325
best mean reward -inf
running time 3.577993
Train_EnvstepsSoFar : 2001
Train_AverageReturn : -356.06432532472695
TimeSinceStart : 3.577993154525757
Training Loss : 0.23083141446113586
Done logging...




********** Iteration 3000 ************

Training agent...

Beginning logging procedure...
Timestep 3001
mean reward (100 episodes) -311.599968
best mean reward -inf
running time 6.179479
Train_EnvstepsSoFar : 3001
Train_AverageReturn : -311.5999676699081
TimeSinceStart : 6.179478645324707
Training Loss : 0.25829875469207764
Done logging.

In [None]:
### Visualize vanilla DQN results on Lunar Lander
%load_ext tensorboard
%tensorboard --logdir logs/dqn/LunarLander/vanilla_dqn

## Double DQN
One potential issue with learning our Q functions with bootstrapping is _maximization bias_, where the learned Q-values tend to overestimate the actual expected future returns. The main idea is that when there is estimation error in the next state's Q-values, even if the values were correct on average, picking the action with the maximum Q-value would tend to select one where the value is overestimated. This overoptimistic value would then also get propagated via the Bellman backups to other states and actions, and can potentially slow down learning.

Double DQN (https://arxiv.org/abs/1509.06461) proposes a simple solution to alleviate this _maximization bias_. Instead of taking the next action that maximizes the target network's Q-value, it selects the action to maximize the _current_ Q function at the next state, and then takes the target network's estimate of that action's value. 

Implement the double DQN target value in the update method in <code>critics/dqn_critic.py</code>.

In [None]:
#### Test DQN target value with double Q
dqn_args = dict(dqn_base_args_dict)

env_str = 'LunarLander'
dqn_args['env_name'] = '{}-v3'.format(env_str)
dqn_args['double_q'] = True
dqntrainer = DQN_Trainer(dqn_args)
dqnagent = dqntrainer.rl_trainer.agent
critic = dqnagent.critic

ob_dim = critic.ob_dim
ac_dim = 6
N = 5

np.random.seed(0)
obs = np.random.normal(size=(N, ob_dim))
acts = np.random.choice(ac_dim, size=(N,))
next_obs = np.random.normal(size=(N, ob_dim))
rewards = np.random.normal(size=N)
terminals = np.zeros(N)
terminals[0] = 1

first_weight_before = np.array(ptu.to_numpy(next(critic.q_net.parameters())))
print("Weight before update (first row)", first_weight_before[0])


loss = critic.update(obs, acts, next_obs, rewards, terminals)['Training Loss']
expected_loss = 0.93894196
loss_error = rel_error(loss, expected_loss)
print("Initial loss", loss)
print("Initial Loss Error", loss_error, "should be on the order of 1e-6 or lower")

for i in range(4):
    loss = critic.update(obs, acts, next_obs, rewards, terminals)['Training Loss']
    print(loss)

expected_loss = 0.7871182
loss_error = rel_error(loss, expected_loss)
print("Loss Error", loss_error, "should be on the order of 1e-6 or lower")


first_weight_after = np.array(ptu.to_numpy(next(critic.q_net.parameters())))
print("Weight after update (first row)", first_weight_after.shape)
# Test DQN gradient
print(first_weight_after[0])
weight_change_partial = first_weight_after[0] - first_weight_before[0]
print(weight_change_partial)
expected_weight_change = np.array([-0.0049137, -0.00500057, -0.00499138, -0.00491226, -0.00490116,  0.00489506,
 -0.00284088, -0.00171939,  0.00485736])


updated_weight_error = rel_error(weight_change_partial, expected_weight_change)
print("Weight Update Error", updated_weight_error, "should be on the order of 1e-6 or lower")

We can now also run some experiments on LunarLander with Double DQN. You may be able to see that double DQN performs slightly better and more stably, but as there is very high variance, dont' worry if you do not.

In [None]:
# Run with double DQN
dqn_args = dict(dqn_base_args_dict)

env_str = 'LunarLander'
dqn_args['env_name'] = '{}-v3'.format(env_str)
dqn_args['double_q'] = True

# Delete all previous logs
remove_folder('logs/dqn/{}/double_dqn'.format(env_str))

for seed in range(3):
    print("Running DQN experiment with seed", seed)
    dqn_args['seed'] = seed
    dqn_args['logdir'] = 'logs/dqn/{}/double_dqn/seed{}'.format(env_str, seed)
    dqntrainer = DQN_Trainer(dqn_args)
    dqntrainer.run_training_loop()

In [None]:
### Visualize all DQN results on Lunar Lander
%load_ext tensorboard
%tensorboard --logdir logs/dqn/LunarLander/