# Deep Q-Networks

**Applications of Deep Learning, University of Zaragoza, Ruben Martinez-Cantin**

*This assigment is based on the UC Berkeley course CS 285: Deep Reinforcement Learning by Sergei Levine.*

This assignment requires you to implement and evaluate Q-learning with convolutional neural networks for playing Atari games. The Q-learning algorithm was covered in lecture, and you will be provided with starter code. This assignment will be faster to run on a GPU, though it is possible to complete on a CPU as well. We recommend using the Colab option if you do not have a GPU available to you. Please start early!

In this assignment, we will also use external code directly pulled from Github. If you want to look that code (for example: to check how the Replay Buffer works) you can go to the github url or, after cloning the repository, you can directly open the files with the file explorer from colab.

# 1. Implementation

The first phase of the assignment is to implement a working version of Q-learning. The default code will run the `Ms. Pac-Man` game with reasonable hyperparameter settings. Look for the `#TODO` markers in the files listed above for detailed implementation instructions. You may want to look inside `infrastructure/dqn utils.py` (in Github or local folder) to understand how the (memory-optimized) replay bu
er works, but you will not need to modify it.

Once you implement Q-learning, you might try different extensions (like double DQN) or change the hyperparameters, neural network architectures, and the game.

To determine if your implementation of Q-learning is correct, you should run it with the default hyperparameters on the `Ms. Pac-Man` game for 1 million steps. Our reference solution gets a return of 1500 in this timeframe. On Colab, this will take roughly 3 GPU hours. If it takes much longer than that, there may be a bug in your implementation.
To accelerate debugging, you should try on `LunarLander-v3` first, which trains your agent to play Lunar Lander, a 1979 arcade game (also made by Atari) that has been implemented in OpenAI Gym. Our reference solution
with the default hyperparameters achieves around 150 reward after 350k timesteps, but there is considerable variation between runs and without the double-Q trick the average return often decreases after reaching 150.

# 2. Evaluation
* **Basic Q-learning. (DQN)** Your code using basic DQN (without double) should be able to solve `MsPacman-v0`. As a performance measure, you can plot the average per-epoch reward as well as the best mean reward vs the number of time steps. These quantities are already computed and printed. They are also logged to the data folder, and can be visualized using Tensorboard as in previous assignments. You can extract the values for plotting using the cell at the end of the notebook, replacing the name of the data folder. You should
not need to modify the default hyperparameters in order to obtain good performance, but if you modify any of the parameters, you need to report any change in the plot. Given the GPU limitations on colab, you may report results on `LunarLander-v3`, which should take about 30 minutes. As a middle ground, you can train `Breakout` for 500k-1M timesteps. Average reward should increase more or less linearly after 150-200k timesteps.

* **Double Q-learning (DDQN).** Use the double estimator to improve the accuracy of your learned Q values. This amounts to using the online Q network (instead of the target Q network) to select the best action when computing target values. Compare the performance of DDQN to vanilla DQN. Since there is considerable variance between runs, you must run at least three random seeds for both DQN and DDQN. You should use `LunarLander-v3` for this question. Make a single graph that averages the performance across three runs for both DQN and double DQN. You can extract the values for plotting using the cell at the end of the notebook, replacing the name of the data folder.

* (**Optionals**) Now you can extend the assignment in multiple ways. Note that **the previous part should already give you a grade >9** if the implementation is corrent and the report is accurate.
  * You can experiment with the **hyperparameters**. For that, you can use shorter learning times by reducing the number of timesteps, although that limits the information on the performance on the hyperparameters. You can plot the runs with different parameters in the same graph for comparison.
Examples include: learning rates, neural network architecture, exploration schedule or exploration rule (e.g. you may implement an alternative to $\epsilon$-greedy), etc. Be efficient in experimentation and report. You can combine and reuse previous runs, comparisons and discussions for this.
  * As a specific architecture to try, you can implement the **dueling architecture**. Note that, for the Q-values, you should combine the previous layers using equation (9) from the [paper](https://arxiv.org/pdf/1511.06581.pdf).
  $$
  Q(s,a) = V(s) + \left(A(s,a) - \sum_{a'} A(s,a')\right)
  $$
  In this case, you might change only the fully connected layers, leaving the convolutional layers as defined by default.
  * You can also try experimenting with the hyperparameters and/or architecture of the **actor-critic lab** instead.

#3. Submitting the code and experiment runs

You need to submit a zip file with the code (.py or .ipynb) and the data generated in the runs.

If you submit the ipynb file, you can replace the visualization and tensorboard boxes for text and figures briefly explainin the results (example: average expected reward in multiple runs with different seeds...).

If you prefer, you can submit the results (text and figures) in a separated PDF instead.

**Note:** Ideally, you should run everything for multiple seeds and see the average outcomes, but for the longer experiments, like Ms-Pacman, you can just run it once or twice to save time.

In [None]:
#@title install dependencies
# remove ` > /dev/null 2>&1` to see what is going on under the hood
!apt update > /dev/null 2>&1
!apt install -y --no-install-recommends \
        swig \
        xvfb \
        python3-opengl \
        ffmpeg > /dev/null 2>&1

In [None]:
#@title install Python dependencies
# remove ` > /dev/null 2>&1` to see what is going on under the hood
%pip install swig
%pip install gym[box2d,accept-rom-license,atari]==0.25.2 \
  tensorboardX==2.5.1 \
  pyvirtualdisplay==3.0 \
  opencv-python==4.6.0.66 > /dev/null 2>&1

In [None]:
#@title imports (torch, numpy, gym, pybullet...)
import numpy as np
import time
import copy
import abc
import itertools
import pickle
import os

from collections import OrderedDict
from typing import Union

import torch
from torch import nn
from torch import distributions
from torch import optim
from torch.nn import utils
from tensorboardX import SummaryWriter

import gym
import gym.spaces
from gym import wrappers

In [None]:
#@title clone project repo
#@markdown This include some functions and utils used in DQN

#@markdown You can check the code in https://github.com/rmcantin/dqn_project.git
%cd /content/

!git clone https://github.com/rmcantin/dqn_project.git

%cd dqn_project
%pip install -e .

In [None]:
#@title code to display animations
from gym.wrappers import RecordVideo
import glob
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay

## modified from https://colab.research.google.com/drive/1flu31ulJlgiRL1dnN2ir8wGh9p7Zij2t#scrollTo=TCelFzWY9MBI

def show_video():
  mp4list = glob.glob('/content/video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else:
    print("Could not find video")


def wrap_env(env):
  env = RecordVideo(env, '/content/video')
  return env


In [None]:
#@title set up virtual display
from pyvirtualdisplay import Display

display = Display(visible=0, size=(1400, 900))
display.start()

<pyvirtualdisplay.display.Display at 0x7fbbc3ff6f10>

In [None]:
#@title test virtual display

#@markdown If you see a video of PacMan, setup is complete!

import matplotlib
matplotlib.use('Agg')

env = wrap_env(gym.make("MsPacman-v0", render_mode="rgb_array"))

observation = env.reset()
for i in range(10):
    env.render()
    obs, rew, term, _ = env.step(env.action_space.sample() )
    if term:
      break;

env.close()
print('Loading video...')
show_video()

Loading video...


In [None]:
#@title set-up GPU if available

ptu_device = None

def ptu_init_gpu(use_gpu=True, gpu_id=0):
    global ptu_device
    if torch.cuda.is_available() and use_gpu:
        ptu_device = torch.device("cuda:" + str(gpu_id))
        print("Using GPU id {}".format(gpu_id))
    else:
        ptu_device = torch.device("cpu")
        print("GPU not detected. Defaulting to CPU.")

In [None]:
#@title Other imports
from cs285.infrastructure import pytorch_util as ptu
from cs285.infrastructure.dqn_utils import MemoryOptimizedReplayBuffer
from cs285.infrastructure.rl_trainer import RL_Trainer

In [None]:
#@title Hyperparameters and networks for the different experiments
#
from cs285.infrastructure.dqn_utils import PiecewiseSchedule, OptimizerSpec, PreprocessAtari, Flatten
from cs285.infrastructure.atari_wrappers import wrap_deepmind

def get_env_kwargs(env_name):
    if env_name in ['MsPacman-v0', 'PongNoFrameskip-v4', 'BreakoutDeterministic-v4']:
        kwargs = {
            'learning_starts': 50000,
            'target_update_freq': 10000,
            'replay_buffer_size': int(1e6),
            'num_timesteps': int(1e6), # if you cand, try int(2e8),
            'q_func': create_atari_q_network,
            'learning_freq': 4,
            'grad_norm_clipping': 10,
            'input_shape': (84, 84, 4),
            'env_wrappers': wrap_deepmind,
            'frame_history_len': 4,
            'gamma': 0.99,
        }
        kwargs['optimizer_spec'] = atari_optimizer(kwargs['num_timesteps'])
        kwargs['exploration_schedule'] = atari_exploration_schedule(kwargs['num_timesteps'])

    elif env_name == 'LunarLander-v3':
        def lunar_empty_wrapper(env):
            return env
        kwargs = {
            'optimizer_spec': lander_optimizer(),
            'q_func': create_lander_q_network,
            'replay_buffer_size': 50000,
            'batch_size': 32,
            'gamma': 1.00,
            'learning_starts': 1000,
            'learning_freq': 1,
            'frame_history_len': 1,
            'target_update_freq': 3000,
            'grad_norm_clipping': 10,
            'lander': True,
            'num_timesteps': 500000,
            'env_wrappers': lunar_empty_wrapper
        }
        kwargs['exploration_schedule'] = lander_exploration_schedule(kwargs['num_timesteps'])

    else:
        raise NotImplementedError

    return kwargs

class LanderQNetwork(nn.Module):
    def __init__(self, ob_dim, num_actions):
        super(LanderQNetwork, self).__init__()

        self.fc1 = nn.Linear(ob_dim, 64)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(64, 64)
        self.fc_out = nn.Linear(64, num_actions)

    def forward(self, state):
        y1 = self.relu(self.fc1(state))
        y2 = self.relu(self.fc2(y1))

        return self.fc_out(y2)

class AtariQNetwork(nn.Module):
    def __init__(self, ob_dim, num_actions):
        super(AtariQNetwork, self).__init__()

        self.prec = PreprocessAtari()
        self.conv1 = nn.Conv2d(in_channels=4, out_channels=32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1)
        self.flat = Flatten()
        self.fc1 = nn.Linear(3136, 512)  # 3136 hard-coded based on img size + CNN layers
        self.fc_out = nn.Linear(512, num_actions)
        self.relu = nn.ReLU()

    def forward(self, state):
        x = self.relu(self.conv1(self.prec(state)))
        x = self.relu(self.conv2(x))
        x = self.flat(self.relu(self.conv3(x)))
        x = self.relu(self.fc1(x))
        return self.fc_out(x)

def create_lander_q_network(ob_dim, num_actions):
    return LanderQNetwork(ob_dim, num_actions)

def create_atari_q_network(ob_dim, num_actions):
    return AtariQNetwork(ob_dim, num_actions)


def lander_exploration_schedule(num_timesteps):
    return PiecewiseSchedule(
        [
            (0, 1),
            (num_timesteps * 0.1, 0.02),
        ], outside_value=0.02
    )

def atari_exploration_schedule(num_timesteps):
    return PiecewiseSchedule(
        [
            (0, 1.0),
            (num_timesteps / 16, 0.1),
            (num_timesteps / 8, 0.01),
        ], outside_value=0.01
    )

def lander_optimizer():
    return OptimizerSpec(
        constructor=optim.Adam,
        optim_kwargs=dict(
            lr=1,
        ),
        learning_rate_schedule=lambda epoch: 1e-3,  # keep init learning rate
    )

def atari_optimizer(num_timesteps):
    lr_schedule = PiecewiseSchedule(
        [
            (0, 1e-1),
            (num_timesteps / 40, 1e-1),
            (num_timesteps / 8, 5e-2),
        ],
        outside_value=5e-2,
    )

    return OptimizerSpec(
        constructor=optim.Adam,
        optim_kwargs=dict(
            lr=1e-3,
            eps=1e-4
        ),
        learning_rate_schedule=lambda t: lr_schedule.value(t),
    )




In [None]:
#@title Critic
#@markdown You need to code the critic `update(..)` function

#@markdown The code inside the `if self.double_q` can be left TODO until the second part (DDQN).

class BaseCritic(object):
    def update(self, ob_no, ac_na, next_ob_no, re_n, terminal_n):
        raise NotImplementedError

class DQNCritic(BaseCritic):

    def __init__(self, hparams, optimizer_spec, **kwargs):
        super().__init__(**kwargs)
        self.env_name = hparams['env_name']
        self.ob_dim = hparams['ob_dim']

        if isinstance(self.ob_dim, int):
            self.input_shape = (self.ob_dim,)
        else:
            self.input_shape = hparams['input_shape']

        self.ac_dim = hparams['ac_dim']
        self.double_q = hparams['double_q']
        self.grad_norm_clipping = hparams['grad_norm_clipping']
        self.gamma = hparams['gamma']

        self.optimizer_spec = optimizer_spec
        network_initializer = hparams['q_func']
        self.q_net = network_initializer(self.ob_dim, self.ac_dim)
        self.q_net_target = network_initializer(self.ob_dim, self.ac_dim)
        self.optimizer = self.optimizer_spec.constructor(
            self.q_net.parameters(),
            **self.optimizer_spec.optim_kwargs
        )
        self.learning_rate_scheduler = optim.lr_scheduler.LambdaLR(
            self.optimizer,
            self.optimizer_spec.learning_rate_schedule,
        )
        self.loss = nn.SmoothL1Loss()  # AKA Huber loss
        self.q_net.to(ptu.device)
        self.q_net_target.to(ptu.device)

    def update(self, ob_no, ac_na, next_ob_no, reward_n, terminal_n):
        """
            Update the parameters of the critic.
            let sum_of_path_lengths be the sum of the lengths of the paths sampled from
                Agent.sample_trajectories
            let num_paths be the number of paths sampled from Agent.sample_trajectories
            arguments:
                ob_no: shape: (sum_of_path_lengths, ob_dim)
                next_ob_no: shape: (sum_of_path_lengths, ob_dim). The observation after taking one step forward
                reward_n: length: sum_of_path_lengths. Each element in reward_n is a scalar containing
                    the reward for each timestep
                terminal_n: length: sum_of_path_lengths. Each element in terminal_n is either 1 if the episode ended
                    at that timestep of 0 if the episode did not end
            returns:
                nothing
        """
        ob_no = ptu.from_numpy(ob_no)
        ac_na = ptu.from_numpy(ac_na).to(torch.long)
        next_ob_no = ptu.from_numpy(next_ob_no)
        reward_n = ptu.from_numpy(reward_n)
        terminal_n = ptu.from_numpy(terminal_n)

        qa_t_values = self.q_net(ob_no)
        q_t_values = torch.gather(qa_t_values, 1, ac_na.unsqueeze(1)).squeeze(1)

        # TODO compute the Q-values from the target network
        qa_tp1_values = TODO

        if self.double_q:
            # You must fill this part for Q2 of the Q-learning portion of the homework.
            # In double Q-learning, the best action is selected using the Q-network that
            # is being updated, but the Q-value for this action is obtained from the
            # target Q-network. See page 5 of https://arxiv.org/pdf/1509.06461.pdf for more details.
            TODO
        else:
            q_tp1, _ = qa_tp1_values.max(dim=1)

        # compute targets for minimizing Bellman error
        # HINT: as you saw in lecture, this would be:
        #currentReward + self.gamma * qValuesOfNextTimestep * (not terminal)
        target = TODO
        target = target.detach()

        assert q_t_values.shape == target.shape
        loss = self.loss(q_t_values, target)

        self.optimizer.zero_grad()
        loss.backward()
        utils.clip_grad_value_(self.q_net.parameters(), self.grad_norm_clipping)
        self.optimizer.step()

        return {
            'Training Loss': ptu.to_numpy(loss),
        }

    def update_target_network(self):
        for target_param, param in zip(
                self.q_net_target.parameters(), self.q_net.parameters()
        ):
            target_param.data.copy_(param.data)

    def qa_values(self, obs) -> np.ndarray:
        obs = ptu.from_numpy(obs)
        qa_values = self.q_net(obs)
        return ptu.to_numpy(qa_values)

In [None]:
#@title Policy (actor)
#@markdown You need to code the policy `get_action(..)` function

class ArgMaxPolicy(object):

    def __init__(self, critic):
        self.critic = critic

    def get_action(self, obs):
        if len(obs.shape) > 3:
            observation = obs
        else:
            observation = obs[None]

        ## TODO return the action that maxinmizes the Q-value
        # at the current observation as the output
        action = TODO

        return action.squeeze()

In [None]:
#@title DQN agent
#@markdown We have all the ingredientes. Let's put all together.
#@markdown You need to code the `step_env(..)` and the `train(..)` function

class DQNAgent(object):
    def __init__(self, env, agent_params):

        self.env = env
        self.agent_params = agent_params
        self.batch_size = agent_params['batch_size']
        # import ipdb; ipdb.set_trace()
        self.last_obs = self.env.reset()

        self.num_actions = agent_params['ac_dim']
        self.learning_starts = agent_params['learning_starts']
        self.learning_freq = agent_params['learning_freq']
        self.target_update_freq = agent_params['target_update_freq']

        self.replay_buffer_idx = None
        self.exploration = agent_params['exploration_schedule']
        self.optimizer_spec = agent_params['optimizer_spec']

        self.critic = DQNCritic(agent_params, self.optimizer_spec)
        self.actor = ArgMaxPolicy(self.critic)

        lander = agent_params['env_name'].startswith('LunarLander')
        self.replay_buffer = MemoryOptimizedReplayBuffer(
            agent_params['replay_buffer_size'], agent_params['frame_history_len'], lander=lander)
        self.t = 0
        self.num_param_updates = 0

    def add_to_replay_buffer(self, paths):
        pass

    def step_env(self):
        """
        Step the env and store the transition
        At the end of this block of code, the simulator should have been
        advanced one step, and the replay buffer should contain one more
        transition. Note that self.last_obs must always point to the new latest
        observation.
        """

        # TODO store the latest observation ("frame") into the replay buffer
        # HINT: the replay buffer used here is `MemoryOptimizedReplayBuffer`
        # in dqn_utils.py
        self.replay_buffer_idx = TODO

        eps = self.exploration.value(self.t)

        # TODO use epsilon greedy exploration when selecting action
        # HINT: take random action with probability eps (see np.random.random())
        # OR if your current step number (see self.t) is less that self.learning_starts
        perform_random_action = TODO
        if perform_random_action:
            # take random action
            action = self.env.action_space.sample()
        else:
            # HINT: Your actor will take in multiple previous observations ("frames") in order
            # to deal with the partial observability of the environment. Get the most recent
            # `frame_history_len` observations using functionality from the replay buffer,
            # and then use those observations as input to your actor.
            action = TODO

        # take a step in the environment using the action from the policy
        self.last_obs, reward, done, info = self.env.step(action)

        # TODO store the result of taking this action into the replay buffer
        # HINT1: see your replay buffer's `store_effect` function
        # HINT2: one of the arguments you'll need to pass in is self.replay_buffer_idx from above
        TODO

        # if taking this step resulted in done, reset the env (and the
        # latest observation)
        if done:
            self.last_obs = self.env.reset()

    def sample(self, batch_size):
        if self.replay_buffer.can_sample(self.batch_size):
            return self.replay_buffer.sample(batch_size)
        else:
            return [], [], [], [], []

    def train(self, ob_no, ac_na, re_n, next_ob_no, terminal_n):
        log = {}
        if (self.t > self.learning_starts
                and self.t % self.learning_freq == 0
                and self.replay_buffer.can_sample(self.batch_size)
        ):

            log = self.critic.update(ob_no, ac_na, next_ob_no, re_n, terminal_n)

            # TODO update the target network periodically
            # HINT: your critic already has this functionality implemented
            if self.num_param_updates % self.target_update_freq == 0:
                TODO

            self.num_param_updates += 1

        self.t += 1
        return log


In [None]:
#@title runtime arguments

class Args:

  def __getitem__(self, key):
    return getattr(self, key)

  def __setitem__(self, key, val):
    setattr(self, key, val)

  def __contains__(self, key):
    return hasattr(self, key)

  env_name = 'LunarLander-v3' #@param ['MsPacman-v0', 'LunarLander-v3', 'PongNoFrameskip-v4', 'BreakoutDeterministic-v4']
  ep_len = 200 #@param {type: "integer"}

  #@markdown batches and steps
  batch_size = 32 #@param {type: "integer"}
  eval_batch_size = 1000 #@param {type: "integer"}

  num_agent_train_steps_per_iter = 1 #@param {type: "integer"}

  num_critic_updates_per_agent_update = 1 #@param {type: "integer"}

  #@markdown Q-learning parameters
  double_q = False #@param {type: "boolean"}

  #@markdown system
  save_params = False #@param {type: "boolean"}
  no_gpu = False #@param {type: "boolean"}
  which_gpu = 0 #@param {type: "integer"}
  seed = 1 #@param {type: "integer"}

  #@markdown logging
  ## default is to not log video so
  ## that logs are small enough to be
  ## uploaded to gradscope
  video_log_freq =  -1 #@param {type: "integer"}
  scalar_log_freq =  10000#@param {type: "integer"}


args = Args()

## ensure compatibility with hw1 code
args['train_batch_size'] = args['batch_size']


if args['video_log_freq'] > 0:
  import warnings
  warnings.warn(
      '''\nLogging videos will make eventfiles too large.'''
      '''\nSet video_log_freq = -1 to avoid that.''')

In [None]:
#@title Define Q-function trainer
#@markdown This calls the RL_trainer with the specific parameters for DQN

class Q_Trainer(object):

    def __init__(self, params):
        self.params = params

        train_args = {
            'num_agent_train_steps_per_iter': params['num_agent_train_steps_per_iter'],
            'num_critic_updates_per_agent_update': params['num_critic_updates_per_agent_update'],
            'train_batch_size': params['batch_size'],
            'double_q': params['double_q'],
        }

        env_args = get_env_kwargs(params['env_name'])

        for k, v in env_args.items():
          params[k] = v

        self.params['agent_class'] = DQNAgent
        self.params['agent_params'] = params
        self.params['train_batch_size'] = params['batch_size']
        self.params['env_wrappers'] = env_args['env_wrappers']

        self.rl_trainer = RL_Trainer(self.params)

    def run_training_loop(self):
        self.rl_trainer.run_training_loop(
            self.params['num_timesteps'],
            collect_policy = self.rl_trainer.agent.actor,
            eval_policy = self.rl_trainer.agent.actor,
            )

In [None]:
#@title create directories for logging

data_path = '''/content/data'''

if not (os.path.exists(data_path)):
    os.makedirs(data_path)

logdir = 'dqn_' + args.env_name + '_' + time.strftime("%d-%m-%Y_%H-%M-%S")
logdir = os.path.join(data_path, logdir)
args['logdir'] = logdir
if not(os.path.exists(logdir)):
    os.makedirs(logdir)

print("LOGGING TO: ", logdir)

In [None]:
#@markdown You can visualize your runs with tensorboard from within the notebook

## requires tensorflow==2.3.0
%load_ext tensorboard
%tensorboard --logdir /content/data/

In [None]:
#@title run training
trainer = Q_Trainer(args)
trainer.run_training_loop()

In [None]:
#@title Visualize a test run on video
#@markdown You can run the cell multiple times to get different random initializations
from cs285.infrastructure.atari_wrappers import wrap_deepmind

env = gym.make(args['env_name'], render_mode="rgb_array")

if args['env_name'] != 'LunarLander-v3':
    # This is only for Atari games
    env = wrap_deepmind(env)

env = wrap_env(env)

obs = env.reset()

if args['env_name'] != 'LunarLander-v3':
    # This is only for Atari games
    frames = []
    for _ in range(trainer.params['frame_history_len']):
        frames.append(obs)
    npframes = np.concatenate(frames, 2)
else:
    npframes = obs

term = False
i = 0
while not term:
    i += 1
    env.render()
    action= trainer.rl_trainer.agent.actor.get_action(npframes)
    obs, rew, term, _ = env.step(action)
    if args['env_name'] != 'LunarLander-v3':
        # This is only for Atari games
        frames.pop(0)
        frames.append(obs)
        npframes = np.concatenate(frames, 2)
    else:
        npframes = obs

    if term:
      break;

env.close()
print('Loading video...',i)
show_video()

In [None]:
#@title Code to download the data folder.
#@markdown Make sure to run it frequently in case Google decides to shut down the instance suddently.
!zip -r /content/data.zip /content/data

from google.colab import files
files.download("/content/data.zip")

In [None]:
#@title Code to plot different metrics
#@markdown This is just an example of the things you can do.
#@markdown You might need to change the tag selected, the labels,
#@markdown the file names, etc.

#@markdown **IMPORTANT:** If you run the same experiment multiple times, this will
#@markdown also plot error bars, but remmember to **change the seed** of the random
#@markdown number generator.

#@markdown You can also run this cell in a separate colab where you upload the
#@markdown data folder (see https://colab.research.google.com/notebooks/io.ipynb#scrollTo=BaCkyg5CV5jF)

# Plotting example requires tensorflow==1.12.0

import glob
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline

def get_section_results(files, tag):
    data = []
    for file in files:
        row = []
        for e in tf.compat.v1.train.summary_iterator(file):
            for v in e.summary.value:
                if v.tag == tag:
                    row.append(v.simple_value)
        data.append(row)
    return data


#logfile = 'data/my_experiment/events*'
all_logdir= data_path+'/dqn_' + args.env_name + '_*'
logfile = all_logdir+'/events*'
eventfiles = glob.glob(logfile)

tag = 'Train_AverageReturn'
X = get_section_results(eventfiles, tag)
for j, row in enumerate(X):
    for i, x in enumerate(row):
        print('Experiment {:d} | Iteration {:d} | {}: {} '.format(j, i, tag, x))

color = 'r'
X = np.array(X)
mean_plot = X.mean(axis=0)
std_plot = X.std(axis=0)
iters = np.arange(len(mean_plot))
plt.plot(iters,mean_plot,color, label=tag)
plt.fill_between(iters, mean_plot-std_plot, mean_plot+std_plot, color=color, alpha=0.2)
plt.ylabel('reward')
plt.xlabel('iteration')
plt.legend()