# Neural Actor Critic

**Applications of Deep Learning, University of Zaragoza, Ruben Martinez-Cantin**

*This assigment is based on the UC Berkeley course CS 285: Deep Reinforcement Learning by Sergei Levine.*

This assignment requires you to implement an actor critic algorithm to solve certain tasks. This assigment is very similar to the project (where you will implement a DQN algorithm), but it is relatively shorter. The actual coding for this assignment will involve less than 20 lines
of code.

Recall the policy gradient equation:
$$
  \nabla_\theta J (\theta) = \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^{T} \log \nabla_\theta \pi_\theta(a_{i,t} | s_{i,t}) A^\pi(s_{i,t}, a_{i,t})
$$
Using the sum of the rewards to go like in policy gradient methods, the estimated advantage value $A^\pi$ suffers from high variance. Actor-critic addresses this issue by using a critic network to estimate the sum of rewards to go.

The most common type of critic network used is a value function, in which case our estimated advantage becomes
$$
A^\pi(s_{t}, a_{t}) = r(s_{t}, a_{t}) + \gamma V_\phi^\pi(s_{t+1}) - V_\phi^\pi(s_{t})
$$
One additional consideration in actor-critic is updating the critic network itself. While we can use Monte Carlo rollouts to estimate the sum of rewards to go for updating the value function network, in practice we
fit our value function to the following *target values*:
$$
y_t = r(s_{t}, a_{t}) + \gamma V^\pi(s_{t+1})
$$
we then regress onto these target values via the following regression objective which we can optimize with gradient descent:
$$
\min_\phi \sum_{i,t}(V_\phi^\pi(s_{t}) - y_t)^2
$$
In theory, we need to perform this minimization every time we update our policy, so that our value function
matches the behavior of the new policy. In practice however, this operation can be costly, so we may instead
just take a few gradient steps at each iteration. Also note that since our target values are based on the
old value function, we may need to recompute the targets with the updated value function, in the following
fashion:
1. Update targets with current value function
2. Regress onto targets to update value function by taking a few gradient steps
3. Redo steps 1 and 2 several times
In all, the process of fitting the value function critic is an iterative process in which we go back and forth
between computing target values and updating the value function to match the target values. Through
experimentation, you will see that this iterative process is crucial for training the critic network.

# 1. Implementation

You will need to fill in the TODOS for the following parts of the code.
* In Policy and Critic section you should implement the update methods for both networks. In the Critic perform the update according to process outlined in the introduction. You must perform
`self.num_target_updates * self.num_grad_steps_per_target_update`
number of updates, and recompute the target values every `self.num_grad_steps_per_target_update` number of steps.
* In Agent section, finish the estimate_advantage function: this function uses the critic network to estimate the advantage values. The advantage values are computed according to
$$
A^\pi(s_{t}, a_{t}) = r(s_{t}, a_{t}) + \gamma V_\phi^\pi(s_{t+1}) - V_\phi^\pi(s_{t})
$$
Note: for terminal timesteps, you must make sure to cut of the reward to go (i.e., set it to zero), in which case we have
$$
A^\pi(s_{t}, a_{t}) = r(s_{t}, a_{t}) - V_\phi^\pi(s_{t})
$$

# 2. Evaluation
Now that you have implemented actor-critic, check that your solution works by running `CartPole-v0`. This experiment should run quite fast compared with the other experiments, so you can use it to debugging.

To test the CartPole, you can use the default configuration:
```
env_name = 'CartPole-v0'
ep_len = 200
batch_size = 1000
eval_batch_size =  400
n_iter =  100
discount =  0.9
learning_rate = 5e-3
```
Then you can try with different variations of updates and gradient steps:
```
num_target_updates = 1
num_grad_steps_per_target_update = 1
```
In the example above, we alternate between performing one target update and one gradient update step for the critic. As you will see, this probably doesn't work, and you need to increase both the number of target
updates and number of gradient updates. Compare the results for the following settings and report which worked best.
```
num_target_updates = 1
num_grad_steps_per_target_update = 100
```
```
num_target_updates = 100
num_grad_steps_per_target_update = 1
```
```
num_target_updates = 10
num_grad_steps_per_target_update = 10
```
At the end, the best setting from above should give you a robust performance on Cartpole (reward 200).

#3. Run actor-critic with more dificult tasks.
Use the best setting from the previous question to run the harder `InvertedPendulumSwingupBulletEnv-v0` which uses a phisics engine and requires to learn the swing up maneouver and the even harder `BipedalWalker-v3` or `HalfCheetah-v4`:

For both Bullet-based inverted pendulums, you can use the followind settings:
```
ep_len = 1000
batch_size = 6000
eval_batch_size =  500

n_iter =  150
discount =  0.95
learning_rate = 0.005
n_layers = 2
size =  64
```
The Mujoco `InvertedPendulum-v4` is slightly easier and can be solved with less iterations and smaller batch size:
```
ep_len = 1000
batch_size = 5000
eval_batch_size =  500

n_iter =  100
discount =  0.95
learning_rate = 0.01
n_layers = 2
size =  64
```

For halfcheetah, you can use the followind settings:
```
ep_len = 150
batch_size = 30000
eval_batch_size =  1500

n_iter =  150
discount =  0.9
learning_rate = 0.02
n_layers = 2
size =  32
```
and for the bipedal walker:
```
ep_len = 1000
batch_size = 20000
eval_batch_size =  1500

n_iter =  200
discount =  0.95
learning_rate = 0.002
n_layers = 4
size =  64
```
For reference, using the HalfCheetah-v4, you should get around 150 of reward after 150 iterations.

#4. Submitting the code and experiment runs

You need to submit a zip file with the code (.py or .ipynb) and the data generated in the runs.

If you submit the ipynb file, you can replace the visualization and tensorboard boxes for text and figures briefly explainin the results (example: average expected reward in multiple runs with different seeds...).

If you prefer, you can submit the results (text and figures) in a separated PDF instead.

**Note:** Ideally, you should run everything for multiple seeds and see the average outcomes, but for the longer experiments, like the swing up or the bipedal walker, you can just run it once or twice to save time.


In [1]:
#@title install dependencies
#@markdown it might take a while. Run it as soon as possible.
# remove ` > /dev/null 2>&1` to see what is going on under the hood
!apt update > /dev/null 2>&1
!apt install -y --no-install-recommends \
        swig \
        xvfb \
        libglfw3 \
        libglfw3-dev \
        python3-opengl \
        ffmpeg > /dev/null 2>&1
%pip install swig
%pip install mujoco==2.2.0 \
  gym[box2d,mujoco]==0.25.2 \
  tensorboardX==2.5.1 \
  pyvirtualdisplay==3.0 \
  opencv-python==4.6.0.66 \
  pybullet > /dev/null 2>&1

Collecting swig
  Downloading swig-4.2.0.post0-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: swig
Successfully installed swig-4.2.0.post0


In [2]:
#@title imports (torch, numpy, gym, pybullet...)
import numpy as np
import time
import copy
import abc
import itertools
import pickle
import os

from collections import OrderedDict
from typing import Union

import torch
from torch import nn
from torch import distributions
from torch import optim
from tensorboardX import SummaryWriter

import gym
import gym.spaces
from gym import wrappers

import mujoco
import pybullet_envs

  logger.warn(


In [3]:
#@title code to display animations
from gym.wrappers import RecordVideo
import glob
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay

## modified from https://colab.research.google.com/drive/1flu31ulJlgiRL1dnN2ir8wGh9p7Zij2t#scrollTo=TCelFzWY9MBI

def show_video():
  mp4list = glob.glob('/content/video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else:
    print("Could not find video")


def wrap_env(env):
  env = RecordVideo(env, '/content/video')
  return env


  and should_run_async(code)


In [4]:
#@title set up virtual display
from pyvirtualdisplay import Display

display = Display(visible=0, size=(1400, 900))
display.start()

<pyvirtualdisplay.display.Display at 0x7d0418176230>

In [5]:
#@title test virtual display

#@markdown If you see a video of a two-leg-dog fumbling about, setup is complete!

import matplotlib
matplotlib.use('Agg')

env = wrap_env(gym.make("HalfCheetah-v4", render_mode='rgb_array'))

observation = env.reset()
for i in range(10):
    env.render()
    obs, rew, term, _ = env.step(env.action_space.sample() )
    if term:
      break;

env.close()
print('Loading video...')
show_video()

  and should_run_async(code)
  deprecation(
  deprecation(
  if not isinstance(terminated, (bool, np.bool8)):


Loading video...


In [6]:
#@title set-up GPU if available

ptu_device = None

def ptu_init_gpu(use_gpu=True, gpu_id=0):
    global ptu_device
    if torch.cuda.is_available() and use_gpu:
        ptu_device = torch.device("cuda:" + str(gpu_id))
        print("Using GPU id {}".format(gpu_id))
    else:
        ptu_device = torch.device("cpu")
        print("GPU not detected. Defaulting to CPU.")

  and should_run_async(code)


In [7]:
#@title logging tools for tensorboard, video...

class Logger:
    def __init__(self, log_dir, n_logged_samples=10, summary_writer=None):
        self._log_dir = log_dir
        print('########################')
        print('logging outputs to ', log_dir)
        print('########################')
        self._n_logged_samples = n_logged_samples
        self._summ_writer = SummaryWriter(log_dir, flush_secs=1, max_queue=1)

    def log_scalar(self, scalar, name, step_):
        self._summ_writer.add_scalar('{}'.format(name), scalar, step_)

    def log_scalars(self, scalar_dict, group_name, step, phase):
        """Will log all scalars in the same plot."""
        self._summ_writer.add_scalars('{}_{}'.format(group_name, phase), scalar_dict, step)

    def log_image(self, image, name, step):
        assert(len(image.shape) == 3)  # [C, H, W]
        self._summ_writer.add_image('{}'.format(name), image, step)

    def log_video(self, video_frames, name, step, fps=10):
        assert len(video_frames.shape) == 5, "Need [N, T, C, H, W] input tensor for video logging!"
        self._summ_writer.add_video('{}'.format(name), video_frames, step, fps=fps)

    def log_paths_as_videos(self, paths, step, max_videos_to_save=2, fps=10, video_title='video'):

        # reshape the rollouts
        videos = [np.transpose(p['image_obs'], [0, 3, 1, 2]) for p in paths]

        # max rollout length
        max_videos_to_save = np.min([max_videos_to_save, len(videos)])
        max_length = videos[0].shape[0]
        for i in range(max_videos_to_save):
            if videos[i].shape[0]>max_length:
                max_length = videos[i].shape[0]

        # pad rollouts to all be same length
        for i in range(max_videos_to_save):
            if videos[i].shape[0]<max_length:
                padding = np.tile([videos[i][-1]], (max_length-videos[i].shape[0],1,1,1))
                videos[i] = np.concatenate([videos[i], padding], 0)

        # log videos to tensorboard event file
        videos = np.stack(videos[:max_videos_to_save], 0)
        self.log_video(videos, video_title, step, fps=fps)

    def log_figures(self, figure, name, step, phase):
        """figure: matplotlib.pyplot figure handle"""
        assert figure.shape[0] > 0, "Figure logging requires input shape [batch x figures]!"
        self._summ_writer.add_figure('{}_{}'.format(name, phase), figure, step)

    def log_figure(self, figure, name, step, phase):
        """figure: matplotlib.pyplot figure handle"""
        self._summ_writer.add_figure('{}_{}'.format(name, phase), figure, step)

    def log_graph(self, array, name, step, phase):
        """figure: matplotlib.pyplot figure handle"""
        im = plot_graph(array)
        self._summ_writer.add_image('{}_{}'.format(name, phase), im, step)

    def dump_scalars(self, log_path=None):
        log_path = os.path.join(self._log_dir, "scalar_data.json") if log_path is None else log_path
        self._summ_writer.export_scalars_to_json(log_path)

    def flush(self):
        self._summ_writer.flush()

In [8]:
#@title trajectory sampling functions

def sample_trajectory(env, policy, max_path_length, render=False, render_mode=('rgb_array')):

    # initialize env for the beginning of a new rollout
    ob = env.reset()

    # init vars
    obs, acs, rewards, next_obs, terminals, image_obs = [], [], [], [], [], []
    steps = 0
    while True:

        # render image of the simulated env
        if render:
            if 'rgb_array' in render_mode:
                if hasattr(env, 'sim'):
                    image_obs.append(env.sim.render(camera_name='track', height=500, width=500)[::-1])
                else:
                    image_obs.append(env.render(mode=render_mode))
            if 'human' in render_mode:
                env.render(mode=render_mode)
                time.sleep(env.model.opt.timestep)

        # use the most recent ob to decide what to do
        obs.append(ob)
        ac = policy.get_action(ob)
        ac = ac[0]
        acs.append(ac)

        # take that action and record results
        ob, rew, done, _ = env.step(ac)

        # record result of taking that action
        steps += 1
        next_obs.append(ob)
        rewards.append(rew)

        # end the rollout if the rollout ended
        rollout_done = done or steps >= max_path_length
        terminals.append(rollout_done)

        if rollout_done:
            break

    return Path(obs, image_obs, acs, rewards, next_obs, terminals)

def sample_trajectories(env, policy, min_timesteps_per_batch, max_path_length, render=False, render_mode=('rgb_array')):
    """
        Collect rollouts until we have collected min_timesteps_per_batch steps.
    """
    timesteps_this_batch = 0
    paths = []
    while timesteps_this_batch < min_timesteps_per_batch:
        path = sample_trajectory(env, policy, max_path_length, render, render_mode)
        paths.append(path)
        timesteps_this_batch += get_pathlength(path)

    return paths, timesteps_this_batch

def sample_n_trajectories(env, policy, ntraj, max_path_length, render=False, render_mode=('rgb_array')):
    """
        Collect ntraj rollouts.
    """

    paths = [sample_trajectory(env, policy, max_path_length, render, render_mode)
             for _ in range(ntraj)]

    return paths

def Path(obs, image_obs, acs, rewards, next_obs, terminals):
    """
        Take info (separate arrays) from a single rollout
        and return it in a single dictionary
    """
    if image_obs != []:
        image_obs = np.stack(image_obs, axis=0)
    return {"observation" : np.array(obs, dtype=np.float32),
            "image_obs" : np.array(image_obs, dtype=np.uint8),
            "reward" : np.array(rewards, dtype=np.float32),
            "action" : np.array(acs, dtype=np.float32),
            "next_observation": np.array(next_obs, dtype=np.float32),
            "terminal": np.array(terminals, dtype=np.float32)}


def convert_listofrollouts(paths):
    """
        Take a list of rollout dictionaries
        and return separate arrays,
        where each array is a concatenation of that array from across the rollouts
    """
    observations = np.concatenate([path["observation"] for path in paths])
    actions = np.concatenate([path["action"] for path in paths])
    next_observations = np.concatenate([path["next_observation"] for path in paths])
    terminals = np.concatenate([path["terminal"] for path in paths])
    concatenated_rewards = np.concatenate([path["reward"] for path in paths])
    unconcatenated_rewards = [path["reward"] for path in paths]
    return observations, actions, next_observations, terminals, concatenated_rewards, unconcatenated_rewards

############################################
############################################

def get_pathlength(path):
    return len(path["reward"])

def normalize(data, mean, std, eps=1e-8):
    return (data-mean)/(std+eps)

def unnormalize(data, mean, std):
    return data*std+mean

def add_noise(data_inp, noiseToSignal=0.01):

    data = copy.deepcopy(data_inp) #(num data points, dim)

    #mean of data
    mean_data = np.mean(data, axis=0)

    #if mean is 0,
    #make it 0.001 to avoid 0 issues later for dividing by std
    mean_data[mean_data == 0] = 0.000001

    #width of normal distribution to sample noise from
    #larger magnitude number = could have larger magnitude noise
    std_of_noise = mean_data * noiseToSignal
    for j in range(mean_data.shape[0]):
        data[:, j] = np.copy(data[:, j] + np.random.normal(
            0, np.absolute(std_of_noise[j]), (data.shape[0],)))

    return data




In [9]:
#@title Pytorch tools
#@markdown `ptu_build_mlp(..)` build a MLP network \\
#@markdown `ptu_from_numpy(..)` `ptu_to_numpy(..)` to convert torch.tensor to and from numpy.array

Activation = Union[str, nn.Module]


_str_to_activation = {
    'relu': nn.ReLU(),
    'tanh': nn.Tanh(),
    'leaky_relu': nn.LeakyReLU(),
    'sigmoid': nn.Sigmoid(),
    'selu': nn.SELU(),
    'softplus': nn.Softplus(),
    'identity': nn.Identity(),
}


def ptu_build_mlp(
        input_size: int,
        output_size: int,
        n_layers: int,
        size: int,
        activation: Activation = 'tanh',
        output_activation: Activation = 'identity',
):
    """
        Builds a feedforward neural network
        arguments:
            input_placeholder: placeholder variable for the state (batch_size, input_size)
            scope: variable scope of the network
            n_layers: number of hidden layers
            size: dimension of each hidden layer
            activation: activation of each hidden layer
            input_size: size of the input layer
            output_size: size of the output layer
            output_activation: activation of the output layer
        returns:
            output_placeholder: the result of a forward pass through the hidden layers + the output layer
    """
    if isinstance(activation, str):
        activation = _str_to_activation[activation]
    if isinstance(output_activation, str):
        output_activation = _str_to_activation[output_activation]
    layers = []
    in_size = input_size
    for _ in range(n_layers):
        layers.append(nn.Linear(in_size, size))
        layers.append(activation)
        in_size = size
    layers.append(nn.Linear(in_size, output_size))
    layers.append(output_activation)
    return nn.Sequential(*layers)


def ptu_from_numpy(*args, **kwargs):
    return torch.from_numpy(*args, **kwargs).float().to(ptu_device)


def ptu_to_numpy(tensor):
    return tensor.to('cpu').detach().numpy()


In [10]:
#@title Replay buffer
class ReplayBuffer(object):

    def __init__(self, max_size=1000000):

        self.max_size = max_size
        self.paths = []
        self.obs = None
        self.acs = None
        self.concatenated_rews = None
        self.next_obs = None
        self.terminals = None

    def add_rollouts(self, paths, noised=False):

        # add new rollouts into our list of rollouts
        for path in paths:
            self.paths.append(path)

        # convert new rollouts into their component arrays, and append them onto our arrays
        observations, actions, next_observations, terminals, concatenated_rews, unconcatenated_rews = convert_listofrollouts(paths)

        if noised:
            observations = add_noise(observations)
            next_observations = add_noise(next_observations)

        if self.obs is None:
            self.obs = observations[-self.max_size:]
            self.acs = actions[-self.max_size:]
            self.next_obs = next_observations[-self.max_size:]
            self.terminals = terminals[-self.max_size:]
            self.concatenated_rews = concatenated_rews[-self.max_size:]
        else:
            self.obs = np.concatenate([self.obs, observations])[-self.max_size:]
            self.acs = np.concatenate([self.acs, actions])[-self.max_size:]
            self.next_obs = np.concatenate(
                [self.next_obs, next_observations]
            )[-self.max_size:]
            self.terminals = np.concatenate(
                [self.terminals, terminals]
            )[-self.max_size:]
            self.concatenated_rews = np.concatenate(
                [self.concatenated_rews, concatenated_rews]
            )[-self.max_size:]


    def sample_random_rollouts(self, num_rollouts):

        rand_indices = np.random.permutation(len(self.paths))[:num_rollouts]
        return self.paths[rand_indices]


    def sample_recent_rollouts(self, num_rollouts=1):

        return self.paths[-num_rollouts:]


    def sample_random_data(self, batch_size):

        assert self.obs.shape[0] == self.acs.shape[0] == self.concatenated_rews.shape[0] == self.next_obs.shape[0] == self.terminals.shape[0]
        rand_indices = np.random.permutation(self.obs.shape[0])[:batch_size]
        return self.obs[rand_indices], self.acs[rand_indices], self.concatenated_rews[rand_indices], self.next_obs[rand_indices], self.terminals[rand_indices]


    def sample_recent_data(self, batch_size=1, concat_rew=True):

        if concat_rew:
            return self.obs[-batch_size:], self.acs[-batch_size:], self.concatenated_rews[-batch_size:], self.next_obs[-batch_size:], self.terminals[-batch_size:]
        else:
            num_recent_rollouts_to_return = 0
            num_datapoints_so_far = 0
            index = -1
            while num_datapoints_so_far < batch_size:
                recent_rollout = self.paths[index]
                index -=1
                num_recent_rollouts_to_return +=1
                num_datapoints_so_far += get_pathlength(recent_rollout)
            rollouts_to_return = self.paths[-num_recent_rollouts_to_return:]
            observations, actions, next_observations, terminals, concatenated_rews, unconcatenated_rews = convert_listofrollouts(rollouts_to_return)
            return observations, actions, unconcatenated_rews, next_observations, terminals

In [11]:
#@title Critic
#@markdown You need to code the critic `update(..)` function

class BaseCritic(object):
    def update(self, ob_no, ac_na, next_ob_no, re_n, terminal_n):
        raise NotImplementedError

class BootstrappedContinuousCritic(nn.Module, BaseCritic):
    """
        Notes on notation:

        Prefixes and suffixes:
        ob - observation
        ac - action
        _no - this tensor should have shape (batch self.size /n/, observation dim)
        _na - this tensor should have shape (batch self.size /n/, action dim)
        _n  - this tensor should have shape (batch self.size /n/)

        Note: batch self.size /n/ is defined at runtime.
        is None
    """
    def __init__(self, hparams):
        super().__init__()
        self.ob_dim = hparams['ob_dim']
        self.ac_dim = hparams['ac_dim']
        self.discrete = hparams['discrete']
        self.size = hparams['size']
        self.n_layers = hparams['n_layers']
        self.learning_rate = hparams['learning_rate']

        # critic parameters
        self.num_target_updates = hparams['num_target_updates']
        self.num_grad_steps_per_target_update = hparams['num_grad_steps_per_target_update']
        self.gamma = hparams['gamma']
        self.critic_network = ptu_build_mlp(
            self.ob_dim,
            1,
            n_layers=self.n_layers,
            size=self.size,
        )
        self.critic_network.to(ptu_device)
        self.loss = nn.MSELoss()
        self.optimizer = optim.Adam(
            self.critic_network.parameters(),
            self.learning_rate,
        )

    def forward(self, obs):
        return self.critic_network(obs).squeeze(1)

    def forward_np(self, obs):
        obs = ptu_from_numpy(obs)
        predictions = self(obs)
        return ptu_to_numpy(predictions)

    def update(self, ob_no, ac_na, next_ob_no, reward_n, terminal_n):
        """
            Update the parameters of the critic.

            let sum_of_path_lengths be the sum of the lengths of the paths sampled from
                Agent.sample_trajectories
            let num_paths be the number of paths sampled from Agent.sample_trajectories

            arguments:
                ob_no: shape: (sum_of_path_lengths, ob_dim)
                next_ob_no: shape: (sum_of_path_lengths, ob_dim). The observation after taking one step forward
                reward_n: length: sum_of_path_lengths. Each element in reward_n is a scalar containing
                    the reward for each timestep
                terminal_n: length: sum_of_path_lengths. Each element in terminal_n is either 1 if the episode ended
                    at that timestep of 0 if the episode did not end

            returns:
                training loss
        """
        # Implement the pseudocode below: do the following
        # (self.num_target_updates * self.num_grad_steps_per_target_update) times:
        # every self.num_target_updates (which includes the first time),
        # recompute the target values by
        #     a) calculating V(s') by querying the critic with next_ob_no
        #     b) and computing the target values as r(s, a) + gamma * V(s')
        # every time, update this critic using the observations and targets
        #     c) compute the loss/backward pass and perform as many grad steps
        # as indicated by self.num_grad_steps_per_target_update.
        # HINT: don't forget to use terminal_n to cut off the V(s') (ie set it
        #       to 0) when a terminal state is reached
        # HINT: make sure to squeeze the output of the critic_network to ensure
        #       that its dimensions match the reward
        # HINT: remember that pytorch accumulate gradients. You can use zero_grad
        #       to reinitialize the gradient before/after any step.

        # Comment this is you prefer to work with numpy arrays or pytorch tensors
        ob_no = ptu_from_numpy(ob_no)
        next_ob_no = ptu_from_numpy(next_ob_no)
        reward_n = ptu_from_numpy(reward_n)
        terminal_n = ptu_from_numpy(terminal_n).bool()

        TODO
        v = self.forward(...)

        v = v.detach() #pytorch Tensors

        v = v.copy() #numpy array

        self.optimizer.zero_grad()
        #my computations
        loss = self.loss(...)
        loss.backward()
        self.optimizer.step()

        v2 = self.forward(...)

        return loss.item()

In [12]:
#@title Policy (actor)
#@markdown You need to code the policy `update(..)` function

class BasePolicy(object, metaclass=abc.ABCMeta):
    def get_action(self, obs):
        raise NotImplementedError

    def update(self, obs, acs, **kwargs):
        """Return a dictionary of logging information."""
        raise NotImplementedError

    def save(self, filepath):
        raise NotImplementedError


class MLPPolicy(BasePolicy, nn.Module, metaclass=abc.ABCMeta):

    def __init__(self,
                 ac_dim,
                 ob_dim,
                 n_layers,
                 size,
                 discrete=False,
                 learning_rate=1e-4,
                 training=True,
                 nn_baseline=False,
                 **kwargs
                 ):
        super().__init__(**kwargs)

        # init vars
        self.ac_dim = ac_dim
        self.ob_dim = ob_dim
        self.n_layers = n_layers
        self.discrete = discrete
        self.size = size
        self.learning_rate = learning_rate
        self.training = training
        self.nn_baseline = nn_baseline

        if self.discrete:
            self.logits_na = ptu_build_mlp(input_size=self.ob_dim,
                                           output_size=self.ac_dim,
                                           n_layers=self.n_layers,
                                           size=self.size)
            self.logits_na.to(ptu_device)
            self.mean_net = None
            self.logstd = None
            self.optimizer = optim.Adam(self.logits_na.parameters(),
                                        self.learning_rate)
        else:
            self.logits_na = None
            self.mean_net = ptu_build_mlp(input_size=self.ob_dim,
                                      output_size=self.ac_dim,
                                      n_layers=self.n_layers, size=self.size)
            self.logstd = nn.Parameter(
                torch.zeros(self.ac_dim, dtype=torch.float32, device=ptu_device)
            )
            self.mean_net.to(ptu_device)
            self.logstd.to(ptu_device)
            self.optimizer = optim.Adam(
                itertools.chain([self.logstd], self.mean_net.parameters()),
                self.learning_rate
            )

        if nn_baseline:
            self.baseline = ptu_build_mlp(
                input_size=self.ob_dim,
                output_size=1,
                n_layers=self.n_layers,
                size=self.size,
            )
            self.baseline.to(ptu_device)
            self.baseline_optimizer = optim.Adam(
                self.baseline.parameters(),
                self.learning_rate,
            )
        else:
            self.baseline = None

    ##################################

    def save(self, filepath):
        torch.save(self.state_dict(), filepath)

    ##################################

    # query the policy with observation(s) to get selected action(s)
    def get_action(self, obs: np.ndarray) -> np.ndarray:
        if len(obs.shape) > 1:
            observation = obs
        else:
            observation = obs[None]

        observation_tensor = torch.tensor(observation, dtype=torch.float).to(ptu_device)
        action_distribution = self.forward(observation_tensor)
        return ptu_to_numpy(action_distribution.sample())


    # update/train this policy
    def update(self, observations_np, actions_np, advantages_np=None):

        # Comment this is you prefer to work with numpy arrays or pytorch tensors
        observations = ptu_from_numpy(observations_np)
        actions = ptu_from_numpy(actions_np)
        advantages = ptu_from_numpy(advantages_np)

        # Compute the loss that should be optimized when training with policy gradient
        # HINT1: Recall that the expression that we want to MAXIMIZE
            # is the expectation over collected trajectories of:
            # sum_{t=0}^{T-1} [grad [log pi(a_t|s_t) * A_t ]]
        # HINT2: you will want to use the `log_prob` method on the distribution returned
            # by the `forward` method
        # HINT3: don't forget that `optimizer.step()` MINIMIZES a loss

        TODO
        lp = self.forward(....).log_prob(...)

        if not self.discrete:
            lp = lp.sum(1)

        loss = # grad equation

        # Optimize `loss` using `self.optimizer`
        # HINT: remember to `zero_grad` first
        TODO

        return loss.item()

    # This function defines the forward pass of the network. It returns
    # `torch.distributions.Distribution` objects which allows quite flexibility.
    def forward(self, observation: torch.Tensor):
        if self.discrete:
            return distributions.Categorical(logits=self.logits_na(observation))
        else:
            assert self.logstd is not None
            return distributions.Normal(
                self.mean_net(observation),
                torch.exp(self.logstd)[None],
            )

SyntaxError: invalid syntax (<ipython-input-12-0278ef3762c6>, line 117)

In [None]:
#@title Actor-Critic agent
#@markdown We have all the ingredientes (actor, critic, replay bufffer...). Let's put all together.
#@markdown You need to code the `estimate_advantage(..)` function

class BaseAgent(object):
    def __init__(self, **kwargs):
        super(BaseAgent, self).__init__(**kwargs)

    def train(self) -> dict:
        """Return a dictionary of logging information."""
        raise NotImplementedError

    def add_to_replay_buffer(self, paths):
        raise NotImplementedError

    def sample(self, batch_size):
        raise NotImplementedError

    def save(self, path):
        raise NotImplementedError

class ACAgent(BaseAgent):
    def __init__(self, env, agent_params):
        super(ACAgent, self).__init__()

        self.env = env
        self.agent_params = agent_params

        self.gamma = self.agent_params['gamma']
        self.standardize_advantages = self.agent_params['standardize_advantages']

        self.actor = MLPPolicy(
            self.agent_params['ac_dim'],
            self.agent_params['ob_dim'],
            self.agent_params['n_layers'],
            self.agent_params['size'],
            self.agent_params['discrete'],
            self.agent_params['learning_rate'],
        )
        self.critic = BootstrappedContinuousCritic(self.agent_params)

        self.replay_buffer = ReplayBuffer()

    def train(self, ob_no, ac_na, re_n, next_ob_no, terminal_n):
        # for agent_params['num_critic_updates_per_agent_update'] steps,
        #     update the critic

        loss = OrderedDict()

        for _ in range(self.agent_params['num_critic_updates_per_agent_update']):
            loss['Critic_Loss'] = self.critic.update(
                ob_no, ac_na, next_ob_no, re_n, terminal_n)

        advantages = self.estimate_advantage(ob_no, next_ob_no, re_n, terminal_n)

        # for agent_params['num_actor_updates_per_agent_update'] steps,
        #     update the actor
        for _ in range(self.agent_params['num_actor_updates_per_agent_update']):
            loss['Actor_Loss'] = self.actor.update(
                ob_no, ac_na, advantages)

        return loss

    def estimate_advantage(self, ob_no, next_ob_no, re_n, terminal_n):
        # Implement the following pseudocode:
        # 1) query the critic with ob_no, to get V(s)
        # 2) query the critic with next_ob_no, to get V(s')
        # 3) estimate the Q value as Q(s, a) = r(s, a) + gamma*V(s')
        # HINT: Remember to cut off the V(s') term (ie set it to 0) at terminal states (ie terminal_n=1)
        # 4) calculate advantage (adv_n) as A(s, a) = Q(s, a) - V(s)

        # Comment this is you prefer to work with numpy arrays or pytorch tensors
        ob_no = ptu_from_numpy(ob_no)
        next_ob_no = ptu_from_numpy(next_ob_no)
        re_n = ptu_from_numpy(re_n)
        terminal_n = ptu_from_numpy(terminal_n).bool()

        TODO

        if self.standardize_advantages:
            adv_n = (adv_n - np.mean(adv_n)) / (np.std(adv_n) + 1e-8)
        return adv_n

    def add_to_replay_buffer(self, paths):
        self.replay_buffer.add_rollouts(paths)

    def sample(self, batch_size):
        return self.replay_buffer.sample_recent_data(batch_size)

In [None]:
#@title Define RL_trainer to perform the control loop
#@markdown It takes care to collect data, update networks, manage environement, logs...

from tqdm import tqdm_notebook

# how many rollouts to save as videos to tensorboard
MAX_NVIDEO = 2
MAX_VIDEO_LEN = 40 # we overwrite this in the code below


class RL_Trainer(object):

    def __init__(self, params):

        #############
        ## INIT
        #############

        # Get params, create logger
        self.params = params
        self.logger = Logger(self.params['logdir'])

        # Set random seeds
        seed = self.params['seed']
        np.random.seed(seed)
        torch.manual_seed(seed)
        ptu_init_gpu(
            use_gpu=not self.params['no_gpu'],
            gpu_id=self.params['which_gpu']
        )

        #############
        ## ENV
        #############

        # Make the gym environment
        self.env = gym.make(self.params['env_name'])
        self.env.seed(seed)

        # Maximum length for episodes
        self.params['ep_len'] = self.params['ep_len'] or self.env.spec.max_episode_steps
        global MAX_VIDEO_LEN
        MAX_VIDEO_LEN = self.params['ep_len']

        # Is this env continuous, or self.discrete?
        discrete = isinstance(self.env.action_space, gym.spaces.Discrete)
        # Are the observations images?
        img = len(self.env.observation_space.shape) > 2

        self.params['agent_params']['discrete'] = discrete

        # Observation and action sizes

        ob_dim = self.env.observation_space.shape if img else self.env.observation_space.shape[0]
        ac_dim = self.env.action_space.n if discrete else self.env.action_space.shape[0]
        self.params['agent_params']['ac_dim'] = ac_dim
        self.params['agent_params']['ob_dim'] = ob_dim

        # simulation timestep, will be used for video saving
        if 'model' in dir(self.env):
            self.fps = 1/self.env.model.opt.timestep
        elif 'video.frames_per_second' in self.env.env.metadata.keys():
            self.fps = self.env.env.metadata['video.frames_per_second']
        else:
            self.fps = 10


        #############
        ## AGENT
        #############

        agent_class = self.params['agent_class']
        self.agent = agent_class(self.env, self.params['agent_params'])

    def run_training_loop(self, n_iter, collect_policy, eval_policy,
                          initial_expertdata=None, relabel_with_expert=False,
                          start_relabel_with_expert=1, expert_policy=None):
        """
        :param n_iter:  number of iterations
        :param collect_policy:
        :param eval_policy:
        """

        # init vars at beginning of training
        self.total_envsteps = 0
        self.start_time = time.time()

        print_period = 1

        for itr in tqdm_notebook(range(n_iter), desc='Training'):
            if itr % print_period == 0:
                print("\n********** Iteration ", itr, " of ", n_iter, "************")

            # decide if videos should be rendered/logged at this iteration
            if itr % self.params['video_log_freq'] == 0 and self.params['video_log_freq'] != -1:
                self.logvideo = True
            else:
                self.logvideo = False

            # decide if metrics should be logged
            if self.params['scalar_log_freq'] == -1:
                self.logmetrics = False
            elif itr % self.params['scalar_log_freq'] == 0:
                self.logmetrics = True
            else:
                self.logmetrics = False

            use_batchsize = self.params['batch_size']
            if itr==0:
                use_batchsize = self.params['batch_size_initial']
            paths, envsteps_this_batch, train_video_paths = (
                self.collect_training_trajectories(
                    itr, collect_policy, use_batchsize)
            )

            self.total_envsteps += envsteps_this_batch

            # relabel the collected obs with actions from a provided expert policy
            if relabel_with_expert and itr>=start_relabel_with_expert:
                paths = self.do_relabel_with_expert(expert_policy, paths)

            # add collected data to replay buffer
            self.agent.add_to_replay_buffer(paths)

            # train agent (using sampled data from replay buffer)
            if itr % print_period == 0:
                print("\nTraining agent...")
            all_logs = self.train_agent()

            # log/save
            if self.logvideo or self.logmetrics:
                self.perform_logging(itr, paths, eval_policy, train_video_paths, all_logs)

                if self.params['save_params']:
                    self.agent.save('{}/agent_itr_{}.pt'.format(self.params['logdir'], itr))

    ####################################
    ####################################
    def collect_training_trajectories(self, itr, collect_policy, batch_size, save_expert_data_to_disk=False):
        """
        :param itr:
        :param collect_policy:  the current policy using which we collect data
        :param batch_size:  the number of transitions we collect
        :return:
            paths: a list trajectories
            envsteps_this_batch: the sum over the numbers of environment steps in paths
            train_video_paths: paths which also contain videos for visualization purposes
        """

        print("\nCollecting data to be used for training...")
        envsteps_this_batch = 0
        paths = []
        while envsteps_this_batch <= batch_size:
            paths.extend(sample_n_trajectories(
                    self.env,
                    collect_policy,
                    max((batch_size - envsteps_this_batch) // self.params['ep_len'], 1),
                    max_path_length=self.params['ep_len'],
                ))
            envsteps_this_batch = sum(path['observation'].shape[0] for path in paths)

        # collect more rollouts with the same policy, to be saved as videos in tensorboard
        # note: here, we collect MAX_NVIDEO rollouts, each of length MAX_VIDEO_LEN
        train_video_paths = None
        if self.logvideo:
            print('\nCollecting train rollouts to be used for saving videos...')
            train_video_paths = sample_n_trajectories(self.env, collect_policy, MAX_NVIDEO, MAX_VIDEO_LEN, True)

        return paths, envsteps_this_batch, train_video_paths

    def train_agent(self):
        # print('\nTraining agent using sampled data from replay buffer...')
        all_logs = []
        for train_step in range(self.params['num_agent_train_steps_per_iter']):
            # Sample some data from the data buffer
            ob_batch, ac_batch, re_batch, next_ob_batch, terminal_batch = \
                self.agent.sample(self.params['train_batch_size'])

            # Use the sampled data to train an agent
            train_log = self.agent.train(
                ob_batch, ac_batch, re_batch, next_ob_batch, terminal_batch)
            all_logs.append(train_log)
        return all_logs

    ####################################
    ####################################
    def perform_logging(self, itr, paths, eval_policy, train_video_paths, all_logs):

        last_log = all_logs[-1]

        #######################

        # collect eval trajectories, for logging
        print("\nCollecting data for eval...")
        eval_paths, eval_envsteps_this_batch = sample_trajectories(self.env, eval_policy, self.params['eval_batch_size'], self.params['ep_len'])

        # save eval rollouts as videos in tensorboard event file
        if self.logvideo and train_video_paths != None:
            print('\nCollecting video rollouts eval')
            eval_video_paths = sample_n_trajectories(self.env, eval_policy, MAX_NVIDEO, MAX_VIDEO_LEN, True)

            #save train/eval videos
            print('\nSaving train rollouts as videos...')
            self.logger.log_paths_as_videos(train_video_paths, itr, fps=self.fps, max_videos_to_save=MAX_NVIDEO,
                                            video_title='train_rollouts')
            self.logger.log_paths_as_videos(eval_video_paths, itr, fps=self.fps,max_videos_to_save=MAX_NVIDEO,
                                             video_title='eval_rollouts')

        #######################

        # save eval metrics
        if self.logmetrics:
            # returns, for logging
            train_returns = [path["reward"].sum() for path in paths]
            eval_returns = [eval_path["reward"].sum() for eval_path in eval_paths]

            # episode lengths, for logging
            train_ep_lens = [len(path["reward"]) for path in paths]
            eval_ep_lens = [len(eval_path["reward"]) for eval_path in eval_paths]

            # decide what to log
            logs = OrderedDict()
            logs["Eval_AverageReturn"] = np.mean(eval_returns)
            logs["Eval_StdReturn"] = np.std(eval_returns)
            logs["Eval_MaxReturn"] = np.max(eval_returns)
            logs["Eval_MinReturn"] = np.min(eval_returns)
            logs["Eval_AverageEpLen"] = np.mean(eval_ep_lens)

            logs["Train_AverageReturn"] = np.mean(train_returns)
            logs["Train_StdReturn"] = np.std(train_returns)
            logs["Train_MaxReturn"] = np.max(train_returns)
            logs["Train_MinReturn"] = np.min(train_returns)
            logs["Train_AverageEpLen"] = np.mean(train_ep_lens)

            logs["Train_EnvstepsSoFar"] = self.total_envsteps
            logs["TimeSinceStart"] = time.time() - self.start_time
            logs.update(last_log)

            if itr == 0:
                self.initial_return = np.mean(train_returns)
            logs["Initial_DataCollection_AverageReturn"] = self.initial_return

            # perform the logging
            for key, value in logs.items():
                print('{} : {}'.format(key, value))
                self.logger.log_scalar(value, key, itr)
            print('Done logging...\n\n')

            self.logger.flush()


In [None]:
#@title runtime arguments

class ACArgs:

  def __getitem__(self, key):
    return getattr(self, key)

  def __setitem__(self, key, val):
    setattr(self, key, val)

  def __contains__(self, key):
    return hasattr(self, key)

  env_name = 'CartPole-v0' #@param ['CartPole-v0', 'HalfCheetah-v4', 'InvertedPendulum-v4', 'InvertedPendulumBulletEnv-v0', 'InvertedPendulumSwingupBulletEnv-v0', 'BipedalWalker-v3']

  ## Check the intro on how to set ep_len
  ## and discount for each environment
  ep_len = 200 #@param {type: "integer"}

  #@markdown batches and steps
  batch_size = 1000 #@param {type: "integer"}
  eval_batch_size =  400#@param {type: "integer"}

  n_iter =  100#@param {type: "integer"}
  num_agent_train_steps_per_iter = 1 #@param {type: "integer"}
  num_actor_updates_per_agent_update = 1 #@param {type: "integer"}
  num_critic_updates_per_agent_update = 1 #@param {type: "integer"}

  #@markdown Actor-Critic parameters
  discount =  0.9#@param {type: "number"}
  learning_rate = 5e-3 #@param {type: "number"}
  dont_standardize_advantages = False #@param {type: "boolean"}
  num_target_updates = 10 #@param {type: "integer"}
  num_grad_steps_per_target_update = 10 #@param {type: "integer"}
  n_layers = 2 #@param {type: "integer"}
  size =  64#@param {type: "integer"}

  #@markdown system
  save_params = True #@param {type: "boolean"}
  no_gpu = False #@param {type: "boolean"}
  which_gpu = 0 #@param {type: "integer"}
  seed = 1 #@param {type: "integer"}

  #@markdown logging
  ## default is to not log video so
  ## that logs are small enough
  video_log_freq =  -1#@param {type: "integer"}
  scalar_log_freq = 10 #@param {type: "integer"}


args = ACArgs()


if args['video_log_freq'] > 0:
  import warnings
  warnings.warn(
      '''\nLogging videos will make eventfiles too large.'''
      '''\nSet video_log_freq = -1 to avoid that.''')

In [None]:
#@title Define AC trainer
#@markdown This calls the RL_trainer with the specific parameters for Actor-Critic

class AC_Trainer(object):

    def __init__(self, params):

        #####################
        ## SET AGENT PARAMS
        #####################

        computation_graph_args = {
            'n_layers': params['n_layers'],
            'size': params['size'],
            'learning_rate': params['learning_rate'],
            'num_target_updates': params['num_target_updates'],
            'num_grad_steps_per_target_update': params['num_grad_steps_per_target_update'],
            }

        estimate_advantage_args = {
            'gamma': params['discount'],
            'standardize_advantages': not(params['dont_standardize_advantages']),
        }

        train_args = {
            'num_agent_train_steps_per_iter': params['num_agent_train_steps_per_iter'],
            'num_critic_updates_per_agent_update': params['num_critic_updates_per_agent_update'],
            'num_actor_updates_per_agent_update': params['num_actor_updates_per_agent_update'],
        }

        agent_params = {**computation_graph_args, **estimate_advantage_args, **train_args}

        self.params = params
        self.params['agent_class'] = ACAgent
        self.params['agent_params'] = agent_params
        self.params['train_batch_size'] = params['batch_size']
        self.params['batch_size_initial'] = self.params['batch_size']
        self.params['non_atari_colab_env'] = True

        ################
        ## RL TRAINER
        ################

        self.rl_trainer = RL_Trainer(self.params)

    def run_training_loop(self):

        self.rl_trainer.run_training_loop(
            self.params['n_iter'],
            collect_policy = self.rl_trainer.agent.actor,
            eval_policy = self.rl_trainer.agent.actor,
            )


In [None]:
#@title Create directories for logging

data_path = '''/content/data'''

if not (os.path.exists(data_path)):
    os.makedirs(data_path)

logdir = 'ac_' + args.env_name + '_' + time.strftime("%d-%m-%Y_%H-%M-%S")
logdir = os.path.join(data_path, logdir)
args['logdir'] = logdir
if not(os.path.exists(logdir)):
    os.makedirs(logdir)

print("LOGGING TO: ", logdir)

In [None]:
#@title Tensorboard panel
#@markdown You can visualize your runs with tensorboard from within the notebook

#@markdown You might need to refresh the panel after you start training

## requires tensorflow==2.3.0
%load_ext tensorboard
%tensorboard --logdir /content/data/

In [None]:
#@title run training
trainer = AC_Trainer(args)
trainer.run_training_loop()

In [None]:
#@title Visualize a test run on video
if args['env_name'] in ['InvertedPendulumBulletEnv-v0', 'InvertedPendulumSwingupBulletEnv-v0']:
  env = wrap_env(gym.make(args['env_name']))
else:
  env = wrap_env(gym.make(args['env_name'], render_mode='rgb_array'))


obs = env.reset()
term = False
i = 0
while not term:
    i += 1
    if args['env_name'] in ['InvertedPendulumBulletEnv-v0', 'InvertedPendulumSwingupBulletEnv-v0']:
      env.render(mode='rgb_array')
    else:
      env.render()
    obs, rew, term, _ = env.step(trainer.rl_trainer.agent.actor.get_action(obs)[0])
    if term:
      break;

env.close()
print('Loading video...',i)
show_video()

In [None]:
#@title Download results
#@markdown Download the content of data the folder in a zip file

!zip -r /content/data.zip /content/data

from google.colab import files
files.download("/content/data.zip")

In [None]:
#@title Code to plot different metrics
#@markdown This is just an example of the things you can do.
#@markdown You might need to change the tag selected, the labels,
#@markdown the file names, etc.

#@markdown **IMPORTANT:** If you run the same experiment multiple times, this will
#@markdown also plot error bars, but remmember to **change the seed** of the random
#@markdown number generator.

#@markdown You can also run this cell in a separate colab where you upload the
#@markdown data folder (see https://colab.research.google.com/notebooks/io.ipynb#scrollTo=BaCkyg5CV5jF)

# Plotting example requires tensorflow==1.12.0

import glob
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline

def get_section_results(files, tag):
    data = []
    for file in files:
        row = []
        for e in tf.compat.v1.train.summary_iterator(file):
            for v in e.summary.value:
                if v.tag == tag:
                    row.append(v.simple_value)
        data.append(row)
    return data


#logfile = 'data/my_experiment/events*'
all_logdir= data_path+'/ac_' + args.env_name + '_*'
logfile = all_logdir+'/events*'
eventfiles = glob.glob(logfile)

tag = 'Train_AverageReturn'
X = get_section_results(eventfiles, tag)
for j, row in enumerate(X):
    for i, x in enumerate(row):
        print('Experiment {:d} | Iteration {:d} | {}: {} '.format(j, i, tag, x))

color = 'r'
X = np.array(X)
mean_plot = X.mean(axis=0)
std_plot = X.std(axis=0)
iters = np.arange(len(mean_plot))
plt.plot(iters,mean_plot,color, label=tag)
plt.fill_between(iters, mean_plot-std_plot, mean_plot+std_plot, color=color, alpha=0.2)
plt.ylabel('reward')
plt.xlabel('iteration')
plt.legend()