# Tutorial For ppo_torch

In this notebook, I will go through how PPO works and how it is implemented in the `ppo_torch` library. This notebook does not contain any code for training, in the hope that the training loop is self-explanitory. Here is a list of what this notebook will go through:

- [Creating multi-layer perceptrons](#mlp)
- [Creating Actors](#actors_create)
- [Updating Actors](#actors_update)
- [Creating Critics](#critics_create)
- [Updating Critics](#critics_update)
- [Replay Buffer](#replay_buffer)

In [1]:
import numpy as np
import torch
from torch import nn

from actor import ActorContinuous, ActorDiscrete
from critic import Critic
from mlp import mlp

## Creating Multi-Layer Perceptrons
<a id='mlp'></a>

The function that creates a multi-layer perceptron, `mlp()`, is defined in `mlp.py`. It takes in two lists

- `feature_sizes`: a list of the size of each layer of the mlp
- `activation`: a list of the activation functions for each layer of the mlp

For example, if the MLP has input size of 8, output size of 4 and 2 hidden layers of size 16, we will define `nn_size = [8, 16, 16, 4]`. Since there are three layers:

- Input Layer (8, 16)
- Hidden Layer (16, 16)
- Output Layer (16, 4)

and we set all layers to have ReLU activation, we can define `activations = [nn.ReLU, nn.ReLU, nn.ReLU]`.

In [2]:
nn_sizes = [8, 16, 16, 4]
activations = [nn.ReLU, nn.ReLU, nn.ReLU]
mlp_model = mlp(nn_sizes, activations)
print(mlp_model)

Sequential(
  (0): Linear(in_features=8, out_features=16, bias=True)
  (1): ReLU()
  (2): Linear(in_features=16, out_features=16, bias=True)
  (3): ReLU()
  (4): Linear(in_features=16, out_features=4, bias=True)
  (5): ReLU()
)


## Create Actors
<a id='actors_create'></a>

We can set that the `obs_dim = 8` and `act_dim = 4`, we also set the hidden layer have both the `in_feature` and `out_feature` to be 16. Thus, we have `hidden_dim = [16, 16]`. Additionally, we use ReLU for all of the activations, then we have `activations = [nn.ReLU, nn.ReLU, nn.ReLU]`.

Using this set up, we can define actors for both continuous and discrete action spaces. First, we define a discrete actor using the `ActorDiscrete` class, which is defined in `actor.py`. 

In [3]:
obs_dim = 8
act_dim = 4
hidden_dim = [16, 16]
activations = [nn.ReLU, nn.ReLU, nn.ReLU]

actor_d = ActorDiscrete(obs_dim, act_dim, hidden_dim, activations)

In [4]:
print(actor_d.net)

Sequential(
  (0): Linear(in_features=8, out_features=16, bias=True)
  (1): ReLU()
  (2): Linear(in_features=16, out_features=16, bias=True)
  (3): ReLU()
  (4): Linear(in_features=16, out_features=4, bias=True)
  (5): ReLU()
)


Here we can see that the neural network is defined as `actor_d.net`. Next, we see what is the output of `actor_d.forward(obs)`, note this can be directly called using `actor_d(obs)`.

In [5]:
obs = torch.ones(1, 8)
pi, _ = actor_d(obs)
print(pi)
print(pi.sample())

Categorical(logits: torch.Size([1, 4]))
tensor([2])


We see that the output is not the action, but a probability distribution. Here for discrete actions, the output of `actor_d` is a `Categorical` distribution, we can sample from this distribution `pi` using `pi.sample()`. The output of this sampling is the index of the sampled action, here the output will only be 0, 1, 2, 3.

Next we do the same thing for the actor for continuous action spaces using the `ActorContinuous` class.

In [6]:
obs_dim = 8
act_dim = 4
hidden_dim = [16, 16]
activations = [nn.ReLU, nn.ReLU, nn.ReLU]

actor_c = ActorContinuous(obs_dim, act_dim, hidden_dim, activations)

In [7]:
print(actor_c.net)

Sequential(
  (0): Linear(in_features=8, out_features=16, bias=True)
  (1): ReLU()
  (2): Linear(in_features=16, out_features=16, bias=True)
  (3): ReLU()
  (4): Linear(in_features=16, out_features=4, bias=True)
  (5): ReLU()
)


In [8]:
obs = torch.ones(1, 8)
pi, _ = actor_c(obs)
print(pi)
print(pi.sample())

Normal(loc: torch.Size([1, 4]), scale: torch.Size([1, 4]))
tensor([[0.7370, 0.2345, 1.5332, 0.1561]])


Since here we have a continuous action space, the output of `actor_c.forward()` will be a normal distribution with mean `pi.loc`, which is the same as `actor_c.net(obs)` and a fixed standard deviation `pi.scale = np.exp(-0.5)`.

In [9]:
pi.loc, actor_c.net(obs), pi.scale

(tensor([[0.0431, 0.1402, 0.0000, 0.0000]], grad_fn=<ReluBackward0>),
 tensor([[0.0431, 0.1402, 0.0000, 0.0000]], grad_fn=<ReluBackward0>),
 tensor([[0.6065, 0.6065, 0.6065, 0.6065]], grad_fn=<ExpandBackward>))

## Updating Actors
<a id='actors_update'></a>

## Creating Critics
<a id='critics_create'></a>

We can set the `obs_dim = 8`, since the output is the value function, it always has dimension 1. Similar to the actor, we set the `in_feature`'s and `out_feature`'s to be `hidden_dim = [16, 16]` and all activations to be `activations = [nn.ReLU, nn.ReLU, nn.ReLU]`. 

In [10]:
obs_dim = 8
hidden_dim = [16, 16]
activations = [nn.ReLU, nn.ReLU, nn.ReLU]
critic = Critic(obs_dim, hidden_dim, activations)

We can see that the output of the critic is simply a one dimension value. For both the continuous and discrete actor, the st

In [12]:
obs = torch.ones(1, 8)
val = critic(obs)
print(val)

tensor([[0.0974]], grad_fn=<ReluBackward0>)
