<a href="https://colab.research.google.com/github/LondonNode/Pearl-tutorials/blob/main/1_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pearll

# Introduction

This notebook is a tutorial for the `models` module within Pearl. This module represents the neural network structures for approximating a policy or value function to be optimized. All deep structures are built using the *PyTorch* framework.

The key features implemented are as follows:

| Features                 | Pearl   | 
|-------------------       |---------|
| Modular Components       | ✅      |
| Target networks          | ✅      |
| Vector representation    | ✅      |
| Network Population       | ✅      |
| Dummy Models             | ✅      |
| Shared architectures     | ✅      |
| Separate architectures   | ✅      |

In [2]:
# An example of a model setup

import torch as T
from pearll.models.encoders import IdentityEncoder
from pearll.models.torsos import MLP
from pearll.models.heads import ValueHead, DeterministicHead
from pearll.models import Actor, Critic, ActorCritic
from pearll.settings import PopulationSettings

# Shared encoder
encoder = IdentityEncoder()
# Separate torsos
actor_torso = MLP(layer_sizes=[5, 10, 10], activation_fn=T.nn.ReLU)
critic_torso = MLP(layer_sizes=[5, 5, 5], activation_fn=T.nn.ReLU)
# Separate heads
actor_head = DeterministicHead(input_shape=10, action_shape=2)
critic_head = ValueHead(input_shape=5)

# No target network for actor
actor = Actor(encoder=encoder, torso=actor_torso, head=actor_head)
# Add target network for critic
critic = Critic(encoder=encoder, torso=critic_torso, head=critic_head, create_target=True)

# Population settings to create 10 actors and 10 critics with normally
# distributed parameters.
population_settings = PopulationSettings(
    actor_population_size=10,
    critic_population_size=10,
    actor_distribution="normal",
    critic_distribution="normal",
)
model = ActorCritic(actor=actor, critic=critic, population_settings=population_settings)

input = T.ones((10, 5))

# forward = run on population networks
# predict = run on single global network
population_actors_output = model(input)
population_critics_output = model.forward_critics(input)
target_critics_output = model.forward_target_critics(input)
global_actor_output = model.predict(input)
global_critic_output = model.predict_critic(input)

# Network Structure

The models within Pearl are broken down into three components for easy configuration and modularity: the **encoder**, the **torso** and the **head**.

## Encoders

The encoder processes the input; for example, by concatenating the state observation and action for the continuous Q function network. 

In [3]:
from pearll.models import encoders as enc
import torch as T

# It is assumed that the models can have two input options:
# 1. An observation
# 2. An observation AND action

In [4]:
# The IdentityEncoder passes the input through unchanged

observation = T.ones(2)
action = T.ones(2)

encoder = enc.IdentityEncoder()
print(f"IdentityEncoder output: {encoder(observation)}")
print(f"IdentityEncoder output with action: {encoder(observation, action)}")


IdentityEncoder output: tensor([1., 1.])
IdentityEncoder output with action: tensor([1., 1., 1., 1.])


In [5]:
# The FlattenEncoder flattens the input

observation = T.ones(2, 2)

encoder = enc.FlattenEncoder()
print(f"FlattenEncoder output: {encoder(observation)}")

FlattenEncoder output: tensor([[1., 1.],
        [1., 1.]])


In [6]:
# The MLPEncoder acts as a single layer MLP

observation = T.ones(2)

encoder = enc.MLPEncoder(input_size=2, output_size=1)
print(f"MLPEncoder output: \n {encoder(observation)}\n")
print(encoder)

MLPEncoder output: 
 tensor([-0.0740], grad_fn=<AddBackward0>)

MLPEncoder(
  (model): Linear(in_features=2, out_features=1, bias=True)
)


In [7]:
# The CNNEncoder is the CNN from the DQN Nature paper: 
# Mnih, Volodymyr, et al.
# "Human-level control through deep reinforcement learning."
# Nature 518.7540 (2015): 529-533.

from gym.spaces import Box

observation = T.normal(0, 1, (1, 1, 64, 64))

encoder = enc.CNNEncoder(observation_space=Box(low=0, high=255, shape=(1, 64, 64)))
print(f"CNNEncoder output shape: \n {encoder(observation).shape}\n")
print(encoder)

CNNEncoder output shape: 
 torch.Size([1, 512])

CNNEncoder(
  (cnn): Sequential(
    (0): Conv2d(1, 32, kernel_size=(8, 8), stride=(4, 4))
    (1): ReLU()
    (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
    (3): ReLU()
    (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
    (5): ReLU()
    (6): Flatten(start_dim=1, end_dim=-1)
  )
  (linear): Sequential(
    (0): Linear(in_features=1024, out_features=512, bias=True)
    (1): ReLU()
  )
)


In [8]:
# The DictEncoder allows for dictionary observations e.g. from GoalEnv
# This allows us to specify the labels of the input and to pass it to another
# encoder module for processing, but defaults to the IdentityEncoder.

observation = {"label": T.ones(2), "other_label": T.zeros(100)}

encoder1 = enc.DictEncoder(labels=["label"], encoder=enc.IdentityEncoder())
print(f"DictEncoder output: \n {encoder1(observation)}\n")

post_processing = enc.MLPEncoder(input_size=2, output_size=1)
encoder2 = enc.DictEncoder(labels=["label"], encoder=post_processing)
print(f"DictEncoder output with post-processing MLPEncoder: \n {encoder2(observation)}")

DictEncoder output: 
 tensor([1., 1.])

DictEncoder output with post-processing MLPEncoder: 
 tensor([0.0874], grad_fn=<AddBackward0>)


## Torsos

The torso embodies the deep layers; for now, only a *multilayer  perceptron* is supported.

In [9]:
from pearll.models.torsos import MLP
import torch as T

input = T.ones(2)

torso = MLP(layer_sizes=[2, 3, 5, 1])
print(f"Torso network: \n {torso}")
print(f"Torso output: {torso(input)}\n")

# An activation function can be defined between layers
torso_with_activations = MLP(layer_sizes=[2, 3, 5, 1], activation_fn=T.nn.ReLU)
print(f"Torso network with ReLU activations: \n {torso_with_activations}")
print(f"Torso output with ReLU activations: {torso_with_activations(input)}")

Torso network: 
 MLP(
  (model): Sequential(
    (0): Linear(in_features=2, out_features=3, bias=True)
    (1): Linear(in_features=3, out_features=5, bias=True)
    (2): Linear(in_features=5, out_features=1, bias=True)
  )
)
Torso output: tensor([0.2159], grad_fn=<AddBackward0>)

Torso network with ReLU activations: 
 MLP(
  (model): Sequential(
    (0): Linear(in_features=2, out_features=3, bias=True)
    (1): ReLU()
    (2): Linear(in_features=3, out_features=5, bias=True)
    (3): ReLU()
    (4): Linear(in_features=5, out_features=1, bias=True)
    (5): ReLU()
  )
)
Torso output with ReLU activations: tensor([0.2642], grad_fn=<ReluBackward0>)


## Heads

The head dictates the output; for example, a categorical distribution for an actor. There are two types of head, one for an 'actor' and one for a 'critic'. The critic heads are used for prediction, where the goal is to get a good approximation for a value function or Q function. The actor heads are used for control, where the goal is to optimize a policy to maximize reward.

In [10]:
from pearll.models import heads
import torch as T

### Critics

In [11]:
# The value function represents the expected sum of rewards from a state so
# should have a single number output

input = T.ones(5)

head = heads.ValueHead(input_shape=5, activation_fn=T.nn.Tanh)
print(f"ValueHead output: {head(input)}")

ValueHead output: tensor([-0.1909], grad_fn=<TanhBackward0>)


In [12]:
# The Q value represents the expected sum of rewards from a state given an
# action, but in the discrete action case we can have a single observation 
# input and Q value outputs for each action.

input = T.ones(5)

# The ContinuousQHead is actually the same as the ValueHead
head = heads.ContinuousQHead(input_shape=5, activation_fn=T.nn.Tanh)
print(f"ContinuousQHead output: {head(input)}")

head = heads.DiscreteQHead(input_shape=5, output_shape=2, activation_fn=T.nn.Tanh)
print(f"DiscreteQHead output: {head(input)}")

ContinuousQHead output: tensor([0.1124], grad_fn=<TanhBackward0>)
DiscreteQHead output: tensor([-0.0922, -0.1400], grad_fn=<TanhBackward0>)


### Actors

In [13]:
# The DummyHead is just a dummy for when you don't want a nerual network
# e.g. black box static function optimization using evolutionary strategies

head = heads.DummyHead()
# Output shows there is nothing in this head! It's here to allow compatibility
# in upstream models :)
print(head)

DummyHead()


In [14]:
# The DeterministicHead is used if a deterministic policy is used. That is,
# the network doesn't output a distribution to be sampled from, but rather
# the action itself.

input = T.ones(5)

head = heads.DeterministicHead(input_shape=5, action_shape=2, activation_fn=T.nn.ReLU)
print(f"DeterministicHead output: {head(input)}\n")
print(head)

DeterministicHead output: tensor([0.0000, 0.1568], grad_fn=<ReluBackward0>)

DeterministicHead(
  (model): MLP(
    (model): Sequential(
      (0): Linear(in_features=5, out_features=2, bias=True)
      (1): ReLU()
    )
  )
)


In [15]:
# The other heads output probability distributions that are sampled to get
# the action. Let's take the case where the action is normally distributed.
# A diagonal gaussian distribution is supported -> diagonal covariance matrix!
# We can specify how the standard deviation of the distribution is learned. If
# 'mlp' is used, the standard deviation output is determined by a normal linear
# layer. By default it's set to 'parameter' so that the output isn't updated
# every step.
head1 = heads.DiagGaussianHead(input_shape=5, action_size=1)
head2 = heads.DiagGaussianHead(input_shape=5, action_size=1, log_std_network_type="mlp")
print(f"T.nn.Parameter std: \n {head1}\n")
print(f"MLP std: \n {head2}\n")

input = T.ones(5)
print(f"Distribution output: {head1.action_distribution(input)}")

T.nn.Parameter std: 
 DiagGaussianHead(
  (mean_network): MLP(
    (model): Sequential(
      (0): Linear(in_features=5, out_features=1, bias=True)
      (1): Tanh()
    )
  )
)

MLP std: 
 DiagGaussianHead(
  (mean_network): MLP(
    (model): Sequential(
      (0): Linear(in_features=5, out_features=1, bias=True)
      (1): Tanh()
    )
  )
  (log_std_network): MLP(
    (model): Sequential(
      (0): Linear(in_features=5, out_features=1, bias=True)
      (1): Softplus(beta=1, threshold=20)
    )
  )
)

Distribution output: Normal(loc: tensor([-0.5080], grad_fn=<TanhBackward0>), scale: tensor([1.], grad_fn=<ExpBackward0>))


# Actor Critic

The `ActorCritic` is the main network interface compatible with the other high level modules in Pearl. This in turn is made up of an `Actor` and a `Critic`, and consists of many useful methods to easily and quickly process inputs and manipulate the networks themselves.

In [16]:
from pearll.models.encoders import IdentityEncoder
from pearll.models.torsos import MLP
from pearll.models.heads import ValueHead, CategoricalHead

encoder = IdentityEncoder()
torso = MLP(layer_sizes=[5, 10, 5])
critic_head = ValueHead(input_shape=5)
actor_head = CategoricalHead(input_shape=5, action_size=2)

In [17]:
from pearll.models import Critic
import numpy as np

# Setting the create_target flag to True creates a target network.
critic = Critic(
  encoder=encoder,
  torso=torso,
  head=critic_head,
  create_target=True,
  polyak_coeff=0.995,
)

input = T.ones(5)

# run the online critic network
output = critic(input)
# run the target critic network
target_output = critic.forward_target(input)
# update the target network via polyak averaging: 
# target_params = polyak_coeff * target_params + (1 - polyak_coeff) * online_params
critic.update_targets()
# update the target network by directly copying from the online network:
# target_params = online_params
critic.assign_targets()


print(f"This network state can be represented as a vector with shape: {critic.numpy().shape}")
# You can also access the vector state parameter directly:
state = critic.state
# Update network state
critic.set_state(np.ones(121))
print(f"After update, network state vector is np.ones(121): {all(np.equal(np.ones(121), critic.numpy()))}")

This network state can be represented as a vector with shape: (121,)
After update, network state vector is np.ones(121): True


In [18]:
# The Actor is the same as the Critic with the addition of getting a
# probability distribution output

from pearll.models import Actor

actor = Actor(
    encoder=encoder,
    torso=torso,
    head=actor_head,
)

input = T.ones(5)

# Get distribution output
dist = actor.action_distribution(input)
print(f"Distribution: {dist}")

Distribution: Categorical(logits: torch.Size([2]))


In [19]:
# The ActorCritic combines and Actor and Critic template to generate a
# population of networks that can be manipulated

from pearll.models import ActorCritic
from pearll.settings import PopulationSettings
import numpy as np

# A settings object can be used to configure the network populations
settings = PopulationSettings(
    actor_population_size=2,
    critic_population_size=2,
    actor_distribution="uniform",
    critic_distribution="uniform",
    actor_std=None,
    critic_std=None,
)

model = ActorCritic(actor=actor, critic=critic, population_settings=settings)
print(f"Now each actor/critic network population state can be represented as a matrix with shape (population_size, single_network_params): {model.numpy_actors().shape}")
# Can update the population networks in the same way
model.set_critics_state(np.ones((2, 127)))
# Update the single global network as the average of the population networks
model.update_global()
# Update any target networks via Polyak averaging
model.update_targets()
# Update any target networks by assignment
model.assign_targets()

Now each actor/critic network population state can be represented as a matrix with shape (population_size, single_network_params): (2, 127)


In [20]:
# A dummy model is useful in cases where you don't need a neural network
# structure for your algorithm. Now such use cases are compatible with
# Pearl as well (e.g. optimizing spherical function with evolutionary
# strategy)!

from pearll.models import Dummy
from gym.spaces import Box
from pearll.models import ActorCritic
from pearll.settings import PopulationSettings
import numpy as np

actor = Dummy(space=Box(-100, 100, shape=(1,)))
critic = Dummy(space=Box(-100, 100, shape=(1,)))

# Default PopulationSettings defines a single network for each actor/critic
model = ActorCritic(actor=actor, critic=critic)

# Input doesn't matter anymore, the output is just the state from the space 
# provided at initialization
state = model(np.ones(1))
print(f"State of network from actor space using forward(): {state}")
# The state can be set the same way as a standard ActorCritic
model.set_critics_state(np.array([100]))
# The state can be taken from the numpy() method again
state = model.numpy_actors()
print(f"State of network from actor space using numpy(): {state}")
print(f"State of network from critic space using numpy(): {model.numpy_critics()}")
print("NOTE: numpy() returns numpy representation, forward() returns torch representation")

State of network from actor space using forward(): tensor([-90.1194])
State of network from actor space using numpy(): [[-90.11937]]
State of network from critic space using numpy(): [[100]]
NOTE: numpy() returns numpy representation, forward() returns torch representation
