# Solving reinforcement learning problems using pgpelib

In this example, we are going to solve the `CartPole-v1` environment.

In addition to the `PGPE` class in its core namespace, `pgpelib` provides utilities for expressing policies (feed-forward neural network policies and linear policies are supported). This tutorial demonstrates those utility classes as well.

---

We start by importing the required libraries.

In [1]:
from pgpelib import PGPE
from pgpelib.policies import LinearPolicy, MLPPolicy
from pgpelib.restore import to_torch_module

import numpy as np
import torch

import gym

We store the environment to solve below:

In [2]:
ENV_NAME = 'CartPole-v1'

We now define a policy object.

A policy object:

- stores a [PyTorch](https://pytorch.org/) module (in the form of a feed-forward neural network, or a linear transformation) to represent the structure of the policy;
- provides ability to fill the parameters of the module from a parameter vector (so that solutions evolved by PGPE can be loaded);
- can run an agent using the policy (with the loaded parameters) in a [gym](https://gym.openai.com/) environment.

This policy object will serve as our fitness function.
In more details, our goal is to maximize the total reward we get from the gym environment, so, we are looking for a parameter vector which, when loaded into the policy object, returns the highest amount of total reward.

In [3]:
policy = MLPPolicy(
    
    # The name of the environment in which the policy will be tested:
    env_name=ENV_NAME,
    
    # Number of hidden layers:
    num_hidden=1,
    
    # Size of a hidden layer:
    hidden_size=8,
    
    # Activation function to be used in the hidden layers:
    hidden_activation='tanh',
    
    # Whether or not to do online normalization on the observations
    # received from the environments.
    # The default is True, and using observation normalization
    # can be very helpful.
    # In this tutorial, we set it to False just to keep things simple.
    observation_normalization=False
)

policy

<pgpelib.policies.MLPPolicy at 0x7fa948419860>

We now set an initial solution (initial parameter vector) for PGPE to start exploring from.
In this case, we start from a zero-filled vector, its length being the number of parameters required by the policy.

In [4]:
x0 = np.zeros(policy.get_parameters_count(), dtype='float32')
x0

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0.], dtype=float32)

Below we initialize our PGPE solver.

In [5]:
pgpe = PGPE(
    
    # We are looking for solutions whose lengths are equal
    # to the number of parameters required by the policy:
    solution_length=policy.get_parameters_count(),
    
    # Population size:
    popsize=250,
    
    # Initial mean of the search distribution:
    center_init=x0,
    
    # Learning rate for when updating the mean of the search distribution:
    center_learning_rate=0.075,
    
    # Optimizer to be used for when updating the mean of the search
    # distribution, and optimizer-specific configuration:
    optimizer='clipup',
    optimizer_config={'max_speed': 0.15},
    
    # Initial standard deviation of the search distribution:
    stdev_init=0.08,
    
    # Learning rate for when updating the standard deviation of the
    # search distribution:
    stdev_learning_rate=0.1,
    
    # Limiting the change on the standard deviation:
    stdev_max_change=0.2,
    
    # Solution ranking (True means 0-centered ranking will be used)
    solution_ranking=True,
    
    # dtype is expected as float32 when using the policy objects
    dtype='float32'
)

pgpe

<pgpelib.pgpe.PGPE at 0x7fa93ecd90f0>

We are now ready to run our evolution loop.

In [6]:
# Number of iterations
num_iterations = 50

# The main loop of the evolutionary computation
for i in range(1, 1 + num_iterations):

    # Get the solutions from the pgpe solver
    solutions = pgpe.ask()

    # The list below will keep the fitnesses
    # (i-th element will store the reward accumulated by the
    # i-th solution)
    fitnesses = []
    
    for solution in solutions:
        # For each solution, we load the parameters into the
        # policy and then run it in the gym environment,
        # by calling the method set_params_and_run(...).
        # In return we get our fitness value (the accumulated
        # reward), and num_interactions (an integer specifying
        # how many interactions with the environment were done
        # using these policy parameters).
        fitness, num_interactions = policy.set_params_and_run(solution)
        
        # In the case of this example, we are only interested
        # in our fitness values, so we add it to our fitnesses list.
        fitnesses.append(fitness)
    
    # We inform our pgpe solver of the fitnesses we received,
    # so that the population gets updated accordingly.
    pgpe.tell(fitnesses)
    
    print("Iteration:", i, "  median score:", np.median(fitnesses))

Iteration: 1   median score: 10.0
Iteration: 2   median score: 10.0
Iteration: 3   median score: 10.0
Iteration: 4   median score: 10.0
Iteration: 5   median score: 10.0
Iteration: 6   median score: 10.0
Iteration: 7   median score: 10.0
Iteration: 8   median score: 13.0
Iteration: 9   median score: 22.0
Iteration: 10   median score: 29.0
Iteration: 11   median score: 34.0
Iteration: 12   median score: 44.0
Iteration: 13   median score: 47.5
Iteration: 14   median score: 54.0
Iteration: 15   median score: 62.5
Iteration: 16   median score: 70.0
Iteration: 17   median score: 82.0
Iteration: 18   median score: 80.0
Iteration: 19   median score: 103.0
Iteration: 20   median score: 134.5
Iteration: 21   median score: 202.0
Iteration: 22   median score: 261.0
Iteration: 23   median score: 382.0
Iteration: 24   median score: 475.0
Iteration: 25   median score: 500.0
Iteration: 26   median score: 500.0
Iteration: 27   median score: 500.0
Iteration: 28   median score: 500.0
Iteration: 29   med

We now get the center point (i.e. mean) of the search distribution as our final solution.

In [7]:
center_solution = pgpe.center.copy()
center_solution

array([ 0.19697863,  0.15743393,  0.2706258 ,  0.44188738, -0.65611017,
        0.14537652,  0.09155902,  0.55707943, -0.06450157,  0.72363424,
        0.32821342,  0.03987094, -0.09272053, -0.04438904, -0.05690874,
       -0.53247833, -0.3564978 , -1.0647632 , -1.5106467 , -1.3660367 ,
       -0.29835108, -0.45291892,  0.6627035 , -0.01135479, -0.25734568,
       -0.02408281,  0.09201899,  0.95454407, -0.28973174, -0.42773312,
        0.2055804 , -0.10689969,  0.23113649,  0.4962121 , -0.04484191,
       -0.08292229, -0.11028722, -0.32896033,  0.8065649 ,  0.06629611,
       -0.13041477,  0.02464531, -0.39663786,  0.21919973,  1.3402436 ,
       -0.45583156, -0.1102507 ,  0.025219  ,  0.47419178,  0.05519801,
        0.9163134 ,  0.01721925, -0.52789515, -0.16436481,  0.02202819,
       -0.09143799,  0.04324111,  0.17192318], dtype=float32)

We are now ready to test this final solution in the gym environment, and visualize its behavior.

For visualization of the agent, we instantiate a gym environment:

In [9]:
env = gym.make(ENV_NAME)
env

<TimeLimit<CartPoleEnv<CartPole-v1>>>

We now load the parameters of our final solution into the policy, and then convert that policy object to a PyTorch module.

In [10]:
policy.set_parameters(center_solution)

net = to_torch_module(policy)
net

Sequential(
  (0): Linear(in_features=4, out_features=8, bias=True)
  (1): Tanh()
  (2): Linear(in_features=8, out_features=2, bias=True)
)

We are now ready to manually test our final policy, by using gym and PyTorch.
Below is the loop for generating a trajectory of our policy.

In [11]:
# Declare the cumulative_reward variable, which will accumulate
# all the rewards we get from the environment
cumulative_reward = 0.0

# Reset the environment, and get the observation of the initial
# state into a variable.
observation = env.reset()

# Visualize the initial state
env.render()

# Main loop of the trajectory
while True:

    # We pass the observation vector through the PyTorch module
    # and get an action vector
    with torch.no_grad():
        action = net(
            torch.as_tensor(observation, dtype=torch.float32)
        ).numpy()

    if isinstance(env.action_space, gym.spaces.Box):
        # If the action space of the environment is Box
        # (that is, continuous), then the action vector returned
        # by the policy is what we will send to the environment.
        # This is the case for continuous control environments
        # like 'Humanoid-v2', 'Walker2d-v2', 'HumanoidBulletEnv-v0'.
        interaction = action
    elif isinstance(env.action_space, gym.spaces.Discrete):
        # If the action space of the environment is Discrete,
        # then the returned vector is in this form:
        #   [ suggestionForAction0, suggestionForAction1, ... ]
        # We get the index of the action that has the highest
        # suggestion value, and that index is what we will
        # send to the environment.
        # This is the case for discrete-actioned environments
        # like 'CartPole-v1'.
        interaction = int(np.argmax(action))
    else:
        assert False, "Unknown action space"

    observation, reward, done, info = env.step(interaction)
    env.render()

    cumulative_reward += reward

    if done:
        break
    
cumulative_reward

500.0