# Continuous Cartpole

We are going to solve a continuous **Cart-Pole** task with **REINFORCE**.

We are going to use *gym* for simulation and *potion* for learning.

*potion* is based on *pytorch*.

In [1]:
import gym
import torch

### Import the custom environments
A custom environment is a class extending `gym.Env` (e.g. `potion.envs.cartpole.ContCartPole`).

The custom environments must be registered (see the *\_\_init\_\_.py* of `potion.envs`).

At registration, each environment is assigned an id of the form "Name-vx", where x denotes the version number.

The following `import` runs the registration code:

In [2]:
import potion.envs

### Create the environment
The environment is created using the id assigned at registration.

We are going to load a custom continuous variant of the popular Cart-Pole problem:

In [3]:
env = gym.make('ContCartPole-v0')



The state is 4-dimensional, while the action is a scalar:

In [4]:
state_dim = sum(env.observation_space.shape) #dimensionality of the state space
action_dim = sum(env.action_space.shape) #dimensionality of the action space
(state_dim, action_dim)

(4, 1)

The environment has indefinite horizon. For practical reasons, we are going to set a finite maximum horizon:

In [5]:
horizon = 500 #maximum length of a trajectory

We also have to define the discount factor separately:

In [6]:
gamma = 1.

### Prepare the policy
We are going to optimize the mean parameters of a shallow Gaussian policy.

In [7]:
from potion.actors.continuous_policies import ShallowGaussianPolicy

We set the standard devation to 1.0 and initialize the mean parameters with a tensor of zeros:

In [8]:
policy = ShallowGaussianPolicy(state_dim, #input size
                               action_dim, #output size
                               mu_init = torch.zeros(4), #initial mean parameters
                               logstd_init = 0., #log of standard deviation
                               learn_std = False #We are NOT going to learn the variance parameter
                              ) 

The policy is just a stochastic mapping from state to actions:

In [9]:
state = torch.ones(4)
policy.act(state)

tensor([-1.2340])

The policies parameters are represented as a 1-dimensional tensor:

In [10]:
policy.get_flat()

tensor([0., 0., 0., 0.])

### Run the algorithm
We are going to run the **REINFORCE** algorithm:

In [11]:
from potion.algorithms.reinforce import reinforce

We use a constant step size and a constant batch size:

In [12]:
from potion.meta.steppers import ConstantStepper
stepper = ConstantStepper(0.05)

batchsize = 100

We set up a logger to save learning statistics:

In [13]:
from potion.common.logger import Logger

log_dir = '../logs'
log_name = 'REINFORCE'
logger = Logger(directory=log_dir, name = log_name)

We set a random seed to make the experiment fully reproducible (`seed = None` would make it truly random).

We need to apply the random seed to the environment *and* to the learning algorithm

In [14]:
seed = 42

env.seed(seed)

[42]

Now we run the algorithm. It will take some time. 

You can also monitor its progress with [tensorboard](https://www.tensorflow.org/guide/summaries_and_tensorboard). Event files are saved in the log directory.

In [17]:
policy.set_from_flat(torch.zeros(4)) #Reset the policy (in case is run multiple times)

reinforce(env = env, 
          policy = policy,
          horizon = horizon,
          stepper = stepper,
          batchsize = batchsize,
          disc = gamma,
          iterations = 100,
          seed = 42,
          logger = logger,
          save_params = 5, #Policy parameters will be saved on disk each 5 iterations
          shallow = True, #Use optimized code for shallow policies
          estimator = 'gpomdp', #Use the G(PO)MDP refined estimator
          baseline = 'peters' #Use Peter's variance-minimizing baseline
         )


Iteration  0
Perf :	 39.29999923706055
UPerf :	 39.29999923706055
AvgHorizon :	 39.29999923706055
StepSize :	 0.05000000074505806
GradNorm :	 1.5244964361190796
Time :	 1.1873750686645508
Exploration :	 1.0
Entropy :	 1.4189385175704956
Info :	 0.0

Iteration  1
Perf :	 39.06999969482422
UPerf :	 39.06999969482422
AvgHorizon :	 39.06999969482422
StepSize :	 0.05000000074505806
GradNorm :	 4.0708327293396
Time :	 1.2566392421722412
Exploration :	 1.0
Entropy :	 1.4189385175704956
Info :	 0.0

Iteration  2
Perf :	 43.79999923706055
UPerf :	 43.79999923706055
AvgHorizon :	 43.79999923706055
StepSize :	 0.05000000074505806
GradNorm :	 3.1258842945098877
Time :	 1.2887117862701416
Exploration :	 1.0
Entropy :	 1.4189385175704956
Info :	 0.0

Iteration  3
Perf :	 39.130001068115234
UPerf :	 39.130001068115234
AvgHorizon :	 39.130001068115234
StepSize :	 0.05000000074505806
GradNorm :	 3.1046805381774902
Time :	 1.1068427562713623
Exploration :	 1.0
Entropy :	 1.4189385175704956
Info :	 0.0


Perf :	 130.75
UPerf :	 130.75
AvgHorizon :	 130.75
StepSize :	 0.05000000074505806
GradNorm :	 15.89585018157959
Time :	 3.3146133422851562
Exploration :	 1.0
Entropy :	 1.4189385175704956
Info :	 0.0

Iteration  34
Perf :	 133.11000061035156
UPerf :	 133.11000061035156
AvgHorizon :	 133.11000061035156
StepSize :	 0.05000000074505806
GradNorm :	 3.967622995376587
Time :	 3.3782176971435547
Exploration :	 1.0
Entropy :	 1.4189385175704956
Info :	 0.0

Iteration  35
Perf :	 144.99000549316406
UPerf :	 144.99000549316406
AvgHorizon :	 144.99000549316406
StepSize :	 0.05000000074505806
GradNorm :	 15.731992721557617
Time :	 3.6752612590789795
Exploration :	 1.0
Entropy :	 1.4189385175704956
Info :	 0.0

Iteration  36
Perf :	 159.1699981689453
UPerf :	 159.1699981689453
AvgHorizon :	 159.1699981689453
StepSize :	 0.05000000074505806
GradNorm :	 37.85042190551758
Time :	 4.024984359741211
Exploration :	 1.0
Entropy :	 1.4189385175704956
Info :	 0.0

Iteration  37
Perf :	 218.67999267578125


Perf :	 498.3299865722656
UPerf :	 498.3299865722656
AvgHorizon :	 498.3299865722656
StepSize :	 0.05000000074505806
GradNorm :	 5.935730934143066
Time :	 13.966299295425415
Exploration :	 1.0
Entropy :	 1.4189385175704956
Info :	 0.0

Iteration  68
Perf :	 500.0
UPerf :	 500.0
AvgHorizon :	 500.0
StepSize :	 0.05000000074505806
GradNorm :	 0.0
Time :	 14.603025913238525
Exploration :	 1.0
Entropy :	 1.4189385175704956
Info :	 0.0

Iteration  69
Perf :	 500.0
UPerf :	 500.0
AvgHorizon :	 500.0
StepSize :	 0.05000000074505806
GradNorm :	 0.0
Time :	 13.64837908744812
Exploration :	 1.0
Entropy :	 1.4189385175704956
Info :	 0.0

Iteration  70
Perf :	 500.0
UPerf :	 500.0
AvgHorizon :	 500.0
StepSize :	 0.05000000074505806
GradNorm :	 0.0
Time :	 13.016191244125366
Exploration :	 1.0
Entropy :	 1.4189385175704956
Info :	 0.0

Iteration  71
Perf :	 500.0
UPerf :	 500.0
AvgHorizon :	 500.0
StepSize :	 0.05000000074505806
GradNorm :	 0.0
Time :	 14.207878589630127
Exploration :	 1.0
Entropy 

### Visualize the results

In [16]:
import os
import glob
import pandas as pd
import matplotlib.pyplot as plt

The data for this experiment are saved as a *csv* file in the logger's directory.

The *csv* file's name is the logger's name plus a unique timestamp, to distinguish it from other runs of the same experiment

We load the data of each experiment in a separate pandas dataframe

In [None]:
os.chdir(log_dir) #Change directory 
runs = [pd.read_csv(file, index_col=False) 
          for file in glob.glob("*.csv") if file.startswith(log_name + '_')]

In this case we just have one run. We plot the average performance per iteration.

Normally, results are averaged over several (>= 5) runs (each with a different random seed) and confidence intervals are reported as shaded areas.

In [None]:
run = runs[0]
performance = run['Perf']
plt.plot(range(len(performance)), performance)
plt.xlabel('Iterations')
plt.ylabel('Performance')

The optimal performance for this task (500) was achieved, although large oscillations happened during the learning phase.

### Retrieve the learned parameters
These are the final policy parameters learned by the algorithm:

In [19]:
policy.get_flat()

tensor([-0.4193,  4.6197, 10.9384, 16.7262])

Intermediate parameters have also been saved as torch tensors inside the log directory. This can be useful to restore an aborted experiment.

### Observe the learned behavior

It is good practice to observe the behavior the agent has actually learned:

In [23]:
from potion.simulation.play import play

play(env, policy, horizon, episodes=1)

### Cleanup

In [24]:
env.viewer.close()
env.close()