# *Policy Gradient agent implementation in Tensorforce*

___This tutorial will:___
* briefly introduce the Policy Gradient algorithm;
* walk the reader through the implementation of a simple Policy Gradient agent <br>  using the Tensorforce framework  and its usage to solve a basic OpenAi Gym problem.

# A brief introduction to the TensorForce framework

In this tutorial, the implementation of the Policy Gradient agent will be entirely realized through the usage of the RL framework Tensorforce. TensorForce is an open-source framework which provides high level, user friendly interfaces and tools to operate in the field of deep RL. It's built on top of Google's TensorFlow framework and requires Python 3 to funcion. An important aspect which TensorForce focuses on is the general applicability of every feature implementation provided, which translates to algorithms made to be agnostic of the structure and type of the inputs and the outputs related to specific scenarios. This framework can be used to implement a wide variety of highly customizable agents and environments through a contained number of functions, from simple Policy Gradient agents to DQNs to Acto. For instance, in order to create a constant agent, the *Agent.create* function can be used as follows:

In [1]:
constant_agent = tensorforce.Agent.create(agent='constant', ...)

SyntaxError: positional argument follows keyword argument (<ipython-input-1-aaaf9db95e28>, line 1)

The same can be done when it comes to environment creation, whether a custom environment or one from the OpenAI Gym is used. To create an instance of the MountainCar environment from Gym, for example, the *Environment.create* function must be used:

In [2]:
mountain_car = tensorforce.Environment.create(environment='gym', level='MountainCar', ...)

SyntaxError: positional argument follows keyword argument (<ipython-input-2-5d78d118354f>, line 1)

This framework also provides an execution utility equipped to handle both the training and evaluation of agents: the Runner utility, which offers a wide range of configuration options and can be used to train a single agent as well as multiple agents at the same time. A more detailed introduction to the framework and its tools can be found on TensorForce's [official website](https://tensorforce.readthedocs.io/en/latest/index.html).

# Policy Gradient: the idea

The term Policy Gradient refers to a family of RL algorithms based on the optimization of a *policy*: $$ \pi_\theta = (a_t|s_t) $$
Which is the probability, dependent on the paramenters $\theta$, that action $a_t$ is chosen given the state $s_t$ at time $t$.
The basic idea of a Policy Gradient method consists in using probabilistic action choice to select an action to execute at each timestep. Once an episode is over, the parameters are adjusted according to the value of the overall *reward* (or *return*).
In general, the policy is represented through a neural network known as the *Policy Network*. 
Since an action is picked at each timestep, it's then possible to put all of the timesteps together to define a *trajectory*:
$$\tau = (\overline{a}, \overline{s})$$
Where $\overline{a} = a_0, a_1, a_2...$ is a sequence of actions and $\overline{s} = s_1, s_2...$ is a sequence of states ($s_0$ is fixed), and subsequently calculate the following probability: $$P_{\theta}(\tau) = \prod_t{P(s_{t+1}|s_t,a_t)\tau_{\theta}(a_t|s_t)}$$
Which allows to find the expected overall reward: $$\overline{R} = E[R] = \sum_{\tau}P_{\theta}(\tau)R(\tau)$$
The formula right above shows that the expected overall reward is given by the sum over all trajectories of the corresponding reward multiplied by the probability of following that trajectory.<br>
The goal is to maximize this reward, and that's where the *Gradient* part of Policy Gradient comes into play. In order to find the values for each parameter in $\theta$ which maximize the reward, it's necessary to calculate the *gradient* of the expected reward over the parameters, which results in: $$\dfrac{\partial \overline{R}}{\partial \theta} = \sum_t{E[R\dfrac{\partial ln\tau_\theta(a_t|s_t)}{\partial \theta}]}$$
This formula not only gives a method for the mathematical computation of this gradient, but also shows how it's only dependent on the policy, not on the environment this method is applied in, which means the Policy Gradient method is ***Model Free***.
The parameters are then updated following the formula: $$\theta_{new} = \theta_{old} + \alpha\dfrac{\partial \overline{R}}{\partial \theta}$$
Where $\alpha$ is the *learning rate*.

# OpenAi Gym: The environment

For the sake of this tutorial, the Policy Gradient agent will be tasked to solve the *CartPole* environment from the OpenAI gym toolkit. This is the most basic environment among those provided by gym, and is to Deep Learning what "*Hello World*" is to general programming. The problem is extremely straightforward: the agent must prevent a pole precariously balanced on a cart from falling by any means necessary. 
<div><img src="78819170-cb8f0780-79a3-11ea-8ad6-069968da4d14.gif", width=250px, height=250px /></div>
The cart moves on a frictionless surface, and it's controlled by applying a force in order to move it to the left or to the right. For each timestep the pole stays upright, a reward of +1 point is provided to the agent, and an episode terminates either by reaching the timestep limit, which qualifies as a success, by moving 2.4 units away from the center of the platform or if the pole is more than 15 degrees from vertical.

# Solution implementation: the TensorForce framework

## Step 1: The training function

In order to properly train and evaluate the agent, in this tutorial will be exploited a slightly modified version of TensorForce's act-experience-update interface: the agent is trained for a certain number of episodes, where at the end of each one the *agent.experience* and *agent.update* TensorForce functions will be called in order to, respectively, feed the agent the episode returns and updating the agent's parameters (which correspond to the $\theta$ parameters discussed in the Policy Gradient paragraph). In this version of the interface, the training and evaluation steps are divided into two functions for the sake of clarity.
    
* ### Step 1.1: initializing the variables and lists of returns
    
  The first step consists in initializing five lists to store the different kinds of returns provided by the functions used during training, which will be covered shortly. After this, three variables have to be initialized: one to keep track of the current *state*, which is initialized by resetting the environment, one to store the current agent internals and a boolean variable which is used to determine whether a state is terminal or not.

import tensorforce as tf
import matplotlib.pyplot as plt

def train(agent, environment, num_episodes):
    tr_sum = 0.0
 

In [8]:
def train(agent, environment, num_episodes):   
    for episode in range(num_episodes):
        ep_states = list()
        ep_actions = list()
        ep_internals = list()
        ep_terminal = list()
        ep_rewards = list()
        states = environment.reset()
        internals=agent.initial_internals()
        terminal = False

* ### Step 1.2: The training loop
    
    Once the variables and lists are initialized, the actual training phase is tackled. The main body consists in a loop that goes on until a terminal state is reached. In this loop can be found two TensorForce functions:
        
    * *agent.act*, which takes the current state and the agent internals as input and outputs the chosen action to take as well as the new internals;
    * *envirnoment.execute*, which takes the action provided by the previous function as input and outputs the reward at the current timestep, the new state and the indication of whether such state is terminal.
        
    the outputs are appended to the corresponding lists and, when a terminal state is reached, the *agent.experience* and *agent.update* functions are called. Finally the function returns the mean reward over the total number of training episodes. 

In [9]:
        while not terminal:
            ep_states.append(states)
            ep_internals.append(internals)
            actions, internals = agent.act(states=states, internals=internals, independent=True, deterministic=False)
            ep_actions.append(actions)
            states, terminal, reward = environment.execute(actions=actions)
            ep_terminal.append(terminal)
            ep_rewards.append(reward)
            tr_sum += reward
        agent.experience(states=ep_states, actions=ep_actions, terminal=ep_terminal, reward=ep_rewards,
                            internals=ep_internals,)
        agent.update()
    tr_sum = tr_sum/num_episodes
    return tr_sum

IndentationError: unexpected indent (<ipython-input-9-6c55d0b08b26>, line 13)

## Step 2: The evaluation function

After defining the training function, it's time to create an evaluation function, which will have the agent execute the task in the environment for a certain number of episodes without feeding it any returns. Apart from this major difference, the implementation of this function is very similar to the previous one, although without the lists used to store the returns.

In [None]:
def evaluate(agent, environment, num_episodes):
    ev_sum = 0.0
    for t in range(num_episodes):
        states = environment.reset()
        internals = agent.initial_internals()
        terminal = False
        while not terminal:
            actions, internals = agent.act(states=states, internals=internals, independent=True)
            states, terminal, reward = environment.execute(actions=actions)
            ev_sum += reward
    ev_sum = ev_sum/num_episodes
    return ev_sum

One thing worth noting is that, in this case, the argument deterministic=<font color=green>**False**</font> is not given as input to the agent.act function. This is because in the evaluation phase the agent needs to select actions deterministically, without exploration, and <font color=green>**True**</font> is the default value for the deterministic parameter.

## Step 3: Initializing the parametes

Once the training and evaluation routines are defined, the next step is initializing the parameters that will allow the execution of the main routine. Moreover, in this step both the environment and the agent are initialized.

In [None]:
max_time = 100
my_env = tf.Environment.create(environment='gym', level='CartPole', max_episode_timesteps=max_time, visualize=True)
actionSpace = my_env.actions()
n_actions=actionSpace.get('num_values')

Interestingly, there is no need to import the OpenAI Gym toolkit in order to initialize an environment as an instance of CartPole, the correct usage of which is handled directly by the TensorForce function.

* ### Step 3.1: The Policy Network

    At this point, the policy network, which was briefly introduced in the first paragraph of this tutorial, must also be instantiated. In this case, it consists of a simple neural network with three layers, two dense layers with a size of 64 and a *relu* activation and another dense layer, where this time the size corresponds to the dimention of the action space of the CartPole environment and with a *softmax* activation. This network is then used as the policy of the policy gradient agent. The creation of the agent is entirely handeled by TensorForce aside from providing the correct values for the input parameters. When calling the funcion *agent.create*, an identifier must be provided so that it can create the correct type of agent, which in this case will be a "vpg", Vanilla Policy Gradient, or "reinforce" agent.

In [None]:
network_spec = [dict(type='dense', size=64, activation='relu'),
                dict(type='dense', size=64, activation='relu'),
                dict(type='dense', size=n_actions, activation='softmax')]

my_agent = tf.Agent.create(agent='reinforce', environment=my_env, batch_size=64, network=network_spec,
                           memory=10000, learning_rate=5e-4, discount=0.99, baseline='auto',
                           baseline_optimizer=dict(optimizer='adam', learning_rate=5e-4))

The values for the learning rate, batch size, memoy and discount parameters can be modified as the reader sees fit, while keeping in mind the general rules of thumb regarding said values.

## Step 4: The main routine

Now everything is ready to realize the main program, which will allow the training and evaluation of the agent in the environment for an arbitrary number of epochs. In this tutorial, the agent will do so for 100 epochs, storing the total training and evaluation rewars per epoch in order to subsequently plot them to help visualize the results.

In [12]:
my_agent.reset()
train_rewards = []
rewards = []
for i in range(100):
    # Training
    training = train(my_agent, my_env, 10)
    train_rewards.append(training)
    # Evaluation
    sum_reward = evaluate(my_agent, my_env, 100)
    print(str(sum_reward))
    print('Cleared epoch ', i)
    rewards.append(sum_reward)

my_env.close()
my_agent.close()

plt.plot(rewards)
plt.xlabel('epoch')
plt.ylabel('avg_rewards_eval')
plt.show()
plt.plot(train_rewards)
plt.xlabel('epoch')
plt.ylabel('avg_rewards_train')
plt.show()

NameError: name 'my_agent' is not defined

Below are given two plots corresponding to an example of execution of the program. It can be noted how the agent never manages to reach the maximum amount of points during training, while still showing an upward trend as expected, while during the evaluation it reaches the maximum in a very minimal amount of epochs.

<div><img src="VPG_train_10train.png", width=450px, height=450px, align="left" /></div>
<div><img src="VPG_eval_10train.png", width=450px, height=450px, align="right" /></div>