# Advantage Actor-Critic (DPG) and Exploration Strategies

In this tutorial we use for first time an Actor-Critic agent. This kind of agents extend the Policy-Based agents with a mechanism for estimation of state values V(s), resulting on a mix between Policy and Value Based agents. They are composed of two entities: 1) the Actor, which learn the policy and proposed directly the actions and 2) the Critic, which estimates the state value V(s). Then, we have two neural networks, one for the Actor and one for the Critic. In some especific situations you may want to use just one neural network with two outputs heads, this can be done implementing your neural network extending the interfaz in RL_Agent.utils.networks.networks_interfaz.py. This funtionality will be revisited in further tutorials.

Aditionally we will see how to change to a different exploration strategy based on exploration rate and how to change the way to selecting actions.

By the end of this tutorial you will know how implement your own exploration strategy based on modifying the exploration rate as you need, how to modify the action selection procedure and how to use Actor-Critic agents.

In [None]:
from RL_Problem.base.ActorCritic import a2c_problem
from RL_Agent import a2c_agent_discrete, a2c_agent_discrete_queue
import gym
from RL_Agent.base.utils import agent_saver, history_utils
from RL_Agent.base.utils.networks import networks
from RL_Agent.base.utils.networks import action_selection_options

## Defining the Neural Network Architecture
We define the network architecture using the function "actor_critic_net_architecture" from "RL_Agent.base.utils.networks.networks.py" which return a dictionary. As we are using an Actor-Critic agent, this function will requires the user to define the parameters of both neural networks, the actor net and the critic net.

In [None]:
net_architecture = networks.actor_critic_net_architecture(
                    actor_dense_layers=3,                                critic_dense_layers=2,
                    actor_n_neurons=[128, 128, 128],                     critic_n_neurons=[256, 256],
                    actor_dense_activation=['relu', 'relu', 'relu'],     critic_dense_activation=['relu', 'relu']
                    )

## Customizing the Exploration Rate

Exploration rate, also known as epsilon, can be reduced by specifying the "epsilon_decay" paramerter. By that way the new epsion parameter will be calculated each step as: epsilon' = epsilon * epsilon_decay. If we wanted to make a different modifiction over epsilon, we well need to define an specific function to do that. "epsilon_decay" parameter admits a float or a fucntion. In this example we will create a function for reducing epsilon in a linear way and doing cycles. This means when epsilon reach a minimum umbral, epsilon will be reseted to a higher value. 

The function that we defined bellow ("epsilon_decay") will need to recive as parameter "epsilon" and "epsilon_min", both beaing floats.

In [None]:
def custom_epsilon_decay(decay_rate=0.0001, init_epsilon=1.):
    # Create a class for introducing some aditional properties
    class epsilon_control:
        def __init__(self, decay_rate, init_epsilon):
            self.decay_rate = decay_rate
            self.init_epsilon = init_epsilon
            self.aux_epsilon = init_epsilon

    eps_control = epsilon_control(decay_rate, init_epsilon)
    
    # Defining the function that will modify epsilon
    def epsilon_decay(epsilon, epsilon_min):
        epsilon = epsilon - eps_control.decay_rate
        if epsilon < epsilon_min:
            eps_control.aux_epsilon = eps_control.aux_epsilon - 0.1
            epsilon = eps_control.aux_epsilon
            if epsilon < 0.1:
                eps_control.aux_epsilon = eps_control.init_epsilon
                epsilon = eps_control.init_epsilon

        return epsilon

    return epsilon_decay

## Defining the Agent and Modifying the Action Selection Procedure

In the next cell, we define the agent as usual. Here is where we set the epsilon decay function that we defined before, we assing it to the "epsilon_decay" parameter.

We also modify the action selection procedure. We introduce the "train_action_selection_options" and "action_selection_options" parameters. This two parameters allows the user to select how the agent select the actions during training and how the agent select the action during test, explotation or deployment. We use the functions provided on "RL_Agent.base.utils.networks.action_selection_options.py" 

The user can specify its own function for action selection following the next interface:

```python
def function(act_pred, n_actions, epsilon=0., n_env=1, exploration_noise=1.0):
    """
    :param act_pred: (nd array of floats) network predictions.
            
    :param n_actions: (int) number of actions. In a discrete action configuration represent 
                      the number of possibles actions. In a continuous action configuration 
                      represent the number of actions to take simultaneously.
                    
    :param epsilon: (float in range [0., 1.]) Exploration rate. Probability of selecting an 
                    exploitative action.  
                
    :param n_env: (int) Number of simultaneous environment in multithread agents. Also may 
                  be seen as the number of input states; if there is one state only an 
                  action is selected, if there is three (or multiple) states three (or multiple) 
                  actions must be selected.
                    
    :param exploration_noise: (float in range [0., 1.]) Multiplier of exploration rate of 
                              scale of exploration.E.g.: Used for setting the stddev when 
                              sampling from a normal distribution.
    """
```

In [None]:
agent = a2c_agent_discrete.Agent(actor_lr=1e-2,
                                  critic_lr=1e-2,
                                  batch_size=128,
                                  epsilon=1.0, 
                                  epsilon_decay=custom_epsilon_decay(decay_rate=0.0001, init_epsilon=1.),
                                  epsilon_min=0.1,
                                  n_step_return=15,
                                  n_stack=4,
                                  net_architecture=net_architecture,
                                  loss_entropy_beta=0.002,
                                  train_action_selection_options=action_selection_options.greedy_action,
                                  action_selection_options=action_selection_options.argmax,
                                  tensorboard_dir='tensorboard_logs')

## Define the environment

We chose the LunarLander environment from OpenAI Gym.

In [None]:
environment = "LunarLander-v2"
environment = gym.make(environment)

## Build a RL Problem

The RL problem is were the comunications between agent and environment are managed.

In [None]:
problem = a2c_problem.A2CProblem(environment, agent)

## Solving the RL Problem

Next step is solving the RL problem that we have define. Here, we specify the number of episodes and the skip_states parameter.

In [None]:
problem.solve(700, render=False, skip_states=1)

In [None]:
problem.test(render=True, n_iter=10)

In [None]:
hist = problem.get_histogram_metrics()
history_utils.plot_reward_hist(hist, 10)

## Run Tensorboard to See the Recorded Summaries

Lets see the tensorboard logs. Next cell executes the command that runs the tensorboard service. To see the result, you have to open a tab in your browser on the url that the command shows, usually http://localhost:6006/

In [None]:
!tensorboard --logdir=tensorboard_logs

# Takeaways
- We trained our first Actor-Critic.
- We learned how to customize the exploration process.
- We learned how to set the desired mode to select actions in training mode and exploitation mode.