# Asynchronous advantage actor critic (A3C)
by Tore Hilbert (s144328)

This notebook is a supplement to the report also included in this repository and aims to visualize the trained agents from the report interacting with the environment. The notebook will consider the following topics A) visualize the effect of different action selection strategies, and B) visualize the performance of the MountainCar agents in the entropy weight experiment.

In [43]:
import os
import numpy as np
import matplotlib.pyplot as plt
import torch
import gym
from gym import wrappers
from IPython import display
%matplotlib inline

from ASSETS.network_policy import Policy
from ASSETS.misc import load_policy, simulate_rollout

In [44]:
def show_replay(env):
    """
    Function directly stolen from the week 8 exercise in 02456 (except for the 'env' addition as input argument)
    
    "Not-so-elegant way to display the MP4 file generated by the Monitor wrapper inside a notebook.
    The Monitor wrapper dumps the replay to a local file that we then display as a HTML video object."
    """
    import io
    import base64
    from IPython.display import HTML
    video = io.open('./gym-results/openaigym.video.%s.video000000.mp4' % env.file_infix, 'r+b').read()
    encoded = base64.b64encode(video)
    return HTML(data='''
        <video width="360" height="auto" alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''
    .format(encoded.decode('ascii')))

def visualize_agent(folder_path, rollout=1000, softmax_action_selection=False, epsilon=0.0):
    env, policy = load_policy(folder_path, Policy)
    env = wrappers.Monitor(env, "./gym-results", force=True)
    simulate_rollout(policy, env, rollout_limit=rollout, softmax_action_selection=softmax_action_selection, epsilon=epsilon)
    return show_replay(env)

## A) Action selection
### Softmax action selection
The first aspect that deserves a nice visualization is different methods for choosing actions from the policy probabilities. The agent can be greedy which means that it takes the action with the highest probability. This is the action selection used during validation of the agent, i.e. the reward series used in the experiments. This action strategy is illustrated below by loading an agent from the LunarLander-number-of-processes experiment in the report. 

In [45]:
# WITHOUT softmax action selection
path = r".\Data\Experiment Cores LunarLander\23_01"
visualize_agent(path, softmax_action_selection=False, epsilon=0.0)

Another method is the **softmax action selection** that is used during **training** of the agents to provide exploration. The same agent using this method is illustrated below.

In [46]:
# WITH softmax action selection
path = r".\Data\Experiment Cores LunarLander\23_01"
visualize_agent(path, softmax_action_selection=True, epsilon=0.0)

There should be (and isn't in my end) much of a difference between the performance of the two episodes above, because the agent used has trained for quite a while on 23 processes (see report). Therefore, the probabilities are assumably converged to have a quite dominant action, in terms of its probability, in each state encountered. The actions taken will therefore not differ much.

I we instead select an agent that we know have more equal action probabilities we can see that this action selection choice makes a difference and should indeed not be used for validation! The agent loaded is from the LunarLander-entropy experiment with a very high entropy weight $\beta=2.00$, such enforcing action probabilities to be fairly equal.

In [47]:
# WITHOUT softmax action selection
path = r".\Data\Entropy LunarLander v2\Output6\200_01"
visualize_agent(path, softmax_action_selection=False, epsilon=0.0)

In [49]:
# WITH softmax action selection
path = r".\Data\Entropy LunarLander v2\Output6\200_01"
visualize_agent(path, softmax_action_selection=True, epsilon=0.0)

In this case it makes a big difference, the agent with softmax action selection doesn't even hit the landing target. This is however the advantage for training. It is possible for the agent to explore new regions and potentially discover better ways (think a second landing pad with a much higher reward in the LunarLander environment). If there is no big difference in your end, try run the code pieces multiple times to get a fealing of the agents behavior. The start random start conditions create some variance in the different episodes. 

### Epsilon-greedy
Another strategy (which has not been used in the training of the agents in the report but should have been) is to be greedy, i.e. selecting the action favorized most by the agent sometimes, and simply selecting a random action other times. Whether to be greedy or random is determined by randomly rolling above or below a value, $\epsilon=[0,1]$. If the random roll in below $\epsilon$ a uniform random action is chosen.

Below the effect of the epsilon-greedy method with $\epsilon=0.1$, $\epsilon=0.4$, and $\epsilon=0.8$ is illustrated using the same agent as the beginning of the previous section.

In [50]:
# EPSILON=0.1
path = r".\Data\Experiment Cores LunarLander\23_01"
visualize_agent(path, softmax_action_selection=False, epsilon=0.1)

In [51]:
# EPSILON=0.4
path = r".\Data\Experiment Cores LunarLander\23_01"
visualize_agent(path, softmax_action_selection=False, epsilon=0.4)

In [52]:
# EPSILON=0.8
path = r".\Data\Experiment Cores LunarLander\23_01"
visualize_agent(path, softmax_action_selection=False, epsilon=0.8)

It can be seen that the agent performs worse as $\epsilon$ increases (because it is already trained), and should thereby also not be used in validation of the agent. Epsilon-greedy is an effective method for forcing exploration - especially with agents having certain policies that wouldn't gain much exploration by using softmax action selection. 

## B) Entropy
In the entropy experiment from the report, it was shown that the entropy weight has to be of some size for the agent to solve the MountainCar environment at all. However, the entropy weight can also be too big hurting training performance. 

The first two code pieces loads one of the agents using the optimal entropy weight ($\beta=0.5$) both in validation mode (without softmax action selection) and in training mode (with softmax action selection).

In [53]:
# VALIDATION MODE
# ENTROPY WEIGHT = 0.5, NO softmax action selection
path = r".\Data\Experiment Entropy MountainCar 6\050_01"
visualize_agent(path, softmax_action_selection=False, epsilon=0.0)

In [54]:
# TRAINING MODE
# ENTROPY WEIGHT = 0.5, SOFTMAX ACTION SELECTION
path = r".\Data\Experiment Entropy MountainCar 6\050_01"
visualize_agent(path, softmax_action_selection=True, epsilon=0.0)

Here it is seen that in validation mode, the agent very effectively oscillates up and down the hills to reach the flag in three movements (forward - back - and then forward all the way to the flag). In training mode the agent reaches the flag in four movements once again demonstrating the possibility of exploration.

The following example for the MountainCar environment is this example below, where it has been trained with too big an entropy weight yielding both poor validation and training performance. 

In [55]:
# VALIDATION MODE
# Trained using entropy weight beta=2.0 (way to high)
path = r".\Data\Experiment Entropy MountainCar 6\200_04"
visualize_agent(path, softmax_action_selection=False, epsilon=0.0)

The last example below simply demonstrates the agent solving the Acrobot environment. Every value of for the entropy weight tried did solve the environment ending with the same validation score.

In [56]:
# VALIDATION MODE
path = r".\Data\Entropy Acrobot\001_1"
visualize_agent(path, softmax_action_selection=False, epsilon=0.0)

The end of the notebook.