# Introduction to Reinforcement Learning with TF-Agents (AML Fall 2019)

This tutorial is adapted from the [official tutorial](https://github.com/tensorflow/agents/blob/master/tf_agents/colabs/1_dqn_tutorial.ipynb) and rewritten for the purpose of the course. Below we will show by examples of how to train reinforcement-learning agents on the CartPole environment using the TF-Agents library.

**What is TF-Agents?**

TF-Agents is a robust, scalable and easy to use RL Library for TensorFlow. It is still under development, so in this tutorial, we will use the nightly preview build.

**Why TF-Agents?**

 - It is compatible with other TensorFlow high-level API's and therefore is extensively resourceful. 
 - It standardizes common steps of RL and thus helps researchers to try and test new RL algorithms quickly.
 - It is well tested and easy to configure with gin-config.

**Official GitHub Repo:**

For the most complete and updated info, please refer to the [official GitHub repo](https://github.com/tensorflow/agents/tree/master/tf_agents).


# Setup

In [0]:
# Note: If you haven't installed the following dependencies, run:
!apt-get install xvfb
!pip install 'gym==0.10.11'
!pip install 'imageio==2.4.0'
!pip install PILLOW
!pip install 'pyglet==1.3.2'
!pip install pyvirtualdisplay
!pip install tf-agents-nightly
try:
  %%tensorflow_version 2.x
except:
  pass

In [0]:
!pip install --upgrade tf-nightly

In [0]:
!pip install tf-agents-nightly

In [0]:
from __future__ import absolute_import, division, print_function

import base64
import imageio
import IPython
import matplotlib
import matplotlib.pyplot as plt
import PIL.Image
import pyvirtualdisplay

import tensorflow as tf

from tf_agents.agents.dqn import dqn_agent
from tf_agents.agents.reinforce import reinforce_agent
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.drivers import dynamic_step_driver
from tf_agents.drivers import dynamic_episode_driver
from tf_agents.environments import suite_gym
from tf_agents.environments import tf_py_environment
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.metrics import py_metric
from tf_agents.metrics import tf_py_metric
from tf_agents.networks import q_network
from tf_agents.networks import actor_distribution_network
from tf_agents.policies import random_tf_policy
from tf_agents.policies import q_policy
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.trajectories import trajectory
from tf_agents.utils import common


In [0]:
tf.version.VERSION

In [0]:
!pip install pyvirtualdisplay

# Load the Environment

In Reinforcement Learning (RL), an environment represents the task or problem to be solved. Standard environments can be created in TF-Agents using `tf_agents.environments` suites. TF-Agents has suites for loading environments from sources such as the OpenAI Gym, Atari, and DM Control.

In this tutorial, we will use the CartPole-v0 environment from the OpenAI Gym suite.

From the [official website](https://gym.openai.com/envs/CartPole-v0/) description of the problem: A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

In [0]:
# Load the CartPole environment
env_name = 'CartPole-v0'
env = suite_gym.load(env_name)

# Inspect the environment
print('Observation Spec:\n', env.time_step_spec().observation)
print('Reward Spec:\n', env.time_step_spec().reward)
print('Action Spec:\n', env.action_spec())

The `environment.step` method takes an `action` in the environment and returns a `TimeStep` tuple containing the next observation of the environment and the reward for the action.

The `time_step_spec()` method returns the specification for the `TimeStep` tuple. Its `observation` attribute shows the shape of observations, the data types, and the ranges of allowed values. The `reward` attribute shows the same details for the reward.

The `action_spec()` method returns the shape, data types, and allowed values of valid actions.

In the CartPole environment:

-   `observation` is an array of 4 floats: 
    -   the position and velocity of the cart
    -   the angular position and velocity of the pole 
-   `reward` is a scalar float value
-   `action` is a scalar integer with only two possible values:
    -   `0` — "move left"
    -   `1` — "move right"


The CartPole environment, like most environments, is written in pure Python. This is converted to TensorFlow using the `TFPyEnvironment` wrapper.

The original environment's API uses Numpy arrays. The `TFPyEnvironment` converts these to `Tensors` to make it compatible with Tensorflow agents and policies.


In [0]:
tf_env = tf_py_environment.TFPyEnvironment(env)

## Visualization

In [0]:
# Set up a virtual display for rendering OpenAI gym environments.
display = pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()

#@test {"skip": true}
env.reset()
PIL.Image.fromarray(env.render())

# Define the Policy/ Agent and Start Training

Policies can be created independently of agents. For example, use `tf_agents.policies.random_tf_policy` to create a policy which will randomly select an action for each `time_step`.

To get an action from a policy, call the `policy.action(time_step)` method. The `time_step` contains the observation from the environment. This method returns a `PolicyStep`, which is a named tuple with three components:

-   `action` — the action to be taken (in this case, `0` or `1`)
-   `state` — used for stateful (that is, RNN-based) policies
-   `info` — auxiliary data, such as log probabilities of actions

## Example 1: Random Policy

In [0]:
from tf_agents.policies import random_py_policy
from tf_agents.metrics import py_metrics
from tf_agents.drivers import py_driver

### Define the Policy

First, let's try to use the built-in random policy.

In [0]:
tf_policy = random_tf_policy.RandomTFPolicy(action_spec=tf_env.action_spec(),
                                            time_step_spec=tf_env.time_step_spec())

### Train Through a Driver

In TF-Agents we use a Driver to collect experience in an environment. To use a Driver, we specify an Observer that is a function for the Driver to execute when it receives a trajectory.

Thus, to add trajectory elements to the replay buffer, we add an observer that calls `add_batch(items)` to add a (batch of) items on the replay buffer.

Using the example below, you should learn how to use the [driver](https://github.com/tensorflow/agents/tree/master/tf_agents/drivers) to control the workflow. Note that the policy in this example is not trained, so there is no RL loops here.

In [0]:
num_episodes = tf_metrics.NumberOfEpisodes()
env_steps = tf_metrics.EnvironmentSteps()
ave_return = tf_metrics.AverageReturnMetric()
observers = [num_episodes, env_steps, ave_return]

# define the driver
driver = py_driver.PyDriver(
         tf_env, tf_policy, observers, max_steps=200, max_episodes=10)

# initial driver.run will reset the environment and initialize the policy.
initial_time_step = tf_env.reset()
final_time_step, policy_state = driver.run(initial_time_step)

print('final_time_step:', final_time_step)
print('policy_state:', policy_state)
print('Number of Steps: ', env_steps.result().numpy())
print('Number of Episodes: ', num_episodes.result().numpy())
print('Average Return: ', ave_return.result().numpy())

## Example 2: Deep Q-Learning

### Same setting, but changing the policy

The following example shows how to use the TF-agent policies. Note that this is just for the demonstration of using different existing TF-agent policies. For the Q Learning per se, it actually shouldn't be learned this way. We will show the proper way for Deep Q Learning in later example.

In [0]:
from tf_agents.policies import q_policy

In [0]:
# Create q network
fc_layer_params = (100,)

q_net = q_network.QNetwork(tf_env.observation_spec(),
                           tf_env.action_spec(),
                           fc_layer_params=fc_layer_params)

In [0]:
tf_policy = q_policy.QPolicy(tf_env.time_step_spec(), tf_env.action_spec(),q_network=q_net)

num_episodes = tf_metrics.NumberOfEpisodes()
env_steps = tf_metrics.EnvironmentSteps()
metric = tf_metrics.AverageReturnMetric()
observers = [num_episodes, env_steps, metric]
driver = dynamic_episode_driver.DynamicEpisodeDriver(
    tf_env, tf_policy, observers, num_episodes=10)

# initial driver.run will reset the environment and initialize the policy.
final_time_step, policy_state = driver.run()

print('final_time_step', final_time_step)
print('Number of Steps: ', env_steps.result().numpy())
print('Number of Episodes: ', num_episodes.result().numpy())
print('Average Return: ', metric.result().numpy())

### Use The Agent Class

Below shows a correct way of training a deep q-learning network with `TFUniformReplayBuffer`. We first create an environment, a network and an agent. Then we create a `TFUniformReplayBuffer`. Note that the specs of the trajectory elements in the replay buffer are equal to the agent's collect data spec. We then set it's `add_batch` method as the observer for the driver that will do the data collect during our training. 

In this example, we use a q network that has a single hidden fully-connected layer of 100 neurons. We first define the network and then pass it to the `dqn_agent.DqnAgent` class.

In [0]:
q_net = q_network.QNetwork(
    tf_env.time_step_spec().observation,
    tf_env.action_spec(),
    fc_layer_params=(100,))

agent = dqn_agent.DqnAgent(
    tf_env.time_step_spec(),
    tf_env.action_spec(),
    q_network=q_net,
    optimizer=tf.compat.v1.train.AdamOptimizer(0.001))

replay_buffer_capacity = 1000

replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    agent.collect_data_spec,
    batch_size=tf_env.batch_size,
    max_length=replay_buffer_capacity)

In [0]:
# define the driver
def collect_training_data():
  dynamic_step_driver.DynamicStepDriver(
    tf_env,
    agent.collect_policy,
    observers=[replay_buffer.add_batch],
    num_steps=1000).run()

Next, the agent needs access to the replay buffer. This is provided by creating an iterable `tf.data.Dataset` pipeline which will feed data to the agent.

Each row of the replay buffer only stores a single observation step. But since the DQN Agent needs both the current and next observation to compute the loss, the dataset pipeline will sample two adjacent rows for each item in the batch (`num_steps=2`).

This dataset is also optimized by running parallel calls and prefetching data.

In [0]:
def train_agent():
  dataset = replay_buffer.as_dataset(
      sample_batch_size=100,
      num_steps=2)

  iterator = iter(dataset)

  loss = None
  for _ in range(100):
    trajectories, _ = next(iterator)
    loss = agent.train(experience=trajectories)
    
  print('Training loss: ', loss.loss.numpy())
  return loss.loss.numpy()

### Start Training

In [0]:
import matplotlib.pyplot as plt
import numpy as np

training_loss = []

for i in range(20):
  print('Step ', i)
  collect_training_data()
  training_loss.append(train_agent())

print(training_loss)

### Evaluate the Agent

You may also define your own metric and add it into the observers to inspect. We demonstrate this by reusing the similar codes before, except for adding the maximum reward for evaluating the agent. 

In [0]:
class MaxEpisodeScoreMetric(py_metric.PyStepMetric):
  def __init__(self, name='MaxEpisodeScoreMetric'):
    super(py_metric.PyStepMetric, self).__init__(name)
    self.rewards = []
    self.discounts = []
    self.max_discounted_reward = None
    self.reset()
  def reset(self):
    self.rewards = []
    self.discounts = []
    self.max_discounted_reward = None
  def call(self, trajectory):
    self.rewards += trajectory.reward
    self.discounts += trajectory.discount
    
    if(trajectory.is_last()):      
      adjusted_discounts = [1.0] + self.discounts # because a step has its value + the discount of the NEXT step (Bellman equation)
      adjusted_discounts = adjusted_discounts[:-1] # dropping the discount of the last step because it is not followed by a next step, so the value is useless
      discounted_reward = np.sum(np.multiply(self.rewards, adjusted_discounts))
      print(self.rewards, adjusted_discounts, discounted_reward)
      
      if self.max_discounted_reward == None:
        self.max_discounted_reward = discounted_reward
      
      if discounted_reward > self.max_discounted_reward:
        self.max_discounted_reward = discounted_reward
        
      self.rewards = []
      self.discounts = []
  def result(self):
    return self.max_discounted_reward

In [0]:
class TFMaxEpisodeScoreMetric(tf_py_metric.TFPyMetric):

  def __init__(self, name='MaxEpisodeScoreMetric', dtype=tf.float32):
    py_metric = MaxEpisodeScoreMetric()

    super(TFMaxEpisodeScoreMetric, self).__init__(
        py_metric=py_metric, name=name, dtype=dtype)

In [0]:
def evaluate_agent():
  max_score = TFMaxEpisodeScoreMetric() 
  observers = [max_score]
  driver = dynamic_episode_driver.DynamicEpisodeDriver(
      tf_env, agent.policy, observers, num_episodes=100)

  final_time_step, policy_state = driver.run()

  print('Max test score:', max_score.result().numpy())
  return max_score.result().numpy()

In [0]:
training_loss = []
max_test_score = []

for i in range(20):
  print('Step ', i)
  collect_training_data()
  training_loss.append(train_agent())
  max_test_score.append(evaluate_agent())

plt.plot(np.arange(1, 21, step = 1), training_loss, c = 'orange', label = 'Training loss')
plt.plot(np.arange(1, 21, step = 1), max_test_score, c = 'blue', label = 'Max test score')
plt.axhline(1.715, c = 'gray', linestyle='dashed', label = 'Max possible score')
plt.xlabel('Iteration')
plt.grid(True)
plt.title('Training loss and max test score')
plt.xticks(np.arange(1, 21))
plt.legend()