##### Copyright 2018 The TF-Agents Authors.

### Get Started
<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/agents/blob/master/tf_agents/colabs/4_drivers_tutorial.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/agents/blob/master/tf_agents/colabs/4_drivers_tutorial.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>



In [None]:
# Note: If you haven't installed tf-agents or gym yet, run:
try:
    %%tensorflow_version 2.x
except:
    pass
!pip install tf-agents-nightly
!pip install gym

### Imports

In [2]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf


from tf_agents.environments import suite_gym
from tf_agents.environments import tf_py_environment
from tf_agents.policies import random_py_policy
from tf_agents.policies import random_tf_policy
from tf_agents.metrics import py_metrics
from tf_agents.metrics import tf_metrics
from tf_agents.drivers import py_driver
from tf_agents.drivers import dynamic_episode_driver

tf.compat.v1.enable_v2_behavior()

# 简介

强化学习的基本模式是在一个环境中执行一个策略，并迭代一定数量的步数或事件。例如，在数据收集、评估和生成代理的视频期间都包含在其中。

虽然用python编写相对简单，但在TensorFlow中编写和调试要复杂得多，因为它涉及到 `tf.while` 循环, `tf.cond` 和 `tf.control_dependencies`. 因此，我们将run循环的概念抽象为一个名为 `driver`的类，并在Python和TensorFlow中提供了经过良好测试的实现。

此外，driver在每个步骤中遇到的数据被保存在一个名为轨迹（Trajectory）的命名元组中，并广播给一组观察者，如重播缓冲区和度量。这些数据包括来自环境的观察，策略建议的动作，获得的奖励，当前和下一步的类型等。

# Python Drivers

 `PyDriver` 类接受一个python环境、一个python策略和一个观察者列表，以便在每一步更新。主要的方法是`run()`, 它使用策略中的操作逐步处理环境，直到至少满足下列终止条件之一:步骤的数量达到 `max_steps`或剧集的数量达到 `max_episodes`。

具体实施情况大致如下:

```python
class PyDriver(object):

  def __init__(self, env, policy, observers, max_steps=1, max_episodes=1):
    self._env = env
    self._policy = policy
    self._observers = observers or []
    self._max_steps = max_steps or np.inf
    self._max_episodes = max_episodes or np.inf

  def run(self, time_step, policy_state=()):
    num_steps = 0
    num_episodes = 0
    while num_steps < self._max_steps and num_episodes < self._max_episodes:

      # Compute an action using the policy for the given time_step
      action_step = self._policy.action(time_step, policy_state)

      # Apply the action to the environment and get the next step
      next_time_step = self._env.step(action_step.action)

      # Package information into a trajectory
      traj = trajectory.Trajectory(
         time_step.step_type,
         time_step.observation,
         action_step.action,
         action_step.info,
         next_time_step.step_type,
         next_time_step.reward,
         next_time_step.discount)

      for observer in self._observers:
        observer(traj)

      # Update statistics to check termination
      num_episodes += np.sum(traj.is_last())
      num_steps += np.sum(~traj.is_boundary())

      time_step = next_time_step
      policy_state = action_step.state

    return time_step, policy_state

```

现在，让我们运行一个在CartPole环境上运行随机策略的示例，将结果保存到回放缓冲区并计算一些度量。

In [4]:
env = suite_gym.load('CartPole-v0')
policy = random_py_policy.RandomPyPolicy(time_step_spec=env.time_step_spec(),
                                         action_spec=env.action_spec())
replay_buffer = []
metric = py_metrics.AverageReturnMetric()
observers = [replay_buffer.append, metric]
driver = py_driver.PyDriver(env,
                            policy,
                            observers,
                            max_steps=20,
                            max_episodes=1)

initial_time_step = env.reset()
final_time_step, _ = driver.run(initial_time_step)

print('Replay Buffer:')
for traj in replay_buffer:
    print(traj)

print('Average Return: ', metric.result())

Replay Buffer:
Trajectory(step_type=array(0, dtype=int32), observation=array([-0.02310036, -0.02402413,  0.02206977,  0.04428826], dtype=float32), action=array(0), policy_info=(), next_step_type=array(1, dtype=int32), reward=array(1., dtype=float32), discount=array(1., dtype=float32))
Trajectory(step_type=array(1, dtype=int32), observation=array([-0.02358084, -0.2194555 ,  0.02295554,  0.3438519 ], dtype=float32), action=array(1), policy_info=(), next_step_type=array(1, dtype=int32), reward=array(1., dtype=float32), discount=array(1., dtype=float32))
Trajectory(step_type=array(1, dtype=int32), observation=array([-0.02796995, -0.0246675 ,  0.02983258,  0.05849522], dtype=float32), action=array(0), policy_info=(), next_step_type=array(1, dtype=int32), reward=array(1., dtype=float32), discount=array(1., dtype=float32))
Trajectory(step_type=array(1, dtype=int32), observation=array([-0.0284633 , -0.2202042 ,  0.03100248,  0.36043924], dtype=float32), action=array(0), policy_info=(), next_st

# TensorFlow Drivers

我们在TensorFlow中也有一些drivers，它们的功能与Python中的drivers类似，但是使用TF环境、TF策略、TF观察者等。我们目前有两个TensorFlow的driver:`DynamicStepDriver`，它在给定数量的(有效的)环境步骤之后终止;`DynamicEpisodeDriver`它在给定数量的事件之后终止。让我们来看一个DynamicEpisodeDriver的例子

In [5]:
env = suite_gym.load('CartPole-v0')
tf_env = tf_py_environment.TFPyEnvironment(env)

tf_policy = random_tf_policy.RandomTFPolicy(action_spec=tf_env.action_spec(),
                                            time_step_spec=tf_env.time_step_spec())


num_episodes = tf_metrics.NumberOfEpisodes()
env_steps = tf_metrics.EnvironmentSteps()
observers = [num_episodes, env_steps]
driver = dynamic_episode_driver.DynamicEpisodeDriver(
    tf_env, tf_policy, observers, num_episodes=2)

# Initial driver.run will reset the environment and initialize the policy.
final_time_step, policy_state = driver.run()

print('final_time_step', final_time_step)
print('Number of Steps: ', env_steps.result().numpy())
print('Number of Episodes: ', num_episodes.result().numpy())

final_time_step TimeStep(step_type=<tf.Tensor: id=2194, shape=(1,), dtype=int32, numpy=array([0], dtype=int32)>, reward=<tf.Tensor: id=2195, shape=(1,), dtype=float32, numpy=array([0.], dtype=float32)>, discount=<tf.Tensor: id=2196, shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>, observation=<tf.Tensor: id=2197, shape=(1, 4), dtype=float32, numpy=
array([[-0.04311429,  0.02406621,  0.02612159, -0.01080346]],
      dtype=float32)>)
Number of Steps:  53
Number of Episodes:  2


In [6]:
# Continue running from previous state
final_time_step, _ = driver.run(final_time_step, policy_state)

print('final_time_step', final_time_step)
print('Number of Steps: ', env_steps.result().numpy())
print('Number of Episodes: ', num_episodes.result().numpy())

final_time_step TimeStep(step_type=<tf.Tensor: id=3845, shape=(1,), dtype=int32, numpy=array([0], dtype=int32)>, reward=<tf.Tensor: id=3846, shape=(1,), dtype=float32, numpy=array([0.], dtype=float32)>, discount=<tf.Tensor: id=3847, shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>, observation=<tf.Tensor: id=3848, shape=(1, 4), dtype=float32, numpy=
array([[ 0.04527216,  0.03846889, -0.02772892, -0.00087081]],
      dtype=float32)>)
Number of Steps:  93
Number of Episodes:  4
