##### Copyright 2023 The TF-Agents Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Train a Deep Q Network with TF-Agents

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial">
    <img src="https://www.tensorflow.org/images/tf_logo_32px.png" />
    View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/Supra-CN/agents/blob/master/docs/practice/cartpole_dqn.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/Supra-CN/agents/blob/master/docs/practice/cartpole_dqn.ipynb">
    <img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />
    View source on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/agents/docs/practice/cartpole_dqn.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

## Introduction


This example shows how to train a [DQN (Deep Q Networks)](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf)  agent on the Cartpole environment using the TF-Agents library.
> 这个例子展示了如何使用 TF-Agents 库在 Cartpole 环境中训练一个 [DQN (Deep Q Networks)](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf)  智能体。

![Cartpole environment](https://raw.githubusercontent.com/tensorflow/agents/master/docs/tutorials/images/cartpole.png)

It will walk you through all the components in a Reinforcement Learning (RL) pipeline for training, evaluation and data collection.
> 这将带您了解强化学习 (RL) 的训练、评估和数据收集流程中的所有构成部分。


To run this code live, click the 'Run in Google Colab' link above.
> 点击上方'Run in Google Colab'链接，可实时运行本代码

## Setup

If you haven't installed the following dependencies, run:
> 如果你还没有安装下列依赖，请执行：

In [None]:
!sudo apt-get update
!sudo apt-get install -y xvfb ffmpeg freeglut3-dev
!pip install 'imageio==2.4.0'
!pip install pyvirtualdisplay
!pip install tf-agents[reverb]
!pip install pyglet

In [None]:
from __future__ import absolute_import, division, print_function

import base64
import imageio
import IPython
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import PIL.Image
import pyvirtualdisplay
import reverb

import tensorflow as tf

from tf_agents.agents.dqn import dqn_agent
from tf_agents.drivers import py_driver
from tf_agents.environments import suite_gym
from tf_agents.environments import tf_py_environment
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks import sequential
from tf_agents.policies import py_tf_eager_policy
from tf_agents.policies import random_tf_policy
from tf_agents.replay_buffers import reverb_replay_buffer
from tf_agents.replay_buffers import reverb_utils
from tf_agents.trajectories import trajectory
from tf_agents.specs import tensor_spec
from tf_agents.utils import common

In [None]:
# Set up a virtual display for rendering OpenAI gym environments.
# 设置一个用于渲染 OpenAI gym 环境的虚拟器
display = pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()

In [None]:
tf.version.VERSION

## Hyperparameters （超参数集）

In [None]:
num_iterations = 20000 # @param {type:"integer"}

initial_collect_steps = 100  # @param {type:"integer"}
collect_steps_per_iteration =   1# @param {type:"integer"}
replay_buffer_max_length = 100000  # @param {type:"integer"}

batch_size = 64  # @param {type:"integer"}
learning_rate = 1e-3  # @param {type:"number"}
log_interval = 200  # @param {type:"integer"}

num_eval_episodes = 10  # @param {type:"integer"}
eval_interval = 1000  # @param {type:"integer"}

## Environment（环境）

In Reinforcement Learning (RL), an environment represents the task or problem to be solved. Standard environments can be created in TF-Agents using `tf_agents.environments` suites. TF-Agents has suites for loading environments from sources such as the OpenAI Gym, Atari, and DM Control.
> 在强化学习「Reinforcement Learning (RL)」中，「环境」代表「需要完成的任务」或「需要解决的问题问题」

Load the CartPole environment from the OpenAI Gym suite. 
> 从 OpenAI Gym 套件中加载 CartPole 环境

In [None]:
env_name = 'CartPole-v0'
env = suite_gym.load(env_name)

You can render this environment to see how it looks. A free-swinging pole is attached to a cart.  The goal is to move the cart right or left in order to keep the pole pointing up.
> 你可以渲染这个环境看一下，一个自由摆动的杆子连接到一个小推车上。目标是通过左右移动小推车，使杆子保持朝上。

In [None]:
#@test {"skip": true}
env.reset()
PIL.Image.fromarray(env.render())

The `environment.step` method takes an `action` in the environment and returns a `TimeStep` tuple containing the next observation of the environment and the reward for the action.
> `environment.step` 方法在环境中选取一个动作 `action` 并返回一个 `TimeStep` 元组，其中包含对环境的下一次观察和该动作的奖励。

The `time_step_spec()` method returns the specification for the `TimeStep` tuple. Its `observation` attribute shows the shape of observations, the data types, and the ranges of allowed values. The `reward` attribute shows the same details for the reward.
> `time_step_spec()` 方法返回 `TimeStep` 元组的规范说明。它的可观察对象属性`observation`显示可观察对象的形状、数据类型和允许的取值范围。 `reward` 属性显示相同的奖励详细信息。


In [None]:
print('Observation Spec:（可观测对象的规范）')
print(env.time_step_spec().observation)

In [None]:
print('Reward Spec:（奖励对象的规范）')
print(env.time_step_spec().reward)

The `action_spec()` method returns the shape, data types, and allowed values of valid actions.
> `action_spec()` 方法返回有效操作的形状、数据类型和允许值。

In [None]:
print('Action Spec:')
print(env.action_spec())

In the Cartpole environment:

-   `observation` is an array of 4 floats: 
    -   the position and velocity of the cart
    -   the angular position and velocity of the pole 
-   `reward` is a scalar float value
-   `action` is a scalar integer with only two possible values:
    -   `0` — "move left"
    -   `1` — "move right"

> 在 Cartpole 环境中:
>
> -   `observation` 可观测对象，是包含 4 个 浮点数的数组: 
>    -   小推车的「位置」和「速度」
>    -   杆子的「角位置」和「角速度」 
> -   `reward` 奖励对象，是标量浮点值
> -   `action` 动作对象，是只有两个可能值的标量整数:
>    -   `0` — "向左移动"
>    -   `1` — "向右移动"


In [None]:
time_step = env.reset()
print('Time step:')
print(time_step)

action = np.array(1, dtype=np.int32)

next_time_step = env.step(action)
print('Next time step:')
print(next_time_step)

Usually two environments are instantiated: one for training and one for evaluation. 
> 通常会实例化两个环境：一个用于训练，一个用于评估。

In [None]:
train_py_env = suite_gym.load(env_name)
eval_py_env = suite_gym.load(env_name)

The Cartpole environment, like most environments, is written in pure Python. This is converted to TensorFlow using the `TFPyEnvironment` wrapper.
> 与大多数环境一样，Cartpole 环境是用纯 Python 编写的。使用 `TFPyEnvironment` 包装器将其转换为 `TensorFlow`。

The original environment's API uses Numpy arrays. The `TFPyEnvironment` converts these to `Tensors` to make it compatible with Tensorflow agents and policies.
> 原始环境的 API 使用 Numpy 数组。 `TFPyEnvironment` 将这些转换为张量 `Tensors`，使其与 Tensorflow 代理和策略兼容。


In [None]:
train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)

## Agent（智能体）

The algorithm used to solve an RL problem is represented by an `Agent`. TF-Agents provides standard implementations of a variety of `Agents`, including:

-   [DQN](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf) (used in this tutorial)
-   [REINFORCE](https://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf)
-   [DDPG](https://arxiv.org/pdf/1509.02971.pdf)
-   [TD3](https://arxiv.org/pdf/1802.09477.pdf)
-   [PPO](https://arxiv.org/abs/1707.06347)
-   [SAC](https://arxiv.org/abs/1801.01290)

> 智能体 `Agent` 表示用于解决 RL 问题的算法。 TF-Agents 提供各种智能体 `Agent` 的标准实现，包括：
>
> -   [DQN](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf) (用于本教程)
> -   [REINFORCE](https://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf)
> -   [DDPG](https://arxiv.org/pdf/1509.02971.pdf)
> -   [TD3](https://arxiv.org/pdf/1802.09477.pdf)
> -   [PPO](https://arxiv.org/abs/1707.06347)
> -   [SAC](https://arxiv.org/abs/1801.01290)

The DQN agent can be used in any environment which has a discrete action space.
> DQN 智能体可用于任何具备离散操作空间的环境中

At the heart of a DQN Agent is a `QNetwork`, a neural network model that can learn to predict `QValues` (expected returns) for all actions, given an observation from the environment.
> DQN 智能体的核心是Q网络 `QNetwork`，这是一种可以在给定环境中根据所有的动作来学习和预测其 `QValues` （预期回报）的神经网络。

We will use `tf_agents.networks.` to create a `QNetwork`. The network will consist of a sequence of `tf.keras.layers.Dense` layers, where the final layer will have 1 output for each possible action.
> 我们将使用 `tf_agents.networks.` 来创建一个 `QNetwork`. 该网络将由一系列 `tf.keras.layers.Dense` 层组成，其中最后一层将为每个可能的动作提供 1 个输出。

In [None]:
fc_layer_params = (100, 50)
action_tensor_spec = tensor_spec.from_spec(env.action_spec())
num_actions = action_tensor_spec.maximum - action_tensor_spec.minimum + 1

# Define a helper function to create Dense layers configured with the right
# activation and kernel initializer.
# 定义一个辅助函数来创建配置了正确激活和内核初始化器的 Dense layers
def dense_layer(num_units):
  return tf.keras.layers.Dense(
      num_units,
      activation=tf.keras.activations.relu,
      kernel_initializer=tf.keras.initializers.VarianceScaling(
          scale=2.0, mode='fan_in', distribution='truncated_normal'))

# QNetwork consists of a sequence of Dense layers followed by a dense layer
# with `num_actions` units to generate one q_value per available action as
# its output.
# QNetwork 由一系列 Dense layers 组成，
# 紧接着是一个带有 `num_actions` 单元的 dense layer，
# 为每个可用动作生成一个 q_value 作为其输出
dense_layers = [dense_layer(num_units) for num_units in fc_layer_params]
q_values_layer = tf.keras.layers.Dense(
    num_actions,
    activation=None,
    kernel_initializer=tf.keras.initializers.RandomUniform(
        minval=-0.03, maxval=0.03),
    bias_initializer=tf.keras.initializers.Constant(-0.2))
q_net = sequential.Sequential(dense_layers + [q_values_layer])

Now use `tf_agents.agents.dqn.dqn_agent` to instantiate a `DqnAgent`. In addition to the `time_step_spec`, `action_spec` and the QNetwork, the agent constructor also requires an optimizer (in this case, `AdamOptimizer`), a loss function, and an integer step counter.
> 现在使用 `tf_agents.agents.dqn.dqn_agent` 实例化一个 `DqnAgent`。除了 `time_step_spec`、`action_spec` 和 QNetwork 之外，智能体构造器还需要一个优化器（在本例中为 AdamOptimizer）、一个损失函数和一个整数计步器。

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

train_step_counter = tf.Variable(0)

agent = dqn_agent.DqnAgent(
    train_env.time_step_spec(),
    train_env.action_spec(),
    q_network=q_net,
    optimizer=optimizer,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=train_step_counter)

agent.initialize()

## Policies（策略）

A policy defines the way an agent acts in an environment. Typically, the goal of reinforcement learning is to train the underlying model until the policy produces the desired outcome.
> 策略定义智能体在环境中的行为方式。通常，强化学习的目标是训练基础模型，直到策略产生预期的结果。

In this tutorial:

-   The desired outcome is keeping the pole balanced upright over the cart.
-   The policy returns an action (left or right) for each `time_step` observation.
> 在本教程中：
> -   期望的结果是使杆在推车上方保持平衡。 
> -   该策略为每个 `time_step` 观察返回一个动作（左或右）。

Agents contain two policies: 

-   `agent.policy` — The main policy that is used for evaluation and deployment.
-   `agent.collect_policy` — A second policy that is used for data collection.
> 智能体包含两个策略: 
> -   `agent.policy` — 第一个是用来预测和部署的主要策略。
> -   `agent.collect_policy` — 第二个是用来做数据收集的。


In [None]:
eval_policy = agent.policy
collect_policy = agent.collect_policy

Policies can be created independently of agents. For example, use `tf_agents.policies.random_tf_policy` to create a policy which will randomly select an action for each `time_step`.
> 策略可以独立于智能体创建。例如，使用 `tf_agents.policies.random_tf_policy` 创建一个策略，该策略将为每个 `time_step` 随机选择一个动作。

In [None]:
random_policy = random_tf_policy.RandomTFPolicy(train_env.time_step_spec(),
                                                train_env.action_spec())

To get an action from a policy, call the `policy.action(time_step)` method. The `time_step` contains the observation from the environment. This method returns a `PolicyStep`, which is a named tuple with three components:

-   `action` — the action to be taken (in this case, `0` or `1`)
-   `state` — used for stateful (that is, RNN-based) policies
-   `info` — auxiliary data, such as log probabilities of actions
> 调用 `policy.action(time_step)` 方法，可以从策略中获取一个动作。 `time_step` 包含来自环境的观察。此方法返回一个 `PolicyStep`，它是一个具有三个组件的命名元组：
> -   `action` — 要采取的动作（在本例中是 `0` 或 `1` ）
> -   `state` — 用于有状态（即基于 RNN 的）策略
> -   `info` — 辅助数据，例如动作的对数概率

In [None]:
example_environment = tf_py_environment.TFPyEnvironment(
    suite_gym.load('CartPole-v0'))

In [None]:
time_step = example_environment.reset()

In [None]:
random_policy.action(time_step)

## Metrics and Evaluation（指标和评估）

The most common metric used to evaluate a policy is the average return. The return is the sum of rewards obtained while running a policy in an environment for an episode. Several episodes are run, creating an average return.
> 用于评估政策的最常见指标是平均回报奖励。回报奖励是在一个情节的环境中运行策略时获得的奖励总和。运行几回合之后，计算得出一个平均回报奖励。

The following function computes the average return of a policy, given the policy, environment, and a number of episodes.
> 以下函数在给定策略、环境和多个回合的情况下计算策略的平均回报奖励。


In [None]:
#@test {"skip": true}
def compute_avg_return(environment, policy, num_episodes=10):

  total_return = 0.0
  for _ in range(num_episodes):

    time_step = environment.reset()
    episode_return = 0.0

    while not time_step.is_last():
      action_step = policy.action(time_step)
      time_step = environment.step(action_step.action)
      episode_return += time_step.reward
    total_return += episode_return

  avg_return = total_return / num_episodes
  return avg_return.numpy()[0]


# See also the metrics module for standard implementations of different metrics.
# https://github.com/tensorflow/agents/tree/master/tf_agents/metrics

Running this computation on the `random_policy` shows a baseline performance in the environment.
> 在 `random_policy` 上运行此计算显示了环境中的基准性能。

In [None]:
compute_avg_return(eval_env, random_policy, num_eval_episodes)

## Replay Buffer（重播缓冲区）

In order to keep track of the data collected from the environment, we will use [Reverb](https://deepmind.com/research/open-source/Reverb), an efficient, extensible, and easy-to-use replay system by Deepmind. It stores experience data when we collect trajectories and is consumed during training.
> 为了保持跟踪环境中产生的数据，我们需要用到 [Reverb](https://deepmind.com/research/open-source/Reverb), 这是 Deepmind 提供的一个高效、可扩展且易于使用的回放系统。它用于我们收集轨迹时存储经验数据，并在训练期间使用。

This replay buffer is constructed using specs describing the tensors that are to be stored, which can be obtained from the agent using agent.collect_data_spec.
> 这个回放缓冲区是使用描述要存储的张量的规范构建的，可以使用 `agent.collect_data_spec` 从智能体对象中获得。

In [None]:
table_name = 'uniform_table'
replay_buffer_signature = tensor_spec.from_spec(
      agent.collect_data_spec)
replay_buffer_signature = tensor_spec.add_outer_dim(
    replay_buffer_signature)

table = reverb.Table(
    table_name,
    max_size=replay_buffer_max_length,
    sampler=reverb.selectors.Uniform(),
    remover=reverb.selectors.Fifo(),
    rate_limiter=reverb.rate_limiters.MinSize(1),
    signature=replay_buffer_signature)

reverb_server = reverb.Server([table])

replay_buffer = reverb_replay_buffer.ReverbReplayBuffer(
    agent.collect_data_spec,
    table_name=table_name,
    sequence_length=2,
    local_server=reverb_server)

rb_observer = reverb_utils.ReverbAddTrajectoryObserver(
  replay_buffer.py_client,
  table_name,
  sequence_length=2)

For most agents, `collect_data_spec` is a named tuple called `Trajectory`, containing the specs for observations, actions, rewards, and other items.
> 对于大多数智能体，`collect_data_spec` 是一个名为 `Trajectory` 的命名元组，包含观察、动作、奖励和其他项目的规范定义。

In [None]:
agent.collect_data_spec

In [None]:
agent.collect_data_spec._fields

## Data Collection（数据采集）

Now execute the random policy in the environment for a few steps, recording the data in the replay buffer.
> 现在在环境中执行随机策略几步，将数据记录在重放缓冲区中。

Here we are using 'PyDriver' to run the experience collecting loop. You can learn more about TF Agents driver in our [drivers tutorial](https://www.tensorflow.org/agents/tutorials/4_drivers_tutorial).
> 在这里，我们使用 `PyDriver` 来运行经验收集循环。您可以在我们的驱动程序教程[drivers tutorial](https://www.tensorflow.org/agents/tutorials/4_drivers_tutorial)中了解有关 TF Agents 驱动程序的更多信息。

In [None]:
#@test {"skip": true}
py_driver.PyDriver(
    env,
    py_tf_eager_policy.PyTFEagerPolicy(
      random_policy, use_tf_function=True),
    [rb_observer],
    max_steps=initial_collect_steps).run(train_py_env.reset())

The replay buffer is now a collection of Trajectories.
> 重播缓冲区现在是轨迹的集合。

In [None]:
# For the curious:
# Uncomment to peel one of these off and inspect it.
# 如果对他感兴趣：
# 可以取消下面这行的注释来观察他
# iter(replay_buffer.as_dataset()).next()

The agent needs access to the replay buffer. This is provided by creating an iterable `tf.data.Dataset` pipeline which will feed data to the agent.
> 智能体需要通过创建一个可迭代的 `tf.data.Dataset` 管道来访问重播缓冲区，该管道将向代理提供数据。

Each row of the replay buffer only stores a single observation step. But since the DQN Agent needs both the current and next observation to compute the loss, the dataset pipeline will sample two adjacent rows for each item in the batch (`num_steps=2`).
> 回放缓冲区的每一行只存储一个观察步骤。但是由于 DQN 代理需要当前和下一个观察来计算损失，因此数据集管道将为批次中的每个项目采样两个相邻的行 `(num_steps=2)`。

This dataset is also optimized by running parallel calls and prefetching data.
> 该数据集还通过运行并行调用和预取数据进行了优化。

In [None]:
# Dataset generates trajectories with shape [Bx2x...]
dataset = replay_buffer.as_dataset(
    num_parallel_calls=3,
    sample_batch_size=batch_size,
    num_steps=2).prefetch(3)

dataset

In [None]:
iterator = iter(dataset)
print(iterator)

In [None]:
# For the curious:
# Uncomment to see what the dataset iterator is feeding to the agent.
# Compare this representation of replay data 
# to the collection of individual trajectories shown earlier.
# 如果对他感兴趣：
# 取消注释以查看数据集迭代器向代理提供的内容。
# 将重放数据的这种表示与前面显示的单个轨迹的集合进行比较。

# iterator.next()

## Training the agent（训练智能体）

Two things must happen during the training loop:

-   collect data from the environment
-   use that data to train the agent's neural network(s)
> 在训练循环中必须发生两件事:
> -   从环境中收集数据
> -   使用该数据来训练智能体的神经网络

This example also periodicially evaluates the policy and prints the current score.
> 此示例同时还定期评估策略并打印当前分数。

The following will take ~5 minutes to run.
> 下面的运行过程可能需要大约 5 分钟

In [None]:
#@test {"skip": true}
try:
  %%time
except:
  pass

# (Optional) Optimize by wrapping some of the code in a graph using TF function.
# (可选) 通过使用 TF 函数将一些代码包装在图形中进行优化。
agent.train = common.function(agent.train)

# Reset the train step.
# 重置训练步数
agent.train_step_counter.assign(0)

# Evaluate the agent's policy once before training.
# 在训练前先对智能体的策略进行一次评估。
avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
returns = [avg_return]

# Reset the environment.
# 重置环境
time_step = train_py_env.reset()

# Create a driver to collect experience.
# 创建一个 driver 来采集经验
collect_driver = py_driver.PyDriver(
    env,
    py_tf_eager_policy.PyTFEagerPolicy(
      agent.collect_policy, use_tf_function=True),
    [rb_observer],
    max_steps=collect_steps_per_iteration)

for _ in range(num_iterations):

  # Collect a few steps and save to the replay buffer.
  # 采集几步并保存到经验缓冲区
  time_step, _ = collect_driver.run(time_step)

  # Sample a batch of data from the buffer and update the agent's network.
  # 从缓冲区中采样一批数据并更新智能体的神经网络。
  experience, unused_info = next(iterator)
  train_loss = agent.train(experience).loss

  step = agent.train_step_counter.numpy()

  if step % log_interval == 0:
    print('step = {0}: loss = {1}'.format(step, train_loss))

  if step % eval_interval == 0:
    avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
    print('step = {0}: Average Return = {1}'.format(step, avg_return))
    returns.append(avg_return)

## Visualization（可视化）


### Plots（制图）

Use `matplotlib.pyplot` to chart how the policy improved during training.
> 使用 `matplotlib.pyplot` 绘制策略在训练期间如何改进的图表。

One iteration of `Cartpole-v0` consists of 200 time steps. The environment gives a reward of `+1` for each step the pole stays up, so the maximum return for one episode is 200. The charts shows the return increasing towards that maximum each time it is evaluated during training. (It may be a little unstable and not increase monotonically each time.)
> `Cartpole-v0` 的一次迭代包含 200 个时间步长。环境给杆子撑起的每一步奖励 `+1`，所以一集的最大奖励是200，图表显示每次在训练期间对其进行评估时，回报都会朝着该最大值增加。（可能有点不稳定，不是每次都单调递增。）

In [None]:
#@test {"skip": true}

iterations = range(0, num_iterations + 1, eval_interval)
plt.plot(iterations, returns)
plt.ylabel('Average Return')
plt.xlabel('Iterations')
plt.ylim(top=250)

### Videos（视频）

Charts are nice. But more exciting is seeing an agent actually performing a task in an environment. 
> 图表很好。但更令人兴奋的是看到代理在环境中实际执行任务。

First, create a function to embed videos in the notebook.
> 首先，创建一个在笔记本中嵌入视频的功能。

In [None]:
def embed_mp4(filename):
  """Embeds an mp4 file in the notebook."""
  video = open(filename,'rb').read()
  b64 = base64.b64encode(video)
  tag = '''
  <video width="640" height="480" controls>
    <source src="data:video/mp4;base64,{0}" type="video/mp4">
  Your browser does not support the video tag.
  </video>'''.format(b64.decode())

  return IPython.display.HTML(tag)

Now iterate through a few episodes of the Cartpole game with the agent. The underlying Python environment (the one "inside" the TensorFlow environment wrapper) provides a `render()` method, which outputs an image of the environment state. These can be collected into a video.
> 现在与代理一起迭代 Cartpole 游戏的几集。底层 Python 环境（TensorFlow 环境包装器“内部”）提供了一个 `render()` 方法，它输出环境状态的图像。这些可以收集到视频中。

In [None]:
def create_policy_eval_video(policy, filename, num_episodes=5, fps=30):
  filename = filename + ".mp4"
  with imageio.get_writer(filename, fps=fps) as video:
    for _ in range(num_episodes):
      time_step = eval_env.reset()
      video.append_data(eval_py_env.render())
      while not time_step.is_last():
        action_step = policy.action(time_step)
        time_step = eval_env.step(action_step.action)
        video.append_data(eval_py_env.render())
  return embed_mp4(filename)

create_policy_eval_video(agent.policy, "trained-agent")

For fun, compare the trained agent (above) to an agent moving randomly. (It does not do as well.)
> 为了好玩，将训练有素的代理人（上图）与随机移动的代理人进行比较。 （它并没有那么好。）

In [None]:
create_policy_eval_video(random_policy, "random-agent")