# 用深度强化学习玩Atari游戏 (Playing Atari with Deep Reinforcement Learning)

# 标签 (Labels)

* Reinforcement Learning

# Harvard引用

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. and Riedmiller, M., 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.

# BibTex引用

`@article{mnih2013playing,
  title={Playing atari with deep reinforcement learning},
  author={Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Graves, Alex and Antonoglou, Ioannis and Wierstra, Daan and Riedmiller, Martin},
  journal={arXiv preprint arXiv:1312.5602},
  year={2013}
}`

# 摘要 (Abstract)

We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

我们要介绍第一个用强化学习直接从高维传感器输入成功学习了控制策略的深度学习模型。这个模型是用Q-learning变体训练的一个卷积神经网络，这个模型的输入是原始像素，这个模型的输出的一个估计未来奖励的价值函数。我们在7个基于Arcade学习环境的Atari 2600游戏上运用了我们的方法，并且我们没有调整学习算法的结构。我们发现我们的算法在6个游戏中超过了所有之前的方法，并且在3个游戏中超过了人类专家。

# 研究问题 (Research Problem)

How to learn to control agents directly from high-dimensional visual inputs.

怎样直接从高维视觉输入中学习智能体的控制策略。

# 主要贡献 (Contributions)

(1) DQN is the "very first" well-known work of applying deep learning techniques to reinforcement learning problems.

(2) DQN uses an experience replay mechanism to overcome the correlated data issue.

# 概要 (Summary)

Reinforcement learning presents several challenges from a deep learning perspective.

Firstly, most successful deep learning applications to date have acquired large amounts of hand-labelled training data. Reinforcement learning algorithms, on the other hand, must be able to learn from a scalar reward signal that is frequently sparse, noisy and delayed. The delay between actions and resulting rewards, which can be thousands of timesteps long, seems particularly daunting when compared to the direct association between inputs and targets found in supervised learning.

Another issue is that most deep learning algorithms assume the data samples to be independent, while in reinforcement learning one typically encounters sequences of highly correlated states.

Furthermore, in reinforcement learning, the data distribution changes as the algorithm learns new behaviours, which can be problematic for deep learning methods that assume a fixed underlying distribution.

从深度学习的角度看，强化学习面临了许多的挑战。

首先，现在最成功的深度学习应用需要大量手工标注的训练数据。但强化学习算法必须能够从稀疏的、有噪音的和有延迟的标量奖励信号中学习。强化学习中的动作和奖励间的延迟有时可以有上千帧那么多，和监督学习中输入和目标值间的关联比起来，这样的延迟显得更加艰巨了。

其次，大多数深度学习算法都假设了数据样本间是互相独立的，但在强化学习中高度相关的状态序列是非常常见的。

此外，在强化学习中，算法在学习新的行为时数据的分布也会随之改变，这对于假设数据分布是静态的深度学习方法来说是有问题的。

# 之前的研究 (Previous Work)

#### Value function approximation ($V(s)$, not $Q(s,a)$) using neural networks (namely multi-layer perceptron, or MLP):

- Tesauro, G., 1995. Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3), pp.58-68.
- Pollack, J.B. and Blair, A.D., 1997. Why did TD-gammon work?. In Advances in Neural Information Processing Systems (pp. 10-16).
- Tsitsiklis, J.N. and Van Roy, B., 1996. An analysis of temporal-difference learning with function approximationTechnical. Report LIDS-P-2322). Laboratory for Information and Decision Systems, Massachusetts Institute of Technology.
- Baird, L., 1995. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995 (pp. 30-37).

#### Estimating value function or policy with restricted Boltzmann machines (RBM)

- Sallans, B. and Hinton, G.E., 2004. Reinforcement learning with factored states and actions. Journal of Machine Learning Research, 5(Aug), pp.1063-1088.
- Heess, N., Silver, D. and Teh, Y.W., 2012, December. Actor-Critic Reinforcement Learning with Energy-Based Policies. In EWRL (pp. 43-58).

#### Linear control by Q-learning with gradient temporal-difference methods

- Bhatnagar, S., Precup, D., Silver, D., Sutton, R.S., Maei, H.R. and Szepesvári, C., 2009. Convergent temporal-difference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems (pp. 1204-1212).
- Maei, H.R., Szepesvári, C., Bhatnagar, S. and Sutton, R.S., 2010, June. Toward off-policy learning control with function approximation. In ICML (pp. 719-726).

#### DQN with batch update (earlier than this paper!)

- Riedmiller, M., 2005, October. Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning (pp. 317-328). Springer, Berlin, Heidelberg.

#### Learn low dimensional representation by deep autoencoders

- Lange, S. and Riedmiller, M., 2010, July. Deep auto-encoder neural networks in reinforcement learning. In Neural Networks (IJCNN), The 2010 International Joint Conference on (pp. 1-8). IEEE.

#### Q-learning + experience replay

- Lin, L.J., 1993. Reinforcement learning for robots using neural networks (No. CMU-CS-93-103). Carnegie-Mellon Univ Pittsburgh PA School of Computer Science.

#### Playing Atari with reinforcement learning

- Bellemare, M.G., Naddaf, Y., Veness, J. and Bowling, M., 2013. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, pp.253-279.
- Bellemare, M., Veness, J. and Bowling, M., 2012. Sketch-based linear value function approximation. In Advances in Neural Information Processing Systems (pp. 2213-2221).

#### Playing Atari with evolutionary architecture

- Hausknecht, M., Lehman, J., Miikkulainen, R. and Stone, P., 2014. A neuroevolution approach to general atari game playing. IEEE Transactions on Computational Intelligence and AI in Games, 6(4), pp.355-366.

# 假设 (Assumptions)

Authors consider tasks in which an agent interacts with an environment $\mathcal{E}$, in this case the Atari emulator, in a sequence of actions, observations and rewards. At each time-step the agent selects an action $a_t$ from the set of legal game action, $\mathcal{A}={1,\ldots,K}$. The action is passed to the emulator and modifies its internal state and the game score. In general $\mathcal{E}$ may be **stochastic**. The emulator's internal state is not observed by the agent; instead it observes an image $x_t \in \mathbb{R}^d$ from the emulator, which is a vector of raw pixel values representing the current screen. In addition it receives a reward $r_t$ representing the change in game score. Note that in general the game score may depend on the whole prior sequence of actions and observations; feedback about an action may only be received after many thousands of time-steps have elapsed.

Since the agent only observes images of the current screen, the task is **partially observed** and many emulator states are perceptually aliased, i.e. it is impossible to fully understand the current situation from only the current screen $x_t$. Authors therefore consider sequences of actions and observations, $s_t=x_1, a_1, x_2,\ldots , a_{t-1}, x_t$, and learn game strategies that depend upon these sequences. All sequences in the emulator are assumed to terminate in a finite number of time-steps. This formalism gives rise to a large but finite Markov decision process (MDP) in which **each sequence is a distinct state**. As a result, we can apply standard reinforcement learning methods for MDPs, simply by using the complete sequence $s_t$ as the state representation at time $t$.

The goal of the agent is to interact with the emulator by selecting actions in a way that maximises future rewards. Authors make the standard assumption that future rewards are discounted by a factor of $\gamma$ per time-step, and define the future discounted *return* at time $t$ as $R_t = \sum_{t'=t}^T \gamma^{t'-t}r_{t'}$, where $T$ is the time-step at which the game terminates. We define the optimal action-value function $Q^{\ast}(s,a)$ as the maximum expected return achievable by following any strategy, after seeing some sequence $s$ and then taking some action $a$, $Q^{\ast}(s,a)=\max_{\pi} \mathbb{E}[R_t \mid s_t = s, a_t = a, \pi]$, where $\pi$ is a policy mapping sequences to actions (or distribution over actions).

The optimal action-value function obeys an important identity known as the *Bellman equation*. This is based on the following intuition: if the optimal value $Q^{\ast}(s' , a')$ of the sequence $s'$ at the next time-step was known for all possible action $a'$, then the optimal strategy is to select the action $a'$ maximising the expected value of $r+\gamma Q^{\ast}(s', a')$,

$$Q^{\ast}(s,a)=\mathbb{E}_{s'\sim \mathcal{E}}\left[ r+\gamma \max_{a'} Q^{\ast}(s',a') \mid s, a \right]$$

The basic idea behind many reinforcement learning algorithms is to estimate the action-value function, by using the Bellman equation as an iterative update, $Q_{i+1}(s,a)=\mathbb{E}[r+\gamma \max_{a'}Q_i (s', a') \mid s, a]$. Such *value iteration* algorithms converge to the optimal action-value function, $Q_i \rightarrow Q^{\ast}$ as $i \rightarrow \infty$. **In practice, this basic approach is totally impractical, because the action-value function is estimated separately for each sequence, without any generalisation.** Instead, it is common to use a function approximator to estimate the action-value function, $Q(s,a;\theta)\approx Q^{\ast}(s,a)$. In the reinforcement learning community this is typically a linear function approximator, but sometimes a non-linear function approximator is used instead, such as a neural network. Authors refer to a neural network function approximator with weights $\theta$ as a Q-network. A Q-network can be trained by minimising a sequence of loss functions $L_i (\theta_i)$ that changes at each iteration $i$,

$$L_i (\theta_i) = \mathbb{E}_{s,a \sim \rho (\cdot)}\left[ (y_i - Q(s,a;\theta_i))^2 \right]$$

where $y_i = \mathbb{E}_{s'\sim \mathcal{E}}[r+\gamma \max_{a'} Q(s',a';\theta_{i-1})\mid s, a]$ is the target for iteration $i$ and $\rho (s,a)$ is a probability distribution over sequences $s$ and actions $a$ that we refer to as the *behaviour distribution*. The parameters from the previous iteration $\theta_{i-1}$ are held fixed when optimising the loss function $L_i (\theta_i)$. Note that the targets depend on the network weights; this is in contrast with the targets used for supervised learning, which are fixed before learning begins. Differentiating the loss function with respect to the weights we arrive at the following gradients,

$$\nabla_{\theta_i} L_i (\theta_i) = \mathbb{E}_{s,a\sim \rho (\cdot);s'\sim \mathcal{E}}\left[ \left( r+\gamma \max_{a'} Q(s', a';\theta_{i-1})-Q(s,a;\theta_i) \right) \nabla_{\theta_i} Q(s,a;\theta_i) \right]$$

Rather than computing the full expectations in the above gradient, it is often computationally expedient to optimise the loss function by stochastic gradient descent. If the weights are updated after every time-step, and the expectations are replaced by single samples from the behaviour distribution $\rho$ and the emulator $\mathcal{E}$ respectively, then we arrive at the familiar *Q-learning* algorithm.

Note that this algorithm is *model-free*: it solves the reinforcement learning task directly using samples from the emulator $\mathcal{E}$, without explicitly constructing an estimate of $\mathcal{E}$. It is also *off-policy*: it learns about the greedy strategy $a=\max_{a} Q(s,a;\theta)$, while following a behaviour distribution that ensures adequate exploration of the state space. In practice, the behaviour distribution is often selected by an $\epsilon$-greedy strategy that follows the greedy strategy with probability $1-\epsilon$ and selects a random action with probability $\epsilon$.

# 初次编辑日期 (Initial Edit Date)

2018年5月14日

# 参考文献 (References)

[1] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. and Riedmiller, M., 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.