Two main ideas behind TRPO are **MM algorithms** and the **Trust Region**.

# MM Algorithms

The main idea of MM (minorization-maximization) algorithms is that, intuitively, for a maximization problem, we first find an approximated lower bound of the original objective as the surrogate objective and then maximize the approximated lower bound so as to optimize the original objective. Widely known **Expectation-Maximization (EM) algorithm** is a subclass of MM algorithms.

In TRPO, Schulman el al. (2015) developed a surrogate loss based on Kakade et al. (2001) and Kakade & Langford (2002). The surrogate loss in TRPO is a lower bound of the original objective - the expected cumulative return of the policy.

# Trust Region Methods

As described in Nocedal & Wright's "Numerical Optimization", "trust-region methods define a region around the current iterative within which they trust the model to be an adequate representation of the objective function, and then choose the step to be the approximate minimizer of the model in this region". Intuitively, during our optimization procedure, after we decided the gradient direction, when doing line search we want to constrain our step length ot be within a "trust region" so that the local estimation of the gradient/curvature remains to be "trusted".

In TRPO, Schulman et al. (2015) used KL divergence between the old policy and updated policy as a measurement for trust region.

# Notations

An MDP is a tuple $(S, A, \{P_{sa}\}, \gamma, R, \rho_0)$

* $S$ is a finite set of $N$ states.
* $A=\{a_1, \ldots , a_k\}$ is a set of $k$ actions
* $P_{sa}(s')$ is the state transition probability of landing at state $s': P(s, a, s')$ upon taking the action $a$ at state $s$.
* $\gamma \in [0,1)$ is the discount function.
* $R: S\rightarrow \mathbb{R}$ is the reward function.
* $\rho_0: S \rightarrow \mathbb{R}$ is the state distribution of the initial state $s_0$.
* $\rho_\pi: S \rightarrow \mathbb{R}$ is the discounted visitation frequencies,
$$\rho_\pi (s) = Pr[s_0 = s] + \gamma Pr[s_1 = s] + \gamma^2 Pr[s_2 = s] + \ldots$$
* $\eta (\pi) = \mathbb{E}_{s_0, a_0, \ldots} \left[ \sum_{t=0}^{\infty} \gamma^t r(s_t) \right]$ is the expected discounted cumulative reward of policy $\pi$. Where
$$s_0 \sim \rho_0 (s_0), a_t \sim \pi (a_t \mid s_t), s_{t+1} \sim P(s_{t+1} \mid s_t, a_t)$$
* $Q_\pi (s_t, a_t)=\mathbb{E}_{s_{t+1}, a_{t+1},\ldots}\left[ \sum){l=0}^{\infty} \gamma^l r(s_{t+l}) \right]$ is the action-value function
* $V_\pi (s_t)=\mathbb{E}_{a_t, s_{t+1}, \ldots} \left[ \sum_{l=0}^{\infty} \gamma^l r(s_{t+l}) \right]$ is the value function
* $A_\pi (s, a) = Q_\pi (s, a) - V_\pi (s)$ is the advantage function. Where
$$a_t \sim \pi (a_t \mid s_t), s_{t+1} \sim P(s_{t+1} \mid s_t, a_t), \text{for } t \geq 0$$

# Derivations

Here is the important identity proved by Kakade & Langford (2002):

\begin{split}
& \eta (\pi) & = \eta (\pi_0) + \mathbb{E}_{s_0, a_0, \ldots \sim \pi} \left[ \sum_{t=0}^{\infty} A_{\pi_0} (s_t, a_t) \right] \\
& & = \eta(\pi_0) + \sum_{t=0}^{\infty} \sum_{s} P(s_t = s \mid \pi) \sum_{a} \pi (a \mid s) \gamma^t A_{\pi_0}(s, a) \\
& & = \eta(\pi_0) + \sum_{s} \sum_{t=0}^{\infty} P(s_t = s \mid \pi) \sum_{a} \pi (a \mid s) \gamma^t A_{\pi_0}(s, a) \\
& & = \eta(\pi_0) + \sum_s \rho_\pi (s) \sum_a \pi (a \mid s) A_{\pi_0} (s, a) \\
& \eta(\pi) & = \eta(\pi_0) + \mathbb{E}_{\rho_\pi}\mathbb{E}_{a \sim \pi (s)}\left[ A_{\pi_0} (s, a) \right]
\end{split}

where $\pi_0$ is the old policy and $\pi$ is the new policy. Note that we have the current policy $\pi_0$ but we don't have $\pi$ yet, therefore, $\rho_\pi$ is hard to obtain. Instead, Schulman et al. (2015) used $\rho{\pi_0}$ as an approximation to $\rho_\pi$:

$$\eta (\pi) \approx \eta (\pi_0) + \mathbb{E}_{\rho_{\pi_0}} \mathbb{E}_{a \sim \pi (s)} \left[ A_{\pi_0} (s, a) \right]$$

We then define the following as the objective function,

$$L_{\pi_0}(\pi) = \eta (\pi_0) + \mathbb{E}_{\rho_{\pi_0}} \mathbb{E}_{a \sim \pi (s)} \left[ A_{\pi_0} (s, a) \right]$$

Now is the time when the MM algorithm and trust region come in. Let $\pi' = \operatorname*{argmax}_{\pi'} L_{\pi_0}(\pi')$. If we define the new policy as the following mixture:

$$\pi (s) = (1-\alpha) \pi_0 (s) + \alpha \pi' (s)$$

Kakade & Langford (2002) proved that,

$$\eta (\pi) \geq L_{\pi_0} (\pi) - \frac{2 \epsilon \gamma}{(1-\gamma (1-\alpha))(1-\gamma)} \alpha^2$$

where,

$$\epsilon = \max_{s} \left| \mathbb{E}_{a \sim \pi' (s)} [A_{\pi_0} (s, a)] \right|$$

With this bound (r.h.s of the inequality), we can constraint the update to be within some trust region.

Based on this bound, Schulman et al. (2015) proved the following simpler bound involving KL-divergence between the new policy and the old policy:

$$\eta (\pi) \geq L_{\pi_0} (\pi) - C \max_{s} D_{KL} \left( \pi_0 (s) \| \pi (s) \right)$$

where $C = \frac{2\epsilon \gamma}{(1-\gamma)^2}$

Unfortunately, computing the maximum-KL divergence term over the whole state space is intractable. Schulman et al. (2015) proposed to use mean-KL divergence over state space as an approximation so that we can estimate it by

$$\overline{D}_{KL}(\pi_0 \| \pi) = \mathbb{E}_{s \sim \rho_{\pi_0}} \left[ D_{KL} (\pi_0 (s) \| \pi (s)) \right]$$

So the TRPO optimization problem is:

$$\operatorname*{maximize}_{\theta} \left[ L_{\theta_0} (\theta) - C \overline{D}_{KL} (\pi_0 \| \pi) \right]$$

# In Practice

Finally, in practice, Schulman suggests that we can choose one of the following variants of the algorithm:

* Directly use first order optimization methods to optimize the objective which is known as Proximal Policy Optimization.
* At each iteration, approximate the objective by first order approximation to $L$ and second order approximation to $\overline{D}_{KL}(\pi_0 \| \pi)$ and then use second order methods like conjugate gradient to approximate the gradient direction $F^{-1}g$, where, $F$ is the second order derivative of the KL-divergence or known as the Fisher Information Matrix (FIM).
* Place hard constraint on the KL-divergence (trust region). We can still use conjugate gradient to solve the following formulation

$$\operatorname*{maximize}_\theta L_{\theta_0}(\theta)$$
$$\text{subject to } \overline{D}_{KL} (\pi_{\theta_0} \| \pi_\theta) \leq \delta$$

# 初次编辑日期 (Initial Edit Date)

2018年6月1日

# 参考文献 (References)

[1] http://178.79.149.207/posts/trpo.html

[2] http://andrew.gibiansky.com/blog/machine-learning/hessian-free-optimization/

[3] https://github.com/stormmax/non-convex