# Two Limitations of "Vanilla" Policy Gradient Methods

- Hard to choose stepsizes
    - Input data is nonstationary due to changing policy: observation and reward distributions change
    - Bad step is more damaging than in supervised learning, since it affects visitation distribution
        - Step too far $\rightarrow$ bad policy
        - Next batch: collected under bad policy
        - Cannot recover - collapse in performance
- Sample efficiency
    - Only one gradient step per environment sample
    - Dependent on scaling of coordinates

# Reducing Reinforcement Learning to Optimization

- Much of modern ML: reduce learning to numerical optimization problem
    - Supervised learning: minimize training error
- RL: how to *use all data so far and compute the best policy*?
    - Q-learning: can (in principle) include all transitions seen so far, however, we're optimizing the wrong objective
    - Policy gradient methods: yes stochastic gradients, but no optimization problem*
    - This lecture: write down an optimization problem that allows you to do a small update to policy $\pi$ based on data sampled from $\pi$ (*on-policy* data)

# What Loss to Optimize?

- Policy gradients

$$\hat{g}=\hat{\mathbb{E}}_t \left[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \hat{A}_t \right]$$

- Can differentiate the following loss
$$L^{PG}(\theta)=\hat{\mathbb{E}}_t \left[ \log \pi_{\theta} (a_t \mid s_t) \hat{A}_t \right]$$
     but don't want to optimize it too far

- Equivalently differentiate
$$L_{\theta_{old}}^{IS}(\theta)=\hat{\mathbb{E}}_t \left[ \frac{\pi_\theta (a_t \mid s_t)}{\pi_{\theta_{old}}(a_t \mid s_t)} \hat{A}_t \right]$$
     at $\theta=\theta_{old}$, state-actions are sampled using $\theta_{old}$. (IS = importance sampling)
     Just the chain rule: $\nabla_{\theta} \log f(\theta)\mid_{\theta_{old}}=\frac{\nabla_{\theta} f(\theta)\mid_{\theta_{old}}}{f(\theta_{old})}=\nabla_{\theta}\left(\frac{f(\theta)}{f(\theta_{old})}\right)\mid_{\theta_{old}}$

# Surrogate Loss: Importance Sampling Interpretation

- Importance sampling interpretation
\begin{split}
& \mathbb{E}_{s_t \sim \pi_{\theta_{old}},a_t \sim \pi_{\theta}} \left[ A^{\pi} (s_t, a_t) \right] \\
= & \mathbb{E}_{s_t \sim \pi_{\theta_{old}},a_t \sim \pi_{\theta_{old}}} \left[ \frac{\pi_{\theta} (a_t \mid s_t)}{\pi_{\theta_{old}} (a_t \mid s_t)} A^{\pi_{\theta_{old}}} (s_t, a_t) \right] \enspace \text{(importance sampling)} \\
= & \mathbb{E}_{s_t \sim \pi_{\theta_{old}},a_t \sim \pi_{\theta_{old}}} \left[ \frac{\pi_{\theta} (a_t \mid s_t)}{\pi_{\theta_{old}} (a_t \mid s_t)} \hat{A}_t \right] \enspace \text{(replace}\enspace A^{\pi} \text{ with estimator)} \\
= & L_{\theta_{old}}^{IS} (\theta)
\end{split}

- Kakade et al. (2002) and Schulman et al. (2015) analyze how $L^{IS}$ approximates the actual performance difference between $\theta$ and $\theta_{old}$

- In practice, $L^{IS}$ is not much different than the logprob version $L^{PG} (\theta)=\hat{\mathbb{E}}_t \left[ \log \pi_{\theta} (a_t \mid s_t) \hat{A}_t \right]$, for reasonably small policy changes.

# Trust Region Policy Optimization

- Define the following trust region update:
$$\operatorname*{maximize}_{\theta} \hat{\mathbb{E}}_t \left[ \frac{\pi_\theta (a_t \mid s_t)}{\pi_{\theta_{old}}(a_t \mid s_t)} \hat{A}_t \right]$$
$$\text{subject to}\enspace \hat{\mathbb{E}}_t \left[ \text{KL}[\pi_{\theta_{old}} (\cdot \mid s_t), \pi_\theta (\cdot \mid s_t)] \right] \leq \delta$$

- Also worth considering using a penalty instead of a constraint
$$\operatorname*{maximize}_{\theta} \hat{\mathbb{E}}_t \left[ \frac{\pi_\theta (a_t \mid s_t)}{\pi_{\theta_{old}}(a_t \mid s_t)} \hat{A}_t \right] - \beta \hat{\mathbb{E}}\left[ \text{KL}[\pi_{\theta_{old}} (\cdot \mid s_t), \pi_\theta (\cdot \mid s_t)] \right]$$

- Method of Lagrange multipliers: optimality point of $\delta$-constrained problem is also an optimality point of $\beta$-penalized problem for some $\beta$

- In practice, $\delta$ is easier to tune, and fixed $\delta$ is better than fixed $\beta$

# Monotonic Improvement Result

- Consider KL penalized objective
$$\operatorname*{maximize}_{\theta} \hat{\mathbb{E}}\left[ \frac{\pi_{\theta} (a_t \mid s_t)}{\pi_{\theta_{old}} (a_t \mid s_t)} \hat{A}_t \right] - \beta \hat{\mathbb{E}}_t \left[\text{KL}[\pi_{\theta_{old}}(\cdot \mid s_t), \pi_{\theta}(\cdot \mid s_t)] \right]$$

- Theory result: if we use max KL instead of mean KL in penalty, then we get a lower (=pessimistic) bound on policy performance

<img src="files/figures/monotonic_improvement_result.png" style="width: 300px;" />

# Trust Region Policy Optimization: Pseudocode

- Pseudocode:

\begin{split}
& \textbf{for}\text{ iteration = 1, 2, ... }\textbf{do} \\
& \qquad \text{Run policy for $T$ timesteps or $N$ trajectories} \\
& \qquad \text{Estimate advantage function at all timesteps} \\
& \qquad\qquad \operatorname*{maximize}_{\theta} \sum_{n=1}^{N} \frac{\pi_{\theta} (a_n \mid s_n)}{\pi_{\theta_{old}} (a_n \mid s_n)} \hat{A}_n \\
& \qquad\qquad \text{subject to} \quad \overline{KL}_{\pi_{\theta_{old}}} (\pi_{\theta}) \leq \delta \\
& \textbf{end for}
\end{split}

- Can solve constrained optimization problem efficiently by using conjugate gradient
- Closely related to natural policy gradients (Kakade, 2002), natural actor-critic (Peters and Schaal, 2005), PEPS (Peters et al., 2010)

# Solving KL Penalized Problem

- $\operatorname*{maximize}_\theta L_{\pi_{\theta_{old}}}(\pi_\theta) - \beta \cdot \overline{KL}_{\pi_{\theta_{old}}}(\pi_{\theta})$
- Make linear approximation to $L_{\pi_{\theta_{old}}}$ and quadratic approximation to KL term:
$$\operatorname*{maximize}_\theta g \cdot (\theta - \theta_{old}) - \frac{\beta}{2} (\theta - \theta_{old})^T F (\theta - \theta_{old})$$
$$\text{where} \enspace g=\frac{\partial}{\partial \theta}L_{\pi_{\theta_{old}}}(\pi_\theta)\mid_{\theta=\theta_{old}}, \enspace F=\frac{\partial^2}{\partial^2 \theta}\overline{KL}_{\pi_{\theta_{old}}}(\pi_\theta)\mid_{\theta=\theta_{old}}$$
    - Quadratic part of $L$ is negligible compared to KL term
    - $F$ is positive semidefinite, but not if we include Hessian of $L$
- Solution: $\theta-\theta_{old}=\frac{1}{\beta}F^{-1}g$, where $F$ is Fisher information matrix, $g$ is policy gradient. This is called the **natural policy gradient** (Kakade, 2001)

# Review

- Suggested optimizaing surrogate loss $L^{PG}$ or $L^{IS}$
- Suggested using KL to constrain size of update
- Corresponds to natural gradient step $F^{-1}g$ under linear quadratic approximation
- Can solve for this step approximately using conjugate gradient method

# "Proximal" Policy Optimization: KL Penalty Version

- Use penalty instead of constraint
$$\operatorname*{maximize}_\theta \sum_{n=1}^{N}\frac{\pi_\theta (a_n \mid s_n)}{\pi_{\theta_{old}} (a_n \mid s_n)}\hat{A}_n - C \cdot \overline{KL}_{\pi_{\theta_{old}}}(\pi_\theta)$$

\begin{split}
& \textbf{for} \text{ iteration = 1, 2, ... } \textbf{do} \\
& \qquad \text{Run policy for } T \text{ timesteps or } N \text{ trajectories} \\
& \qquad \text{Estimate advantage function at all timesteps} \\
& \qquad \text{Do SGD on above objective for some number of epochs} \\
& \qquad \text{If KL too high, increase } \beta \text{. If KL too low, decrease } \beta \text{.} \\
& \textbf{end for}
\end{split}

- $\approx$ same performance as TRPO, but only first-order optimization

# Connection Between Trust Region Problem and Other Things

$$\operatorname*{maximize}_\theta \sum_{n=1}^{N} \frac{\pi_\theta (a_n \mid s_n)}{\pi_{\theta_{old}} (a_n \mid s_n)} \hat{A}_n$$
$$\text{subject to } \overline{KL}_{\pi_{\theta_{old}}}(\pi_\theta) \leq \delta$$

- Linear-quadratic approximation + penalty $\Rightarrow$ natural gradient
- No constraint $\Rightarrow$ policy iteration
- Euclidean penalty instead of KL $\Rightarrow$ vanilla policy gradient

# Limitations of TRPO

- Hard to use with architectures with multiple outputs, e.g. policy and value function (need to weight different terms in distance metric)
- Empirically performs poorly on tasks requiring deep CNNs and RNNs, e.g. Atari benchmark
- CG makes implementation more complicated

# Calculating Natural Gradient Step with KFAC

- Summary: do blockwise approximation to FIM, and approximate each block using a certain factorization
- Alternate expression for FIM as outer product (instead of second deriv. of KL)
$$\hat{\mathbb{E}}_t \left[ \nabla_\theta \log \pi_\theta (a_t \mid s_t)^T \nabla_\theta \log \pi_\theta (a_t \mid s_t) \right]$$

Grosse and Martens (2016)
Martens and Grosse (2015)

# 初次编辑日期 (Initial Edit Date)

2018年5月30日

# 参考文献 (References)

[1] Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO, *YouTube*, https://www.youtube.com/watch?v=xvRrgxcpaHY