# HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION

### John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan and Pieter Abbeel
#### Ref: [literature](https://arxiv.org/abs/1506.02438)

## 1. Background(TRPO)

This literature starts from Trust Region Policy Optimization(TRPO 2015) and Approximately Opitimal Approximate Reinforcement Learning(2002)

### 1.1 TRPO Derivation

* Goal Formulation

$max imize_{\theta}\:[\nabla_{\theta}L_{\theta_{old}}(\theta)|_{\theta=\theta_{old}}(\theta-\theta_{old})]$

$subject\:to\: \frac{1}{2}(\theta_{old}-\theta)^{T}A(\theta_{old})(\theta_{old}-\theta)^{T}\leq \delta$

$where\:A(\theta_{old}) = \frac{\partial}{\partial \theta_i}\frac{\partial}{\partial \theta_j}E_{s\sim\rho_\pi}[D_{KL}(\pi_{\theta_{old}}||\pi_{\theta})]_{\theta={\theta_{old}}}$


#### Preliminaries
- __MDP $(S,A,P,r,\gamma,\rho_0)$ :__ 

where each denotes state, action, transition probability, reward, discount factor, initial state distribution. In here, they set reward as a function of state only and policy as stochastic manner  

- __Value Functions :__  

$Q_{\pi}(s_t,a_t) = E_{s_{t+1},a_{t+1}, ...}[\sum_{l=0}^{\infty}\gamma^{l}r(s_{t+l})]$  
      
$V_{\pi}(s_t) = E_{a_{t},s_{t+1}, ...}[\sum_{l=0}^{\infty}\gamma^{l}r(s_{t+l})]$
<br/>
- __Advantage Function :__

$A_{\pi}(s_t,a_t) = Q_{\pi}(s_t,a_t)-V_{\pi}(s_t)$  
$A_{\pi}(s,a) = E_{s'\sim P(s';\pi)}[r(s)+\gamma V_{\pi}(s')-V_{\pi}(s)]$ 
<br/>
- __Policy Objective :__  
It is clear that the agent wants to maximize cummulative reward(return) 
    * __In time repersentation__
$$\eta(\pi) = E_{s_0,a_0,s_1,...}\Big[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})\Big]$$
    * __In state representation__  
$\eta(\tilde\pi) = E_{\tau\sim\tilde\pi}\Big[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})\Big]$  
$\eta(\pi) = E_{\tau\sim\pi}\Big[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})\Big]$  
$\eta(\tilde\pi) = \eta(\pi)-E_{\tau\sim\pi}\Big[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})\Big] + E_{\tau\sim\tilde\pi}\Big[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})\Big]$  
$\eta(\tilde\pi) = \eta(\pi)-V_{\pi}(s_0) + E_{\tau\sim\tilde\pi}\Big[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})\Big]$  
$\eta(\tilde\pi) = \eta(\pi) + E_{\tau\sim\tilde\pi}\Big[-V_{\pi}(s_0)+\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})\Big]$  
$\eta(\tilde\pi) = \eta(\pi) + E_{\tau\sim\tilde\pi}\Big[-V_{\pi}(s_0)+r(s_0)+\gamma V_{\pi}(s_1)+\gamma\{-V_{\pi}(s_1)+r(s_1)+\gamma V_{\pi}(s_2)\}...\Big]$  
$\eta(\tilde\pi) = \eta(\pi) + E_{\tau\sim\tilde\pi}\Big[\sum_{t=0}^{\infty}\gamma^t A_{\pi}(s_t,a_t)\Big]$  
$\eta(\tilde\pi) = \eta(\pi) + \sum_{t=0}^{\infty}\gamma^t\sum_sP(s_t=s;\tilde\pi)\sum_a\tilde\pi(a_t|s_t) A_{\pi}(s_t,a_t)$  
<br />
<br />
<br />
<br />
$$\large\eta(\tilde\pi) = \eta(\pi)+\sum_{s}\rho_{\tilde\pi}(s)\sum_{a}\tilde\pi(a|s)A_{\pi}(s,a)$$
<br />
<br />
It is hard to formulate $\rho_{\tilde\pi}(s)$. In general policy iteration method, value function evaluation is followed by policy improvenment. After improvement, the agent has not been experienced or rolled out with new policy so that new discounted unnormalized visitation frequency has not formed yet. The main idea in 2002 and 2015 literature was, instead of using next policy state disturibution(unnormalized), using previous one. 
<br />
<br />
$$\large L_{\pi}(\tilde\pi) = \eta(\pi)+\sum_{s}\rho_{\pi}(s)\sum_{a}\tilde\pi(a|s)A_{\pi}(s,a)$$
$$\large \pi'\in argmax_{\pi'}L_{\pi}(\pi')$$
$$\large \pi_{new}=(1-\alpha){\pi_{old}}+\alpha{\pi'}$$
<br />
<br />
They suggest that it can not gaurantee direct maximizing the advantage function is equal to improvement in policy. This is because the advantage in practice is parameterized that causes estimation error and approximation error at the same time. Also, they use conservative policy iteration update, for which they could provide explicit lower bounds on the improvement of η.
<br />
<br />
- __Boundness__  
Let's go with 2002 approach before we dive into 2015 approach which little bit changes in policy improvement.
    * __Properties And Condition__  
$\eta(\tilde\pi) = \eta(\pi) + E_{\tau\sim\tilde\pi}\Big[\sum_{t=0}^{\infty}\gamma^t A_{\pi}(s_t,a_t)\Big]$  
if $\tilde\pi=\pi$  
$\:\:\:\:\:\:E_{\tau\sim\pi}\Big[\sum_{t=0}^{\infty}\gamma^t A_{\pi}(s_t,a_t)\Big]=0$  
$\:\:\:\:\:\:\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi}(s,a)=0$   
$\:\:\:\:\:\:\sum_{a}\pi(a|s)A_{\pi}(s,a)=0$  
$\epsilon_{old} = \max_s|E_{a\sim\pi'}A_{\pi}(s,a)|\geq |E_{a\sim\pi'}A_{\pi}(s,a)|$  
    * __Derivation__  
$\:\:\:\:\:\:\eta(\pi_{new}) = \eta(\pi_{old}) + E_{\tau\sim\pi_{new}}\Big[\sum_{t=0}^{\infty}\gamma^t A_{\pi}(s_t,a_t)\Big]$  
$\:\:\:\:\:\:\eta(\pi_{new}) = \eta(\pi_{old}) + \sum_{s}\rho_{\pi_{new}}(s)\sum_{a}\pi_{new}(a|s)A_{\pi_{old}}(s,a)$  
$\:\:\:\:\:\:\eta(\pi_{new}) = \eta(\pi_{old}) + \sum_{s}\rho_{\pi_{new}}(s)\sum_{a}\big\{(1-\alpha)\pi_{old}(a|s)+\alpha\pi'(a|s)\big\}A_{\pi_{old}}(s,a)$  
$\:\:\:\:\:\:\eta(\pi_{new}) = \eta(\pi_{old}) + \sum_{s}\rho_{\pi_{new}}(s)\sum_{a}\big\{\alpha\pi'(a|s)\big\}A_{\pi_{old}}(s,a)$  
$\:\:\:\:\:\:\eta(\pi_{new}) = \eta(\pi_{old}) + \sum_{t=0}^{\infty}\gamma^t\sum_sP(s_t=s;\pi_{new})\sum_{a}\big\{\alpha\pi'(a|s)\big\}A_{\pi_{old}}(s,a)$  
$\:\:\:\:\:\:\eta(\pi_{new}) = \eta(\pi_{old}) + \sum_{t=0}^{\infty}\gamma^t\sum_s\big\{(1-\alpha)^t P(s_t=s;\pi_{old\,only})+\big(1-(1-\alpha)^t\big) P(s_t=s;\pi_{rest})\big\}\sum_{a}\big\{\alpha\pi'(a|s)\big\}A_{\pi_{old}}(s,a)$  
$\large Let\:r_a\:denotes\:1-(1-\alpha)^t$  
$\:\:\:\:\:\:\eta(\pi_{new}) = \eta(\pi_{old}) + \sum_{t=0}^{\infty}\gamma^t\sum_s\big\{(1-r_a) P(s_t=s;\pi_{old\,only})+r_a P(s_t=s;\pi_{rest})\big\}\sum_{a}\big\{\alpha\pi'(a|s)\big\}A_{\pi_{old}}(s,a)$  
$\:\:\:\:\:\:\eta(\pi_{new}) = L_{\pi_{old}}(\pi_{new}) + \sum_{t=0}^{\infty}\gamma^t\sum_s\big\{-r_a P(s_t=s;\pi_{old\,only})+r_a P(s_t=s;\pi_{rest})\big\}\sum_{a}\big\{\alpha\pi'(a|s)\big\}A_{\pi_{old}}(s,a)$  
$\:\:\:\:\:\:\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new}) + \alpha\sum_{t=0}^{\infty}\gamma^t(-2*r_a*\epsilon_{old})$ 
<br />
<br />
$$\large\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new}) -\frac{2\alpha^2\epsilon_{old}\gamma}{(1-\gamma)^2}$$

This inequality condition means that if we maximize $L_{\pi_{old}}(\pi_{new})$, it gaurantees policy improvements with error term

Let's go with 2015 approach. They changes $\epsilon$ definition and give more generality while having more error term. We denotes $\epsilon$ as $\epsilon_{new}$

$\epsilon_{old} = \max_s|E_{a\sim\pi'}A_{\pi}(s,a)|=\max_s\big|\sum_a\pi(a|s)A_{\pi}(s,a)-\sum_a\pi'(a|s)A_{\pi}(s,a)\big|\leq 2*\max_{s,a}|A_{\pi}(s,a)|=2*\epsilon_{new}$  
<br />
<br />
$$\large\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new}) -\frac{4\alpha^2\epsilon_{new}\gamma}{(1-\gamma)^2}$$
$$\large\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new}) -C * D_{KL}^{max}(\pi_{new}||\pi_{old}) $$
$$where\,\, C\,=\, \frac{4\epsilon_{new}\gamma}{(1-\gamma)^2},\,\,\alpha\,=\,D_{TV}^{max}(\pi_{new}||\pi_{old})$$


* **Toward Theory to Practical Implementation form**  
    * **Change Boundness to Constant**  
In practice, policy($\pi$) is parameteriezed by $\theta$. If we follow theoretical step($C$), it would be small. The literature recommands to choose $\delta$, which changes the optimization problem.
$$\large maximize_{\theta}\:L_{\theta_{old}}(\theta)$$
$$\large subject\:to\:D_{KL}^{max}(\theta||\theta_{old})\leq\delta$$  
    * **Exact method to Heuristic approximation**  
It is impractical to calculate $D_{KL}^{max}$ at each iteration. Instead, they choose heuristic approximation($D_{KL}^{\rho}=E[D_{KL}]$)
$$\large maximize_{\theta}\:L_{\theta_{old}}(\theta)$$
$$\large subject\:to\:D_{KL}^{\rho}(\theta||\theta_{old})\leq\delta$$  
    * **Expectation becomes Sampled Sum**  
In this section, sampled sum which we call Monte-Carlo simulation replaces expecation  
$L_{\pi}(\pi_{new}) = \eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi_{new}(a|s)A_{\pi_{old}}(s,a)$  
$\sum_{s}\rho_{\pi}(s)\:\rightarrow\:\frac{1}{1-\gamma}E_{s\sim\rho_{old}}[*]$    
$A_{\pi_{old}}(s,a)\:\rightarrow\:Q_{\pi_{old}}(s,a)$  
Importance of sampling($q$ sampling distribution)   
$\sum_{a}\pi_{new}(a|s)A_{\pi_{old}}(s,a)\:\rightarrow\:E_{a\sim q}\Big[\frac{\pi_{new}}{q}A_{\pi_{old}}(s,a)\Big]$
$$\large maximize_{\theta}\:E_{a\sim q,\,s\sim \rho_{old}}\bigg[\frac{\pi_{new}}{q}Q_{\pi_{old}}(s,a)\bigg]$$
$$\large subject\:to\:D_{KL}^{\rho}(\theta||\theta_{old})\leq\delta$$  