<center><h1>Policy-Based Reinformance Learning</h1></center>

# Basic Concepts

## Reward Functions

### Finite-horizon undiscounted return
It sums up the rewards from a set of actions.
$$ R(\tau) = \sum_{t=0}^{T} r_t $$

### Infinite-horizon discounted return
It has a new coefficient referred to as the discount factor. The discount factor corresponds to how far in the future those rewards were collected and is a number between zero and one.
$$ R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_t $$

## Value Functions
$$ V^{\pi}(s) = E_{\tau \sim \pi} [R(\tau) | s_0 = s] $$
$$ Q^{\pi}(s, a) = E_{\tau \sim \pi} [R(\tau) | s_0 = s, a_0 = a] $$

## Advantange Function
$$ A^{\pi}(s, a) = Q^{\pi}(s, a) - V^{\pi}(s) $$



# Vanilla Policy Gradient Algorithm
[For detail](https://medium.com/analytics-vidhya/a-deep-dive-into-vanilla-policy-gradients-3a79a95f3334)
[zhihu](https://zhuanlan.zhihu.com/p/599897265)

## Reward-to-go Method
Only rewards after action can have an effect on the future actions, which is called the reward-to-go:
$$ \hat R_t = \sum_{t'=t}^{T} R(s_{t'},a_{t'},s_{t'+1}) $$ 

$$ \nabla J(\pi_\theta) = E_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla \theta \log \pi_\theta(a_t|s_t) \sum_{t'=t}^{T} R(s_{t'},a_{t'},s_{t'+1}) \right] $$

## Advantage Function
The gradient of advantage function is:
$$ \nabla J(\pi_\theta) = E_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla \theta \log \pi_\theta(a_t|s_t) A_\pi(s_t, a_t) \right]$$

The value function $V_\pi(s_t)$ is estimated by an approximate value function $V_{\phi}(s_t)$, which is updated by:
$$ \phi_k = argmin_{\phi} E_{s_t, \hat R_t \sim \pi_k} [(V_{\phi}(s_t) - \hat R(t))^2] $$

<center>
    <img src="./images/vpg_algo.webp" alt="example">
</center>

## Code

In [None]:
from implementation.vpg import VPGAgent
from implementation.util import plot_graph

def main():   
    agent = VPGAgent('CartPole-v0', gamma=0.99)
    rewards = agent.train(iterations=200, num_traj=32)
    plot_graph(rewards)

main()

# Proximal Policy Optimization (PPO)

## PPO with Adaptive KL Penalty
Proximal policy optimization (PPO) is an algorithm that aims to address the overhead issue of TRPO by incorporating the constraint into the objective function as a penalty:
$$ E_{\tau \sim \pi_\theta} \left[ \frac{\pi \theta(a_t|st)}{\pi \theta_{old}(a_t|st)} \hat{A}t \right] - C \cdot \overline{KL}\pi \theta_{old} (\pi \theta) \tag{14}$$

One challenge of PPO is choosing the appropriate value for C. To address this, the algorithm updates C based on the magnitude of the KL divergence. If the KL divergence is too high, C is increased, and, if it is too low, C is decreased. This allows for effective optimization over the course of training.

## PPO with Clipped Objective
Importance sampling can cause the variance of the sample to increase with the difference between the new and old strategies. We can directly limit the difference between the new and old strategies by limiting the difference in output action probabilities.

$$ L^{CLIP}(\theta) = E_{t} \left[ \min \left( r_t(\theta) \hat{A}_t, clip(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t \right) \right] \tag{15}$$

$$ r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} \tag{16}$$

<center>
    <img src="./images/ppo_algo.jpg" alt="example">
</center>

<center>
    <img src="./images/ppo_algo.svg" alt="example">
</center>

## Generalized Advantage Estimate (GAE)
[For Detail](https://zhuanlan.zhihu.com/p/658971618)

### Policy Gradiant Estimate
$$ A^{\pi}(a_t, s_t) := Q^{\pi}(a_t, s_t) - V^{\pi}(s_t) $$
$$ V^{\pi}(s_t) := E_{s_{t+1:\infty}, a_{t:\infty}} [\sum_{l=0}^\infty r_{t+l}] $$
$$ Q^{\pi}(s_t, a_t) := E_{s_{t+1:\infty}, a_{t+1:\infty}} [\sum_{l=0}^\infty r_{t+l}] $$

### $\gamma$-just
$$ A^{\pi,\gamma}(a_t, s_t) := Q^{\pi,\gamma}(a_t, s_t) - V^{\pi,\gamma}(s_t) $$
$$ V^{\pi,\gamma}(s_t) := E_{s_{t+1:\infty}, a_{t:\infty}} [\sum_{l=0}^\infty \gamma^l r_{t+l}] $$
$$ Q^{\pi,\gamma}(s_t, a_t) := E_{s_{t+1:\infty}, a_{t+1:\infty}} [\sum_{l=0}^\infty \gamma^l r_{t+l}] $$

### GAE
TD residual error:
$$ \delta_t^V = r_t + \gamma V(s_{t+1}) - V(s_t) $$


<center>
    <img src="./images/gae.webp" alt="example">
</center>

$$ \hat A_t ^{GAE(\gamma, \lambda)} = \sum_{l=0}^\infty (\gamma \lambda)^l \delta_{t+l}^V $$

## Code

In [2]:
from implementation.ppo import PPOAgent
from implementation.util import plot_graph

def main():   
    agent = PPOAgent('CartPole-v1', gamma=0.98, lam=1, epsilon=0.2)
    rewards = agent.train(iterations=1000, num_traj=1)
    # plot_graph(rewards)

main()

ITERATION: 1 AVG REWARD: 15.0
ITERATION: 11 AVG REWARD: 14.0
ITERATION: 21 AVG REWARD: 25.0
ITERATION: 31 AVG REWARD: 12.0
ITERATION: 41 AVG REWARD: 13.0
ITERATION: 51 AVG REWARD: 50.0
ITERATION: 61 AVG REWARD: 36.0
ITERATION: 71 AVG REWARD: 26.0
ITERATION: 81 AVG REWARD: 10.0
ITERATION: 91 AVG REWARD: 9.0
ITERATION: 101 AVG REWARD: 10.0
ITERATION: 111 AVG REWARD: 9.0
ITERATION: 121 AVG REWARD: 9.0
ITERATION: 131 AVG REWARD: 10.0
ITERATION: 141 AVG REWARD: 10.0
ITERATION: 151 AVG REWARD: 9.0
ITERATION: 161 AVG REWARD: 10.0
ITERATION: 171 AVG REWARD: 9.0
ITERATION: 181 AVG REWARD: 10.0
ITERATION: 191 AVG REWARD: 10.0
ITERATION: 201 AVG REWARD: 10.0
ITERATION: 211 AVG REWARD: 10.0
ITERATION: 221 AVG REWARD: 12.0
ITERATION: 231 AVG REWARD: 10.0
ITERATION: 241 AVG REWARD: 9.0
ITERATION: 251 AVG REWARD: 10.0
ITERATION: 261 AVG REWARD: 10.0
ITERATION: 271 AVG REWARD: 9.0
ITERATION: 281 AVG REWARD: 10.0
ITERATION: 291 AVG REWARD: 33.0
ITERATION: 301 AVG REWARD: 11.0
ITERATION: 311 AVG REWARD:

KeyboardInterrupt: 