In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import torch
import numpy as np

## Policy Gradient Algorithm

We calculate the gradient with respect to the control parameter when doing the optimization. 

$$\pi(a|s)$$

Policy and trajectory $\tau$, and this is one path. 

$$s_1 \rightarrow a_1 \rightarrow  s_2 \rightarrow a_2 \rightarrow ... \rightarrow s_{T-1} \rightarrow a_{T-1} \rightarrow s_T \rightarrow a_T$$

such chain depends on both the transition probability and the policy
1. $p(s_{t+1}|s_t,a)$
2. $\pi_{\theta}(a_t|s_t)$
As a result, heurestically, the probability law of chains can be written in the following fashion
$$
\begin{equation}
p_{\theta}(\tau)= p(s_1) \prod^T_{t=1} \pi(a_{t}|s_{t}) p( s_{t+1} |a_{t}, s_{t} )
\end{equation}
$$

In this case, the objective function can be put in the following form: 
$$
\begin{equation}
J(\theta) = \mathbb{E}_{\tau \sim p(\theta)} [\sum r(s_t, a_t)]
\end{equation}
$$
and the goal now is to find the $\theta^*$ that maximize this reward. 

Now, we are going to take derivatives with respect to the parameter $\theta$:
$$
\begin{align}
\nabla_{\theta} J(\theta) &= \int \nabla_{\theta} p_{\theta}(\tau) r(\tau) d\tau \\
&= \int p_{\theta}(\tau) (\nabla_{\theta} \log p_{\theta}(\tau) r(\tau)) d \tau \\ 
&= \int  p_{\theta}(\tau) r(\tau) \nabla_{\theta}\lbrace log p(s_1)+ \sum^T_{t=1}[\log \pi_{\theta}(a_t|s_t) +\log p(s_{t+1}|s_t,a_t)] \rbrace \\
&= \mathbb{E}_{\tau \sim p_{\theta}(\tau)}[(\sum^T_{t=1} \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) ) (\sum^T_{t=1} r(s_t,a_t))]\\
& \approx \frac{1}{N} \sum^N_{i=1} \big \lbrace \sum^T_{t=1} \nabla_{\theta} \log \pi_{\theta}(a^i_t | s^i_t) ) (\sum^T_{t=1} r(s^i_t,a^i_t)) \big \rbrace
\end{align}
$$
And of course, the next step is to do a gradient descent 
$$
\begin{align}
\theta = \theta + \alpha \nabla_{\theta} J(\theta)
\end{align}
$$

Thus, the algorithm is natural: sample $(\tau')$ a set of N trajectories from current policy $\pi_{\theta}(a_t|s)t)$ and update the parameters, and do the iterations. 

We comment here that the forumalation here follows a variational approach and does not depend on the Bellman's formulation




### Variance Reduction

Noticing that the future states will not impact the historical ones (the world is casual), we make the following changes
$$
\begin{align}
\nabla_{\theta} J(\theta) 
&= \mathbb{E}_{\tau \sim p_{\theta}(\tau)}[\sum^T_{t=1} (\nabla_{\theta} \log \pi_{\theta}(a_t | s_t)  \sum^T_{{t'}=t} r(s_{t'},a_{t'}))]\\
& \approx \frac{1}{N} \sum^N_{i=1}  \sum^T_{t=1} \big \lbrace \nabla_{\theta} \log \pi_{\theta}(a^i_t | s^i_t)  \sum^T_{{t'}=t} r(s^i_{t'},a^i_{t'})) \big \rbrace
\end{align}
$$

Then, we can write the following 
$$Q^i_t:=\sum^T_{{t'}=t} r(s^i_{t'},a^i_{t'})$$


## Regularization 

This is to ensure that the policy that we learn does not collapse to a single strategy.

The regularity term is defined as 
$$H(x)= \sum_x - p(x) \log p(x)$$

In this case, $p(x):= \pi_{\theta}(a^i_t|s^i_t)$ and higher entropy would mean the distribution is more spread out. 

$$\begin{align}
L(\theta):=-\frac{1}{N} \sum^N_{i=1} \bigg[  \sum^T_{t=1} \big \lbrace \nabla_{\theta} \log \pi_{\theta}(a^i_t | s^i_t)  \sum^T_{{t'}=t} \gamma^{t'-t}r(s^i_{t'},a^i_{t'})) \big \rbrace -\beta \sum_{a_i} \pi_{\theta}(a^i_t|s^i_t)\log\pi_{\theta}(a^i_t|s^i_t) \bigg]
\end{align}$$


To perform variance reduction to make sure that the sampling is more effective, the following adjustment is proposed. 
$$
\begin{align}
\nabla_{\theta} J(\theta) = \mathbb{E}_{\tau \sim p_{\theta}(\tau)}[\big( \sum^T_{t=1} \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) \big) (r(\tau)-b(s_t))]
\end{align}
$$
Notice that $b(\cdot)$ should only be a function of the state variable. 

We comment that reinforcement and all the variations are on-policy algorithms and so the trajectories created before the policy updates are dated. 

Consider the advantage function 

$$\hat{Q}(s^i_t,a^i_t) -b^i(s_t)$$
with before the $Q^i_t:=\sum^T_{{t'}=t} r(s^i_{t'},a^i_{t'})$, now we can do a rollout, and take $$Q(s^i_t,a^i_t) =r(s^i_t,a^i_t)+V(s_{t+1})$$. Then, we have two approaches 
1. MC approach $$\nabla_{\theta} J(\theta)=\frac{1}{N} \sum^N_{i=1}  \sum^T_{t=1} \bigg \lbrace \nabla_{\theta} \log \pi_{\theta}(a^i_t | s^i_t)  \big[\sum^T_{{t'}=t}  r(s^i_{t'},a^i_{t'})) - V(s_t) \big] \bigg \rbrace$$
2. TD approach $$\nabla_{\theta} J(\theta)=\frac{1}{N} \sum^N_{i=1}  \sum^T_{t=1} \bigg \lbrace \nabla_{\theta} \log \pi_{\theta}(a^i_t | s^i_t)  \big[ r(s^i_{t},a^i_{t})) + V(s_{t+1})- V(s_{t})\big] \bigg \rbrace$$

## Algorithm
Approximate $\pi_{\theta}(a|s)$ and $V_{\phi}(s)$ using two different neural networks. 

Loop: 
1. Sample N trajectories from the current policy $\pi_{\theta}(a_t|s_t)$. 
2. Calculate the $\hat{Q}^i_t=\sum^T_{{t'}=t}  r(s^i_{t'},a^i_{t'})$. and fit it with the approximated neural network $V_{\phi}(s)$ (L2 loss). 

$$\phi= \phi - \beta \nabla_{\phi} L$$
3. compute the cross-entropy loss: 
$$J(\theta):=\frac{1}{N} \sum^N_{i=1}  \sum^T_{t=1} \bigg \lbrace \nabla_{\theta} \log \pi_{\theta}(a^i_t | s^i_t)  \big[ r(s^i_{t},a^i_{t})) + V_{\phi}(s_{t+1})- V_{\phi}(s_{t})\big] \bigg \rbrace$$
Perform gradient descent on $\theta$: 
$$\theta=\theta+ \alpha \nabla_{\theta} J(\theta) $$

We comment that entropic regularization can also be applied to this current framework.

## A combination of Policy gradient descent and Q learning 


1. Q-learning can be unstable sometimes. Though it is off policy and so the transitioning samples can be used multiple times. 
2. Learning policy directly gives much better convergence guarantees. However they are on-policy.

Three methods are in scope. Deep deterministic policy gradients (DDPG), twin delayed DDPG (TD3) and soft actor critic (SAC). 