In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import torch
import numpy as np

## Policy Gradient Algorithm

We calculate the gradient with respect to the control parameter when doing the optimization. 

$$\pi(a|s)$$

Policy and trajectory $\tau$, and this is one path. 

$$s_1 \rightarrow a_1 \rightarrow  s_2 \rightarrow a_2 \rightarrow ... \rightarrow s_{T-1} \rightarrow a_{T-1} \rightarrow s_T \rightarrow a_T$$

such chain depends on both the transition probability and the policy
1. $p(s_{t+1}|s_t,a)$
2. $\pi_{\theta}(a_t|s_t)$
As a result, heurestically, the probability law of chains can be written in the following fashion
$$
\begin{equation}
p_{\theta}(\tau)= p(s_1) \prod^T_{t=1} \pi(a_{t}|s_{t}) p( s_{t+1} |a_{t}, s_{t} )
\end{equation}
$$

In this case, the objective function can be put in the following form: 
$$
\begin{equation}
J(\theta) = \mathbb{E}_{\tau \sim p(\theta)} [\sum r(s_t, a_t)]
\end{equation}
$$
and the goal now is to find the $\theta^*$ that maximize this reward. 

Now, we are going to take derivatives with respect to the parameter $\theta$:
$$
\begin{align}
\nabla_{\theta} J(\theta) &= \int \nabla_{\theta} p_{\theta}(\tau) r(\tau) d\tau \\
&= \int p_{\theta}(\tau) (\nabla_{\theta} \log p_{\theta}(\tau) r(\tau)) d \tau \\ 
&= \int  p_{\theta}(\tau) r(\tau) \nabla_{\theta}\lbrace log p(s_1)+ \sum^T_{t=1}[\log \pi_{\theta}(a_t|s_t) +\log p(s_{t+1}|s_t,a_t)] \rbrace \\
&= \mathbb{E}_{\tau \sim p_{\theta}(\tau)}[(\sum^T_{t=1} \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) ) (\sum^T_{t=1} r(s_t,a_t))]\\
& \approx \frac{1}{N} \sum^N_{i=1} \big \lbrace \sum^T_{t=1} \nabla_{\theta} \log \pi_{\theta}(a^i_t | s^i_t) ) (\sum^T_{t=1} r(s^i_t,a^i_t)) \big \rbrace
\end{align}
$$
And of course, the next step is to do a gradient descent 
$$
\begin{align}
\theta = \theta + \alpha \nabla_{\theta} J(\theta)
\end{align}
$$

Thus, the algorithm is natural: sample $(\tau')$ a set of N trajectories from current policy $\pi_{\theta}(a_t|s)t)$ and update the parameters, and do the iterations. 

We comment here that the forumalation here follows a variational approach and does not depend on the Bellman's formulation




### Variance Reduction

Noticing that the future states will not impact the historical ones (the world is casual), we make the following changes
$$
\begin{align}
\nabla_{\theta} J(\theta) 
&= \mathbb{E}_{\tau \sim p_{\theta}(\tau)}[\sum^T_{t=1} (\nabla_{\theta} \log \pi_{\theta}(a_t | s_t)  \sum^T_{{t'}=t} r(s_{t'},a_{t'}))]\\
& \approx \frac{1}{N} \sum^N_{i=1}  \sum^T_{t=1} \big \lbrace \nabla_{\theta} \log \pi_{\theta}(a^i_t | s^i_t)  \sum^T_{{t'}=t} r(s^i_{t'},a^i_{t'})) \big \rbrace
\end{align}
$$

Then, we can write the following 
$$Q^i_t:=\sum^T_{{t'}=t} r(s^i_{t'},a^i_{t'})$$


## Regularization 

This is to ensure that the policy that we learn does not collapse to a single strategy.

The regularity term is defined as 
$$H(x)= \sum_x - p(x) \log p(x)$$

In this case, $p(x):= \pi_{\theta}(a^i_t|s^i_t)$ and higher entropy would mean the distribution is more spread out. 

$$\begin{align}
L(\theta):=-\frac{1}{N} \sum^N_{i=1} \bigg[  \sum^T_{t=1} \big \lbrace \nabla_{\theta} \log \pi_{\theta}(a^i_t | s^i_t)  \sum^T_{{t'}=t} \gamma^{t'-t}r(s^i_{t'},a^i_{t'})) \big \rbrace -\beta \sum_{a_i} \pi_{\theta}(a^i_t|s^i_t)\log\pi_{\theta}(a^i_t|s^i_t) \bigg]
\end{align}$$


To perform variance reduction to make sure that the sampling is more effective, the following adjustment is proposed. 
$$
\begin{align}
\nabla_{\theta} J(\theta) = \mathbb{E}_{\tau \sim p_{\theta}(\tau)}[\big( \sum^T_{t=1} \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) \big) (r(\tau)-b(s_t))]
\end{align}
$$
Notice that $b(\cdot)$ should only be a function of the state variable. 

We comment that reinforcement and all the variations are on-policy algorithms and so the trajectories created before the policy updates are dated. 

Consider the advantage function 

$$\hat{Q}(s^i_t,a^i_t) -b^i(s_t)$$
with before the $Q^i_t:=\sum^T_{{t'}=t} r(s^i_{t'},a^i_{t'})$, now we can do a rollout, and take $$Q(s^i_t,a^i_t) =r(s^i_t,a^i_t)+V(s_{t+1})$$. Then, we have two approaches 
1. MC approach $$\nabla_{\theta} J(\theta)=\frac{1}{N} \sum^N_{i=1}  \sum^T_{t=1} \bigg \lbrace \nabla_{\theta} \log \pi_{\theta}(a^i_t | s^i_t)  \big[\sum^T_{{t'}=t}  r(s^i_{t'},a^i_{t'})) - V(s_t) \big] \bigg \rbrace$$
2. TD approach $$\nabla_{\theta} J(\theta)=\frac{1}{N} \sum^N_{i=1}  \sum^T_{t=1} \bigg \lbrace \nabla_{\theta} \log \pi_{\theta}(a^i_t | s^i_t)  \big[ r(s^i_{t},a^i_{t})) + V(s_{t+1})- V(s_{t})\big] \bigg \rbrace$$

## Algorithm
Approximate $\pi_{\theta}(a|s)$ and $V_{\phi}(s)$ using two different neural networks. 

Loop: 
1. Sample N trajectories from the current policy $\pi_{\theta}(a_t|s_t)$. 
2. Calculate the $\hat{Q}^i_t=\sum^T_{{t'}=t}  r(s^i_{t'},a^i_{t'})$. and fit it with the approximated neural network $V_{\phi}(s)$ (L2 loss). 

$$\phi= \phi - \beta \nabla_{\phi} L$$
3. compute the cross-entropy loss: 
$$J(\theta):=\frac{1}{N} \sum^N_{i=1}  \sum^T_{t=1} \bigg \lbrace \nabla_{\theta} \log \pi_{\theta}(a^i_t | s^i_t)  \big[ r(s^i_{t},a^i_{t})) + V_{\phi}(s_{t+1})- V_{\phi}(s_{t})\big] \bigg \rbrace$$
Perform gradient descent on $\theta$: 
$$\theta=\theta+ \alpha \nabla_{\theta} J(\theta) $$

We comment that entropic regularization can also be applied to this current framework.

# A combination of Policy gradient descent and Q learning 


1. Q-learning can be unstable sometimes. Though it is off policy and so the transitioning samples can be used multiple times. 
2. Learning policy directly gives much better convergence guarantees. However they are on-policy.

Three methods are in scope. Deep deterministic policy gradients (DDPG), twin delayed DDPG (TD3) and soft actor critic (SAC). 

#### Weakness of the DQN 
1. It learns the action-value value function instead of the policy. 
2. It learns a moving target. 
3. Convergence is not guaranteed. 
4. Can only tackle discrete problems.



#### Weakness of Policy Gradient
1. Collapse of learned results to a bad region. (TRPO and PPO proximal... to improve.)
2. It is on policy, so it is sample inefficient.  
3. Policy is the actor while the value network is the critic, but the learning is still on policy. One uses the critic to guide the actor, but one still has to discard all the transitions after an update to the policy network

## General framework. 
Assume that the policy network is parameterized by $\theta$, and the action is $a=\mu_{\theta}(s)$ (notice that in the control context, this is standard feedback control)
$$\max_{a'}Q^*(s',a') \approx Q^*(s', \mu_{\theta}(s))$$
The critic network will take the form $Q_{\theta}(s,\mu_{\theta}(s))$
one wants to take action $a=\mu_{\theta}(s)$, but then we add a little noise for regularization, set $\epsilon \sim N(0, \sigma^2)$ and use $a+ \epsilon$ action to explore the environment and generate samples. Now one can generate it in an off-line fashion and the samples can be stored in a buffer. 

The approach of updating the target neural network is the polyak averaging (exponential averaging). 
$$
\begin{equation}
\phi_{target} \leftarrow \rho \phi_{target}+(1-\rho) \phi
\end{equation}
$$

The loss function is now 
$$
\begin{equation}
L(\phi, D)= \mathbb{E}_{(s,a,s',d)\sim D} \bigg[ \big( Q_{\phi}(s,a)-(r+\gamma(1-d) \max_{a'}Q_{\phi_{trg}(s',a')})  \big)^2 \bigg]
\end{equation}
$$
We replace the optimality condition with a desense enural network 

$$
\begin{equation}
L(\phi, D)= \mathbb{E}_{(s,a,s',d)\sim D} \bigg[ \bigg( Q_{\phi}(s,a)-\big (r+\gamma(1-d) Q_{\phi_{trg}}(s',\mu_{\theta}(s')) \big)  \bigg)^2 \bigg]
\end{equation}
$$

This is for the value function optimization. We also need a policy update criteria. since we have assumed that the $\mu_{\theta}(s)$ is the optimal one, we have 
$$\max_{\theta}J(\theta, D) = \max_{\theta} \mathbb{E}_{s\sim D} \big[ Q_{\phi} \Big( s, \mu_{\theta}(s) \Big ) \big]$$
And the gradient is simple in this case 
$$
\nabla_{\theta}J(\theta, D) = \mathbb{E}_{s\sim D} \big[ \nabla_a Q_{\phi} ( s, a ) |_{\mu_{\theta}(s)} \nabla_{\phi} \mu_{\theta} (s) \big]
$$

***
**Deep Deterministic Policy Gradient**
***
 
1. Input initial policy parameters $\theta$,  Q-function parameters $\phi$, empty replay buffer D

2. Set target parameters equal to online parameters $\theta_{targ} \leftarrow \theta$ and $\phi_{targ} \leftarrow \phi$

3. **repeat**

4. Observe state s and select action $a = \mu_\theta(s)+\epsilon, \text{where  } \epsilon \sim N$, restrict it to the set $(a_{Low}, a_{High})$

5. Execute a in environment and observe next state s', reward r, and done signal d

6. Store `(s,a,r,s',d)` in Replay Buffer D

7. if `s'` is terminal state, reset the environment

8. if it's time to update **then**:

9. &emsp;&emsp;for as many updates as required:

10. &emsp;&emsp;&emsp;&emsp;Sample a batch B={`(s,a,r,s',d)`} from replay Buffer D:

11. &emsp;&emsp;&emsp;&emsp;Compute targets: $$y(r,s',d) = r + \gamma(1-d)Q_{targ}(s',\mu_{\theta_{targ}}(s'))$$

12. &emsp;&emsp;&emsp;&emsp;Update Q function with one step gradient descent on $\phi$: $$\nabla_\phi \frac{1}{|B|} \sum_{(s,a,r,s',d)\in B}(Q_\phi(s,a) - y(r,s',d))^2$$

13. &emsp;&emsp;&emsp;&emsp;Update Policy with one step gradient Ascent on $\theta$: $$\nabla_\theta \frac{1}{|B|} \sum_{s \in B} Q_\phi(s, \mu_\phi(s, \mu_\theta(s))$$

14. Update target networks using polyak averaging: $$\phi_{targ} \leftarrow \rho\phi_{targ} + (1-\rho)\phi$$  $$\theta_{targ} \leftarrow \rho\theta_{targ} + (1-\rho)\theta$$
***

The algorithm is interesting. One sets up the target with $\theta_{target}$, $\phi_{target}$ being updated dynamically. They are updated by using the polyak averaging. (Why is it the target??? Is there any convergence guarantee?) 

### Twin Delayed DDPG (TD3)
DDPG suffers from the overestimation bias that one observes from the Q-learning.
https://arxiv.org/pdf/1802.09477.pdf

The above paper proposes a variant of double Q-learning. The following modifications are taken:
1. Clipped double Q-learning: TD3 uses two independent Q-functions and takes the minimum of. the two while forming targets under Bellman equations. 
2. Delayed policy updates: TD3 updates the polciy and target networks less frequently as compared to the Q-function updates. 
3. Target policy smoothing: TD3 adds noise to the target action, making it harder for the policy to exploit Q-function estimation error and contorl the overestimation bias. 

***
**Twin Delayed DDPG (TD3)**
***
 
1. Input initial policy parameters $\theta$,  Q-function parameters $\phi_1$ and $\phi_2$, empty replay buffer D

2. Set target parameters equal to online parameters $\theta_{targ} \leftarrow \theta$, $\phi_{targ,1} \leftarrow \phi_1$ and $\phi_{targ,2} \leftarrow \phi_2$

3. **repeat**

4. Observe state s and select action $a = clip(\mu_\theta(s)+\epsilon, a_{Low}, a_{High}), \text{where  } \epsilon \sim N$

5. Execute a in environment and observe next state s', reward r, and done signal d

6. Store `(s,a,r,s',d)` in Replay Buffer D

7. if `s'` is terminal state, reset the environment

8. if it's time to update **then**:

9. &emsp;&emsp;for j in range (as many updates as required):

10. &emsp;&emsp;&emsp;&emsp;Sample a batch B={`(s,a,r,s',d)`} from replay Buffer D:

11. &emsp;&emsp;&emsp;&emsp;Compute target actions:

$$a'(s') = \text{clip}\left(\mu_{\theta_{\text{targ}}}(s') + \text{clip}(\epsilon,-c,c), a_{Low}, a_{High}\right), \;\;\;\;\; \epsilon \sim \mathcal{N}(0, \sigma)$$

12. &emsp;&emsp;&emsp;&emsp;Compute action targets: 

$$y(r,s',d) = r + \gamma (1-d) \min_{i=1,2} Q_{\phi_{\text{targ},i}}(s', a'(s'))$$

13. &emsp;&emsp;&emsp;&emsp;Update Q function with one step gradient descent on $\phi$: 
$$\nabla_\phi \frac{1}{|B|} \sum_{(s,a,r,s',d)\in B}(Q_{\phi_i}(s,a) - y(r,s',d))^2, \;\;\;\;\;  \text{for } i=1,2$$

14. &emsp;&emsp;&emsp;&emsp;if `j mod policy_update == 0`:

15. &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;Update Policy with one step gradient Ascent on $\theta$: 

$$\nabla_\theta \frac{1}{|B|} \sum_{s \in B} Q_{\phi_1}(s, \mu_\phi(s, \mu_\theta(s))$$

16. &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;Update target networks using polyak averaging: 
$$\phi_{targ,i} \leftarrow \rho\phi_{targ,i} + (1-\rho)\phi_i, \;\;\;\;\;  \text{for } i=1,2$$ 
$$\theta_{targ} \leftarrow \rho\theta_{targ} + (1-\rho)\theta$$
***
Similar to DDPG:
1. TD3 is an off-policy algorithm.
2. TD3 can only be used for environments with continuous action spaces.
3. TD3 can be thought of as being deep Q-learning for continuous action spaces.


This notebook follows the code found in OpenAI's [Spinning Up Library](https://spinningup.openai.com/en/latest/algorithms/td3.html).