# On the REINFORCE Algorithm in Sequence Generation

## Formulation
- $\tau$: Action sequence $(a_{1}, a_{2}, \dots)$
- $\pi_{\theta}$: Policy parameterized by $\theta$
- $\gamma$: Discount factor
- $r(\tau)$: Discounted total reward given action sequence

Loss function:

$L(\theta) = -\mathop{\mathbb{E}}_{\tau\sim \pi_{\theta}(\tau)}[r(\tau)]$

## REIINFORCE

Expectation of policy gradient:
$$\nabla_{\theta}\mathop{\mathbb{E}}_{\tau\sim \pi_{\theta}(\tau)}[r(\tau)] =\mathop{\Sigma}_{\tau}r(\tau)\nabla_{\theta}\pi_{\theta}(\tau)\\
=\mathop{\Sigma}_{\tau}r(\tau)\frac{\nabla_{\theta}\pi_{\theta}(\tau)}{\pi_{\theta}(\tau)}\pi_{\theta}(\tau)\\
=\mathop{\Sigma}_{\tau}\pi_{\theta}(\tau)\cdot r(\tau)\nabla_{\theta}\ln \pi_{\theta}(\tau)\\
=\mathop{\mathbb{E}}_{\tau\sim \pi_{\theta}(\tau)}r(\tau)\nabla_{\theta}\ln \pi_{\theta}(\tau)\\
\approx \frac{1}{N}\mathop{\Sigma}_{i=1}^{N}r(\tau_{i})\nabla_{\theta}\ln \pi_{\theta}(\tau_{i})$$

Generalize the policy gradient by adding baseline reward $b$

$\mathop{\mathbb{E}}_{\tau\sim \pi_{\theta}(\tau)}[r(\tau)-b]\nabla_{\theta}\ln \pi_{\theta}(\tau)$

- Expectation of policy gradient does **NOT** change given $b$ is independent of $\tau$

Proof:
$$\mathop{\mathbb{E}}_{\tau\sim \pi_{\theta}(\tau)}[r(\tau)-b]\nabla_{\theta}\ln \pi_{\theta}(\tau) = \mathop{\mathbb{E}}_{\tau\sim \pi_{\theta}(\tau)}r(\tau)\nabla_{\theta}\ln \pi_{\theta}(\tau)-\mathop{\mathbb{E}}_{\tau\sim \pi_{\theta}(\tau)}b\nabla_{\theta}\ln \pi_{\theta}(\tau)$$
in which
$$\mathop{\mathbb{E}}_{\tau\sim \pi_{\theta}(\tau)}b\nabla_{\theta}\ln \pi_{\theta}(\tau)=\mathop{\Sigma}_{\tau}b\frac{\nabla_{\theta}\pi_{\theta}(\tau)}{\pi_{\theta}(\tau)}\cdot \pi_{\theta}(\tau)=b\mathop{\Sigma}_{\tau}\nabla_{\theta}\pi_{\theta}(\tau)=b\nabla_{\theta}\mathop{\Sigma}_{\tau}\pi_{\theta}(\tau)=b\nabla_{\theta}1=0
$$
- Variance of gradient estimate may got reduced

Denote the $i_{th}$ entry of $\nabla_{\theta}\ln \pi_{\theta}(\tau)$ as $g_{\theta_{i}}(\tau)$.

$\frac{\mathrm{d}\mathbb{D}_{\tau\sim \pi_{\theta}(\tau)}[(r(\tau)-b)\nabla_{\theta}\ln \pi_{\theta}(\tau)]}{\mathrm{d}b}=\frac{\mathrm{d}\mathop{\mathbb{E}}_{\tau\sim \pi_{\theta}(\tau)}[(r(\tau)-b)^{2}g_{\theta_{i}}^{2}(\tau)]}{\mathrm{d}b}= \frac{\mathrm{d}\mathop{\mathbb{E}}_{\tau\sim \pi_{\theta}(\tau)}[r^{2}(\tau)g_{\theta_{i}}^{2}(\tau)]+b^{2}\mathop{\mathbb{E}}_{\tau\sim \pi_{\theta}(\tau)}[g_{\theta_{i}}^{2}(\tau)]-2b\mathop{\mathbb{E}}_{\tau\sim \pi_{\theta}(\tau)}[r(\tau)g_{\theta_{i}}^{2}(\tau)]}{\mathrm{d}b}=0+2b\mathop{\mathbb{E}}_{\tau\sim \pi_{\theta}(\tau)}[g_{\theta_{i}}^{2}(\tau)]-2\mathop{\mathbb{E}}_{\tau\sim \pi_{\theta}(\tau)}[r(\tau)g_{\theta_{i}}^{2}(\tau)]=0$, 

$b=\frac{\mathop{\mathbb{E}}_{\tau\sim \pi_{\theta}(\tau)}[r(\tau)g_{\theta_{i}}^{2}(\tau)]}{\mathop{\mathbb{E}}_{\tau\sim \pi_{\theta}(\tau)}[g_{\theta_{i}}^{2}(\tau)]}$

With a total of $N$ trails, the expectation can be approximated with
$$\mathop{\mathbb{E}}_{\tau\sim \pi_{\theta}(\tau)}[r(\tau)-b]\nabla_{\theta}\ln \pi_{\theta}(\tau) \approx \frac{1}{N}\mathop{\Sigma}_{i=1}^{N}[r(\tau_{i})-b]\nabla_{\theta}\ln \pi_{\theta}(\tau_{i})$$


# Reference
[Self-critical Sequence Training for Image Captioning](https://arxiv.org/abs/1612.00563)

[A Deep Reinforced Model for Abstractive Summarization](https://arxiv.org/abs/1705.04304)