# Day 28 - REINFORCE

Following this [guide](https://chatgpt.com/share/67acc02c-e5f4-800e-9129-899c74684e09).

## Intuition Behind Policy Gradients

* Directly learning a policy allows an agent to handle continuous action spaces naturally
* Policy gradient methods are able to learn stochastic policies explicitly
* Policy estimation is the simplest, most direct form of RL, as it directly learns the final goal: The policy
* REINFORCE uses Monte Carlo returns, which can avoid bootstrapping stability issues,
  but can be quite slow and introduce a lot of variance

## Mathematical Foundation: Policy Gradient Theorem

* To understand policy gradient methods, we have to understand how to compute the gradient
  of the policy's performance
* The performance is, of course, the expected return
$$
J(\theta)=\mathbb E_{\tau\sim\pi_\theta}\left[\sum_{t=1}^T\gamma^{t-1}r_{t}\right],
$$
  where $\tau$ is a trajectory $(s_0, a_0, r_1, s_1, \dots)$
* Our goal is then to maximize this expectation
* So we are now looking for $\nabla_\theta J(\theta)$:

$$
\nabla_\theta J(\theta)=\nabla_\theta\int P(\tau;\theta)R(\tau)d\tau=\int\nabla_\theta P(\tau;\theta)R(\tau)d\tau
$$

* We can use the log-likelihood trick, taking advantage of the identity
  $\nabla_{\theta} P(\tau;\theta) = P(\tau;\theta),\nabla_{\theta} \log P(\tau;\theta)$:

$$
\nabla_\theta J(\theta)
=\int P(\tau;\theta)\nabla_\theta \operatorname{log}P(\tau;\theta)
R(\tau)d\tau=\mathbb E_{\tau\sim\pi_\theta}\left[\nabla_\theta \operatorname{log}P(\tau;\theta)R(\tau)\right]
$$

* Luckily, we know that $P(\tau;\theta)=\prod_{t=0}^T\pi(a|s)p(s',r'|s,a)$, and that $p(s',r'|s,a)$
  does not depend on $\theta$, so that factor is a constant in the gradient
* Substituting this yield a very simple expression which we can estimate from experience:

$$
\nabla_\theta\operatorname{log}P(\tau;\theta)=\sum_{t=0}^T\nabla_\theta\operatorname{log}\pi_\theta(a_t|s_t)
$$

* So, the policy gradient now looks like this:

$$
\nabla_\theta J(\theta)=\mathbb E_{\tau\sim\pi_\theta}\left[
\sum_{t=0}^T\nabla_\theta\operatorname{log}\pi_\theta(a_t|s_t)R(\tau)\right]
$$

* In this expectation, as $R(\tau)$ is the return of the trajectory, we can replace it with the
  expected return, which is $Q_{\pi_\theta}(s_t,a_t)$, with $t=0$ for the entire trajectory, giving us the final
  policy gradient theorem:

$$
\nabla_\theta J(\theta)=\mathbb E_{\pi_\theta}\Bigl[
\nabla_\theta\operatorname{log}\pi_\theta(a|s)Q_{\pi_\theta}(s,a)\Bigr]
$$

* The sum goes missing here, because the summed term is the same for each $t$, so it is just $T+1$ times that
  term, which is a factor that can be silently absorbed into the learning rate

### REINFORCE Algorithm (Monte Carlo Policy Gradient)

The REINFORCE algorithm is simple:
1. Sample episodes using the current policy.
2. Adjust the policy parameter $\theta$ in the direction of $\nabla_\theta \log \pi_\theta(a_t|s_t) G_t$,
   where $G_t$ is the sampled return, for each time step $t$ of the episode.
$$
\theta\leftarrow\theta+\alpha G_t\nabla_\theta\operatorname{log}\pi(a_t|s_t)
$$
   In practice, this update is done as an average, or a sum over all timesteps.