# Policy Gradient from scratch

The idea of policy gradient is to directly learn the policy from optimizing the reward function. We do NOT build a model of environment and we do NOT appeal to the Bellman equation. We aim to maximize the expected total reward


\begin{align*}
\large
J = \mathbf{E}_{\tau \sim \pi_{\theta}}[R(\tau)]
\end{align*}


where
* $\large \pi_{\theta}$ is the probability density function (pdf) of the parametric policy and $\theta$ is the parameter vector. 
* $\large \tau$ is a trajectory obtained from sampling the policy.
* $R(\tau)$ is undiscounted finite-time total reward. Let $s_i,\: a_i$ reperesnt the state and action at time $i.$ The total reward of the trajectory $\tau =(s_0,\:a_0,\:r_0,\: s_1,\: a_1,\:r_1,\:s_2,...,s_{T+1})$ of length $T$ is defined by
\begin{align*}
R(\tau)= \sum_{t=0}^{T} r_t
\end{align*}
* Expectation is defined over the probability of the trajectory

We would like to directly optimize the policy, for example by gradient descent. So, we aim to obtain _the policy gradient_


\begin{align*}
\large
\nabla_{\theta} J.
\end{align*}


> _The algorithms that optimizes the policy in this way are called Policy Gradient (PG) algorithms._ 

The log-derivative trick helps us to obtain the policy gradient $\large \nabla_{\theta} J.$. 

## Log-derivative trick
The log-derivative trick depends on $\nabla_x \log x = \dfrac{1}{x}$. Assume that $x$ is a function of $\theta$. Then, using chain rule, we have

\begin{align*}
\large
\nabla_{\theta} \log x = \nabla_{x} \log x \nabla_{\theta} x =  \dfrac{1}{x}\nabla_{\theta} x .
\end{align*}

Rearranging the above equation
\begin{align*}
\large
\nabla_{\theta} x =x \nabla_{\theta} \log x
\end{align*}

In the sequel, we define the main concept in PG.

## 1. Defining the probability density function for the parametric policy
As we mentioned earlier, $\large \pi_{\theta}$ is the probability density function (pdf) of the parametric policy and $\theta$ is the parameter vector. Depending on whether the action space is discrete or continuous, we define the pdf differently.

### 1.1 Discrete action space
If the action space is discrete, the policy network builds the pdf $\pi_{\theta}$. The policy network maps from the state to the probability of each action. So, if there are $n_a$ actions, the policy network has $n_a$ outputs, each represents the probability of an action. Note that the outputs should sum to 1.

Let $\pi_{\theta}$ is generated by by the function `network(state)`. That is $\pi_{\theta}(s)=$ `network(state)` which takes the state variable `state` as an input, has vector parameter $\theta$. To draw a sample $a \sim \pi_{\theta}$, we do the following

<code>softmax_out = network(state)<code>
<code>a = np.random.choice(n_a, p=softmax_out.numpy()[0])<code>

In the first line, we give the state `state` as the input to the function `network` and the output `softmax_out` is the pdf  $\pi_{\theta}$. In the second line, we select an action, according to the pdf `softmax_out`.

### 1.2 Continuous action space
If the action space is continuous, $\pi_{\theta}$ is selected as a diagonal Gaussian distribution $ \pi_{\theta}=\mathcal{N}(\mu_{\theta},\Sigma)$. Then, the policy network maps from the state to $\mu_{\theta}$. For example, a linear policy can be represented by $\mu_{\theta}(s) =\theta \: s$ where $\theta$ is the linear gain and $s$ is the state. The diagonal covariance matrix $\Sigma$ can be independent of $\theta$ or a function of it. We consider $\Sigma= \sigma^2 I_{n_a}$, where $\sigma>0$ and $n_a$ is the dimension of the control input. 

Let $\mu_{\theta}$ is generated by the function `network(state)`. That is $\mu_{\theta}(s)=$ `network(state)` which takes the state variable `state` as the input, has vector parameter $\theta$. To draw a sample $a \sim \mathcal{N}(\mu_{\theta},\sigma I_{n_a})$, we do the following

<code>a = network (state) + sigma * np.random.randn(n_a)<code>

## 2. Defining the probability of trajectory
$\tau \sim \pi_{\theta}$ means that a trajectory of the environment is generated by sampling action from $\pi_{\theta}$. The procedure is as follows. Let $s_0$ denote the initial state of the environment.
* We sample the action $a_0$ from the pdf; i.e. $a_0 \sim \pi_{\theta}$. We derive the environment using $a_0$. The environment reveals the reward $r_0$ and transits to a new state $s_1$.
* We sample the action $a_1$ from the pdf; i.e. $a_1 \sim \pi_{\theta}$. We derive the environment using $a_1$. The environment reveals the reward $r_1$ and transits to a new state $s_2$. 
We continue the above procedure for $T$ steps and in the end, we get a trajectory $\tau =(s_0,\:a_0,\:r_0,\: s_1,\: a_1,\:r_1,\:s_2,...,s_{T+1})$.


The probability of the trajectory $\tau =(s_0,\:a_0,\: s_1,\: a_1,\:s_1,...,s_{T+1})$ is defined as follows


\begin{align*}
\large
P(\tau| \theta) = \prod_{k=0}^{T}p(s_{k+1}|s_{k},a_{k}) p(a_k|\theta).
\end{align*}


* $p(s_{k+1}|s_{k},a_{k})$ represents the dynamics of the environment; it defines the next state $s_{k+1}$ given the current state $s_{k}$ and the current control $a_{k}$. Note that in RL we do NOT know $p(s_{k+1}|s_{k},a_{k})$. Don't worry. You'll see that we don't need it in our computation.
* $p(a_{k}|\theta)$ is the likelihood function and it is obtained by evaluating the pdf $\pi_{\theta}$ at $a_{k}$. In the sequel, we will see how $p(a_{k}|\theta)$ is defined in discrete and continuous action spaces.


### 2.1 Discrete action space
If the action space is discrete, `network(state)` denotes the probability density function $\pi_{\theta}$. It is a vector with however many entries as there are actions, and the actions are the indices for the vector. So, $p(a_{k}|\theta)$ is obtained by indexing into the output vector `network(state)`.

 ### 2.2 Continuous action space
Let the action space be continuous and assume that the dimension is $n_a$, we consider a gaussian multi-variate Gaussian with mean $\mu_{\theta}(s)=$ `network(state)`,

\begin{align*}
p(a_k|\theta) = \dfrac{1}{\sqrt{(2 \pi \sigma^2)^{n_a}}} \exp[-\dfrac{1}{2\sigma^2} (a_k-\mu_{\theta}(s_k))^T(a_k-\mu_{\theta}(s_k))]
\end{align*}.

## 3  Computing the gradient $\nabla_{\theta} J(\pi_{\theta})$
As we discussed earlier, in policy gradient, the parameter vector is learned by directly optimizing the reward function. So, we need to obtain $\nabla_{\theta} J(\pi_{\theta})$

\begin{align*}
\large
\nabla_{\theta} J&=\nabla_{\theta} \mathbf{E}\: [ R(\tau)]\\
&= \nabla_{\theta} \int_{\tau} P(\tau| \theta) R(\tau) \quad \text{replacing the expectation operator with the integral}\\
&= \int_{\tau} \nabla_{\theta} P(\tau| \theta) R(\tau) \quad \text{bringing the derivative inside}\\
&= \int_{\tau} P(\tau| \theta) \nabla_{\theta} \log P(\tau| \theta) R(\tau)\quad \text{using the trick}\\
&= \mathbf{E}_{\tau \sim \pi_{\theta}} [\nabla_{\theta} \log P(\tau| \theta) R(\tau)] \quad \text{replacing the integral with the expectation operator.}
\end{align*}

Remember that $P(\tau| \theta)$ is the probability of the trajectory which we defined in Section 2. So, $\nabla_{\theta} \log P(\tau| \theta)$ reads

\begin{align*}
\large
\nabla_{\theta} \log P(\tau| \theta)&=\nabla_{\theta} \sum_{k=0}^{T} \log p(s_{k+1}|s_{k},a_{k}) +\nabla_{\theta}\sum_{k=0}^{T} \log p(a_{k}|\theta)\\
&=\sum_{k=0}^{T}\nabla_{\theta} \log p(a_{k}|\theta).
\end{align*}

The first summation is zero because it is independent of the parameter vector $\theta$. Remember that we have defined $p(a_{k}|\theta)$ for continuous and discrete action spaces in section 2. So, we have

> \begin{align*}
\large
\nabla_{\theta} J= \mathbf{E}_{\tau \sim \pi_{\theta}} [R(\tau)\sum_{k=0}^{T}\nabla_{\theta} \log p(a_{k}|\theta)]. 
\end{align*}
This is the main equation in PG. One can replace the expectation with monte carlo sampling or with a stochastic sample. We will see both later.

### 3.1 Discrete action space
In the discrete action space case, computing the gradient is simple as we can use a pre-built cost function in Machine learning libraries. To see this point note that $J(\pi_{\theta})$ function (without the gradient) is in the form of the [cross entropy cost function](https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html) which is used in the classification task. This means that we can use the prebuilt cost functions in Machine learning libraries. Let `network(state)` builds the pdf $\pi_{\theta}$ in the discrete action space case. For example, consider the following graph

```
network = keras.Sequential([
            keras.layers.Dense(30, input_dim=n_s, activation='relu'),
            keras.layers.Dense(30, activation='relu'),
            keras.layers.Dense(n_a, activation='softmax')])
```

We define the loss function as corss entropy for this network

<code>network.compile(loss='categorical_crossentropy')<code>

Now, we have configured the network and all we need to do is to pass data to our network in the learning loop. To cast our problem to the cost function in the classification task, we need to define the true probability for the selected action. For example, if we have 5 different actions and the second action is sampled, the line below builds $[0,1,0,0,0]$

<code>target_action = tf.keras.utils.to_categorical(action, n_a)<code>

Now, we build the loss function for the network by giving the state, the target action, and weighting $R(\tau)$. The `network` gets the state as the input and creates the probability density functions in the output. The true probability density function is defined by `target_action` and it is weighted in the cost function by `R_tau`. That is it!

<code>loss = self.network.train_on_batch(state, target_action, sample_weight=R_tau)<code>

### 3.2 Continuous action space
Remember that for continuous action space, we have chosen a multi-variate Gaussian distribution for $p(a_{k}|\theta)$, see 2.2. So, we have $\nabla_{\theta} \log p(a_{k}|\theta)= \frac{1}{\sigma^2}\frac{d \mu_{\theta}(s_{k})}{d \theta}(a_{k}-\mu_{\theta}(s_{k}))$. To evaluate the gradient, we sample $\mathcal{D}$ trajectories and replace the expectation with the mean of the samples. $\nabla_{\theta} J$ reads

\begin{align*}
\large
\nabla_{\theta} J = \dfrac{1}{\sigma^2 |\mathcal{D}|}\sum_{\tau \in \mathcal{D}} \sum_{k=0}^{T}(a_{k}-\mu_{\theta}(s_{k}))\frac{d \mu_{\theta}(s_{k})}{d \theta}^T R(\tau).
\end{align*}

For example, if we consider a linear policy $\mu_{\theta}(s_{k})= \theta \: s_{k}$. So the above formula is simplified to

\begin{align*}
\large
\nabla_{\theta} J = \dfrac{1}{\sigma^2 |\mathcal{D}|}\sum_{\tau \in \mathcal{D}} \sum_{k=0}^{T}(a_{k}-\theta \:s_{k}) s^T R(\tau).
\end{align*}

Then, we can improve the policy parameter $\theta$ by gradient ascent

\begin{align*}
\theta = \theta +\alpha \nabla_{\theta} J 
\end{align*}

## 4. Putting all together
Now, we put all steps together to run this simple algorithm. 

First, we build a (deep) network to represent $\pi_\theta(s)$= `network(state).` Then, we iteratively improve the network. In each iteration of the algorithm, we do the following
* i. We rollout the environment to collect data for PG by following these steps:
    * i.a. We initialize empty histories for `states=[]`, `actions=[]`, `rewards=[]`
    * i.b. We observe the `state` $s$ and sample `action` $a$ from the policy pdf $\pi_{\theta}(s)$. See section 1.
    * i.c. We derive the environment using $a$ and observe the `reward` $r$.
    * i.d. We add $s,\:a,\:r$ to the history batch `states`, `actions`, `rewards`.
    * i.e. We continue from i.b. until the episode ends.
* ii. We improve the policy by following these steps
    * ii.a. We calculate the total reward. 
    * ii.b. We optimize the policy. See section 3.

## 5. Related files
* [Policy gradient for discrete action space: The cartpole example (study and code)](pg_on_cartpole_notebook.ipynb)
* [Policy gradient for discrete action space: The cartpole example (only code)](./cartpole/pg_on_cartpole.py)
* [Policy gradient for continuous action space: The linear quadratic (study and code)](pg_on_lq_notebook.ipynb)
* [Policy gradient for continuous action space: The linear quadratic (only code)](./lq/pg_on_lq.py)