# Policy Gradient from scratch

The idea of policy gradient is to directly learn the policy from optimizing the total reward. We do NOT build a model of environment and we do NOT appeal to the Bellman equation. We aim to maximize the expected total reward


\begin{align*}
\large
J = \mathbf{E}_{\tau \sim \pi_{\theta}}[R(T)]
\end{align*}


where
* $\large \pi_{\theta}$ is the probability density function (pdf) of the policy and $\theta$ is the parameter vector. 
* $\large \tau$  is a trajectory obtained from sampling the policy and it is given by $\tau =(s_{1},\:a_{1},\:r_{1},\: s_{2},\: a_{2},\:r_{2},\:s_{3},...,s_{T+1})$ where $s_{t},\:a_{t},\:r_{t}$ are the state, action, reward at time $t$ and $T$ is the trajectory length. $\tau \sim \pi_{\theta}$ means that trajectory $\tau$ is generated by sampling actions from the pdf $\pi_{\theta}$.
* $R(T)$ is undiscounted finite-time total reward
\begin{align*}
R(T)= \sum_{t=1}^{T} r_{t}
\end{align*}
* Expectation is defined over the probability of the trajectory

We would like to directly optimize the policy, for example by gradient descent. So, we aim to obtain the gradient of policy with respect to $\theta$


\begin{align*}
\large
\nabla_{\theta} J.
\end{align*}


> _The algorithms that optimizes the policy in this way are called Policy Gradient (PG) algorithms._ 

The log-derivative trick helps us to obtain the policy gradient $\large \nabla_{\theta} J$. 

## Log-derivative trick
The log-derivative trick depends on $\nabla_p \log p = \dfrac{1}{p}$. Assume that $p$ is a function of $\theta$. Then, using chain rule, we have

\begin{align*}
\large
\nabla_{\theta} \log p = \nabla_{p} \log p \nabla_{\theta} p =  \dfrac{1}{p}\nabla_{\theta} p .
\end{align*}

Rearranging the above equation
\begin{align*}
\large
\nabla_{\theta} p =p \nabla_{\theta} \log p.
\end{align*}

This equation is is called the log-derivative trick and plays a central role in PG. In the sequel, we define the main components in PG.

## 1. Defining the probability density function for the parametric policy
As we mentioned earlier, $\large \pi_{\theta}$ is the probability density function (pdf) of the policy and $\theta$ is the parameter vector. Depending on whether the action space is discrete or continuous, we define the pdf differently.

### 1.1 Discrete action space
If the action space is discrete, the policy network builds the pdf $\pi_{\theta}$. The policy network maps from the state to the probability of each action. So, if there are $n_a$ actions, the policy network has $n_a$ outputs, each represents the probability of an action. Note that the outputs should sum to 1.

Let $\pi_{\theta}$ be generated by the function \pi_{\theta}(s)=$ `network(state)`. 

```
network = keras.Sequential([
            keras.layers.Dense(30, input_dim=n_s, activation='relu'),
            keras.layers.Dense(30, activation='relu'),
            keras.layers.Dense(n_a, activation='softmax')])
```
In the above code, we build the network. The network takes state of dimension $n_s$ as the input and uses it in a fully connected layer with 30 neurons, with the activation function as relu, followed by another layer with 30 neurons and again with the activation function as relu. Then, we have the last layer which has $n_a$ number of outputs and we select the activation function as softmax as we want to have the sum of probability equal to one. The parameters in the networks are biases and weights in the layers. 

To draw a sample $a \sim \pi_{\theta}$, we do the following

<code>softmax_out = network(state)<code>
<code>a = np.random.choice(n_a, p=softmax_out.numpy()[0])<code>

In the first line, we give the state `state` as the input to the function `network` and the output `softmax_out` is the pdf  $\pi_{\theta}$. In the second line, we select an action, according to the pdf `softmax_out`.

### 1.2 Continuous action space
If the action space is continuous, $\pi_{\theta}$ is selected as a diagonal Gaussian distribution $ \pi_{\theta}=\mathcal{N}(\mu_{\theta},\Sigma)$. Then, the policy network maps from the state to $\mu_{\theta}$. For example, a linear policy can be represented by $\mu_{\theta}(s) =\theta \: s$ where $\theta$ is the linear gain and $s$ is the state. We consider $\Sigma= \sigma^2 I_{n_a}$, where $\sigma>0$ is a design parameter and $n_a$ is the dimension of the control input. 

Let $\mu_{\theta}$ is generated by the function `network(state)`. That is $\mu_{\theta}(s)=$ `network(state)` takes the state variable `state` as the input and has vector parameter $\theta$. To draw a sample $a \sim \mathcal{N}(\mu_{\theta},\sigma I_{n_a})$, we do the following

<code>a = network (state) + sigma * np.random.randn(n_a)<code>

## 2. Defining the probability of trajectory
$\tau \sim \pi_{\theta}$ means that a trajectory of the environment is generated by sampling action from $\pi_{\theta}$. Let $s_{1}$ denote the initial state of the environment. The procedure is as follows. 
* We sample the action $a_{1}$ from the pdf; i.e. $a_{1}\sim \pi_{\theta}$. We derive the environment using $a_{1}$. The environment reveals the reward $r_{1}$ and transits to a new state $s_{2}$.
* We sample the action $a_{2}$ from the pdf; i.e. $a_{2} \sim \pi_{\theta}$. We derive the environment using $a_{2}$. The environment reveals the reward $r_{2}$ and transits to a new state $s_{3}$.  
We continue the above procedure for $T$ steps and in the end, we get a trajectory $\tau =(s_{1},\:a_{1},\:r_{1},\: s_{2},\: a_{2},\:r_{2},\:s_{3},...,s_{T+1})$.


The probability of the trajectory $\tau $ is defined as follows


\begin{align*}
\large
P(\tau| \theta) = \prod_{t=1}^{T}p(s_{t+1}|s_{t},a_{t}) p(a_t|\theta).
\end{align*}


* $p(s_{t+1}|s_{t},a_{t})$ represents the dynamics of the environment; it defines the next state $s_{t+1}$ given the current state $s_{t}$ and the current action $a_{t}$. Note that in RL we do NOT know $p(s_{t+1}|s_{t},a_{t})$. You will see later that $p(s_{t+1}|s_{t},a_{t})$ is not needed in the computation.
* $p(a_{t}|\theta)$ is the likelihood function and it is obtained by evaluating the pdf $\pi_{\theta}$ at $a_{t}$. In the sequel, we will see how $p(a_{t}|\theta)$ is defined in discrete and continuous action spaces.


### 2.1 Discrete action space
If the action space is discrete, `network(state)` denotes the probability density function $\pi_{\theta}$. It is a vector with however many entries as there are actions, and the actions are the indices for the vector. So, $p(a_{t}|\theta)$ is obtained by indexing into the output vector `network(state)`. For example, if the second action is selected, $p(a_{t}|\theta)$ is given by the second output of the network.

 ### 2.2 Continuous action space
Let the action space be continuous and assume that the dimension is $n_a$, we consider a multi-variate Gaussian with mean $\mu_{\theta}(s)=$ `network(state)`,

\begin{align*}
p(a_{t}|\theta) = \dfrac{1}{\sqrt{(2 \pi \sigma^2)^{n_{a}}}} \exp[-\dfrac{1}{2\sigma^2} (a_{t}-\mu_{\theta}(s_{t}))^{\dagger}(a_{t}-\mu_{\theta}(s_{t}))].
\end{align*}

## 3  Computing the gradient $\nabla_{\theta} J$
As we discussed earlier, in policy gradient, the parameter vector is learned by directly optimizing the reward function. So, we need to obtain $\nabla_{\theta} J$

\begin{align*}
\large
\nabla_{\theta} J&=\nabla_{\theta} \mathbf{E}\: [ R(T)]\\
&= \nabla_{\theta} \int_{\tau} P(\tau| \theta) R(T) \quad \text{replacing the expectation operator with the integral}\\
&= \int_{\tau} \nabla_{\theta} P(\tau| \theta) R(T) \quad \text{bringing the derivative inside}\\
&= \int_{\tau} P(\tau| \theta) \nabla_{\theta} \log P(\tau| \theta) R(T)\quad \text{using the trick}\\
&= \mathbf{E} [\nabla_{\theta} \log P(\tau| \theta) R(T)] \quad \text{replacing the integral with the expectation operator.}
\end{align*}

Remember that $P(\tau| \theta)$ is the probability of the trajectory which we defined in Section 2. So, $\nabla_{\theta} \log P(\tau| \theta)$ reads

\begin{align*}
\large
\nabla_{\theta} \log P(\tau| \theta)&=\nabla_{\theta} \sum_{t=1}^{T} \log p(s_{t+1}|s_{t},a_{t}) +\nabla_{\theta}\sum_{t=1}^{T} \log p(a_{t}|\theta)\\
&=\sum_{t=1}^{T}\nabla_{\theta} \log p(a_{t}|\theta).
\end{align*}

The first summation is zero because it is independent of the parameter vector $\theta$. Remember that we have defined $p(a_{k}|\theta)$ for continuous and discrete action spaces in section 2. So, we have

> \begin{align*}
\large
\nabla_{\theta} J= \mathbf{E}[R(T)\sum_{t=1}^{T}\nabla_{\theta} \log p(a_{t}|\theta)]. 
\end{align*}
This is the main equation in PG. One can replace the expectation with averaging or simply drop the expectation operator.

### 3.1 Discrete action space
In the discrete action space case, computing the gradient is simple as we can use a pre-built cost function in Machine learning libraries. To see this point note that $J(\pi_{\theta})$ function (without the gradient) is in the form of the [cross entropy cost function](https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html) which is used in the classification task. This means that we can use the prebuilt cost functions in Machine learning libraries. Let `network(state)` represent the pdf $\pi_{\theta}$ in the discrete action space case. To do so, we need to recast our problem to a classification task, meaning that our network should produce probability in the last layer, we need to define true probability and the cost to be optimized is cross entropy. We define a cross entropy for the network

<code>network.compile(loss='categorical_crossentropy')<code>

Now, we have configured the network and all we need to do is to pass data to our network in the learning loop. To cast our problem to the cost function in the classification task, we need to define the true probability for the selected action. For example, if we have 5 different actions and the second action is sampled, the line below builds $[0,1,0,0,0]$

<code>target_action = tf.keras.utils.to_categorical(action, n_a)<code>

Now, we build the loss function for the network by giving the state, the target action, and weighting $R(T)$. The `network` gets the state as the input and creates the probability density functions in the output. The true probability density function is defined by `target_action` and it is weighted in the cost function by `R_T`. That is it!

<code>loss = self.network.train_on_batch(state, target_action, sample_weight=R_T)<code>

### 3.2 Continuous action space
Remember that for continuous action space, we have chosen a multi-variate Gaussian distribution for $p(a_{t}|\theta)$, see sections 1.2 and 2.2. So, we have $\nabla_{\theta} \log p(a_{t}|\theta)= \frac{1}{\sigma^2}\frac{d \mu_{\theta}(s_{t})}{d \theta}(a_{t}-\mu_{\theta}(s_{t}))$. To evaluate the gradient, we sample $\mathcal{D}$ trajectories and replace the expectation with the mean of the samples. $\nabla_{\theta} J$ reads

\begin{align*}
\large
\nabla_{\theta} J = \dfrac{1}{\sigma^2 |\mathcal{D}|}\sum_{\tau \in \mathcal{D}} \sum_{t=1}^{T}(a_{t}-\mu_{\theta}(s_{t}))\frac{d \mu_{\theta}(s_{t})}{d \theta}^{\dagger} R(T).
\end{align*}

For example, if we consider a linear policy $\mu_{\theta}(s_{t})= \theta \: s_{t}$, the above line is simplified to

\begin{align*}
\large
\nabla_{\theta} J = \dfrac{1}{\sigma^2 |\mathcal{D}|}\sum_{\tau \in \mathcal{D}} \sum_{t=1}^{T}(a_{t}-\theta \:s_{t}) s_t^{\dagger} R(T).
\end{align*}

Then, we can improve the policy parameter $\theta$ by a gradient ascent method, e.g.

\begin{align*}
\theta = \theta +\alpha \nabla_{\theta} J 
\end{align*}

## 4. Putting all together
Now, we put all steps together to run this simple algorithm. 

First, we build a (deep) network to represent $\pi_\theta(s)$= `network(state).` Then, we iteratively improve the network. In each iteration of the algorithm, we do the following
* i. We rollout the environment to collect data for PG by following these steps:
    * i.a. We initialize empty histories for `states=[]`, `actions=[]`, `rewards=[]`
    * i.b. We observe the `state` $s$ and sample `action` $a$ from the policy pdf $\pi_{\theta}(s)$. See section 1.
    * i.c. We derive the environment using $a$ and observe the `reward` $r$.
    * i.d. We add $s,\:a,\:r$ to the history batch `states`, `actions`, `rewards`.
    * i.e. We continue from i.b. until the episode ends.
* ii. We improve the policy by following these steps
    * ii.a. We calculate the total reward. 
    * ii.b. We optimize the policy. See section 3.

## 5. Related files
* [Policy gradient for discrete action space: The cartpole example (study and code)](pg_on_cartpole_notebook.ipynb)
* [Policy gradient for discrete action space: The cartpole example (only code)](./cartpole/pg_on_cartpole.py)
* [Policy gradient for continuous action space: The linear quadratic (study and code)](pg_on_lq_notebook.ipynb)
* [Policy gradient for continuous action space: The linear quadratic (only code)](./lq/pg_on_lq.py)