# Policy Gradient from scratch

The idea of policy gradient is to driectly learn the policy from optimizing the reward function. We don NOT build a model of environment and we do NOT appeal to the Bellman equation. We aim to maximize the expected total reward


\begin{align}
\large
J = \mathbf{E}_{\tau \sim \pi_{\theta}}[R(\tau)]
\end{align}


where
* $\large \pi_{\theta}$ is the probability density function (pdf) of the parametric policy and $\theta$ is the parameter vector. 
* $\large \tau$ is a trajectory obtained from sampling the policy.
* $R(\tau)$ is undiscounted finite-time total reward. 
* Expectation is defined over the probability of the trajectory

We would like to directly optimize the policy, for example by gradient descent. So, we aim to obtain _the policy gradient_


\begin{align}
\large
\nabla_{\theta} J.
\end{align}


The algorithms that optimizes the policy in this way are called Policy Gradient (PG) algorithms. In the sequel, we define the main concept in PG.

## 1. Defining the probability density function of the parametric policy
As we mentioned earlier, $\large \pi_{\theta}$ is the probability density function (pdf) of the parametric policy and $\theta$ is the parameter vector. Depending on wheter the action space is discrete or continuous, we can define the pdf.

### 1.1 Continuous action space
If the action space is continous, $\large \pi_{\theta}$ is selected as a diagonal Gaussian distribution $\large \pi_{\theta}=\mathcal(\mu_{\theta},\Sigma)$. Then, the policy network maps from the state to $\mu_{\theta}$. For example, a linear policy can be rpresented by $\mu_{\theta} =\theta x$ where $\theta$ is the linear gain and $x$ is the state. The diagonal covariance matrix $\Sigma$ can be independent of $\theta$ or a function of it. We consider $\Sigma= \sigma^2 I_{n_a}$, where $\sigma>0$ and $n_a$ is the dimension of the control input. 

Let $\mu_{\theta}$ is generated by the function `network(s)`. That is $\mu_{\theta}=$ `network(s)` which takes the state variable `s` as an input, has vector parameter $\theta$. To draw a sample $a \sim \mathcal(\mu_{\theta},\sigma I_{n_a})$, we do the following

### 1.2 Discrete action space
If the action space is discrete, the policy network builds the pdf $\pi_{\theta}$. The policy network maps from the state to the probability of each action. So, if there are $n_a$ actions, the policy network has $n_a$ ouputs, each represent the probability of an action. Note that the outputs should sum to 1.

Let $\pi_{\theta}$ is generated by by the function `network(s)`. That is $\pi_{\theta}=$ `network(s)` which takes the state variable `s` as an input, has vector parameter $\theta$. To draw a sample $a \sim \pi_{\theta}$, we do the following

In the first line, we give the state `s` as the input to the function `network` and the output `softmax_out` is the pdf  $\pi_{\theta}$. In the second line, we select an action, according to the pdf `softmax_out`.

### 1.3 What does $\tau \sim \pi_{\theta}$ mean?
We defined the parametric pdf functions for continuous and discrete action spaces. $\tau \sim \pi_{\theta}$ means that a trajectory of the environment is generated by sampling action from $\pi_{\theta}$. The procedure is as follows. Let $s_0$ denote the initial state of the environment.
* We sample the action $a_0$ from the pdf; i.e. $a_0 \sim \pi_{\theta}$. We derive the environment using $a_0$. The environment reveals the reward $r_0$ and transits to a new state $s_1$.
* We sample the action $a_1$ from the pdf; i.e. $a_1 \sim \pi_{\theta}$. We derive the environment using $a_1$. The environment reveals the reward $r_1$ and transits to a new state $s_2$. 
We continue the above procedure for $T$ steps and in the end, we get a trajectory $\tau =(s_0,\:a_0,\:r_0,\: s_1,\: a_1,\:r_1,\:s_2,...,s_{T+1})$.

### 1.4 How to define the total reward $R(\tau)$ associated with the trajectory $\tau$?
The total reward of the trajectory $\tau =(s_0,\:a_0,\:r_0,\: s_1,\: a_1,\:r_1,\:s_2,...,s_{T+1})$ of length $T$ is defined by
\begin{align}
R(\tau)= \sum_{t=0}^{T} r_t
\end{align}

## 2. Defining the probability of trajectory
The probability of the trajectory $\tau =(s_0,\:a_0,\: s_1,\: a_1,\:s_1,...,s_{T+1})$ is defined as follows


\begin{align}
\large
P(\tau| \theta) = \prod_{k=0}^{T}p(s_{k+1}|s_{k},a_{k}) p(a_k|\theta).
\end{align}


* $p(s_{k+1}|s_{k},a_{k})$ represents the dynamics of the environment, it defines the next state $s_{k+1}$ given the current state $s_k$ and the current control $a_k$. Note that in RL we do NOT know $p(s_{k+1}|s_{k},a_{k})$. Don't worry. You'll see that we don't need it in our computation.
* $p(a_k|\theta_k)$ is the likelihood function and it is obtained by evaluating the pdf $\pi_{\theta}$ at $a_k$.

    * In continuous action space

        \begin{align}
        p(a_k|\theta) = \dfrac{1}{\sqrt{(2 \pi \sigma^2)^k}} \exp[-\dfrac{1}{2\sigma^2} (a_k-network(s_k))^T(a_k-network(s_k))]
        \end{align}.

    * In Discrete action space `network(s)` denotes the probability density function $\pi_{\theta}$. It is a vector with however many entries as there are actions, and the actions are the indices for the vector. So, $p(a_k|\theta)$ is obtained by indexing into the output vector `network(s)`.



## 3. Log-derivative Trick
The log-derivative trick helps us to obtain the policy gradient $\large \nabla_{\theta} J.$. 

### 3.1 The trick
This is quite simple 

\begin{align}
\large
\nabla_x \log x = \dfrac{1}{x}.
\end{align}

Now, we combine it with the chain rule. Assume that $x$ is a function of $\theta$

\begin{align}
\large
\nabla_{\theta} \log x = \nabla_{x} \log x \nabla_{\theta} x =  \dfrac{1}{x}\nabla_{\theta} x .
\end{align}

So

\begin{align}
\large
\nabla_{\theta} x = x \nabla_{\theta} \log x. 
\end{align}


We will use the above equation.

### 3.2  Computing $\nabla_{\theta} J(\pi_{\theta})$
As we discussed earlier, in policy gradient, the parameter vector is learned by directly optimizing the reward function. So, we need to obtain $\nabla_{\theta} J(\pi_{\theta})$

\begin{align}
\large
\nabla_{\theta} J&=\nabla_{\theta} \mathbf{E}\: [ R(\tau)]\\
&= \nabla_{\theta} \int_{\tau} P(\tau| \theta) R(\tau) \quad \text{replacing the expectation operator with the integral}\\
&= \int_{\tau} \nabla_{\theta} P(\tau| \theta) R(\tau) \quad \text{bringing the derivative inside}\\
&= \int_{\tau} P(\tau| \theta) \nabla_{\theta} \log P(\tau| \theta) R(\tau)\quad \text{using the trick}\\
&= \mathbf{E}_{\tau \sim \pi_{\theta}} [\nabla_{\theta} \log P(\tau| \theta) R(\tau)] \quad \text{replacing the integral with the expectation operator.}
\end{align}

Remember that $P(\tau| \theta)$ is the probability of the trajectory which we defined in Section 2. So, $\nabla_{\theta} \log P(\tau| \theta)$ reads

\begin{align}
\large
\nabla_{\theta} \log P(\tau| \theta)&=\nabla_{\theta} \sum_{k=0}^{T} \log p(s_{k+1}|s_{k},a_{k}) +\nabla_{\theta}\sum_{k=0}^{T} \log p(a_k|\theta)\\
&=\sum_{k=0}^{T}\nabla_{\theta} \log p(a_k|\theta).
\end{align}

The first summation is zero because it is independent of the parameter vector $\theta$. Remember that we have defined $p(a_k|\theta)$ for continuous and discrete action spaces in section 2. 

The last thing is to replace the expectation with sample mean. Assume that $\mathcal{D}$ trajectories are sampled. Then,

\begin{align}
\large
\nabla_{\theta} J = \dfrac{1}{|\mathcal{D}|}\sum_{\tau \in \mathcal{D}} \sum_{k=0}^{T}\nabla_{\theta} \log p(a_k|\theta) R(\tau)
\end{align}

### 3.3 Make use of Machine Learning libraries
Computing the gradient as denoted above, might seem scary, but it is not. If the action space is discrete, the $J(\pi_{\theta})$ function (without the gradient) is in the form of the [cross entropy cost function](https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html) which is used in the classification task. This means that we can use the prebuilt cost functions in Machine learning libraries. Let `network(s)` builds the pdf $\pi_{\theta}$ in the discrete action space case

We define the loss function as corss entropy

We map the actions to their true probabilities. For example if we have 5 different actions and the second action is sampled, the line below builds $[0\:1\:0\:0\:0]$

Finally, we build the loss function by giving the state, the target action and weightining $R(\tau)$. The `network` get the state as the input, create the probability density functions of the outputs, considering the true probability function by `target_action` and we weight them by `R_tau`. That is it!

## 4. Putting all together
Now, we put all steps together to run this simple algorithm. 

First, we build a (deep) network to build $\pi(\theta)$= `network(s).` Then, we iteratively improve the network. In each iteration of the algorithm, we do the following
* i. We rollout the environment using the current policy by following these steps:
    * i.a. We initialize empty histories for `states=[]`, `actions=[]`, `rewards=[]`
    * i.b. We observe the `state` $s$ and sample `action` $a$ from the poliy pdf $\pi_{\theta}(s)$. See 1.2.
    * i.c. We derive the environment using $a$ and observe the `reward` $r$.
    * i.d. We add $s,\:a,\:r$ to the history batch `states`, `actions`, `rewards`.
    * i.e. We continue from i.b. until the episode ends.
* ii. We improve the policy by following these steps
    * ii.a. We calculate the total reward. See 1.4
    * ii.b. We define the loss function to be minimized.  See 3.3

## 5. Related files
* [Policy gradient for continuous action space: The linear quadratic (study and code)](pg_on_lq_notebook.ipynb)
* [Policy gradient for continuous action space: The linear quadratic (only code)](./lq/pg_on_lq.py)
* [Policy gradient for discrete action space: The cartpole example (study and code)](pg_on_cartpole_notebook.ipynb)
* [Policy gradient for discrete action space: The cartpole example (only code)](./cartpole/pg_on_cartpole.py)