# Policy Gradient (REINFORCE)

## 1. Policy-based RL

#### Value-based RL vs. Policy-based RL

$$
\begin{aligned}
V_\theta (s) &\approx V^\pi (s) \\[10pt]
Q_\theta (s, a) &\approx Q^\pi (s, a)
\end{aligned}
$$

#### Direct parameterization of the policy

$$
\pi_\theta (s, a) = \mathbb{P} [ a | s, \theta ] 
$$

+ choose the action **NOT based on the value function**, but based on **the features of the state**.
+ i.e. **policy gradient** directly approximates the policy. 

#### Pros/Cons of policy-based RL

**Pros**

+ Better convergence properties
+ Effective in high-dimensional/continuous action spaces
+ Can learn **stochastic policies** (vs. action-value methods)

**Cons**

+ Typically, convergence to a local optimum
+ Evaluation of a policy : typically inefficient & high variance

#### When to adopt stochastic policies

**Rock-Paper-Scissors**

+ Any deterministic policy is easily exploited
+ Optimal policy is discrete uniform random policy as a Nash equilibrium

**Aliased Gridworld**

+ Any feature might be incomplete : $\phi(s,a) = \mathbf{1}$ (Wall to the **North**, **AND** Action is to move the **East**
+ Deterministic policy should choose one of the grey squares : no feature would differentiate the two greys

## 2. Policy gradient

A method of **updating the approximated policy** according to the gradient ascent of the goal function $J(\theta)$

Seek to maximize performance, so their updates approximate gradient ascent 

$$
\begin{aligned}
\theta_{t+1} &= \theta_t + \alpha \hat{\nabla J(\theta)} \\[10pt]
&=  \theta_t + \alpha \nabla_\theta J(\theta)
\end{aligned}
$$

where $\nabla_\theta J(\theta) = \Biggl( \begin{matrix}
\frac{\partial J(\theta)}{\partial \theta_1} \\ \vdots \\ \frac{\partial J(\theta)}{\partial \theta_n} \end{matrix} \Biggl)$

#### Objective function $J$

**Goal** : given policy $\pi_\theta (s, a)$, **find best $\theta$ **

How to measure the quality of a policy $\pi_\theta$? : **policy gradient applies equally**

+ **start value** : episodic environments $\big( J_1(\theta) = V^{\pi_\theta} (s1) = \mathbb{E} _{\pi_\theta} [v_1] \big) $

+ **average value** : continuous environments $\big( J_{avV} (\theta) = \sum_s \mu^{\pi_\theta} (s) V^{\pi_\theta} (s) \big)$

+ **average reward per time-step** : $J_{avR} (\theta) = \sum_s \mu^{\pi_\theta} (s) \sum_a \pi_\theta (s,a) R^a_s$

where $\mu^{\pi_\theta} (s)$ : **stationary distribution** of MDP $\pi_\theta$, i.e. the probability of agent being at state $s$

#### Score Function

**Likelihood ratios** : exploit the following identity

$$
\begin{aligned}
\nabla_\theta \pi_\theta (s,a) &= \pi_\theta (s,a) \frac{\nabla_\theta \pi_\theta (s,a)} {\pi_\theta (s,a)} \\[10pt]
&= \pi_\theta (s,a) \nabla_\theta \log \pi_\theta (s, a)
\end{aligned}
$$

where $\nabla_\theta \log \pi_\theta (s, a)$ : **score function**


#### Softmax Policy

Unlike Deep SARSA/DQN, the **ouput is the probability for the actions.** Thus, it uses **softmax** function.

$$
\sigma(y_i) = \frac{e^{y_i}}{\sum_j e^{y_j}}
$$

+ **Weighted actions** using linear combination of features $\phi(s,a)^T \theta$
+ Probability of action is proportional to **exponentiated weight**

$$
\pi_\theta (s,a) \propto e^{\phi(s,a)^T \theta}
$$

#### Gaussian Policy

+ **Policy is Gaussian distributed** i.e. $a \sim N(\mu(s), \sigma^2)$
+ **Mean** : a linear combination of state features $\mu (s) = \phi (s)^T \theta$
+ **Variance** : fixed/parameterized
+ **Score function**

$$
\nabla_\theta \log \pi_\theta (s,a) = \frac{(a - \mu(s)) \phi(s)} {\sigma^2}
$$

$$
\begin{aligned}
\nabla_\theta J(\theta) &= \nabla_{\theta} \mathbf{v}_{\pi_\theta} (s_0) \\[15pt]
&= \sum_s d_{\pi_\theta}(s) 
\end{aligned}
$$

$$
\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \Big[ \nabla_\theta \log \pi_\theta (a \mid s) q_\pi (s, a) \Big]
$$

$$
w_{t+1} \equiv w_t + \alpha \Big( R_{t+1} + \gamma \mathbf{w}_t^T \mathbf{x}-(t+1)
$$

By perturbing $\theta$ by small amount $\varepsilon$

$$
\frac{\partial J(\theta)}{\partial \theta_k} \approx \frac{J(\theta + \varepsilon u_k) - J(\theta)} {\varepsilon}
$$

#### Policy Gradient Theorem

**Policy gradient theorem** generalises the **likelihood ratio** approach to multi-step MDPs

It provides a way to calculate $J(\theta)$ w.r.t. **policy parameters**, with **$Q$-function**, even if we do not know 

+ **True value function** $v_{\pi_\theta}$
+ **State distribution** (i.e. distribution of rewards)

$$
\begin{aligned}
\nabla J(\theta) &\propto \sum_s \mu_\pi(s) \sum_a q_\pi (s,a) \nabla \pi (a | s, \theta) \\[10pt]
&= \mathbb{E}_\pi \Big[ \sum_a \nabla \pi_\theta (a|s) q^{\pi_\theta} (s,a) \Big]
\end{aligned}
$$

where $\mu_\pi(s)$ : on-policy state distribution under $\pi$

#### Proof of Policy Gradient Theorem

$$
\begin{aligned}
\nabla v_\pi (s) &= \nabla \Big[ \sum_a \pi (a | s) q_\pi (s, a) \Big], \forall s \in \mathfrak{S} \\[10pt]
&= \sum_a \Big[ \nabla \pi (a | s) q_\pi (s, a) + \pi (a | s) \nabla q_\pi (s,a) \Big] (\because \ \ \text{product rule & Leibniz rule)} \\[10pt]
&= \sum_a \Big[ \nabla \pi (a | s) q_\pi (s, a) + \pi (a | s) \nabla \sum_{s', r} p(s', r |s, a)(r + v_\pi (s') ) \Big] \\[10pt]
&= \sum_a \Big[ \nabla \pi (a | s) q_\pi (s, a) + \pi (a | s) \sum_{s'} p(s'|s, a) \nabla v_\pi (s')  \Big] \\[10pt]
&= \sum_{x \in S} \sum_{k=0}^\infty P(s \rightarrow x, k, \pi) \sum_a \nabla \pi (a | x) q_\pi (x, a)
\end{aligned}
$$

where $\sum_{x \in S} \sum_{k=0}^\infty P(s \rightarrow x, k, \pi)$ : probability of transitioning from state $s$ to $x$ in $k$ steps under $\pi$, and the above is clear from the repetitive substitution, using the following fact.

$$
\nabla v_\pi (s') = \sum_{a'} \Big[ \nabla \pi (a' | s') q_\pi (s', a') + \pi (a' | s') \sum_{s''} p(s''|s', a') \nabla v_\pi (s'')  \Big]
$$

Thus, 

$$
\begin{aligned}
\nabla J(\theta) &= \nabla v_\pi (s_0) \\[10pt]
&= \sum_s \eta(s) \sum_a \nabla \pi (a | x) q_\pi (x, a) \ \ \big(\because \ \ \eta(s) = \sum_{k=0}^\infty P(s_0 \rightarrow s, k, \pi)\big) \\[10pt]
&\propto \sum_a \mu(s) \sum_s \eta(s) \sum_a \nabla \pi (a | x) q_\pi (x, a) \ \ \big(\because \ \ \eta(s') \propto \mu(s)\big)
\end{aligned}
$$

where $\eta(s)$ : number of time steps spent on average, in state $s$ in a single episode

## 3. REINFORCE : Monte Carlo Policy Gradient

**Model-free** RL

**Stochastic gradient ascent** : requires a way to obtain samples such that the expectation of the sample gradient is proportional to the actual gradient of the performance measure as a function of the parameter.

$$
\begin{aligned}
\nabla J(\theta) &=\sum_s \mu_\pi (s) \sum_s \nabla \pi (a | s) q_\pi (s, a) \\[10pt]
&=\sum_s \mu_\pi (s) \sum_s \pi(a |s) \frac{\nabla \pi (a | s)}{\pi(a |s)} q_\pi (s, a) \\[10pt]
&=\sum_s \mu_\pi (s) \sum_s \pi(a |s) \big[ \nabla \log \pi (a | s) q_\pi (s, a) \big] \\[10pt]
&= \mathbb{E}_\pi \big[ \nabla \log \pi (a | s) q_\pi (s, a) \big]
\end{aligned}
$$

#### Updating Formula of REINFORCE

$$
\begin{aligned}
\theta_{t + 1} &\approx \theta_t + \alpha \big[ \nabla_\theta \log \pi_\theta (a | s) G_t \big]  (*) \\[10pt]
&= \theta_t - \alpha \big[ \nabla_\theta \big(- \log \pi_\theta (a | s) \big) G_t \big]
\end{aligned}
$$

#### REINFORCE with Baseline

In order to reduce variance (and thus the speed of learning), we **use the following**. The baseline does not depend on $a$.

$$
\nabla J(\theta) \propto \sum_s \mu(s) \sum_a \Big( q_\pi (s,a) - b(s) \Big) \nabla \pi(a | s, \theta)
$$

$$
\theta_{t+1} := \theta_t + \alpha \big( G_t - b(S_t) \big) \frac{\nabla \pi (A_t | S_t, \theta_t) } {\pi (A_t | S_t, \theta_t)}
$$

## 4. Algorithm

#### Overview

**1. Input** : a **differentiable** policy parameterization $\pi (a | s, \theta), \ \ \forall a \in A, s \in S, \theta \in \mathbb{R}^n$

**2. Initialize policy weights** $\theta$

**3. Repeat**

3.1. Generate an episode : $S_0, A_0, R_1, \cdots, S_{T-1}, A_{T-1}, R_T$, following $\pi(\cdot | \cdot, \theta)$

3.2. For each step of the episode : $t=0, \cdots T-1$ :

    G_t <- return from step t
    theta <- theta + alpha gamma^t G_t del log pi (A | S, theta)


#### Choose an action using the policy network

Not need $\varepsilon$-greedy, since the **policy itself is stochastic**.

```python
    def get_action(self, state):
            policy = self.model.predict(state)[0]
            return np.random.choice(self.action_size, 1, p=policy)[0]
```




#### Calculate the return

Update $\theta$ using $\theta_{t + 1} \approx \theta_t + \alpha \big[ \nabla_\theta \log \pi_\theta (a | s) G_t \big]$


```python
    def append_sample(self, state, action, reward):
        self.states.append(state[0])
        self.rewards.append(reward)
        act = np.zeros(self.action_size)
        act[action] = 1
        self.actions.append(act)

    def discount_rewards(self, rewards):
        discounted_rewards = np.zeros_like(rewards)
        running_add = 0
        for t in reversed(range(0, len(rewards))):
            running_add = running_add * self.discount_factor + rewards[t]
            discounted_rewards[t] = running_add
        return discounted_rewards
```

#### Create the loss function & training function to update policy network

```python
    def optimizer(self):
        action = K.placeholder(shape=[None, 5])
        discounted_rewards = K.placeholder(shape=[None, ])
        
        # calculate the cross-entropy loss function
        action_prob = K.sum(action * self.model.output, axis=1)
        cross_entropy = K.log(action_prob) * discounted_rewards
        loss = -K.sum(cross_entropy)
        
        # create a training function to update policy network
        optimizer = Adam(lr=self.learning_rate)
        updates = optimizer.get_updates(self.model.trainable_weights,[],
                                        loss)
        train = K.function([self.model.input, action, discounted_rewards], [],
                           updates=updates)

        return train
```

**1. Loss function as the goal of policy network updating**

Since 

$$
\big[ \nabla_\theta \log \pi_\theta (a | s) \big] G_t = \nabla_\theta \big[ log \pi_\theta (a | s) G_t \big]
$$

so the **goal of policy network updating** (i.e. **loss function**) is $\log \pi_\theta (a | s) G_t$

**2. The meaning of the loss function : cross-entropy**

$$
H(p,y) = - \sum_i y_i \log p_i
$$

measures **how similar** between the ground truth $y$ (**the selected action**) and the preditionc $p$ (the **policy**)

**3. Gradient Ascent**

$$
\begin{aligned}
\theta_{t + 1} &\approx \theta_t + \alpha \big[ \nabla_\theta \log \pi_\theta (a | s) G_t \big] \\[10pt]
&= \theta_t - \alpha \big[ \nabla_\theta \big(- \log \pi_\theta (a | s) \big) G_t \big]
\end{aligned}
$$

#### Update the policy network

```python
    def train_model(self):
        discounted_rewards = np.float32(self.discount_rewards(self.rewards))
        discounted_rewards -= np.mean(discounted_rewards)
        discounted_rewards /= np.std(discounted_rewards)

        self.optimizer([self.states, self.actions, discounted_rewards])
        self.states, self.actions, self.rewards = [], [], []

```

#### Extra) (Shanon's) Entropy

the information required to specify a system's state

**Entropy** is defined as follows

$$
\mathfrak{E} (\pi) = - \sum_\Theta \pi(\theta_i) \log \pi (\theta_i)
$$

where $\pi$ : pdf on $\Theta$, which is discrete. 

**Different** from **standard deviation**, as in the case of a bimodal distribution.

**Maximum entropy** : probability distribution when $\pi(\theta_i) = 1/n, \ \ \forall i$

$$
\mathfrak{E} (\pi) = - \sum_{i=1}^n \frac{1}{n} \log(\frac{1}{n}) = \log n
$$

(**noninformative prior** for a discrete $\Theta$)