#### Bellman Error-minimization and Projected Bellman Error-minimization 
###### Notation basics
- State space $S = \{s_1,s_2,...s_n\}$, each of which can be fitted into feature functions $\phi = [\phi_1,...,\phi_m]$
- Action space with finite actions $A$
- Fixed, stochastic policy $\pi(a|s)$
- Value function under $\pi(a|s)$, $\boldsymbol{v}_{\pi}$
- Approximation of value function as $\boldsymbol{v}_{w} = w^T \phi$
- Expected reward $r(s,a)$ | $R_{\pi}(s) = \sum_{a \in A} \pi(a|s) r(s,a)$
- Transition probability $P(s,s',a)$ | $P_{\pi}(s,s') = \sum_{a \in A} \pi(a|s) P(s,s',a)$
- Discount factor $\gamma$
- Bellman operator $\boldsymbol{B}_{\pi} v = R_{\pi} + \gamma P_{\pi} v$
- Projection operator $\boldsymbol{\Pi}_{\phi}$
###### Projection VF
$$\boldsymbol{\Pi}_{\phi} v_{\pi}: \boldsymbol{w}_{\pi} = \arg \min_{\boldsymbol{w}}d(v_{\pi}, v_{\boldsymbol{w}})$$
###### Bellman Error-minimization VF
$$\boldsymbol{w}_{BE} = \arg \min_{\boldsymbol{w}}d(\boldsymbol{B}_{\pi} v_{\boldsymbol{w}}, v_{\boldsymbol{w}})$$
i.e.,
$$
\begin{split}
\boldsymbol{w}_{BE} & = \arg \min_{\boldsymbol{w}}d(v_{\boldsymbol{w}}, R_{\pi}+\gamma P_{\pi}v_{\boldsymbol{w}})\\
& = \arg \min_{\boldsymbol{w}}d(\Phi \boldsymbol{w}, R_{\pi}+\gamma P_{\pi}\Phi \boldsymbol{w})\\
& = \arg \min_{\boldsymbol{w}}d(\Phi \boldsymbol{w} - \gamma P_{\pi}\Phi \boldsymbol{w}, R_{\pi})\\
& = \arg \min_{\boldsymbol{w}}d((\Phi - \gamma P_{\pi}\Phi) \boldsymbol{w}, R_{\pi})
\end{split}
$$
Let $A = \Phi - \gamma P_{\pi}\Phi$ and $R_{\pi} =b$,
$$\boldsymbol{w}_{BE} = (A^T D A)^{-1} A^T D b$$
###### Temporal Difference Error-minimization VF
$$\boldsymbol{w}_{TDE} = \arg \min_{\boldsymbol{w}}\mathbb{E}_{\pi}[\delta^2]$$
###### Projected Bellman Error-minimization VF
$$\boldsymbol{w}_{PBE} = \arg \min_{\boldsymbol{w}}d(\Pi_{\phi} \boldsymbol{B}_{\pi} v_{\boldsymbol{w}}, v_{\boldsymbol{w}})$$
Given that $\min_{\boldsymbol{w}}d(\Pi_{\phi} \boldsymbol{B}_{\pi} v_{\boldsymbol{w}}, v_{\boldsymbol{w}}) = 0$, we can solve the linear equation directly:
$$\Pi_{\phi} \boldsymbol{B}_{\pi} \Phi \boldsymbol{w}_{PBE} = \Phi \boldsymbol{w}_{PBE}$$
$$\Phi (\Phi^T D \Phi)^{-1} \Phi^T D (R_{\pi} + \gamma P_{\pi} \Phi \boldsymbol{w}_{PBE}) = \Phi \boldsymbol{w}_{PBE}$$
$$ \Phi^T D (R_{\pi} + \gamma P_{\pi} \Phi \boldsymbol{w}_{PBE}) = (\Phi^T D \Phi)\boldsymbol{w}_{PBE}$$
$$ \Phi^T D R_{\pi} = \Phi^T D (\gamma P_{\pi} \Phi - \Phi )\boldsymbol{w}_{PBE}$$
Let $A = \Phi^T D (\gamma P_{\pi} \Phi - \Phi )$ and $b = \Phi^T D R_{\pi}$
$$\boldsymbol{w}_{PBE} = A^{-1}b$$

#### Policy Gradient Theorem
###### Notation Basics
- Discount factor $\gamma$
- State $s_t$, action $a_t$ and reward $r_t$
- Transition probability $P_{s,s'}^a$
- Expected reward $R_s^a = \mathbb{E}[r_t|s_t = s, a_t = a]$
- Initial state distribution $p_0$
- Policy function approx $\pi(s,a|\theta)$

###### Obj. Function
$$J(\theta) = \int_{s} p_0(s_0) V_{\pi}(s_0) ds_0 = \int_{s} p_0(s_0) \int_a \pi(s_0, a_0, \theta) Q_{\pi}(s_0, a_0) da_0 ds_0$$

###### Proof of Policy Gradient Theorem
$$
\begin{split}
\nabla_{\theta} J(\theta) & = \nabla_{\theta} \int_{s} p_0(s_0) \int_a \pi(s_0, a_0, \theta) Q_{\pi}(s_0, a_0) da_0 ds_0\\
& = \int_{s} p_0(s_0) \int_a \nabla_{\theta} \pi(s_0, a_0, \theta) Q_{\pi}(s_0, a_0) da_0 ds_0  + \int_{s} p_0(s_0) \int_a \pi(s_0, a_0, \theta) \nabla_{\theta} Q_{\pi}(s_0, a_0) da_0 ds_0 \\
& = \int_{s} p_0(s_0) \int_a \nabla_{\theta} \pi(s_0, a_0, \theta) Q_{\pi}(s_0, a_0) da_0 ds_0 + \int_{s} p_0(s_0) \int_a \pi(s_0, a_0, \theta) \nabla_{\theta} (R_{s_0}^{a_0} + \gamma P_{s_0, s_1}^{a_0} V_{\pi}(s_1)) da_0 ds_0 ds_1\\
& = \int_{s} p_0(s_0) \int_a \nabla_{\theta} \pi(s_0, a_0, \theta) Q_{\pi}(s_0, a_0) da_0 ds_0 + \int_{s} p_0(s_0) \int_a \pi(s_0, a_0, \theta) \gamma \int_{s} P_{s_0, s_1}^{a_0} \nabla_{\theta} V_{\pi}(s_1) da_0 ds_0 ds_1\\
& = \int_{s} p_0(s_0) \int_a \nabla_{\theta} \pi(s_0, a_0, \theta) Q_{\pi}(s_0, a_0) da_0 ds_0 + \int_{s} \gamma \int_{s} p_0(s_0) p(s_0,s_1; \pi) ds_0 \nabla_{\theta} V_{\pi}(s_1) ds_1\\
& = \int_{s} p_0(s_0) \int_a \nabla_{\theta} \pi(s_0, a_0, \theta) Q_{\pi}(s_0, a_0) da_0 ds_0 + \int_{s} \gamma \int_{s} p_0(s_0) p(s_0,s_1; \pi) ds_0 \int_a \pi(s_1,a_1|\theta) \nabla_{\theta} Q_{\pi}(s_1, a_1) da_1 ds_1\\
& = ...\\
& = \sum_{t=0}^{\infty} \int_{s} \int_{s} \gamma^t p_0(s_0) p(s_0,s_t,t; \pi) ds_0 \int_a \pi(s_t,a_t|\theta) \nabla_{\theta} Q_{\pi}(s_t, a_t) da_t ds_t
\end{split}
$$

#### Score function for softmax policy
$$\pi(s,a|\theta) = \frac{\exp(\theta^T \phi(s,a))}{\sum_b \exp(\theta^T \phi(s,b))}$$
$$\nabla_{\theta} \log \pi(s,a|\theta) = \phi(s,a) - \sum_b \theta^T \phi(s,b)$$
#### Score function for gaussian policy
$$\pi(s,a|\theta) = \frac{1}{\sqrt{2\pi \sigma^2}}\exp(\frac{a - \theta^T \phi(s)}{-2\sigma^2})$$
$$\nabla_{\theta} \log \pi(s,a|\theta) =\frac{(a - \theta^T \phi(s))\phi(s)}{\sigma^2}$$

#### Compatible Function Approximation Theorem
If 1) critic gradient is compatible with the Actor score function:
$$\nabla_{\theta} \log \pi(s,a|\theta) = \nabla_{w} Q(s,a,w)$$
and 2) critic parameters w minimize the following mean-squared error:
$$ \int_s \rho_{\pi}(s) \int_a \pi(s,a,\theta) (Q_{\pi}(s,a) - Q(s,a,w))^2 da ds$$
Then the Policy Gradient using critic $Q(s,a,w)$ is exact:
$$\nabla_{\theta} J(\theta) \int_s \rho_{\pi}(s) \int_a \nabla_{\theta} \pi(s,a,\theta) Q(s,a,w)$$

###### Proof
Following 2)
$$ \int_s \rho_{\pi}(s) \int_a \pi(s,a,\theta) (Q_{\pi}(s,a) - Q(s,a,w)) \nabla_{w} Q(s,a,w) da ds = 0$$
Following 1)
$$ \int_s \rho_{\pi}(s) \int_a \pi(s,a,\theta) (Q_{\pi}(s,a) - Q(s,a,w)) \nabla_{\theta} \log \pi(s,a|\theta) da ds = 0$$
i.e.,
$$ \int_s \rho_{\pi}(s) \int_a \pi(s,a,\theta) Q_{\pi}(s,a) \nabla_{\theta} \log \pi(s,a|\theta) da ds = \int_s \rho_{\pi}(s) \int_a \pi(s,a,\theta) Q(s,a,w) \nabla_{\theta} \log \pi(s,a|\theta) da ds$$

#### REINFORCE Algoithm (Monte-Carlo Policy Gradient Algorithm, i.e., no Critic)