# Policy Gradient Methods

### Aim

Policy gradient methods aim to optimize the policy parameters w.r.t. the expected cost-to-go by gradient descent.

As is the case in both Reinforcement Learning and optimal control as a whole, the aim is to find a policy $\theta$ that minimizes the optimal cost to go $J^*$.
\begin{equation}
J^*(s, \theta, t) = C(s, \theta, t) + \gamma \langle J^*(s, \theta, t+1)\rangle
\end{equation}
The marginal cost is described by the function $C(\cdot)$ with state $s$ and time $t$. Discount value $\gamma$ dimishishes the value of future cost. In order to find the policy with the optimal cost-to-go, the policy parameterization is updated according to the gradient update rule

\begin{equation}
\theta_{t+1} = \theta_t + \alpha \nabla_\theta J^*_{\theta_t}
\end{equation}

The challenge in both robotics and control is to find the gradient $\nabla_\theta J^*_{\theta_t}$. Optimal methods rely on knowledge of the system to establish this gradient. However, autonomous and adaptive systems need the ability to establish this gradient without knowing a full model of the system. The resulting challenge is to estimate the policy gradient from data generated during the execution af the task.



### Finite-difference Methods



In [None]:
import jax.numpy as jnp
import jax.random as jrandom
from src.systems.linear import StochasticDoubleIntegrator

x0 = jnp.array([2, 0])
SDI = StochasticDoubleIntegrator(x0)

n_inputs = 2
n_ctrl = 1
n_steps = 10

key = jrandom.PRNGKey(0)

params = jrandom.normal(key, (n_inputs, 1))

for i in range(n_steps - 1):
    y0 = SDI.observe(key)
    u_star = jnp.dot(params, y0)

    # state update
    SDI.update(u_star)

    # learning step
    



