# DDPG

DDPG is a model-free (no transition probability) off-policy `actor-critic` algorithm that combines elements of policy gradient methods with deep Q-learning. DDPG is an extension of DQN for continuous action space. It uses `temporal difference learning` (bootstrapping) and `experience replay buffer` (off-policy) to learn the Q-value (represented by the Critic network). Unlike DQN, DDPG does not use $\epsilon$-greedy policy (exploitation) for action selection. Rather, In DDPG, the behavior policy for action selection is derived from the actions generated by the Actor network (which is a deterministic target policy) with the addition of noise to encourage `exploration` in the environment.

- DDPG uses four neural networks:
    - The Actor network.
    - The Critic network.
    - The target Actor network.
    - The target Critic network.

---
**Algorithm (Pseudocode): DDPG**

1. Initialize the environment to a random state $s_t$.

2. Feed the current state $s_t$ to the Actor neural network that will return an action value $a_t$ (a continuous number, not a probability, since the policy is deterministic). 

3. Apply a noise (typically Gaussian) to the action $a_t$ to drive the agent in the environment that will return a reward $r_t$ and the next state $s_{t+1}$.

4. At each time step, store the experience/transition as a tuple ($s_t, a_t, r_{t}, s_{t+1}, d_{t}$) into the replay buffer. Where $d_{t}$ is an optional Done (boolean) value to determine whether the episode ended. This is to ensure stability.

5. Update Actor network:

    5.1 Sample a random state from the memory buffer and feed it to the `Actor network` $\mu$ to get the respective action value. This action value might be different than the ones stored in the buffer.
    
    5.2 Feed the previous state and action pair to the `Critic network` $Q$ to get the $Q(s_i, \alpha_i | \theta^{Q})$ value.
    
    5.3 Update Actor network's parameters $\theta^{\mu}$ by computing the `gradient ascent` of the Actor network loss function $J$ w.r.t the Actor parameters $\theta^{\mu}$:

    \begin{eqnarray}
    \theta^{\mu}_{t+1} &=&  \theta_t + \alpha \nabla_{\theta^{\mu}} J.\\
    \nabla_{\theta^{\mu}} J &=& \mathbb{E}_{s_t \sim \rho^{\beta}}[\nabla_{\theta^{\mu}} Q(s, \alpha | \theta^{Q})|_{s=s_t, \alpha=\mu(s_t|\theta^{\mu})}].
    \end{eqnarray}

    Where $\theta^{\mu}$ and $\theta^{Q}$ represents the Actor and Critic network's parameters, respectively. And $\nabla_{\theta^{\mu}} Q(s, \alpha | \theta^{Q})$ is the gradient of the Critic network w.r.t the Actor parameters. 

6. Update Critic network:

    6.1 Sample a random mini-batch of state, new states, actions and rewards from the replay buffer.

    6.2 Use `target Actor network` $\mu'$ to get actions for new states.

    6.3. Feed previous actions to the `target Critic network` $Q'$ to get the target value $y_i$.

    6.4. Feed state and actions to the `Critic network` to get the predicted value $Q(s_i, \alpha_i | \theta^{Q})$. 

    6.5. Update the Critic network's parameters $\theta^{Q}$ using `gradient descent` to minimize the mean squared error loss function of the Critic network:

    $$\frac{1}{N} \sum_i (y_i - Q(s_i, \alpha_i | \theta^{Q}))^2.$$

    Where $y_i = r_i + \gamma Q'(s_{i+1}, \mu'(s_{i+1}|\theta^{\mu'}| \theta^{Q'})$ is the target value obtained in step 6.3.

    6.6. Update the target networks using the soft update rule:

    \begin{align}
    \theta^{\mu'} &= \tau \theta^{\mu} (1-\tau) \theta^{\mu'} . \\
    \theta^{Q'} &= \tau \theta^{Q} (1-\tau) \theta^{Q'}.
    \end{align}

    Where $\tau$ is a hyperparameter.

7. Repeat until convergence.

---