# Deep Deterministic Policy Gradient (DDPG)

In this notebook, we explore **Deep Deterministic Policy Gradient (DDPG)** : an advanced reinforcement learning algorithm designed for **continuous action spaces**.

DDPG is an **off-policy, model-free, actor-critic** algorithm that combines ideas from **DQN** and **Policy Gradient** methods.

Introduced by *Lillicrap et al. (2015)*, it’s particularly suited for control problems like robotic arm manipulation or self driving steering control.

## 1. Key Concepts

DDPG extends the Deep Q-Network (DQN) to continuous action spaces by:

- Using a **deterministic policy** (actor network) to output continuous actions.
- Using a **critic network** to estimate Q values.
- Employing **target networks** for both actor and critic to stabilize learning.
- Using a **replay buffer** to store and sample experiences for training.

**Update equations:**

$$ y_i = r_i + \gamma Q'(s_{i+1}, \mu'(s_{i+1})) $$

$$ \nabla_{\theta^\mu} J \approx \frac{1}{N} \sum_i \nabla_a Q(s, a|\theta^Q) \nabla_{\theta^\mu} \mu(s|\theta^\mu) $$

In [None]:
import gym
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers
import random

env = gym.make('Pendulum-v1')

num_states = env.observation_space.shape[0]
num_actions = env.action_space.shape[0]
upper_bound = env.action_space.high[0]
lower_bound = env.action_space.low[0]

print(f"State dim: {num_states}, Action dim: {num_actions}")

## 2. Build Actor and Critic Networks

In [None]:
def get_actor():
    inputs = layers.Input(shape=(num_states,))
    out = layers.Dense(256, activation='relu')(inputs)
    out = layers.Dense(256, activation='relu')(out)
    outputs = layers.Dense(num_actions, activation='tanh')(out)
    outputs = outputs * upper_bound
    return tf.keras.Model(inputs, outputs)

def get_critic():
    state_input = layers.Input(shape=(num_states,))
    state_out = layers.Dense(16, activation='relu')(state_input)
    state_out = layers.Dense(32, activation='relu')(state_out)

    action_input = layers.Input(shape=(num_actions,))
    action_out = layers.Dense(32, activation='relu')(action_input)

    concat = layers.Concatenate()([state_out, action_out])
    out = layers.Dense(256, activation='relu')(concat)
    out = layers.Dense(256, activation='relu')(out)
    outputs = layers.Dense(1)(out)
    return tf.keras.Model([state_input, action_input], outputs)

actor_model = get_actor()
critic_model = get_critic()

target_actor = get_actor()
target_critic = get_critic()

target_actor.set_weights(actor_model.get_weights())
target_critic.set_weights(critic_model.get_weights())

## 3. Replay Buffer for Experience Storage

In [None]:
class ReplayBuffer:
    def __init__(self, buffer_capacity=100000, batch_size=64):
        self.buffer_capacity = buffer_capacity
        self.batch_size = batch_size
        self.buffer_counter = 0

        self.state_buffer = np.zeros((self.buffer_capacity, num_states))
        self.action_buffer = np.zeros((self.buffer_capacity, num_actions))
        self.reward_buffer = np.zeros((self.buffer_capacity, 1))
        self.next_state_buffer = np.zeros((self.buffer_capacity, num_states))

    def record(self, obs_tuple):
        index = self.buffer_counter % self.buffer_capacity
        self.state_buffer[index] = obs_tuple[0]
        self.action_buffer[index] = obs_tuple[1]
        self.reward_buffer[index] = obs_tuple[2]
        self.next_state_buffer[index] = obs_tuple[3]
        self.buffer_counter += 1

    def sample(self):
        record_range = min(self.buffer_counter, self.buffer_capacity)
        batch_indices = np.random.choice(record_range, self.batch_size)

        state_batch = tf.convert_to_tensor(self.state_buffer[batch_indices])
        action_batch = tf.convert_to_tensor(self.action_buffer[batch_indices])
        reward_batch = tf.convert_to_tensor(self.reward_buffer[batch_indices])
        next_state_batch = tf.convert_to_tensor(self.next_state_buffer[batch_indices])
        return state_batch, action_batch, reward_batch, next_state_batch

## 4. Training Utilities

In [None]:
critic_lr = 0.002
actor_lr = 0.001
gamma = 0.99
tau = 0.005

critic_optimizer = tf.keras.optimizers.Adam(critic_lr)
actor_optimizer = tf.keras.optimizers.Adam(actor_lr)
buffer = ReplayBuffer()

In [None]:
def update_target(target_weights, weights, tau):
    for (a, b) in zip(target_weights, weights):
        a.assign(b * tau + a * (1 - tau))

In [None]:
@tf.function
def update(state_batch, action_batch, reward_batch, next_state_batch):
    with tf.GradientTape() as tape:
        target_actions = target_actor(next_state_batch, training=True)
        y = reward_batch + gamma * target_critic([next_state_batch, target_actions], training=True)
        critic_value = critic_model([state_batch, action_batch], training=True)
        critic_loss = tf.math.reduce_mean(tf.math.square(y - critic_value))
    critic_grad = tape.gradient(critic_loss, critic_model.trainable_variables)
    critic_optimizer.apply_gradients(zip(critic_grad, critic_model.trainable_variables))

    with tf.GradientTape() as tape:
        actions = actor_model(state_batch, training=True)
        critic_value = critic_model([state_batch, actions], training=True)
        actor_loss = -tf.math.reduce_mean(critic_value)
    actor_grad = tape.gradient(actor_loss, actor_model.trainable_variables)
    actor_optimizer.apply_gradients(zip(actor_grad, actor_model.trainable_variables))

## 5. DDPG Training Loop (Simplified)

In [None]:
def policy(state, noise_object):
    sampled_actions = tf.squeeze(actor_model(state))
    noise = noise_object()
    sampled_actions = sampled_actions.numpy() + noise
    legal_action = np.clip(sampled_actions, lower_bound, upper_bound)
    return [np.squeeze(legal_action)]

class OUActionNoise:
    def __init__(self, mean, std_deviation):
        self.mean = mean
        self.std_dev = std_deviation
        self.reset()
    def __call__(self):
        x = self.x_prev + self.theta * (self.mean - self.x_prev) + self.std_dev * np.random.normal(size=self.mean.shape)
        self.x_prev = x
        return x
    def reset(self):
        self.x_prev = np.zeros_like(self.mean)
    theta = 0.15

noise = OUActionNoise(mean=np.zeros(1), std_deviation=float(0.2) * np.ones(1))

In [None]:
ep_reward_list = []

for ep in range(5):  # small number for demo
    prev_state = env.reset()[0]
    episodic_reward = 0

    while True:
        tf_prev_state = tf.expand_dims(tf.convert_to_tensor(prev_state), 0)
        action = policy(tf_prev_state, noise)
        state, reward, done, _, _ = env.step(action)
        buffer.record((prev_state, action, reward, state))
        episodic_reward += reward

        state_batch, action_batch, reward_batch, next_state_batch = buffer.sample()
        update(state_batch, action_batch, reward_batch, next_state_batch)
        update_target(target_actor.variables, actor_model.variables, tau)
        update_target(target_critic.variables, critic_model.variables, tau)

        if done:
            break
        prev_state = state

    ep_reward_list.append(episodic_reward)
    print(f"Episode {ep}, Reward: {episodic_reward:.2f}")

## 6. Summary

- DDPG works well for **continuous action spaces**.
- Combines **actor-critic structure** with **replay buffer** and **target networks**.
- Uses **Ornstein-Uhlenbeck noise** for exploration.

**Next:** You can explore extensions like **TD3** or **SAC** for improved stability.