# DDPG - Deep Deterministic Policy Gradient
DDPG is the deep reinforcement learning implementation of DPG - Deterministic Policy Gradient - that employs the methods learnt from DQN to DPG. The DPG algorithm is a policy gradient method that uses a determinisitc policy instead of a stochastic one. This created the question of how to calculate the gradient of the action if it has no probability to which it was shown to be quite simple and easy. DPG offers more efficient gradient estimation than classical policy gradient methods and ,in practice, has been shown to outperform stochastic counterparts. DPG is regarded as an off policy actor-critic method that utilises stochastic/exploratory policies to learn a deterministic target policy.

The Deterministic Policy Gradient is as follows:
$\begin{aligned}
\nabla_\theta J(\theta) 
&= \int_\mathcal{S} \rho^\mu(s) \nabla_a Q^\mu(s, a) \nabla_\theta \mu_\theta(s) \rvert_{a=\mu_\theta(s)} ds \\
&= \mathbb{E}_{s \sim \rho^\mu} [\nabla_a Q^\mu(s, a) \nabla_\theta \mu_\theta(s) \rvert_{a=\mu_\theta(s)}]
\end{aligned}$

DDPG is essentially a combination of DPG and DQN where DDPG makes DQN able to work in continuous environments. 

The following is the DDPG Pseudocode:
![Pseudocode](https://lilianweng.github.io/lil-log/assets/images/DDPG_algo.png)

Since the policy is deterministic, noise is added to the behaviour policy to aid exploration.

## Continuous Policy
The continuous action space policy for DPG is deterministic which means we only need one neural network to approximate the action value.

The sizes the neural network is as follows:

- Input : State Observation size
- Hidden : Whatever you please.
- Output: 1

Batch normalisation really helps with different dimensional spaces.

In [None]:
from torch import nn
import torch.nn as nn
import torch.nn.functional as F
import torch

class Continuous_Policy(nn.Module):

    def __init__(self, input_size, hidden_size, nb_actions) -> None:
        super(Continuous_Policy, self).__init__()
        self.mean_net = nn.Sequential(nn.Linear(input_size, hidden_size),
                                        nn.LeakyReLU(),
                                        nn.Linear(hidden_size, hidden_size),
                                        nn.BatchNorm1d(hidden_size),
                                        nn.LeakyReLU(),
                                        nn.Linear(hidden_size, nb_actions)
                                        )
        
    def forward(self,obs):
        return self.mean_net(obs)

## Critic
This is the critic model - a neural network to learn the value function. This could be a Q (action-value) or V (state-value) approximator.  In this case it is the Q value we are approximating. For this I simply concatenate the state observation and action since the action is deterministic.

In [None]:
class Critic(nn.Module):

    def __init__(self, input_size,nb_actions, hidden_size) -> None:
        super(Critic, self).__init__()
        self.critic = nn.Sequential(nn.Linear(input_size+nb_actions, hidden_size),
                                        nn.LeakyReLU(),
                                        nn.Linear(hidden_size, hidden_size),
                                        nn.LeakyReLU(),
                                        nn.Linear(hidden_size, 1))
        
    def forward(self,obs,action):
        # Concatenate the action and state values
        return self.critic(torch.cat([obs,action],dim=-1))

## Replay Memory

Just like DQNs, DDPG makes use of experience replay memory - this stores previous transitions to allow for data efficient and non correlated learning to take place.

In [None]:
from collections import namedtuple, deque
import random
import numpy as np

Transition = namedtuple(
    "Transition", ("state", "action", "reward", "next_state", "done"))

class Replay_Memory:

    def __init__(self, size):
        self.memory = deque([], maxlen=size)

    def push(self, *args):
        self.memory.append(Transition(*args))

    def sample(self, batch_size):
        if batch_size > len(self.memory):
            return None
        return random.sample(self.memory, batch_size)


## Agent 
The following is the agent class:

In [None]:
from torch.distributions import Categorical, Normal
from torch.optim import AdamW
import torch

class DDPG_Agent:

    def __init__(self,input_size,hidden_size,nb_actions,discount_factor=0.9,learning_rate=0.0001,batch_size=256,replay_memory_size=100000,polyak = 0.005,noise_dev=0.1):
        
        # Create the online network critic
        self.online_net = Critic(input_size,nb_actions,hidden_size)

        # Create the target network critic
        self.target_net = Critic(input_size,nb_actions,hidden_size)

        # Create the online policy
        self.online_policy = Continuous_Policy(input_size,hidden_size,nb_actions)

        # Create the target policy
        self.target_policy = Continuous_Policy(input_size,hidden_size,nb_actions)
       
        # Polyak averaging weight
        self.polyak = polyak
        # Gaussian Noise for exploratory behaviour in training
        self.noise = Normal(torch.zeros(nb_actions),noise_dev)

        # Set the targets initial weights to equal the online weights
        self.target_net.load_state_dict(self.online_net.state_dict())
        self.target_policy.load_state_dict(self.online_policy.state_dict())

        # Future reward discount factor
        self.gamma = discount_factor
        
        # User MSE Loss function
        self.loss_function = torch.nn.MSELoss()

        # Critic and policy optimizer
        self.critic_optimizer = AdamW(self.online_net.parameters(),learning_rate)
        self.policy_optimizer = AdamW(self.online_policy.parameters(),learning_rate)

        # Create Replay Memory and assign batch_size
        self.replay = Replay_Memory(replay_memory_size)
        self.batch_size = batch_size
        
        # Best Weights for Eval
        self.best_weights = self.online_policy.state_dict()
        
    def act(self,obs,with_noise):
        # Put online policy into eval mode 
        self.online_policy.eval()
        # Dont store gradients when acting in the environment
        with torch.no_grad():
            action = self.online_policy(obs)
            # Add noise
            if with_noise:
                action += self.noise.sample()
        # Put the online policy back into train mode
        self.online_policy.train()
        return action

   # store experience in replay memory
    def cache(self, state, action, reward, next_state, done):
        self.replay.push(state, action, reward, next_state, done)
    
    def train(self):
        self.online_policy.train()
        self.target_policy.train()
        self.online_net.train()
        self.target_net.train()
    
    def eval(self):
        self.online_policy.load_state_dict(self.best_weights)
        self.online_policy.eval()
        self.target_policy.eval()
        self.online_net.eval()
        self.target_net.eval()
        
    def update_model(self):
        
        # Get minibatch of data from experience buffer
        batch = self.replay.sample(self.batch_size)

        # If memory doesnt have enough transitions
        if batch == None:
            return

        # Format batch to get a tensor of states, actions, rewards, next states and done booleans
        batch_tuple = Transition(*zip(*batch))
        state = torch.cat(batch_tuple.state,dim=0)
        action = torch.cat(batch_tuple.action,dim=0)
        reward = torch.cat(batch_tuple.reward,dim=0)
        next_state = torch.cat(batch_tuple.next_state,dim=0)
        done = torch.cat(batch_tuple.done,dim=0)

        # Freeze online net parameters as gradients should not be kept for policy loss
        for param in self.online_net.parameters():
            param.requires_grad = False

        policy_loss = -self.online_net(state,self.online_policy(state)).mean()

        # Perform policy update
        self.policy_optimizer.zero_grad()
        policy_loss.backward()
        self.policy_optimizer.step()

        # Unfreeze online net parameters as gradients are needed
        for param in self.online_net.parameters():
            param.requires_grad = True

        # Get Q values current Q values for critic loss
        Q_values = self.online_net(state,action).squeeze()

        # Get Q Targets
        Q_targets = reward + (1 - done.float()) * self.gamma*self.target_net(next_state,self.target_policy(next_state)).squeeze()

        # Calculate critic loss
        critic_loss = self.loss_function(Q_values, Q_targets.detach())
        
        self.critic_optimizer.zero_grad()
        
        # Calculate the gradients
        critic_loss.backward()
    
        # Perform critic update
        self.critic_optimizer.step()
        
    def update_best_weights(self):
        self.best_weights = self.online_policy.state_dict()

    def update_target(self):
        """
        Use polyak averaging to update weights
        """
        with torch.no_grad():
            for online_policy_param, target_policy_param in zip(self.online_policy.parameters(), self.target_policy.parameters()):
                target_policy_param.data.mul_(1-self.polyak)
                target_policy_param.data.add_(self.polyak* online_policy_param.data)
            
            for online_net_param, target_net_param in zip(self.online_net.parameters(), self.target_net.parameters()):
                target_net_param.data.mul_(1-self.polyak)
                target_net_param.data.add_(self.polyak* online_net_param.data)
    
    

## Training Loop
The last thing needed is the training loop.

In [None]:
from collections import deque
import gym
from numpy import double
import torch
import argparse
from torch.distributions import Normal

def train(agent, env, num_episodes, num_steps, log_every=20, render_from=1000):
    """
    :param agent: the agent to be trained.
    :param env: the gym environment.
    :param num_episodes: the number of episodes to train.
    :param num_steps: the max number of steps possible per episode.
    :param log_every: The frequency of logging. Default logs every 20 episodes.
    :param render_from: The episode number to start rendering to screen to allow one to view agent. Rendering significantly slows down training.
    """
    
    # Running Average Reward Memory
    running_avg_reward = deque(maxlen=100)
    best_running_avg = 0
    agent.train()

    for episode in range(1,num_episodes+1):
        # Starting state observation - unsqueeze to give batch dimension of 1
        obs = torch.tensor(env.reset(),dtype=torch.float32).unsqueeze(0)
        reward_total = 0
        for step in range(num_steps):
            if episode>render_from:
                env.render()

            # return chosen action
            action = agent.act(obs,with_noise=True)
            
            # Take a step in the environment with the action drawn
            next_obs , reward, done, info = env.step(action.squeeze().numpy())
            
            # Just for logging
            reward_total+=reward

            # change next state into tensor for update - give batch dimension of 1
            next_obs = torch.tensor(next_obs,dtype=torch.float32).unsqueeze(0)
            
            # change reward into tensor for update - give batch dimension of 1
            reward = torch.tensor(
                reward, dtype=torch.float32).unsqueeze(0)

            # Store transition in replay memory
            agent.cache(obs,action,reward,next_obs,torch.tensor(done).unsqueeze(0))

            # Perform update
            agent.update_model()
            
            # Update target networks
            agent.update_target()

            # Set the current state to the next state
            obs = next_obs
            
            # if done then log and break
            if done or step == num_steps-1:
                if episode % log_every ==0 and len(running_avg_reward)>0:
                    running_avg = sum(running_avg_reward)/len(running_avg_reward)
                    if running_avg > best_running_avg:
                        best_running_avg = running_avg
                        agent.update_best_weights()
                        
                    print("Episode {0:4d} finished after {1:4d} timesteps with a total reward of {2:3.1f} | Running Average: {3:3.2f}".format(episode,step+1,reward_total,running_avg))
                break
        running_avg_reward.append(reward_total)
        

    env.close()

In [None]:

def eval(agent, env, num_episodes, num_steps, log_every=20):
    """
    :param agent: the agent to be trained.
    :param env: the gym environment.
    :param num_episodes: the number of episodes to eval.
    :param num_steps: the max number of steps possible per episode.
    :param log_every: The frequency of logging. Default logs every 20 episodes.
    """
    agent.eval()
    # Running Average Reward Memory
    running_avg_reward = deque(maxlen=100)
    for episode in range(1,num_episodes+1):
        # Starting state observation
        obs = torch.tensor(env.reset(),dtype=torch.float32).unsqueeze(0)
        reward_total = 0
        for step in range(num_steps):
           
            env.render()
            # return chosen action
            action = agent.act(obs,with_noise=False)
            
            # Take a step in the environment with the action drawn
            next_obs , reward, done, info = env.step(action.numpy())
            
            # Just for logging
            reward_total+=reward

            # change next state into tensor for update
            next_obs = torch.tensor(next_obs,dtype=torch.float32).unsqueeze(0)
            
            # Set the current state to the next state
            obs = next_obs
           
            # if done then log and break
            if done or step == num_steps-1:
                if episode % log_every ==0 and len(running_avg_reward)>0:
                    running_avg = sum(running_avg_reward)/len(running_avg_reward)
                    print("Episode {0:4d} finished after {1:4d} timesteps with a total reward of {2:3.1f} | Running Average: {3:3.2f}".format(episode,step+1,reward_total,running_avg))
                break
        running_avg_reward.append(reward_total)
    env.close()

## Time to Train

In [None]:
env = gym.make("LunarLanderContinuous-v2")

nb_actions = env.action_space.shape[0]

# Hyper parameters
HIDDEN_SIZE = 256
GAMMA = 0.99
LEARNING_RATE = 0.001
EPISODES_TO_TRAIN = 1000
EPISODES_TO_EVAL = 20
MAX_STEPS = 200
LOG_EVERY = 20
RENDER_FROM_EP = 2000
BATCH_SIZE=100
REPLAY_MEMORY_SIZE=1000000
POLYAK = 0.005
NOISE_STD_DEV = 0.1

agent = DDPG_Agent(env.observation_space.shape[0],HIDDEN_SIZE,nb_actions,GAMMA,LEARNING_RATE,BATCH_SIZE,REPLAY_MEMORY_SIZE,POLYAK,NOISE_STD_DEV)

print("Training...")

train(agent,env,EPISODES_TO_TRAIN,MAX_STEPS,LOG_EVERY,RENDER_FROM_EP)

print("Evaluating...")

eval(agent,env,EPISODES_TO_EVAL,MAX_STEPS,log_every=1)