This is the Part-7 of the Deep Reinforcement Learning Notebook series. In this Notebook I have introduced Deep Determinsitic Policy Gradients(DDPG).

The Notebook series is about Deep RL algorithms so it excludes all other techniques that can be used to learn functions in reinforcement learning and also the Notebook Series is not exhaustive i.e. it contains the most widely used Deep RL algorithms only.

##What is DDPG Algorithm?


DDPG is extension of Deep Q-Learning(DQN) for continous tasks. DDPN is introduced in paper CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING
(Lillicrap et al., 2016). DDPG is an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces.

In past notebook we read that DQNs can be imporoved by using Double Dqn and target networks. If you want to read more about Doble Dqn and target networks checkout this link: https://github.com/Rahul-Choudhary-3614/Deep-Reinforcement-Learning-Notebooks/blob/master/Deep_Reinforcement_Learning_Part_5_.ipynb

DDQN uses all these techniques to make training more stable

An obvious approach to adapting deep reinforcement learning methods such as DQN to continuous domains is to to simply discretize the action space. However, this has many limitations, most notably the curse of dimensionality: the number of actions increases exponentially with the number of degrees of freedom. For example, a 7 degree of freedom system (as in the human arm) with the coarsest discretization ai ∈ {−k, 0, k} for each joint leads to an action space with dimensionality: $3^7$ = 2187. The situation is even worse for tasks that require fine control of actions as they require a correspondingly finer grained discretization, leading to an explosion of the number of discrete actions. Such large action spaces are difficult to explore efficiently, and thus successfully training DQN-like networks in this context is likely intractable. Additionally, naive discretization of action spaces needlessly throws away information about the structure of the action domain, which may be essential for solving many problems.

DDPG is  a model-free, off-policy actor-critic algorithm using deep function approximators that can learn policies in high-dimensional, continuous action spaces.

DDPG  combines DQN with policy gradient methods using the actor-critic framework to learn a deterministic policy $\mu_\theta(s)$ that acts to approximate Q-learning with guidance from a DQN-like critic $Q_\omega(s,a)$.


## Deterministic Policy Gradients (DPG)
DPG were introduced in paper Deterministic Policy Gradient Algorithms (Silver et al., 2014).A policy can be either deterministic or stochastic. 

A policy can be either deterministic or stochastic. A deterministic policy is policy that maps state to actions. You give it a state and the function returns an action to take.On the other hand, a stochastic policy outputs a probability distribution over actions.

Deterministic policy gradient methods have better relative sample efficiency since they don't integrate over action space while stochastic policy methods integrate over both state and action space. The DPG paper (Silver et al., 2014) showed deterministic policy gradients are the expectation of the action value gradient and introduced a deterministic version of the policy gradient theorem to provide an expression for $\nabla_\theta J(\theta)$ that doesn't require the derivative of the state distribution $\rho^\beta$.

![alt text](https://github.com/Machine-Learning-rc/Unimportant/blob/master/Screenshot%202020-08-14%20at%201.05.52%20AM.png?raw=true)






##Batch Normalization
When learning from low dimensional feature vector observations, the different components of the observation may have different physical units (for example, positions versus velocities) and the ranges may vary across environments. This can make it difficult for the network to learn effectively and may make it difficult to find hyper-parameters which generalise across environments with different scales of state values. This issue is address by adapting technique called batch normalization.

This technique normalizes each dimension across the samples in a minibatch to have unit mean and variance. In addition, it maintains a run- ning average of the mean and variance to use for normalization during testing (in our case, during exploration or evaluation). In deep networks, it is used to minimize covariance shift during training, by ensuring that each layer receives whitened input. In the low-dimensional case, batch normalization is used on the state input and all layers of the μ network and all layers of the Q network prior to the action input.



##Exploration in Continuous action space
Deterministic policies have a harder time attaining sufficient exploration compared to stochastic polices. Continuous action spaces also make exploration important. An advantage of off- policies algorithms such as DDPG is that we can treat the problem of exploration independently from the learning algorithm. We constructed an exploration policy μ′ by adding noise sampled from a noise process N to our actor policy

$μ'(s_t)$=$μ(s_t/θ_t^μ)$ + $\mathcal{N}$

$\mathcal{N}$ can be chosen to suit the environment. $\mathcal{N}$ used in DDPG paper is an Ornstein-Uhlenbeck process so the exploration is temporally correlated. Other noises such as plain Gaussian noise works just as well.

##DDPG 

![alt text](https://github.com/Machine-Learning-rc/Unimportant/blob/master/Screenshot%202020-08-14%20at%201.22.35%20AM.png?raw=true)


#IMPLEMENTING DDPG

Below code setups the environment required to run and record the game and also loads the required library.

In [None]:
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Dense,Dropout,Conv2D, Flatten,MaxPooling2D ,Activation,Input
from tensorflow.keras.models import Sequential,load_model,Model
import gym
import numpy as np
import random
from collections import deque
from tensorflow.keras.utils import normalize as normal_values
import cv2
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
from IPython.display import HTML
from IPython.display import clear_output
from IPython import display as ipythondisplay

In [None]:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

In [None]:
"""
Utility functions to enable video recording of gym environment and displaying it
To enable video, just do "env = wrap_env(env)""
"""

def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

This part ensures the reproducibility of the code below by using a random seed and setups the environment.

In [None]:
RANDOM_SEED=1
# random seed (reproduciblity)
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)

# set the env
env = (gym.make("Pendulum-v0")) # env to import
#env=wrap_env(env)  #use this when you want to record a video of episodes
env.seed(RANDOM_SEED)
env.reset() # reset to env

State_Space = env.observation_space.shape[0]
actions_Space = env.action_space.shape[0]

upper_bound = env.action_space.high[0]
lower_bound = env.action_space.low[0]

Defining the OUNoise Class.

In [None]:
# --------------------------------------
# Ornstein-Uhlenbeck Noise
# Author: Flood Sung
# Date: 2016.5.4
# Reference: https://github.com/rllab/rllab/blob/master/rllab/exploration_strategies/ou_strategy.py
# --------------------------------------

import numpy as np
import numpy.random as nr

class OUNoise:
    """docstring for OUNoise"""
    def __init__(self,action_dimension,mu=0, theta=0.15, sigma=0.2):
        self.action_dimension = action_dimension
        self.mu = mu
        self.theta = theta
        self.sigma = sigma
        self.state = np.ones(self.action_dimension) * self.mu
        self.reset()

    def reset(self):
        self.state = np.ones(self.action_dimension) * self.mu

    def noise(self):
        x = self.state
        dx = self.theta * (self.mu - x) + self.sigma * nr.randn(len(x))
        self.state = x + dx
        return self.state

if __name__ == '__main__':
    ou = OUNoise(3)
    states = []
    for i in range(1000):
        states.append(ou.noise())
    import matplotlib.pyplot as plt

    plt.plot(states)
    plt.show()

Defining the DDPG Class. At initiation, the DDPG object sets a few parameters like environment, action bounds,exploration noise,create Actor,target Actor,Critic and target Critic models and remember(that records the observations of each step.)

In [None]:
class DDPG:
 
  def __init__(self,env,upper_bound,lower_bound,buffer_capacity=50000,p_1=None,p_2=None,p_3=None,p_4=None):
    self.env=env #import env
    self.upper_bound=upper_bound
    self.lower_bound=lower_bound
    self.state_shape=env.observation_space.shape[0] # the state space
    self.action_shape=env.action_space.shape[0] # the action space
    self.gamma=0.99 # decay rate of past observations
    self.learning_rate= 1e-4 # learning rate in deep learning
    self.alpha=0.005
    self.batch_size=64
    self.exploration_noise=OUNoise(self.action_shape)
    if not p_1:
      self.Actor_model=self._create_model('Actor')
      self.target_Actor_model=self._create_model('Actor') 
      self.Critic_model=self._create_model('Critic')      
      self.target_Critic_model=self._create_model('Critic')  
    else:
      self.Actor_model=load_model(p_1) 
      self.Critic_model=load_model(p_2) 
      self.target_Actor_model=load_model(p_3)
      self.target_Critic_model=load_model(p_4) 
    
    self.buffer_capacity = buffer_capacity
    self.buffer_counter = 0
        # record observations
    self.states=np.zeros((self.buffer_capacity, self.state_shape))
    self.rewards=np.zeros((self.buffer_capacity,1))
    self.dones= np.zeros((self.buffer_capacity, 1))
    self.actions=np.zeros((self.buffer_capacity, 1))
    self.next_states=np.zeros((self.buffer_capacity, self.state_shape))
  
  def remember(self,state, action, reward,next_state,done):
    '''stores observations'''
    index = self.buffer_counter % self.buffer_capacity
    self.states[index] = state
    self.rewards[index]=reward
    self.dones[index]=done
    self.actions[index]=action
    self.next_states[index]=next_state
    self.buffer_counter += 1

Creating a Neural Network Model (Actor and Critic)

In [None]:
def _create_model(self,model_type):
 
    ''' builds the model using keras'''
 
    state_input = Input(shape=(3,))
    layer_1=Dense(512,activation="relu")(state_input)
    layer_2=BatchNormalization()(layer_1)
    layer_3=Dense(512,activation="relu")(layer_2)
    layer_4=BatchNormalization()(layer_3)
    layer_5=Dense(64,activation="relu")(layer_4)
    layer_6=BatchNormalization()(layer_5)

    if model_type=='Actor':
      output = Dense(self.action_shape, activation='tanh',kernel_initializer=tf.keras.initializers.RandomUniform(minval=-0.003, maxval=0.003,seed=1))(layer_6)
      output=output * self.upper_bound
      model = Model(inputs=[state_input],outputs=[output])
      model.compile(optimizer=Adam(learning_rate=self.learning_rate), loss="mse")
    else:
      action_input = Input(shape=(self.action_shape))
      action_layer_1 = Dense(128, activation="relu")(action_input)
      action_layer_2 = BatchNormalization()(action_layer_1 )
      action_layer_3 = Dense(64, activation="relu")(action_layer_2)
      action_layer_4 = BatchNormalization()(action_layer_3)
      concat = Concatenate()([layer_6,action_layer_4])

      concat_layer_1=Dense(256,activation="relu")(concat)
      concat_layer_2=BatchNormalization()(concat_layer_1)

      output = Dense(1)(concat_layer_2)

      model = Model(inputs=[state_input, action_input],outputs=[output])
      model.compile(optimizer=Adam(learning_rate=self.learning_rate), loss="mse")
    return model

Action Selection

The get_action method guides the action choice. It uses the Actor network to get a action for a given state. Then, we add noise during training for
exploration. Then we clip the action so that action doesn't got out of legal actions range.

In [None]:
def get_action(self, state,status="training"):
    '''samples the next action based on the policy probabilty distribution 
      of the actions'''
    action = self.Actor_model.predict(state)
    
    if status=="training":
        action = action + self.exploration_noise.noise()


    # We make sure action is within bounds
    legal_action = np.clip(action, self.lower_bound, self.upper_bound)

    
    return legal_action

Updating networks

We update critic and actor according to loss that we discussed earlier and do soft updates on target networks

In [None]:
 def update_models(self):
    '''
    Updates the network.
    '''
    record_range = min(self.buffer_counter, self.buffer_capacity)
    batch_indices = np.random.choice(record_range,self.batch_size)

    states_mb=self.states[batch_indices]
    actions_mb=self.actions[batch_indices]
    next_states_mb=self.next_states[batch_indices]
    rewards_mb=self.rewards[batch_indices]
    dones_mb=self.dones[batch_indices]

    optimizer = tf.keras.optimizers.Adam(learning_rate=self.learning_rate)
  

    def train_step_critic(next_states,rewards,dones):
      with tf.GradientTape() as tape:
        target_actions=self.target_Actor_model(next_states)
        critic_value = self.Critic_model([next_states, target_actions])
        next_state_critic_value=self.target_Critic_model([next_states,target_actions])
        targets=rewards+(1-dones)*self.gamma*next_state_critic_value
        critic_loss = tf.reduce_mean((targets-critic_value)**2)
        print("Critic Loss:",critic_loss)
      critic_grad = tape.gradient(critic_loss, self.Critic_model.trainable_variables)
      optimizer.apply_gradients(zip(critic_grad, self.Critic_model.trainable_variables))
    train_step_critic(next_states_mb,rewards_mb,dones_mb)

    
    def train_step_actor(states):
      with tf.GradientTape() as tape:
        actions=self.Actor_model(states)
        critic_value = self.Critic_model([states, actions])
        actor_loss = -tf.reduce_mean(critic_value)
        print("Actor Loss:",actor_loss)
      actor_grad = tape.gradient(actor_loss, self.Actor_model.trainable_variables)
      optimizer.apply_gradients(zip(actor_grad, self.Actor_model.trainable_variables))
    train_step_actor(states_mb)
  
  def update_target_models(self):
    actor_weights = np.array(self.Actor_model.get_weights())
    actor_tartget_weights = np.array(self.target_Actor_model.get_weights())
    new_weights = self.alpha*actor_weights + (1-self.alpha)*actor_tartget_weights
    self.target_Actor_model.set_weights(new_weights)

    critic_weights = np.array(self.Critic_model.get_weights())
    critic_tartget_weights = np.array(self.target_Critic_model.get_weights())
    new_weights = self.alpha*critic_weights + (1-self.alpha)*critic_tartget_weights
    self.target_Critic_model.set_weights(new_weights)

Training and evaluating the model

This method creates a training environment for the model. Iterating through a set number of episodes, it uses the model to sample actions and play them. When such a sequence ends, the model is using the recorded observations to update the policy.

The evaluate method helps us keep an eye on the model's performance


In [None]:
  def evaluate(self):
    Average_Reward=0.0
    for episode in range(20):
      env = (gym.make("Pendulum-v0"))
      state_=(env.reset()).reshape((1,3))
      done=False
      episode_reward=0  
      print("Episode Started")
      while not done:
        action=self.get_action(state_,"testing")
        next_state, reward, done, info=env.step(action) 
        next_state.reshape((1,3))
        episode_reward+=reward
      Average_Reward+=episode_reward
      print("Testing_Episode:{}  Reward:{} Average_Reward:{} \n\n".format(episode,episode_reward,Average_Reward/20.0))
      print("Episode Ended")
 
 
  def train(self,episodes):
    env=self.env
    for episode in range(episodes):
      state_=(env.reset()).reshape((1,3))
      done=False
      episode_reward=0  
      print("Episode Started")
      while not done:
        action=self.get_action(state_)
        next_state, reward, done, info=env.step(action)
        next_state=next_state.reshape((1,3)) 
        self.remember(state_, action, reward,next_state,done)
        self.update_models()
        self.update_target_models()
        state_ = next_state.reshape((1,3))
        episode_reward+=reward
      print("Episode:{}  Reward:{}\n".format(episode,episode_reward))
      print("Episode Ended")
      if episode%100==0 and episode!=0:
        self.Actor_model.save('Actor_{}.h5'.format(episode+900))
        self.target_Actor_model.save('target_Actor_{}.h5'.format(episode+900))
        self.Critic_model.save('Critic_{}.h5'.format(episode))
        self.target_Critic_model.save('target_Critic_{}.h5'.format(episode+900))
        print("\n\n","Evaluating")
        self.evaluate()
      

In [None]:
Agent=DDPG(env,upper_bound,lower_bound)
Agent.train(episodes=300)