# What is purpose of Double Dueling DQN?
* Regular DQN tends to overestimate Q-values of potential actions in a given state
* Once one specific action becomes overestimated, it’s more likely to be chosen in the next iteration making it very 
hard for the agent to explore the environment uniformly and find the right policy.
* We will use our primary network to select an action and a target network to generate a Q-value for that action.
*  In order to synchronize our networks, we are going to copy weights from the primary network to the target one every 
'n' training steps.


In [3]:
import tensorflow as tf
import gym
import random
import numpy as np
import cv2

We will be using the pong model

In [2]:
ENV_NAME = 'PongDeterministic-v4' 

# Preprocessing
We will have to downscale the image and grayscale it as it makes the policy faster. 
<br>
You can use opencv or skimage or even inbuilt tensorflow methods


In [4]:
class ProcessFrame:
    """Convert to GrayScale and resize"""
    def __init__(self, rows = 84, cols = 84):
        """
        Args:
            :param rows : Height to be resized to  
            :param cols : Width to be resized to
        """
        self.rows = rows
        self.cols = cols
    def process_frame(self, frame):
        """
        Processes the frame passed to required using OpenCV
        :param frame : An Atari game-play frame of (210,160,3)
        :return:      A processed frame of (84,84,1)
        """
        gray = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
        resized = cv2.resize(gray, [self.rows,self.cols])
        preprocessed_frame = resized / 255.0
        return preprocessed_frame

# Making the dueling network
Refer to [Wang Et Al 2016](https://arxiv.org/abs/1511.06581) for more info. 


The architecture used :
* The first convolutional layer has   32    8x8 filters with stride 4
* The second convolutional layer has  64    4x4 filters with stride 2
* The third convolutional layer has   64    3x3 filters with stride 1
* The fourth convolutional layer has  1024  7x7 filters with stride 1

Instead of directly predicting a single Q-value for each action, the dueling architecture 
splits the final convolutional layer into two streams that represent the value and advantage 
functions that predict a state value V(s) that depends only on the state, and action advantages 
A(s,a) that depend on the state and the respective action.

An excerpt from the original paper :

> Intuitively, the dueling architecture can learn which states are (or are not) valuable, without 
having to learn the effect of each action for each state. This is particularly useful in states 
where its actions do not affect the environment in any relevant way. In the experiments, we 
demonstrate that the dueling architecture can more quickly identify the correct action during 
policy evaluation as redundant or similar actions are added to the learning problem.

The state value V(s) talks about is the state is favourable or not while the action advantage A(s,a)
says which is the favourable action if in that state.  

Now to combine the state and advantage values : 

> Next, we have to combine the value and advantage stream into $Q$-values $Q(s,a)$. This is done 
the following way :
\begin{equation}
Q(s,a) = V(s) + \left(A(s,a) - \frac 1{| \mathcal A |}\sum_{a'}A(s, a')\right)
\end{equation}


To give a sense of motion to the network, we stack 4 frames and pass to the network to train upon.

Let's now implement this in a class. 


In [9]:
class DQN:
    """Implementation of the brain"""
    def __init__(self, n_actions, stream,learning_rate = 0.00025, state_length = 4):
        """
        Init all required variables for class
        :param n_actions      : Number of actions available in the environment 
        :param stream         : Type of stream - 'v' or 'a' for Advantage and Value respectively
        :param learning_rate  : The specific lr argument for the Adam optimizer
        :param state_length   : The number of frames which create a state
        """
        self.n_actions = n_actions
        self.learning_rate = learning_rate
        self.agent_history_length = state_length
        self.stream = stream
        self.input_shape = [84, 84, self.agent_history_length]
        self.make_model()
        
    def make_model(self):
        """
        Creates the neural network responsible
        :return: None
        """
        self.model = tf.keras.models.Sequential([
            tf.keras.layers.Conv2D(filters=32, kernel_size=[8,8], strides=4,
                                   input_shape=self.input_shape, 
                                   activation='relu',
                                   kernel_initializer=tf.keras.initializers.VarianceScaling(scale=2)),
            tf.keras.layers.Conv2D(filters=64, kernel_size=[4,4], strides=2,
                                   input_shape=self.input_shape, 
                                   activation='relu',
                                   kernel_initializer=tf.keras.initializers.VarianceScaling(scale=2)),
            tf.keras.layers.Conv2D(filters=64, kernel_size=[3,3], strides=1,
                                   input_shape=self.input_shape, 
                                   activation='relu',
                                   kernel_initializer=tf.keras.initializers.VarianceScaling(scale=2)),
            tf.keras.layers.Conv2D(filters=1024, kernel_size=[7,7], strides=1,
                                   input_shape=self.input_shape, 
                                   activation='relu',
                                   kernel_initializer=tf.keras.initializers.VarianceScaling(scale=2)),
            tf.keras.layers.Flatten()
        ])
        if self.stream == 'v':
            self.model.add(
                tf.keras.layers.Dense(units=1,
                                      kernel_initializer=tf.keras.initializers.VarianceScaling(scale=2)))
        elif self.stream == 'a':
            self.model.add(
                tf.keras.layers.Dense(units=self.n_actions,
                                      kernel_initializer=tf.keras.initializers.VarianceScaling(scale=2)))
            
        self.model.compile(optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate),
                           loss=tf.keras.losses.mean_squared_error,
                           metrics=['accuracy'])
        self.model.summary()
Value = DQN(4, 'v')
Advantage = DQN(4, 'a')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_8 (Conv2D)            (None, 20, 20, 32)        8224      
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 9, 9, 64)          32832     
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 7, 7, 64)          36928     
_________________________________________________________________
conv2d_11 (Conv2D)           (None, 1, 1, 1024)        3212288   
_________________________________________________________________
flatten (Flatten)            (None, 1024)              0         
_________________________________________________________________
dense (Dense)                (None, 1)                 1025      
Total params: 3,291,297
Trainable params: 3,291,297
Non-trainable params: 0
_________________________________________________________________


# Making the Get Action Class

For the first 50000 frames the agent only explores (ϵ=1). Over the following 1 million frames, ϵ is linearly
decreased to 0.1

After that we chose to decrease it to ϵ=0.01 over the remaining frames as suggested by the OpenAi Baselines for DQN.


In [10]:
class GetAction:
    """To implement the Epsilon Greedy Policy for returning actions"""
    def __init__(self, n_actions, eps_initial=1, eps_middle=0.1, eps_final=0.01, eps_eval=0,
                 train_start=50e3, train_middle=1e6, train_max=25e6):
        """
        Init all required variables for class
        :param n_actions     : Action Space for the environment
        :param eps_initial   : Start  Value for Epsilon
        :param eps_middle    : Middle Value for Epsilon
        :param eps_final     : Final  Value for Epsilon
        :param eps_eval      : Epsilon Value to be used during Evaluation
        :param train_start   : Number of frames to be pure exploration ie. Epsilon = 1
        :param train_middle  : Number of frames for epsilon to decay from eps_initial to eps_middle 
        :param train_max     : Number of frames for epsilon to decay from eps_middle to eps_final
        """
        self.n_actions = n_actions
        self.eps_initial = eps_initial
        self.eps_middle = eps_middle
        self.eps_final = eps_final
        self.eps_eval = eps_eval
        self.train_start = train_start
        self.train_middle = train_middle
        self.train_max = train_max
        
        # Now we need to generate linear lines for the decay between the start, middle, and the final points
        # m = slope = y2-y1 / x2-x1
        # c = intercept = y2 - m*x2
        self.m_1 = (self.eps_middle - self.eps_initial) / (self.train_middle - self.train_start)
        self.c_1 = self.eps_middle - self.m_1*self.train_middle
        self.m_2 = (self.eps_final - self.eps_final) / (self.train_max - self.train_middle)
        self.c_2 = self.eps_final - self.m_2*self.train_max
    
    def get_action(self, frame_number, state, dqn_object, evaluation=False):
        """
        Returns the action based on the epsilon greedy policy for the given state
        :param frame_number  : Number of frames passed. Used to determine if to use (m_1,c_1) or (m_2,c_2)
        :param state         : A stack of frames of the game-play after preprocessing ie. (84,84,4)
        :param dqn_object    : DQN object to return the best action
        :param evaluation    : Flag to be set True while evaluation. Relies only upon the exploitation
        :return: An integer between 0 and n_actions-1 to be set as the action
        """
        if evaluation:
            eps = self.eps_eval
        elif frame_number < self.train_start:
            eps = self.eps_initial
        elif frame_number < self.train_middle:
            eps = self.m_1*frame_number + self.c_1
        elif frame_number < self.train_max:
            eps = self.m_2*frame_number + self.c_2
        
        if np.random.random() < eps:
            return np.random.randint(0, self.n_actions)
        else:
            return dqn_object.predcit(state)
        

# Experience Replay!!

> Second, learning directly from consecutive samples is inefficient, due to the strong correlations between 
the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates. 
Third, when learning on-policy the current parameters determine the next data sample that the parameters are 
trained on. For example, if the maximizing action is to move left then the training samples will be dominated 
by samples from the left-hand side; if the maximizing action then switches to the right then the training 
distribution will also switch. It is easy to see how unwanted feedback loops may arise and the parameters 
could get stuck in a poor local minimum, or even diverge catastrophically. <br> [Mnih et al. 2013](https://arxiv.org/abs/1312.5602)

