# What is purpose of Double Dueling DQN?
* Regular DQN tends to overestimate Q-values of potential actions in a given state
* Once one specific action becomes overestimated, it’s more likely to be chosen in the next iteration making it very 
hard for the agent to explore the environment uniformly and find the right policy.
* We will use our primary network to select an action and a target network to generate a Q-value for that action.
*  In order to synchronize our networks, we are going to copy weights from the primary network to the target one every 
'n' training steps.


In [2]:
import tensorflow as tf
import gym
import random
import numpy as np
import cv2

We will be using the pong model

In [3]:
ENV_NAME = 'PongDeterministic-v4' 

# Preprocessing
We will have to downscale the image and grayscale it as it makes the policy faster. 
<br>
You can use opencv or skimage or even inbuilt tensorflow methods


In [4]:
class ProcessFrame:
    """Convert to GrayScale and resize"""
    def __init__(self, rows = 84, cols = 84):
        """
        Args:
            :param rows : Height to be resized to  
            :param cols : Width to be resized to
        """
        self.rows = rows
        self.cols = cols
    def process_frame(self, frame):
        """
        Processes the frame passed to required using OpenCV
        :param frame : An Atari game-play frame of (210,160,3)
        :return:      A processed frame of (84,84,1)
        """
        gray = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
        resized = cv2.resize(gray, [self.rows,self.cols])
        preprocessed_frame = resized / 255.0
        return preprocessed_frame

# Making the dueling network
Refer to [Wang Et Al 2016](https://arxiv.org/abs/1511.06581) for more info. 


The architecture used :
* The first convolutional layer has   32    8x8 filters with stride 4
* The second convolutional layer has  64    4x4 filters with stride 2
* The third convolutional layer has   64    3x3 filters with stride 1
* The fourth convolutional layer has  1024  7x7 filters with stride 1

Instead of directly predicting a single Q-value for each action, the dueling architecture 
splits the final convolutional layer into two streams that represent the value and advantage 
functions that predict a state value V(s) that depends only on the state, and action advantages 
A(s,a) that depend on the state and the respective action.

An excerpt from the original paper :

> Intuitively, the dueling architecture can learn which states are (or are not) valuable, without 
having to learn the effect of each action for each state. This is particularly useful in states 
where its actions do not affect the environment in any relevant way. In the experiments, we 
demonstrate that the dueling architecture can more quickly identify the correct action during 
policy evaluation as redundant or similar actions are added to the learning problem.

The state value V(s) talks about is the state is favourable or not while the action advantage A(s,a)
says which is the favourable action if in that state.  

Now to combine the state and advantage values : 

> Next, we have to combine the value and advantage stream into $Q$-values $Q(s,a)$. This is done 
the following way :
\begin{equation}
Q(s,a) = V(s) + \left(A(s,a) - \frac 1{| \mathcal A |}\sum_{a'}A(s, a')\right)
\end{equation}


To give a sense of motion to the network, we stack 4 frames and pass to the network to train upon.

Let's now implement this in a class. 


In [5]:
class DQN:
    """Implementation of the brain"""
    def __init__(self, n_actions, stream,learning_rate = 0.00025, state_length = 4):
        """
        Init all required variables for class
        :param n_actions      : Number of actions available in the environment 
        :param stream         : Type of stream - 'v' or 'a' for Advantage and Value respectively
        :param learning_rate  : The specific lr argument for the Adam optimizer
        :param state_length   : The number of frames which create a state
        """
        self.n_actions = n_actions
        self.learning_rate = learning_rate
        self.agent_history_length = state_length
        self.stream = stream
        self.input_shape = [84, 84, self.agent_history_length]
        self.make_model()
        
    def make_model(self):
        """
        Creates the neural network responsible
        :return: None
        """
        self.model = tf.keras.models.Sequential([
            tf.keras.layers.Conv2D(filters=32, kernel_size=[8,8], strides=4,
                                   input_shape=self.input_shape, 
                                   activation='relu',
                                   kernel_initializer=tf.keras.initializers.VarianceScaling(scale=2)),
            tf.keras.layers.Conv2D(filters=64, kernel_size=[4,4], strides=2,
                                   input_shape=self.input_shape, 
                                   activation='relu',
                                   kernel_initializer=tf.keras.initializers.VarianceScaling(scale=2)),
            tf.keras.layers.Conv2D(filters=64, kernel_size=[3,3], strides=1,
                                   input_shape=self.input_shape, 
                                   activation='relu',
                                   kernel_initializer=tf.keras.initializers.VarianceScaling(scale=2)),
            tf.keras.layers.Conv2D(filters=1024, kernel_size=[7,7], strides=1,
                                   input_shape=self.input_shape, 
                                   activation='relu',
                                   kernel_initializer=tf.keras.initializers.VarianceScaling(scale=2)),
            tf.keras.layers.Flatten()
        ])
        if self.stream == 'v':
            self.model.add(
                tf.keras.layers.Dense(units=1,
                                      kernel_initializer=tf.keras.initializers.VarianceScaling(scale=2)))
        elif self.stream == 'a':
            self.model.add(
                tf.keras.layers.Dense(units=self.n_actions,
                                      kernel_initializer=tf.keras.initializers.VarianceScaling(scale=2)))
            
        self.model.compile(optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate),
                           loss=tf.keras.losses.mean_squared_error,
                           metrics=['accuracy'])
        self.model.summary()
Value = DQN(4, 'v')
Advantage = DQN(4, 'a')

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 20, 20, 32)        8224      
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 9, 9, 64)          32832     
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 7, 7, 64)          36928     
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 1, 1, 1024)        3212288   
_________________________________________________________________
flatten (Flatten)            (None, 1024)              0         
_________________________________________________________________
dense (Dense)                (None, 1)                 1025      
Total params: 3,291,297
Trainable params: 3,291,297
Non-trainable para

# Making the Get Action Class

For the first 50000 frames the agent only explores (ϵ=1). Over the following 1 million frames, ϵ is linearly
decreased to 0.1

After that we chose to decrease it to ϵ=0.01 over the remaining frames as suggested by the OpenAi Baselines for DQN.


In [6]:
class GetAction:
    """To implement the Epsilon Greedy Policy for returning actions"""
    def __init__(self, n_actions, eps_initial=1, eps_middle=0.1, eps_final=0.01, eps_eval=0,
                 train_start=50e3, train_middle=1e6, train_max=25e6):
        """
        Init all required variables for class
        :param n_actions     : Action Space for the environment
        :param eps_initial   : Start  Value for Epsilon
        :param eps_middle    : Middle Value for Epsilon
        :param eps_final     : Final  Value for Epsilon
        :param eps_eval      : Epsilon Value to be used during Evaluation
        :param train_start   : Number of frames to be pure exploration ie. Epsilon = 1
        :param train_middle  : Number of frames for epsilon to decay from eps_initial to eps_middle 
        :param train_max     : Number of frames for epsilon to decay from eps_middle to eps_final
        """
        self.n_actions = n_actions
        self.eps_initial = eps_initial
        self.eps_middle = eps_middle
        self.eps_final = eps_final
        self.eps_eval = eps_eval
        self.train_start = train_start
        self.train_middle = train_middle
        self.train_max = train_max
        
        # Now we need to generate linear lines for the decay between the start, middle, and the final points
        # m = slope = y2-y1 / x2-x1
        # c = intercept = y2 - m*x2
        self.m_1 = (self.eps_middle - self.eps_initial) / (self.train_middle - self.train_start)
        self.c_1 = self.eps_middle - self.m_1*self.train_middle
        self.m_2 = (self.eps_final - self.eps_final) / (self.train_max - self.train_middle)
        self.c_2 = self.eps_final - self.m_2*self.train_max
    
    def get_action(self, frame_number, state, dqn_object, evaluation=False):
        """
        Returns the action based on the epsilon greedy policy for the given state
        :param frame_number  : Number of frames passed. Used to determine if to use (m_1,c_1) or (m_2,c_2)
        :param state         : A stack of frames of the game-play after preprocessing ie. (84,84,4)
        :param dqn_object    : DQN object to return the best action
        :param evaluation    : Flag to be set True while evaluation. Relies only upon the exploitation
        :return: An integer between 0 and n_actions-1 to be set as the action
        """
        if evaluation:
            eps = self.eps_eval
        elif frame_number < self.train_start:
            eps = self.eps_initial
        elif frame_number < self.train_middle:
            eps = self.m_1*frame_number + self.c_1
        elif frame_number < self.train_max:
            eps = self.m_2*frame_number + self.c_2
        
        if np.random.random() < eps:
            return np.random.randint(0, self.n_actions)
        else:
            return dqn_object.predcit(state)
        

# Experience Replay!!

> Second, learning directly from consecutive samples is inefficient, due to the strong correlations between 
the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates. 
Third, when learning on-policy the current parameters determine the next data sample that the parameters are 
trained on. For example, if the maximizing action is to move left then the training samples will be dominated 
by samples from the left-hand side; if the maximizing action then switches to the right then the training 
distribution will also switch. It is easy to see how unwanted feedback loops may arise and the parameters 
could get stuck in a poor local minimum, or even diverge catastrophically. <br> [Mnih et al. 2013](https://arxiv.org/abs/1312.5602)

We add the experiences to a memory holding *deque*. The experiences are of the form - $(state,action,reward,next\_state,done)$.
We store the last 1 million experiences. We fill up the deque with our experiences and sample a batch of random 
experiences when needed to train. The deque is repopulated as more and more experiences are seen and older ones are
removed.

Lets make the class. We will sample from a memory buffer of 1 Million

In [7]:
class ExpMemory:
    """Storage for the experiences encountered . Stores the last 1 Million experiences"""
    def __init__(self, size=1e6, rows=84, cols=84, state_length=4, batch=32):
        """
        Init all required variables for class
        :param size          : The number of experiences to be stored  
        :param rows          : The pixel ht of a frame
        :param cols          : The pixel wd of a frame
        :param state_length  : The number of frames stacked to form a state
        :param batch         : The size of a batch to train upon
        """
        self.size = size
        self.rows = rows
        self.cols = cols
        self.state_length = state_length
        self.batch = batch
        self.count = 0      # Holds the number of experiences in the deque
        self.current = 0    # Holds the index of the last updated experience
        # Memory of experiences made:
        self.actions = np.empty(self.size, dtype=np.int)                                # Placeholder for actions
        self.rewards = np.empty(self.size, dtype=np.float)                              # Placeholder for rewards                
        self.dones = np.empty(self.size, dtype=np.bool)                                 # Placeholder for done
        self.frames = np.empty((self.size,self.rows,self.cols))                         # Placeholder for frames
        self.states = np.empty((self.size,self.state_length,self.rows,self.cols))       # Placeholder for stacked frames
        self.new_states = np.empty((self.size,self.state_length,self.rows,self.cols))   # Placeholder for next stacked frames
        self.indices = np.empty(self.batch, dtype=np.int)                               # Placeholder for sample of experience tuples
    def remember_state(self, action, frame, reward, done):
        """
        Adds the tuple of experience to the memory buffer
        :param action   : The action taken during the respective state 
        :param frame    : Preprocessed game-play frame (84,84,1)
        :param reward   : Reward associated with action
        :param done     : Done flag returned from taking the step
        :return: None
        """
        if frame.shape != (self.rows,self.cols):
            raise ValueError(f"Wrong dimensions of frame. Reqd - {self.rows},{self.cols} \t Passed - {frame.shape}")
        self.actions[self.current] = action             
        self.frames[self.current, ...] = frame
        self.rewards[self.current] = reward
        self.dones[self.current] = done
        self.current = (self.current+1) % self.size     # Implementation of DeQue on a list
        self.count = max(self.current,self.count)      
    def _get_state(self, index):
        """
        Gets the corresponding stacked state for the associated index 
        :param index : The index query  
        :return: A list of stacked frames of shape (4,84,84)
        """
        if self.count>0 and index>self.state_length-1:
            return self.frames[index-self.state_length+1 : index+1]
        else:
            raise ValueError("Minimum index of 3 is required. Populate the memory buffer")
    def _get_indices(self):
        """
        Gets a list of indices which are valid for experience replay training
        :return: A list of size (batch,)
        """
        for i in range(self.batch):
            while True:
                index = np.random.randint(low=self.state_length, high=self.count)   # Select a random index
                if index >= self.current >= index-self.state_length:                # State frames aren't continuous 
                    continue                                                        # ie. belong to two different times
                if self.dones[index-self.state_length : index].any():               # If any of the frames before were part of a different life
                    continue
                break
            self.indices[i] = index                                                 # Valid cases
    def get_experience(self):
        """
        Gets a batch of experience tuples of size self.batch
        :return: states, actions, rewards, next_state, dones  with self.batch=32 replays
        """
        if self.count<self.state_length:
            raise ValueError("Too few experiences to get s batch")
        self._get_indices()
        for i, index in enumerate(self.indices):
            self.states[i] = self._get_state(index)
            self.new_states[i] = self._get_state(index)
        # We now use transpose just to get the states in the form of (32,84,84,4) -> (batch,rows,cols,state_length)
        return np.transpose(self.states, axes=(0,2,3,1)), self.actions[self.indices], self.rewards[self.indices], \
               np.transpose(self.new_states, axes=(0,2,3,1)), self.rewards[self.indices]

# The Learning Part
The problem is that both $Qprediction$ and $Qtarget$ depend on the same parameters $θ$ if only one network is used. This 
can lead to instability when regressing $Qprediction$ towards $Qtarget$ because the "target is moving". We ensure a 
"fixed target" by introducing a second network with fixed and only occasionally updated parameters that estimates the 
target Q-values.
> Reinforcement learning is known to be unstable or even to diverge when a nonlinear function approximator such as a 
neural network is used to represent the action-value (also known as Q) function. This instability has several causes: 
the correlations present in the sequence of observations, the fact that small updates to Q may significantly change 
the policy and therefore change the data distribution, and the correlations between the action-values [...] and the 
target values [...]. We address these instabilities with a novel variant of Q-learning, which uses two key ideas. 
First, we used a biologically inspired mechanism termed experience replay that randomizes over the data, thereby 
removing correlations in the observation sequence and smoothing over changes in the data distribution [...]. 
Second, we used an iterative update that adjusts the action-values (Q) towards target values that are only 
periodically updated, thereby reducing correlations with the target.
[Mnih et al. 2015](https://www.nature.com/articles/nature14236/)

One network is used to predict the $Qprediction value while the other network is used to predict the $Qtarget$ value
and is fixed. The main network is updated by gradient descent while every 10000 steps the prediction network value is 
copied to the target network

In [8]:
def learn(exp_memory, main_dqn, target_dqn, batch, gamma):
    """
    Implements the DQN equation
    :param exp_memory   : An object of ExpMemory
    :param main_dqn     : An object of DQN
    :param tsrget_dqn   : An object of DQN
    :param batch        : A batch size to perform learning on
    :param gamma        : Discounter
    :return: None
    """
    states, actions, rewards, next_states, dones = exp_memory.get_experience()
    for i in range(batch):
        next_best_action = np.argmax(main_dqn.predict(next_states[i]))  # Best action in the next Q Table tuple - mainDQN
        next_q = target_dqn.predict(next_states[i])                     # Q Table tuple - targetDQN
        double_q = next_q[next_best_action]        
        target_q = rewards[i] + (1-dones[i])*(gamma*double_q)
        main_dqn.fit(states[i], target_q)
