This is the Part-4 of Deep Reinforcement Learning Notebook series. ***In this Notebook I have introduced another Value-based Algorithm known as  Deep Q-Networks algorithm (DQN)***.


The Notebook series is about Deep RL algorithms so it excludes all other techniques that can be used to learn functions in reinforcement learning and also the Notebook Series is not exhaustive i.e. it contains the most widely used Deep RL algorithms only.



# Deep Q-Network
DQN is introduced in 2 papers, Playing Atari with Deep Reinforcement Learning on NIPS in 2013 and Human-level control through deep reinforcement learning on Nature in 2015. Deep Q-Network (DQN) is the first deep reinforcement learning method proposed by DeepMind. After the paper was published on Nature in 2015, a lot of research institutes joined this field because the deep neural network can empower RL to directly deal with high dimensional states like images, thanks to techniques used in DQN. 

# How is DQN different from SARSA
Like SARSA, DQN is a value-based temporal difference (TD) algorithm that approximates the Q-function. The learned Q-function is then used by an agent to select actions. However, DQN learns a different Q-function compared to SARSA — the optimal Q- function instead of the Q-function for the current policy. This small but crucial change improves the stability and speed of learning.SARSA is an on-policy algorithm whereas DQN is an off-policy algorithm which means that DQN can learn from experiences gathered by any agent.DQN makes it possible to de-correlate and re-use experiences by sampling random batches from a large experience replay memory and allows for multiple parameter updates using the same batch.

## LEARNING THE Q-FUNCTION IN DQN
DQN, like SARSA, learns the Q-function using TD learning. Where the two algorithms differ is how $Q_{tar}(s,a)$ is constructed.

![alt text](https://github.com/Machine-Learning-rc/Unimportant/blob/master/Screenshot%202020-07-06%20at%205.14.04%20PM.png?raw=true)

![alt text](https://github.com/Machine-Learning-rc/Unimportant/blob/master/Screenshot%202020-07-06%20at%205.14.14%20PM.png?raw=true)

Instead of using the action a′ actually taken in the next state s′ to estimate $Q_{tar}(s,a)$ , DQN uses the maximum Q-value over all of the potential actions available in that state. In DQN, the Q-value that is used in the next state s′ doesn’t depend on the policy used to gather experiences. Since the reward r and next state s′ are produced by the environment given the current state s and action a, this means that no part of the $Q_{tar}(s,a)$ estimation depends on the data gathering
policy. This makes DQN an off-policy algorithm because the function being learned is independent of the policy being followed to act in the environment and gather experiences. In contrast, SARSA is on-policy because it uses the action a′ taken by the current policy in state s′ to calculate the Q-value for the next state. It directly depends on the policy used to gather experiences.


It is important to note that just because DQN’s objective is to learn the optimal Q-function doesn’t mean that it will. There may be many reasons for this. For example, the hypothesis space represented by the neural network may not actually contain the optimal Q-function, non-convex optimization methods are imperfect and might not find a global minimum, and computation and time constraints place a limit on how long we can train an agent for. However, we can say that the upper bound on performance with DQN is optimal, compared to a potentially sub-optimal upper bound for SARSA resulting from learning the Q-function under an ∊-greedy policy.



# ACTION SELECTION IN DQN
Even though DQN is an off-policy algorithm, how a DQN agent gathers experiences still matters. There are two important factors to consider.
First, an agent still faces the exploration-exploitation trade-off discussed in the last notebook(Deep Reinforcement Learning- Part-3 ). An agent should rapidly explore the state-action space at the beginning of training to increase the chances of discovering good ways to act in an environment. As training progress and the agent learns more, an agent should gradually decrease the rate of exploration and spend more time exploiting what it has learned. This improves the efficiency of training as the agent focuses on better actions.
Second, if the state-action space is very large because it consists of continuous values or is discrete with high dimensionality1, then it will be intractable to experience all (s, a) pairs, even once. In these cases, Q-values for the unvisited (s, a) pairs may be no better than random guessing. Fortunately, function approximation with neural networks mitigates this problem because they can generalize from visited (s, a) pairs to similar states and actions. However, this does not completely solve the problem. There may still be parts of the state-action space that are far away and very different from the states and actions an agent has experienced. A neural network is unlikely to generalize well in these cases and the estimated Q-values may be inaccurate.

# How will Tabular Methods fail when  state-action space is very large
Let’s consider a tabular representation for $Q_@(s,a)$ .
Suppose the state-action space is very large with millions of (s, a) pairs and at the beginning of training each cell representing a particular   is initialized to 0. During
training an agent visits (s, a) pairs and the table is updated but the unvisited (s, a) pairs continue to have.
Since the state-action space is large, many (s, a) pairs will remain unvisited and their Q-value estimates will remain at 0 even if (s, a) is desirable, with $Q^π$(s, a) >> 0. The main issue is that a tabular function representation does not learn anything about how different states and actions relate to each other.

# Why Neural Networks are better for  generalization 
Neural networks can extrapolate from Q-values for known (s, a) to unknown (s′, a′) because they learn how different states and actions are related to each other. This is very useful when (s, a) is large or has infinitely many elements because it means an agent does not have to visit all (s, a) to learn a good estimate of the Q-function. An agent only need to visit a representative subset of the state-action space.

# When can Neural Networks fail
There are limitations to how well a network will generalize, and there are two common cases where it often fails. First, if a network receives inputs that are significantly different from the inputs it has been trained on, it is unlikely to produce good outputs. Generalization is typically much better in small neighborhoods of the input space surrounding the training data. Second, neural networks are likely to generalize poorly if the function they are approximating has sharp discontinuities. This is because neural networks implicitly assume that the input space is locally smooth. If the inputs x and x′ are similar, then the corresponding outputs y and y′ should also be similar.

# Solution to our problem 

If an environment has a large state-action space it is unlikely that an agent will be able to learn good Q-value estimates for all parts of this space. It is still possible to achieve good performance in such environments, provided that an agent focuses on learning on the states and actions that a good policy is likely to visit often. When this strategy is combined with neural networks, the Q-function approximation is likely to be good in a local region surrounding the commonly visited parts of the state-action space.

The policy used by a DQN agent should, therefore, visit states and select actions that are reasonably similar the those that would be visited by acting greedily with respect to the agent’s current Q-function estimate, which is the current estimate of the optimal policy.

In practice, this can be achieved by using the ∊-greedy policy or the Boltzmann policy. We will use Boltmann's policy here.

# The Boltzmann Policy

The Boltzmann policy tries to improve over random exploration(which is done in ∊-greedy policy) by selecting actions using their relative Q-values. The Q-value maximizing action a in state s will be selected most often, but other actions with relatively high Q-values will also have a high probability of being chosen. Conversely, actions with very low Q-values will hardly ever be taken. This has the effect of focusing exploration on more promising actions off the Q-value maximizing path instead of selecting all actions with equal probability.

To produce a Boltzmann policy, we construct a probability distribution over the Q-values for all actions a in state s by applying the softmax function (Equation 4.1). The softmax function is parameterized by a temperature parameter τ ∈ (0,∞), which controls how uniform or concentrated the resulting probability distribution is. High values of τ push the distribution to become more uniform, low values of τ push the distribution to become more concentrated. Actions are then sampled according to this distribution as shown in Equation
4.2.

![alt text](https://github.com/Machine-Learning-rc/Unimportant/blob/master/Screenshot%202020-07-12%20at%207.08.07%20AM.png?raw=true)

Equation 4.1

![alt text](https://github.com/Machine-Learning-rc/Unimportant/blob/master/Screenshot%202020-07-12%20at%207.08.13%20AM.png?raw=true)

Equation 4.2

The role of the temperature parameter τ in the Boltzmann policy is analogous to that of ∊ in the ∊-greedy policy. It encourages exploration of the state-action space. High values of τ (e.g τ = 5) encourage the probability distribution to be closer to a uniform distribution. This results in an agent acting very randomly. Low values of τ (e.g. 0.1) increase the probability of the action corresponding to the largest Q-value, and so the agent will act more greedily. τ = 1 reduces to the softmax function. Adjusting the value of τ during training balances exploration and exploitation. High values of τ at the beginning of training will encourage exploration. As τ is decreased over time, the policy will approximate the greedy policy more closely.


The main advantage of the Boltzmann policy when compared to the ∊-greedy policy is that it explores the environment less randomly. Each time an agent selects an action, it samples a from the probability distribution over the actions generated by the Boltzmann policy. Instead of acting randomly with probability ∊, the agent selects action a with probability $p_{Boltzmann}(a|s)$ so actions with higher Q-values are more likely to be chosen. Even if an agent does not select the Q-maximizing action, they are more likely to select the second than the third or fourth-best action. A Boltzmann policy also results in a smoother relationship between Q-value estimates and action probabilities compared to an ∊-greedy policy.

∊-greedy policies lead to more extreme behavior. If one of the Q-values is a fraction higher than the other, ∊-greedy will assign all of the non-random probability (1 – ∊) to that action. If in the next iteration the other Q-value was now slightly higher, ∊-greedy would immediately switch and assign all of the non-random probability to the other action. An ∊-greedy policy can be more unstable than a Boltzmann policy which can make it more difficult for an agent to learn.

A Boltzmann policy can cause an agent to get stuck in a local minimum if the Q-function estimate is inaccurate for some parts of the state space. One way to tackle this problem with the Boltzmann policy is to use a large value for τ at the beginning of training so that the action probability distribution is more uniform. As training progresses and the agent learns more τ can be decayed, allowing the agent to exploit what it has learned. However, care needs to be taken not to decay τ too quickly otherwise a policy may get stuck in local minima.

# EXPERIENCE REPLAY
DQN improves on the sample efficiency of SARSA with the help of experience replay memory.

**Let us first see why on-policy are sample inefficient.**

We have seen that on-policy algorithms can only use data gathered by the current policy to update the policy parameters. Each experience is used just once. This is problematic when combined with function approximation methods that learn using gradient descent, such as neural networks. Each parameter update must be small because the gradient only conveys meaningful information about a descent direction in a small area around the current parameter values. However, the optimal parameter update for some experiences may be large, for example, if there is a large difference between the Q-value a network predicts and actual Q-value. In these cases, network parameters may need to be updated many times using the experiences to make use of all of the information conveyed in them. On-policy algorithms cannot do this. Also, the experiences used to train on-policy algorithms are highly correlated. This is because the data used to compute a single parameter update is often from a single episode, where future states and rewards depend on previous states and actions. This can lead to high variance in the parameter updates.

 TD learning could be slow due to the trial and error mechanism for gathering data inherent in RL and the need to propagate information backward through time. Speeding up TD learning amounts to either speeding up the credit assignment process or shortening the trial and error process.
Experience replay focuses on the latter by facilitating the re-use of experiences.

An experience replay memory stores the k most recent experiences an agent has gathered. If memory is full, the oldest experience is discarded to make space for the latest one. Each time an agent trains, one or more batches of data are sampled randomly uniformly from the experience replay memory. Each of these batches is used in turn to update the parameters of the Q- function network. k is typically quite large, between 10, 000 and 1, 000, 000, whereas the number of elements in a batch is much smaller, typically between 32 and 2048.
The size of the memory should be large enough to contain many episodes of experiences. Each batch will typically contain experiences from different episodes and different policies that de-correlates the experiences used to train an agent. In turn, this reduces the variance of the parameter updates, helping to
actions. This can lead to high variance in the parameter updates.
stabilize training. However, the memory should also be small enough so that each experience is likely to be sampled more than once before being discarded, which makes learning more efficient.
Discarding the oldest experiences is also important. As an agent learns, the distribution of (s, a) pairs that an agent experiences changes. Older experiences become less useful because an agent is less likely to visit the older states. With finite time and computational resources, it is preferable an agent focus on learning from the more recently gathered experiences, since these tend to be more relevant. Storing just the k most recent experiences in the memory implements this idea.



# DQN ALGORITHM
![alt text](https://github.com/Machine-Learning-rc/Unimportant/blob/master/Screenshot%202020-07-12%20at%207.34.54%20AM.png?raw=true)

# IMPLEMENTING DQN

Below code setups the environment required to run and record the game and also loads the required library.

In [None]:
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Dense,Dropout,Conv2D, Flatten,MaxPooling2D ,Activation
from tensorflow.keras.models import Sequential
import gym
import numpy as np
import random
from collections import deque
from tensorflow.keras.utils import normalize as normal_values
import cv2
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
from IPython.display import HTML

from IPython import display as ipythondisplay

In [None]:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

In [None]:
"""
Utility functions to enable video recording of gym environment and displaying it
To enable video, just do "env = wrap_env(env)""
"""

def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

This part ensures the reproducibility of the code below by using a random seed and setups the environment.

In [None]:
import gym.envs.toy_text 
RANDOM_SEED=1
N_EPISODES=500

# random seed (reproduciblity)
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)

# set the env
env = (wrap_env(gym.make("Assault-v0"))) # env to import
env.seed(RANDOM_SEED)
env.reset() # reset to env

Defining the DQN Class.You can see that I have commented out few things like temperature_parameter .You can uncomment them if you can want to use Boltzmann policy.In this epsilon greedy works good. So I have used that.

In [None]:
class DQN:

  def __init__(self, env,path=None):
    self.env=env #import env
    self.state_shape=(250,160,4) # the state space
    self.action_shape=env.action_space.n # the action space
    self.gamma=[0.99] # decay rate of past observations
    self.alpha=1e-4 # learning rate in the policy gradient
    self.learning_rate=0.001 # learning rate in deep learning
    self.epsilon_initial_value=1.0 # initial value of epsilon
    self.epsilon_current_value=1.0 # current value of epsilon
    #self.temperature_parameter_final_value=0.0001 # final value of temperature_parameter
    #self.temperature_parameter_initial_value=5.0 # initial value of temperature_parameter
    #self.temperature_parameter_current_value=5.0 # current value of temperature_parameter
    self.epsilon_final_value=0.0001 # final value of epsilon
    #self.nma=3 # No of Top actions to take while exploring 
    self.observing_episodes=5 #No of episodes to observe before updating
    self.batch_sizee=128
    self.transitions= deque()
    self.replay_memory=50000 # number of previous transitions to remember
    if not path:
      self.model=self._create_model() #build model
    else:
      self.model=load_model(path) #import model
    
    def remember(self,delta,state,action,next_state,reward):      #This is the function to store our experiences
    self.transitions.append([delta,state,action,next_state,reward])
    if len(self.transitions) > self.replay_memory:
      self.transitions.append([delta,state,action,next_state,reward])

Creating a Neural Network Model.

This is for epsilon greedy method.

In [None]:
def _create_model(self):
    ''' builds the model using keras'''
    model = Sequential()
    model.add(Conv2D(32, (8, 8), padding='same',strides=(4, 4),input_shape=(250,160,4)))  
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(Activation('relu'))
    model.add(Conv2D(64, (4, 4),strides=(2, 2),  padding='same'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(Activation('relu'))
    model.add(Conv2D(64, (3, 3),strides=(1, 1),  padding='same'))
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(Activation('relu'))
    model.add(Flatten())
    model.add(Dense(512))
    model.add(Activation('relu'))
    model.add(Dense(7))
    model.compile(loss='MSE',optimizer=tf.keras.optimizers.Adam(learning_rate=self.learning_rate))
    return model

Action Selection

The get_action method guides out action choice. Initially, when training begins we use exploration policy but later we do exploitation.

In [None]:
def get_action(self, state):
        '''samples the next action based on the Boltzmann exploration policy'''

        x=random.random()
        if x < self.epsilon_current_value:                                    #Exlporation
          #q_values=((self.model.predict(state)[0])**(1/self.temperature_parameter_current_value) #This is the step where we use Boltzmann exploration policy. Uncomment this and below to use boltzmann policy
          #top_actions=q_values.argsort()[-self.nma:][::-1]
          #action=random.choice(top_actions)
           action=random.choice([0,1,2,3,4,5,6])
            
        else:
          q_values=(self.model.predict(state)) #Exploitation
          max_Q = np.argmax(q_values)
          action = max_Q
          
        return action

Updating the Policy

The update_policy method updates the model weights. It does it by training the model on a batch sampled from the experiences we store while training. Batch size and no of epochs can be tuned to future increase training efficiency.

In [None]:
def update_policy(self):
    '''
    Updates the policy network using the NN model.
      '''
    transitions=random.sample(self.transitions,self.batch_size)
    
    inputs=np.zeros((self.batch_size, 250,160,4))
    targets = np.zeros((self.batch_size,7)) 

    for i in range(0,self.batch_size):
      delta=transitions[i][0] #wheather the state is terminal or not
      state=transitions[i][1] # 4D stack of images
      action=transitions[i][2] #This is action
      next_state=transitions[i][3] #next state
      reward=transitions[i][4] #reward at state due to action

      inputs[i:i + 1] = state
      targets[i] = (self.model.predict(state)) # predicted q values
      Q_sa = (self.model.predict(next_state))  #predict q values for next step
      if delta==0:
        targets[i, action] = reward
      else:
        targets[i, action] = reward + np.asarray(self.gamma) * np.max(Q_sa)
    self.model.fit(inputs, targets,epochs=20) #Training the model

Training the model

This method creates a training environment for the model. Iterating through a set number of episodes, it uses the model to sample actions and play them. When such a timestep ends, the model is using the observations to update the policy.

We can perform various pre-processing steps to improve computational efficiency.
Here I have only done greyscaling.

We know that in a dynamic game we cannot predict action based on 1 observation(which is 1 frame of the game in this case) so we will use a stack of 4 frames to predict the output.

We have also defined the reward system ourselves. This helps model learn faster about optimal actions to take.

In [None]:
def train(self, episodes):
    '''
          train the model
          episodes - number of training iterations
          ''' 
    env=self.env
    total_rewards=np.zeros(episodes)

    for episode in range(episodes):
      # each episode is a new game env
      state=env.reset()
      done=False
      state=cv2.cvtColor(state, cv2.COLOR_RGB2GRAY) #RGB to Grey Scale
      stacked_frames = np.stack((state,state,state,state),axis=2)  # stack 4 images to create input
      stacked_frames = stacked_frames.reshape(1,stacked_frames.shape[0],stacked_frames.shape[1],stacked_frames.shape[2]) #1*250*160*4
      state_=stacked_frames
      episode_reward=0 #record episode reward
      print("Episode Started")
      while not done:
        # play an action and record the game state & reward per episode
        action=self.get_action(state_) #input a stack of 4 images, get the action
        next_state, reward, done, _=env.step(action)
        print("Episode Going On."+"\n"+"Action taken:"+'\t'+"Reward:",action,reward)
        next_state=cv2.cvtColor(next_state, cv2.COLOR_RGB2GRAY) #RGB to Grey Scale
        next_state_ = next_state.reshape(1,next_state.shape[0],next_state.shape[1],1) #1x250x160x1
        stacked_frames_1 = np.append(next_state_, stacked_frames[:, :, :, :3], axis=3) # append the new image to input stack and remove the first one
        next_state_=stacked_frames_1
        if done:
          delta=0.0
        else:
          delta=1.0

        if action==2:
            reward=3.0
        elif action==3 or action==4:
            reward=2.8
        elif action==5 or action==6:
            reward=1.5
        else:
            reward=1.0
        
        self.remember(delta,state_,action,next_state_,reward)
        state_=next_state_
        episode_reward+=reward
      print("Episode_reward:{}".format(episode_reward))
      print("Episode Ended")
      if episode%self.observing_episodes==0 and episode!=0:
        self.update_policy()
        self.model.save('model_2_{}.h5'.format(episode))
        print('Current Epsilon Value:',self.epsilon_current_value)
        #print('Current Temperature parameter Value:',self.temperature_parameter_current_value)
      self.epsilon_current_value=self.epsilon_current_value-(self.epsilon_initial_value-self.epsilon_final_value)/1000
      #self.temperature_parameter_current_value=self.temperature_parameter_current_value-(self.temperature_parameter_initial_value-self.temperature_parameter_final_value)/1000

Agent_2=DQN(env)
Agent_2.train(episodes=2000) 

With the help of below code we run our algorithm and see the success of it.With the help of below code we run our algorithm and see the success of it.(Before running this set self.epsilon_current_value=0.001 in SARSA class so that model does not choose actions randomly)

In [None]:
env = (wrap_env(gym.make("Assault-v0")))
Agent_3=DQN(env,path='model.h5')
state=env.reset()
done=False
state=cv2.cvtColor(state, cv2.COLOR_RGB2GRAY)
stacked_frames = np.stack((state,state,state,state),axis=2)
stacked_frames = stacked_frames.reshape(1,stacked_frames.shape[0],stacked_frames.shape[1],stacked_frames.shape[2])
state_=stacked_frames
while True:
  
    env.render('ipython')
    
    #your agent goes here
    action =Agent_3.get_action(state_)
    next_state, reward, done, info = env.step(action)
    print(action)
    next_state=cv2.cvtColor(next_state, cv2.COLOR_RGB2GRAY)
    next_state_ = next_state.reshape(1,next_state.shape[0],next_state.shape[1],1)
    stacked_frames_1 = np.append(next_state_, stacked_frames[:, :, :, :3], axis=3) 
    state_=stacked_frames_1
    if done:
      break
            
env.close()
show_video()

**Ways to increasing the model efficiency**

# Skipping Frames
ALE (The Arcade Learning Environment: An Evaluation Platform for General Agents was published in 2013, which proposes learning environments for AI. ALE has a lot of games originally designed for a classical game console, Atari 2600) is capable of rendering 60 images per second. But actually people don’t take actions so much in a second. AI doesn’t need to calculate Q values every frame. Skipping Frames technique is that DQN calculates Q values every 4 frames and use past 4 frames as inputs. This reduces computational cost and gathers more experiences.

# Clipping Rewards
Each game has different score scales. For example, in Pong, players can get 1 point when wining the play. Otherwise, players get -1 point. However, in SpaceInvaders, players get 10~30 points when defeating invaders. This difference would make training unstable. Thus Clipping Rewards technique clips scores, which all positive rewards are set +1 and all negative rewards are set -1.

**DQN has various limitations.In the next notebook we see another methods to improve DQN**