This is the Part-6 of the Deep Reinforcement Learning Notebook series. In this Notebook I have introduced introduces the first Combined method known as Advantage Actor-Critic (A2C).
The Notebook series is about Deep RL algorithms so it excludes all other techniques that can be used to learn functions in reinforcement learning and also the Notebook Series is not exhaustive i.e. it contains the most widely used Deep RL algorithms only.

# Actor-Critic Algorithms
Actor-Critic algorithms elegantly combine the policy gradient and a learned value function. In these algorithms, a policy is reinforced with a learned reinforcing signal generated using a learned value function. This contrasts with REINFORCE which uses a high-variance Monte Carlo estimate of the return to reinforce the policy.

All Actor - Critic algorithms have two components which are learned jointly — an actor which learns a parameterized policy, and a critic which learns a value function to evaluate state-action pairs. The critic provides a reinforcing signal to the actor.




#Concept of Advantage function 
In Actor-Critic algorithm learning the policy depends on the feedback given by value function estimations which are being learned parallelly.So the problem arises in initial stages where value function is not generating reasonable signals for the policy, learning how to select good actions.

It is common to learn the advantage function $A^π$(s, a) = $Q^π(s, a) – V^π(s)$ as the reinforcing signals in these methods. The key idea is that it is better to select an action based on how it performs relative to the other actions available in a particular state instead of using the absolute value of that action as measured by the Q-function. The advantage quantifies how much better or worse an action is than the average available action. Actor-Critic algorithms that learn the advantage function are known as Advantage Actor-Critic (A2C) algorithms.

You might think now we have to construct two neural networks for both the Q value and the V value (in addition to the policy network). But we know that would be very inefficient and also care needs to be taken to ensure the two estimates are consistent. We calculate V and not Q and the reason behind this is $Q^π$ is a more complex function and may require more samples and
to learn a good estimate and also estimating V(s) from Q(s, a) requires computing the values for all possible actions in state s, then taking the action-probability weighted average to obtain $V^π$(s) which can be computationally expensive. Don't worry we don't do that here. Instead, we use the relationship between the Q and the V from the Bellman optimality equation(Equation 6.1):
![alt text](https://github.com/Machine-Learning-rc/Unimportant/blob/master/Screenshot%202020-07-21%20at%204.01.06%20PM.png?raw=true)

Equation 6.1

So, we the advantage can be written as(Equation 6.2):
![alt text](https://github.com/Machine-Learning-rc/Unimportant/blob/master/Screenshot%202020-07-21%20at%204.01.12%20PM.png?raw=true)

Equation 6.2

We would see 2 methods of estimating of Advantage => n-step returns and Generalized Advantage Estimation

# Actor
An Actor is one that controls how our agent behaves (policy-based). Actors learn parametrized policies $π_θ$ using the policy gradient similar to reinforce but instead of return we use advantage.

Actor update equation can be written as(Equation 6.3):

![alt text](https://github.com/Machine-Learning-rc/Unimportant/blob/master/Screenshot%202020-07-21%20at%204.01.20%20PM.png?raw=true)

Equation 6.3

# CRITIC
A Critic is the one that measures how good the action taken by the actor is (value-based). Critics are responsible for learning how to evaluate (s, a) pairs and using this to generating $A^π$.

Critic Update is similar to value-update we have seen before(in SARSA,DQN) but the only difference their we learn state-action value function ($Q^π(s,a)$) and here we learn state value function ($V^π$(s)).

# Estimating Advantage: n-step Returns
If we assume for a moment that we have a perfect estimate of $V^π(s)$, then the Q-function can be rewritten as a mix of the expected rewards for n time steps, followed by $V^π(s_{n+1})$ as shown in Equation 6.4. To make this tractable to estimate, we use a single trajectory of rewards (r1, . . . , rn) in place of the expectation, and substitute in value-function learned by the critic. Shown in Equation 6.5, this is known as n-step forward returns.

![alt text](https://github.com/Machine-Learning-rc/Unimportant/blob/master/Screenshot%202020-07-21%20at%204.59.16%20PM.png?raw=true)

Equation 6.5 makes the trade-off between bias and variance of the estimator explicit. The n-steps of actual rewards are unbiased but have high variance since they come from only a single trajectory. value-function learned by the critic has lower variance since it reflects an
expectation over all of the trajectories seen so far, but is biased because it is calculated using a function approximator. The intuition between mixing these two types of estimates is that the variance of the actual rewards typically increases the more steps away from t you take. Close to t, the benefits of using an unbiased estimate may outweigh the variance introduced. As n increases, the variance in the estimates will likely start to become problematic, and switching to a lower variance but the biased estimate is better. The number of steps of actual rewards, n, controls the trade-off between the two.

The formula for estimating the advantage function combining the n-step estimate for Q with V is(Equation 6.6)

![alt text](https://github.com/Machine-Learning-rc/Unimportant/blob/master/Screenshot%202020-07-21%20at%205.11.32%20PM.png?raw=true)

Equation 6.6




# Estimating Advantage: Generalized Advantage Estimation (GAE)

Generalized Advantage Estimation (GAE) was proposed by Schulman et al. as an improvement over the n-step return estimate for the advantage function. It addresses the problem of having to explicitly choose the number of steps of returns, n. The main idea behind GAE is that instead of picking one value of n, we mix multiple values of n. That is, calculate the advantage using a weighted average of individual advantages calculated with n = 1, 2, 3, . . . , k. The purpose of GAE is to significantly reduce the variance of the estimator while keeping the bias introduced as low as possible.

Intuitively, GAE is taking a weighted average of several advantage estimators with different bias and variance. GAE weights the high bias, low variance 1-step advantage the most, but also includes contributions from lower bias, higher variance estimators using 2, 3, . . . , n steps. The contribution decays at an exponential rate as the number of steps increases. The decay rate is controlled by the coefficient λ. Therefore, the larger λ the higher the variance.

![alt text](https://github.com/Machine-Learning-rc/Unimportant/blob/master/Screenshot%202020-07-22%20at%2012.05.47%20PM.png?raw=true)

# Ways to calculate $V^π_{tar}$
For learning the advantage function we need an estimate for $V^π$.

We do this by learning $V^π$ using TD learning in the same way that it is used to learn $Q^π$ for DQN. Parametrize $V^π$ with θ, generate $V^π_{tar}$ for each of the experiences an agent gathers and minimize the difference between $V^π$ and $V^π_{tar}$ using a regression loss such as MSE. Repeat this process for many steps.

Now $V^π_{tar}$ can be generated a few ways.

1) Simple Method: $V^π_{tar}(s)$ = r + $V^π(s';θ)$

2) Monte Carlo estimate : $V^π_{tar}(s)$=$\sum_{t'=t}^Tγ^{t'-t}r_{t'}$

3) $V^π_{tar}(s_t)$ = $A^π_{GAE}(s_t,a_t)$ + $V^π(s_t)$




# Advantage Actor-Critic Algorithm
![alt text](https://github.com/Machine-Learning-rc/Unimportant/blob/master/Screenshot%202020-07-22%20at%202.17.47%20PM.png?raw=true)

# IMPLEMENTING Advantage Actor-Critic Algorithm

Below code setups the environment required to run and record the game and also loads the required library.

In [None]:
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Dense,Dropout,Conv2D, Flatten,MaxPooling2D ,Activation,Input
from tensorflow.keras.models import Sequential,load_model,Model
import gym
import numpy as np
import random
from collections import deque
from tensorflow.keras.utils import normalize as normal_values
import cv2
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
from IPython.display import HTML
from IPython.display import clear_output
from IPython import display as ipythondisplay 

In [None]:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

In [None]:
"""
Utility functions to enable video recording of gym environment and displaying it
To enable video, just do "env = wrap_env(env)""
"""

def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

This part ensures the reproducibility of the code below by using a random seed and setups the environment.

In [None]:
RANDOM_SEED=1
N_EPISODES=500

# random seed (reproduciblity)
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)

# set the env
env = (gym.make("KungFuMaster-v0")) # env to import
env.seed(RANDOM_SEED)
env.reset() # reset to env 

Defining the A2C Class.You can see that I have commented out few things like temperature_parameter .You can uncomment them if you can want to use Boltzmann policy.In this epsilon greedy works good. So I have used that.

In [None]:
class A2C:

  def __init__(self, env,path_1=None,path_2=None):
    self.env=env #import env
    self.state_shape=70, 160, 4 # the state space
    self.action_shape=env.action_space.n # the action space
    self.gamma=[0.99] # decay rate of past observations
    self.learning_rate= 1e-5 # learning rate in deep learning
    self.lambda_=0.90       #λ is a hyperparameter for GAE(Generalized Advantage Estimation)
    self.alpha=1e-4
    if not path_1:
      self.Actor_model=self._create_model('Actor')    #Target Model is model used to calculate target values
      self.Critic_model=self._create_model('Critic')  #Training Model is model to predict q-values to be used.
    else:
      self.Actor_model=load_model(path_1) #import model
      self.Critic_model=load_model(path_2) #import model
    
        # record observations
    self.states=[]
    self.gradients=[]
    self.rewards=[]
    self.probs=[]
    self.next_states=[]
    self.actions=[]

Creating a Neural Network Model.

In [None]:
  def _create_model(self,model_type):
    ''' builds the model using keras'''
    inputs=Input(shape=(self.state_shape))
    layer_1=(Conv2D(256, (8, 8), padding='same',strides=(4, 4)))(inputs)  
    layer_2=(MaxPooling2D(pool_size=(2,2),padding='same'))(layer_1)
    layer_3=(Activation('relu'))(layer_2)
    layer_4=(Conv2D(64, (4, 4),strides=(2, 2),  padding='same'))(layer_3)
    layer_5=(MaxPooling2D(pool_size=(2,2),padding='same'))(layer_4)
    layer_6=(Activation('relu'))(layer_5)
    layer_7=(Conv2D(32, (3, 3),strides=(1, 1),  padding='same'))(layer_6)
    layer_8=(MaxPooling2D(pool_size=(2,2),padding='same'))(layer_7)
    layer_9=(Activation('relu'))(layer_8)
    layer_10=(Flatten())(layer_9)
    layer_11=(Dense(512))(layer_10)
    layer_12=(Activation('relu'))(layer_11)
    layer_13=(Dense(self.action_shape))(layer_12)
    layer_14=(Activation('softmax'))(layer_13)
    layer_15=(Dense(1))(layer_12)

    if model_type=='Actor':
      model=Model(inputs,layer_14)
    else:
       model=Model(inputs,layer_15)
    model.compile(loss='mse',optimizer=tf.keras.optimizers.Adam(learning_rate=self.learning_rate))
    return model

This is the preprocessing we do to the image we obtained by interacting with the environment. This is the preprocessing we do to the image we obtained by interacting with the environment. Here I have done grayscaling and also cropped the image to remove game scores and area which I found was not necessary to train the agent. This speeds up the training process.

In [None]:
  def get_crop_and_grayscale_frame(self,frame):
    frame=frame[95:-45,:]
    frame=cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
    frame=(frame-frame.mean())/frame.std()
    return frame

Action Selection 

The get_action method guides its action choice. It uses the neural network to generate a normalized probability distribution for a given state. Then, it samples its next action from this distribution.The hot_encode_action method encodes the actions into a one-hot-encoder format.

In [None]:
  def get_action(self, state):
    '''samples the next action based on the policy probabilty distribution 
      of the actions'''
    action_probability_distribution=self.Actor_model.predict(state).flatten()
    # norm action probability distribution
    action_probability_distribution/=np.sum(action_probability_distribution)
    
    # sample action   
    action=np.random.choice(self.action_shape,1,p=action_probability_distribution)[0]

    return action, action_probability_distribution
    
  def hot_encode_action(self, action):
  '''encoding the actions into a binary list'''

  action_encoded=np.zeros(self.action_shape, np.float32)
  action_encoded[action]=1

  return action_encoded

The remember method records the observations of each step.

In [None]:
  def remember(self, state, next_state, action_prob, reward,action):
    '''stores observations'''
    encoded_action=self.hot_encode_action(action)
    self.gradients.append(encoded_action-action_prob)
    self.states.append(state)
    self.rewards.append(reward)
    self.actions.append(action)
    self.probs.append(action_prob)
    self.next_states.append(next_state)

The get_GAEs method calculates Generalized advantage estimation (GAE) which later is used to calculate Advantage.

In [None]:
  def get_GAEs(self,v_preds):
    '''
    Advantage Estimation with GAE
    '''
    gaes = np.zeros((len(self.rewards),1))
    future_gae = 0.0
    for t in reversed(range(len(self.rewards))):
      delta = self.rewards[t] + np.asarray(self.gamma) * v_preds[t + 1] - v_preds[t]
      gaes[t] = future_gae = delta + np.asarray(self.gamma) * np.asarray(self.lambda_) *np.asarray(future_gae)
    return gaes

Updating the models
The update_models method updates the Actor and Critic Models.

In [None]:
  def update_models(self):
    '''
    Updates the network.
    '''
  #get V_preds and V_tar from critic model
    states=(np.array(self.states))[:,0,:,:,:]
    next_states=np.array(self.next_states)[:,0,:,:,:]
    V_s=self.Critic_model.predict(states)
    V_next_s=self.Critic_model.predict(next_states)
    V_last_state=np.reshape(np.array(V_next_s[-1]),(1,1))
    v_all=np.concatenate((V_s,V_last_state),axis=0)

    Advatanges=self.get_GAEs(v_all) #Calculating the Advantage
  
    critic_targets = Advatanges +  V_s
    
    self.Critic_model.fit(states, critic_targets,epochs=3) #Training the Critic Model

    gradients=self.gradients
    gradients*=Advatanges
    actor_targets=np.asarray(self.alpha)*(gradients)+self.probs
    
    self.Actor_model.fit(states, actor_targets,epochs=3)  #Training the Actor Model

        '''
    #Use this if you want to use regularisation for Actor Model training
    	self.beta=0.01
    	optimizer = tf.keras.optimizers.Adam(learning_rate=self.learning_rate)
    	
    	def train_step(states):
      		with tf.GradientTape() as tape:
        		probs= (self.Actor_model)(states,training=True) 
        		Entropy = (self.probs)*(np.log(self.probs))              
        		loss=-(np.asarray(self.alpha)*(gradients) + (self.beta)*Entropy)/(len(states))
      		grads = tape.gradient(loss,self.Actor_model.trainable_variables)
      		optimizer.apply_gradients(zip(grads, self.Actor_model.trainable_variables))
      		
   	 	train_step(states)

    '''

    self.states=[];self.gradients=[];self.probs=[];self.rewards=[];self.next_states=[];self.actions=[];self.deltas=[];

Training the model
This method creates a training environment for the model. Iterating through a set number of episodes, it uses the model to sample actions and play them. When such a timestep ends, the model is using the observations to update the policy.
We know that in a dynamic game we cannot predict action based on 1 observation(which is 1 frame of the game in this case) so we will use a stack of 4 frames to predict the output.
We can also clip the rewards to help model learn faster.

In [None]:
  def train(self,episodes):
    env=self.env
    for episode in range(episodes):
      # each episode is a new game env
      state=env.reset()
      done=False
      state= self.get_crop_and_grayscale_frame(state)
      stacked_frames = np.stack((state,state,state,state),axis=2)
      stacked_frames = stacked_frames.reshape(1,stacked_frames.shape[0],stacked_frames.shape[1],stacked_frames.shape[2]) 
      state_=stacked_frames
      episode_reward=0 #record episode reward
      print("Episode Started")
      while not done:
        # play an action and record the game state & reward per episode
        action,action_prob=self.get_action(state_)
        print("Episode Going On."+"\n"+"Action taken:",action)
        next_state, reward, done, _=env.step(action)
        next_state=self.get_crop_and_grayscale_frame(next_state)
        next_state_ = next_state.reshape(1,next_state.shape[0],next_state.shape[1],1)
        stacked_frames_1 = np.append(next_state_, stacked_frames[:, :, :, :3], axis=3)
        next_state_=stacked_frames_1
        self.remember(state_, next_state_, action_prob, reward,action)
        state_=next_state_
        episode_reward+=reward
      print("Episode:{}  reward:{}".format(episode,episode_reward))
      self.update_models()
      if episode%200==0 and episode!=0:
        self.Actor_model.save('Actor_{}.h5'.format(episode))
        self.Critic_model.save('Critic_{}.h5'.format(episode))
      print("Episode Ended")

In [None]:
no_of_episodes=10000

Agent=A2C(env)
Agent.train(no_of_episodes)

With the help of below code we run our algorithm and see the success of it.With the help of below code we run our algorithm and see the success of it.

In [None]:
class tester:

  def __init__(self,model):
      self.Actor_model= load_model(model)     #import model

  
  def get_action(self, state):
    '''samples the next action based on the policy probabilty distribution 
      of the actions'''
    action_probability_distribution=(self.model.predict(state))[0]
    action=np.argmax(action_probability_distribution)
    return action
    

  def get_frame(self,frame):
    frame=frame[95:-45,:]
    frame=cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
    frame=(frame-frame.mean())/frame.std()
    return frame

In [None]:
env=(wrap_env(gym.make("KungFuMaster-v0")))
state=env.reset()
done=False
test=tester("Actor_Model.h5")
state=test.get_frame(state)
stacked_frames = np.stack((state,state,state,state),axis=2)
stacked_frames = stacked_frames.reshape(1,stacked_frames.shape[0],stacked_frames.shape[1],stacked_frames.shape[2]) 
state_=stacked_frames
while True:
  env.render('ipython')
  action = test.get_action(state_)
  next_state, reward, done, _=env.step(action)
  print(action,reward)
  next_state=test.get_frame(next_state)
  next_state_ = next_state.reshape(1,next_state.shape[0],next_state.shape[1],1)
  stacked_frames_1 = np.append(next_state_, stacked_frames[:, :, :, :3], axis=3)
  next_state_=stacked_frames_1
  state_=next_state_
  if done:
    break
env.close()
show_video()