This is the Part-12 of the Deep Reinforcement Learning Notebook series. In this Notebook I have introduced Rainbow Algorithm.


The Notebook series is about Deep RL algorithms so it excludes all other techniques that can be used to learn functions in reinforcement learning and also the Notebook Series is not exhaustive i.e. it contains the most widely used Deep RL algorithms only.

##What is Rainbow Algorithm?

Rainbow is a DQN based off-policy deep reinforcement learning algorithm with six extensions of DQN's that each have addressed a limitation and improved overall performance.

##Extensions Used

1) Double Q-learning

=> Double Q-learning address problem of overestimation of Q-values by neural network.

2) Prioritized replay

=> Prioritized replay replaces the randomly sampling process in DQN by efficiently sampling the transitions from which there is much to learn.

3) Dueling networks

 => The dueling network is a neural network architecture designed for value based RL. It features two streams of computation, the value and advantage streams, sharing a convolutional encoder, and merged by a special aggregator

You can read more about Double Q-learning, Prioritized replay, Dueling networks at https://github.com/Rahul-Choudhary-3614/Deep-Reinforcement-Learning-Notebooks/blob/master/Deep_Reinforcement_Learning_Part_5_.ipynb

4) Multi-step learning

=> When calculating the target value in Q-Learning, the target value is based on only the current reward. For N-step Q-Learning, rewards from N steps are added together and the Q function value is added only at the very end.


5) Distributional RL

=> We can learn to approximate the distribution of returns instead of the expected return.

6) Noisy Nets

=> Noisy Nets are way to improve exploration in the environment by adding noise to network parameters

You can read more about Noisy networks at https://github.com/Rahul-Choudhary-3614/Deep-Reinforcement-Learning-Notebooks/blob/master/Deep_Reinforcement_Learning_Part_11_.ipynb




Full Rainbow paper => [https://arxiv.org/pdf/1710.02298.pdf]

Below code loads the required library and 2 components for our algorithm i.e. Priority Experience Replay memory buffer and Noisy Dense Layer

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Dense,Dropout,Conv2D,Flatten,MaxPooling2D,Activation
import numpy as np
import gym
import pickle
import random
import imageio
import os
from collections import deque
from tensorflow.python.framework import tensor_shape
import cv2
import matplotlib
import matplotlib.pyplot as plt
from skimage.transform import resize
%matplotlib inline
from PriorityExperienceReplay import Memory
from noisy_nets import noisy_dense

This part ensures the reproducibility of the code below by using a random seed and setups the environment.

In [None]:
RANDOM_SEED=1

# random seed (reproduciblity)
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)

# set the env
env_name = "Bowling-v0" 
env = gym.make(env_name)
env.seed(RANDOM_SEED)
env.reset();

This parts initializes parameter necessary to implement Distributional RL


In [None]:
N_atoms = 51
V_Max = 20.0
V_Min = 0.0
Delta_z = (V_Max - V_Min)/(N_atoms - 1)
z_list = tf.constant([V_Min + i * Delta_z for i in range(N_atoms)],dtype=tf.float32)
z_list_broadcasted = tf.tile(tf.reshape(z_list,[1,N_atoms]), tf.constant([action_shape,1]))

This parts initializes necessary hyper-parameters for our algorithm

In [None]:
observing_episodes = 4    #No of observations before updating the training network
observing_episodes_target_model = 10000 #No of observations before updating the target network
learning_rate = 3e-3 # learning rate 

epsilon_initial_value = 1.0 # initial value of epsilon
epsilon_current_value = 1.0# current value of epsilon
epsilon_final_value = 0.001 # final value of epsilon

batch_size = 256
gamma = 0.99 # decay rate of past observations
state_shape = (76, 160, 4) # the state space
action_shape = env.action_space.n # the action space
memory_size = 10000

Creating a Rainbow Algorithm Model Class. Here we have used 2 noisy dense layers as final layers of the model

In [None]:
class _rainbow_model(tf.keras.Model):
  
  def __init__(self,state_shape,action_shape,N_atoms):
    super(_rainbow_model,self).__init__()

    self.layer_1 = Conv2D(32,5,strides=2,input_shape=(state_shape))
    self.layer_2 = MaxPooling2D(pool_size=(2,2))
    self.layer_3 = Activation('relu')

    self.layer_4 = Conv2D(64,3)
    self.layer_5 = MaxPooling2D(pool_size=(2,2))
    self.layer_6 = Activation('relu')

    self.layer_7 = Conv2D(128,3)
    self.layer_8 = MaxPooling2D(pool_size=(2,2))
    self.layer_9 = Activation('relu')

    self.layer_10 = Flatten()
    self.layer_11 = noisy_dense(64)
    self.layer_12 = Activation('relu')

    self.layer_13 = noisy_dense(action_shape*N_atoms)
  
  def call(self,x):
    x = self.layer_3(self.layer_2(self.layer_1(x)))
    x = self.layer_6(self.layer_5(self.layer_4(x)))
    x = self.layer_9(self.layer_8(self.layer_7(x)))
    x = self.layer_12(self.layer_11(self.layer_10(x)))
    x = self.layer_13(x)
    return x

Defining the Rainbow Class. You can see that I have commented out few things like temperature_parameter .You can uncomment them if you can want to use Boltzmann policy.In this epsilon greedy works good. So I have used that.

In [None]:
class Rainbow():

  def __init__(self,env,memory_size,path_1=None,path_2=None):
    self.env = env
    self.memory = Memory(memory_size)
    self.learning_rate = learning_rate
    self.state_shape = state_shape
    self.action_shape = action_shape
    self.epsilon_current_value = epsilon_current_value
    self.epsilon_initial_value = epsilon_initial_value
    self.epsilon_final_value = epsilon_final_value
    self.observing_episodes = 10  
    self.observing_episodes_target_model = 200
    #self.temperature_parameter_initial_value=5.0 # initial value of epsilon
    #self.temperature_parameter_current_value=5.0# current value of epsilon
    #self.temperature_parameter_final_value=1.0 # final value of epsilon
    self.gamma  = gamma
    self.batch_size = batch_size
    self._num_step = 2000

    if not path_1:
      self.target_model =  _rainbow_model(self.state_shape,self.action_shape,N_atoms)    #Target Model is model used to calculate target values
      self.training_model = _rainbow_model(self.state_shape,self.action_shape,N_atoms)  #Training Model is model to predict q-values to be used.
    else:
      self.training_model=load_model(path_1)
      self.target_model=load_model(path_2)

  def get_frame(self,frame):
    frame=cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
    frame=frame[100:-34,:]
    frame=frame/255.
    return frame

The model output is first passed through  softmax layer and clipped to get distributional q values which are some algebraic manipulation are converted to q values.

This is the Distributional RL part of the Rainbow algorithm

In [None]:
  def get_q_values(self,model_output):
    q_distributional = tf.reshape(model_output, [-1, self.action_shape, N_atoms])
    q_distributional = tf.nn.softmax(q_distributional, axis = 2)
    q_distributional = tf.clip_by_value(q_distributional, 1e-8, 1.0-1e-8)

    q_values =  tf.multiply(q_distributional, z_list_broadcasted)
    q_values = tf.reduce_sum(q_values, axis=2)
    return q_distributional,q_values

Action Selection 

The get_action method guides out action choice. Initially, when training begins we use exploration policy but later we do exploitation.

You can uncomment the commented lines to use Boltzmann exploration policy instead of epsilon greedy policy.

In [None]:
  def get_action(self, state,status='Training'):
    '''samples the next action based on the E-greedy policy'''

    if status=="Evaluating":
       _ , q_values = self.get_q_values(self.training_model(state))
       action = np.argmax(q_values)
       return action

    if random.random() < self.epsilon_current_value:                                    #Exlporation
      #_ , q_values = self.get_q_values(self.training_model.predict(state))
      #q_values=(q_values[0])**(1/self.temperature_parameter_current_value)  #This is the step where we use Boltzmann exploration policy
      #top_actions=q_values.argsort()[-self.nma:][::-1]
      #action=random.choice(top_actions)
      action = random.choice(list(range(self.action_shape)))
    else:
      _ , q_values = self.get_q_values(self.training_model(state)) #Exploitation
      action = np.argmax(q_values)
    return action

This function is used to get the overall loss of the alorithm. To fully understand this you are suggested to read the paper ([https://arxiv.org/pdf/1710.02298.pdf])

In [None]:
  def get_loss(self,states_mb,dones_mb,rewards_mb,preds_mb,actions_mb,ISWeights_mb):
    
    states_mb = tf.cast(states_mb,tf.float32)
    dones_mb = tf.cast(dones_mb,tf.float32)
    rewards_mb = tf.cast(rewards_mb,tf.float32)
    preds_mb = tf.cast(preds_mb,tf.float32)
    actions_mb = tf.cast(actions_mb,tf.float32)
    ISWeights_mb = tf.cast(ISWeights_mb,tf.float32)

    Q_distributional_values_target,_ = self.get_q_values((self.target_model(states_mb)))
    Q_distributional_values_target = tf.cast(Q_distributional_values_target,tf.float32)
    tmp_batch_size = tf.shape(Q_distributional_values_target)[0]
    tmp_batch_size = tf.cast(tmp_batch_size,tf.float32)
    preds_mb = tf.convert_to_tensor(np.asarray(np.array(list(enumerate(preds_mb)),dtype=object).astype('int32')))

    Q_distributional_chosen_by_action_target = tf.gather_nd(Q_distributional_values_target,preds_mb)

    target = tf.tile(tf.reshape(rewards_mb,[-1, 1]), tf.constant([1, N_atoms])) + (self.gamma**self._num_step) * tf.multiply(tf.reshape(z_list,[1,N_atoms]),(1.0 - tf.tile(tf.reshape(dones_mb ,[-1, 1]), tf.constant([1, N_atoms]))))
    target = tf.cast(target,tf.float32)
    target = tf.clip_by_value(target, V_Min, V_Max)
    b = (target - V_Min) / Delta_z
    u, l = tf.math.ceil(b), tf.math.floor(b)
    u_id, l_id = tf.cast(u, tf.int32), tf.cast(l, tf.int32)
    u_minus_b, b_minus_l = u - b, b - l

    Q_distributional_values_online,_ = self.get_q_values((self.training_model(states_mb))) 
    Q_distributional_values_online = tf.cast(Q_distributional_values_online,tf.float32)
    actions_mb = tf.convert_to_tensor(np.asarray(np.array(list(enumerate(actions_mb)),dtype=object).astype('int32')))
    Q_distributional_chosen_by_action_online = tf.gather_nd(Q_distributional_values_online, actions_mb)

    index_help = tf.tile(tf.reshape(tf.range(tmp_batch_size),[-1, 1]), tf.constant([1, N_atoms])) 
    index_help = tf.expand_dims(index_help, -1)

    index_help = tf.cast(index_help,tf.int32)
    u_id = tf.cast(u_id,tf.int32)
    l_id = tf.cast(l_id,tf.int32)

    u_id = tf.concat([index_help, tf.expand_dims(u_id, -1)], axis=2)
    l_id = tf.concat([index_help, tf.expand_dims(l_id, -1)], axis=2)
    error = Q_distributional_chosen_by_action_target * u_minus_b * tf.math.log(tf.gather_nd(Q_distributional_chosen_by_action_online, l_id)) + Q_distributional_chosen_by_action_target * b_minus_l * tf.math.log(tf.gather_nd(Q_distributional_chosen_by_action_online, u_id))
    error = tf.reduce_sum(error, axis=1)
    loss = tf.negative(error * ISWeights_mb)
    error_op = tf.abs(error)
    return error_op

Updating the model

The update_training_model method updates the training model weights.

The update_target_model method updates the target model weights.

In [None]:
  def update_training_model(self):
    
    tree_idx, batch, ISWeights_mb = self.memory.sample(self.batch_size)
    
    states_mb = np.zeros((self.batch_size,*self.state_shape))
    dones_mb =   np.zeros((self.batch_size,1))
    rewards_mb = np.zeros((self.batch_size,1))
    preds_mb   = np.zeros((self.batch_size,1))
    actions_mb = np.zeros((self.batch_size,1))

    for i in range(self.batch_size):
      states_mb[i]           =         batch[i][0][0]
      _ ,q_values            = self.get_q_values(self.training_model(batch[i][0][0]))
      preds_mb[i]  = np.argmax(q_values)
      actions_mb[i] = batch[i][0][1]
      rewards_mb[i] = (batch[i][0][2])
      dones_mb[i] = batch[i][0][3]      

    optimizer = tf.keras.optimizers.Adam(learning_rate=self.learning_rate)

    def train_step(states_mb,dones_mb,rewards_mb,preds_mb,actions_mb,ISWeights_mb):
      with tf.GradientTape() as tape:
        loss = self.get_loss(states_mb,dones_mb,rewards_mb,preds_mb,actions_mb,ISWeights_mb)                
      grads = tape.gradient(loss,self.training_model.trainable_variables)
      optimizer.apply_gradients(zip(grads, self.training_model.trainable_variables))
      return loss

    abs_error = train_step(states_mb,dones_mb,rewards_mb,preds_mb,actions_mb,ISWeights_mb)

    # Update priority
    abs_error=abs_error/(np.max(abs_error)+1e-30)
    self.memory.batch_update(tree_idx, abs_error)

  def update_target_model(self):
    self.target_model.set_weights(self.training_model.get_weights()) 

This function is used to evaluate the algorithm during training of the algorithm and record the results in video format


In [None]:
  def evaluate(self,ep,no_of_testing_episodes=20):
    Average_Reward=[]
    for episode in range(no_of_testing_episodes):
      writer = imageio.get_writer("Evaluating_video_{}_{}.mp4".format(ep,episode), fps=20)
      env = (gym.make("Bowling-v0"))
      state = env.reset()
      writer.append_data(state)
      state_ = self.get_frame(state)
      stacked_frames = np.stack((state_,state_,state_,state_),axis=2)
      stacked_frames = stacked_frames.reshape(1,stacked_frames.shape[0],stacked_frames.shape[1],stacked_frames.shape[2])
      done=False
      episode_reward=0  
      while not done:
        action=self.get_action(stacked_frames,"Evaluating")
        next_state, reward, done, info=env.step(action)
        writer.append_data(next_state) 
        next_state_ = self.get_frame(next_state)
        next_state_ = next_state_.reshape(1,next_state_.shape[0],next_state_.shape[1],1)
        stacked_frames = np.append(next_state_, stacked_frames[:, :, :, :3], axis=3)
        episode_reward+=reward
      Average_Reward.append(episode_reward)
      print("Evaluating_Episode:{}  Reward:{} Average_Reward:{}".format(episode,episode_reward,sum(Average_Reward)/len(Average_Reward)))
      writer.close()

Training the model

This method creates a training environment for the model. Iterating through a set number of episodes, it uses the model to sample actions and play them. When such a timestep ends, the model is using the observations to update the policy.

We know that in a dynamic game we cannot predict action based on 1 observation(which is 1 frame of the game in this case) so we will use a stack of 4 frames to predict the output.

In [None]:
  def train(self,no_of_episodes):
    self.Average_rewards = []
    for episode in range(no_of_episodes):
      state = env.reset()
      state_ = self.get_frame(state)
      stacked_frames = np.stack((state_,state_,state_,state_),axis=2)
      stacked_frames = stacked_frames.reshape(1,stacked_frames.shape[0],stacked_frames.shape[1],stacked_frames.shape[2])
      done = False
      episode_reward = 0
      while not done:
        action = self.get_action(stacked_frames)
        next_state, reward, done, info = env.step(action)
        next_state = self.get_frame(next_state)
        next_state_ = next_state.reshape(1,next_state.shape[0],next_state.shape[1],1)
        next_state_ = np.append(next_state_, stacked_frames[:, :, :, :3], axis=3)

        experience = stacked_frames, action, reward, 1*done
        episode_reward+=reward
        self.memory.store(experience)
        stacked_frames = next_state_
        
      if episode%self.observing_episodes==0 and episode!=0:
        self.update_training_model()
         
      if episode%self.observing_episodes_target_model==0 and episode!=0:
        self.update_target_model()

      self.Average_rewards.append(episode_reward)
      avg_reward = np.mean(self.Average_rewards[-40:])

      if self.epsilon_current_value > self.epsilon_final_value:
        self.epsilon_current_value=self.epsilon_current_value-(self.epsilon_initial_value-self.epsilon_final_value)/1000.0
      
      print("Episode:{} Average Reward:{} Reward:{} Epsilon:{}".format(episode,avg_reward,episode_reward,self.epsilon_current_value))


      if episode%500==0 and episode!=0:

        self.evaluate(episode,1)
        
        weights = self.training_model.get_weights()
        with open("training_model_{}.txt".format(episode), "wb") as fp:
          pickle.dump(weights, fp)

      #if self.temperature_parameter_current_value > self.temperature_parameter_final_value:
        #self.temperature_parameter_current_value=self.temperature_parameter_current_value-(self.temperature_parameter_initial_value-self.temperature_parameter_final_value)/1000.0

In [None]:
no_of_episodes=1001

Agent = Rainbow(env,memory_size)
Agent.train(no_of_episodes)

With the help of below code we run our algorithm and see the success of it. With the help of below code we run our algorithm and see the success of it.

In [None]:
  def get_action(self, state,status='Training'):
    '''samples the next action based on the E-greedy policy'''

    if status=="testing":
       _ , q_values = self.get_q_values(self.training_model(state))
       action = np.argmax(q_values)
       return action

    if random.random() < self.epsilon_current_value:                                    #Exlporation
      #_ , q_values = self.get_q_values(self.training_model.predict(state))
      #q_values=(q_values[0])**(1/self.temperature_parameter_current_value)  #This is the step where we use Boltzmann exploration policy
      #top_actions=q_values.argsort()[-self.nma:][::-1]
      #action=random.choice(top_actions)
      action = random.choice(list(range(self.action_shape)))
    else:
      _ , q_values = self.get_q_values(self.training_model(state)) #Exploitation
      action = np.argmax(q_values)
    return action

With the help of below code we run our algorithm and see the success of it. With the help of below code we run our algorithm and see the success of it.

In [None]:
class tester:
  def __init__(self,path):
    self.model = _rainbow_model(state_shape,action_shape,N_atoms)
    with open(path, "rb") as fp:
      weights = pickle.load(fp)
    self.model(np.zeros(*state_shape));
    self.model.set_weights(weights)
      
  def get_action(self, state):
    '''samples the next action based on the E-greedy policy'''
    _ , q_values = self.get_q_values(self.training_model(state))
    action = np.argmax(q_values)
    return action
  
  def get_frame(self,frame):
    frame=cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
    frame=frame[100:-34,:]
    frame=frame/255.
    return frame

In [None]:
writer = imageio.get_writer("test_video.mp4", fps=20)
env = (gym.make("Bowling-v0"))
state = env.reset()
writer.append_data(state)
state_ = test.get_frame(state)
stacked_frames = np.stack((state_,state_,state_,state_),axis=2)
stacked_frames = stacked_frames.reshape(1,stacked_frames.shape[0],stacked_frames.shape[1],stacked_frames.shape[2])
episode_reward=0
while True:
  action = test.get_action(stacked_frames)
  next_state, reward, done, info=env.step(action)
  writer.append_data(next_state) 
  next_state_=test.get_frame(next_state)
  next_state_ = next_state_.reshape(1,next_state_.shape[0],next_state_.shape[1],1)
  stacked_frames = np.append(next_state_, stacked_frames[:, :, :, :3], axis=3)
  episode_reward+=reward
  if done:
    break
env.close()
writer.close()
print("Testing_Episode Reward:{}".format(episode_reward)