<a href="https://colab.research.google.com/github/NicMaq/Reinforcement-Learning/blob/master/e_greedy_and_softmax_explained.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ε-greedy and softmax policies

This Google Colab was published to support the post: XXX Best Practices for Reinforcement Learning. <br> 
It also complements this [github](https://github.com/NicMaq/Reinforcement-Learning).
<br><br>
Here, I will detail the two policies ε-greedy and softmax.
<br><br>
The 𝜖-greedy policy is a special case of 𝜖-soft policies. It chooses the best action with a probability 1−𝜖 and a random action with probability 𝜖. 
<br>
There are two problems with 𝜖-greedy. First, when it chooses the random actions, it chooses them uniformly, including the actions we know are bad. This limits the performance in training and this is bad for production environments. Therefore, when the network is used to evaluate performance or control a system, 𝜖 should be set to zero. On the flip side, setting 𝜖 to zero creates a second problem: we are now only exploiting our knowledge. We stopped exploring. If the dynamics of the system changes a little, our algorithm is unable to adapt.    
<br>
A solution to this problem is to select random actions with probabilities proportional to their current values. This is what softmax policies do. 
<br><br>
**The policies have two distinct roles.** They are used **to find the best action** and they are used **to calculate the TD update**.  
<br>
First, in the following two cells, we import the required package and declare a few global constants.
<br>
I am running tensorflow 2.x.


In [None]:
%tensorflow_version 2.x

import tensorflow as tf
import numpy as np

In [None]:
# NUM_ACTIONS 
ACTIONS = {
    0: "NOOP",
    1: "FIRE",
    3: "RIGHT",
    4: "LEFT",
    #5: "RIGHTFIRE",
    #6: "LEFTFIRE",
}
NUM_ACTIONS = len(ACTIONS)

# Tau = Softmax Policy
TAU = 0.001
# Epsilon = e-greedy Policy
epsilon = 0.5
# Gamma 
GAMMA = 0.99

# e-greedy policy with tie management



In [None]:
# Create a fake value of Q(s,a) for a mini batch of three experiences:
qsa = [[-0.5, 0.7, 0.6, 0.8],[-0.6, 0.9, 0.7, 0.9],[-0.3, -0.9, -0.2, -0.4]]
qsa_tf = tf.convert_to_tensor(qsa)
print('qsa_tf is: ', qsa_tf)   

batch_terminal = [[0],[0],[1]]
batch_reward = [[1],[2],[3]]

qsa_tf is:  tf.Tensor(
[[-0.5  0.7  0.6  0.8]
 [-0.6  0.9  0.7  0.9]
 [-0.3 -0.9 -0.2 -0.4]], shape=(3, 4), dtype=float32)


In [None]:
# Find the maximums of Q(s,a)
all_ones = tf.ones_like(qsa)
qsa_max = tf.math.reduce_max(qsa_tf, axis=1, keepdims=True)
print('qsa_max is: ', qsa_max)

qsa_max_mat = qsa_max * all_ones
print('qsa_max_mat is: ', qsa_max_mat)

losers = tf.zeros_like(qsa_tf)
qsa_maximums = tf.where(tf.equal(qsa_max_mat, qsa_tf), x =all_ones, y =losers)
print('qsa_maximums is: ', qsa_maximums)

qsa_maximums_ind = tf.where(tf.equal(qsa_max_mat, qsa_tf))
print('qsa_maximums_ind is: ', qsa_maximums_ind)

nb_maximums = tf.math.reduce_sum(qsa_maximums, axis=1, keepdims=True)
print('nb_maximums is: ', nb_maximums)

qsa_max is:  tf.Tensor(
[[ 0.8]
 [ 0.9]
 [-0.2]], shape=(3, 1), dtype=float32)
qsa_max_mat is:  tf.Tensor(
[[ 0.8  0.8  0.8  0.8]
 [ 0.9  0.9  0.9  0.9]
 [-0.2 -0.2 -0.2 -0.2]], shape=(3, 4), dtype=float32)
qsa_maximums is:  tf.Tensor(
[[0. 0. 0. 1.]
 [0. 1. 0. 1.]
 [0. 0. 1. 0.]], shape=(3, 4), dtype=float32)
qsa_maximums_ind is:  tf.Tensor(
[[0 3]
 [1 1]
 [1 3]
 [2 2]], shape=(4, 2), dtype=int64)
nb_maximums is:  tf.Tensor(
[[1.]
 [2.]
 [1.]], shape=(3, 1), dtype=float32)


In [None]:
# Without tie management the best_action is:

best_action = tf.math.argmax(qsa, axis=1, output_type=tf.dtypes.int32)
print('best_action is: ', best_action)

best_action is:  tf.Tensor([3 1 2], shape=(3,), dtype=int32)


In [None]:
# With tie management the best action is:

only_one_max = tf.ones_like(nb_maximums)
isMaxMany = nb_maximums > only_one_max
print('isMaxMany is: ', isMaxMany)

if tf.reduce_any(isMaxMany):

  nbr_maximum_int = tf.reshape(nb_maximums,[-1]) 
  nbr_maximum_int = tf.dtypes.cast(nbr_maximum_int, tf.int32)

  for idx in tf.range(best_action.shape[0]):
    print('idx is', idx)
    if isMaxMany[idx]: 
            selected_idx = tf.random.uniform((), minval=0, maxval=nbr_maximum_int[idx], dtype=tf.int32)
            rows_index = tf.slice(qsa_maximums_ind,[0,0],[-1,1])
            all_actions = tf.slice(qsa_maximums_ind,[0,1],[-1,-1])
            current_index = tf.ones_like(rows_index)
            current_index = current_index * tf.cast(idx, dtype=tf.int64)
            selected_rows = tf.where(tf.equal(rows_index,current_index))
            select_action = tf.slice(selected_rows,[0,0],[-1,1])
            select_action = tf.squeeze(select_action)
            new_action = all_actions[select_action[selected_idx]]
            new_action = tf.cast(new_action, dtype=tf.int32)
            tf.print('***************************************************************************************** \n')
            tf.print('egreedy tie management new_action is: ', new_action)
            tf.print('***************************************************************************************** \n')

            indice = tf.reshape(idx,(1,1))
            tf.tensor_scatter_nd_update(best_action, indice, new_action)

  print('best_action is: ', best_action)

isMaxMany is:  tf.Tensor(
[[False]
 [ True]
 [False]], shape=(3, 1), dtype=bool)
idx is tf.Tensor(0, shape=(), dtype=int32)
idx is tf.Tensor(1, shape=(), dtype=int32)
***************************************************************************************** 

egreedy tie management new_action is:  [3]
***************************************************************************************** 

idx is tf.Tensor(2, shape=(), dtype=int32)
best_action is:  tf.Tensor([3 1 2], shape=(3,), dtype=int32)


In [None]:
# Calculate the TD update of the Bellman equation:

num_actions_float = tf.dtypes.cast(NUM_ACTIONS, tf.float32)
pi_s = tf.dtypes.cast(all_ones, tf.float32)
print('pi_s cast  is: ', pi_s)
pi_s = pi_s * epsilon / num_actions_float
print('pi_s  is: ', pi_s)

pi_max = (1 - epsilon)/nb_maximums
print('pi_max  is: ', pi_max)
pi = qsa_maximums * pi_max + pi_s
print('pi  is: ', pi)
pi_qsa = tf.multiply(pi, qsa)
print('pi_qsa  is: ', pi_qsa)
sum_piq = tf.math.reduce_sum(pi_qsa, axis=1, keepdims=True)
print('sum_piq  is: ', sum_piq)

pi_s cast  is:  tf.Tensor(
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]], shape=(3, 4), dtype=float32)
pi_s  is:  tf.Tensor(
[[0.125 0.125 0.125 0.125]
 [0.125 0.125 0.125 0.125]
 [0.125 0.125 0.125 0.125]], shape=(3, 4), dtype=float32)
pi_max  is:  tf.Tensor(
[[0.5 ]
 [0.25]
 [0.5 ]], shape=(3, 1), dtype=float32)
pi  is:  tf.Tensor(
[[0.125 0.125 0.125 0.625]
 [0.125 0.375 0.125 0.375]
 [0.125 0.125 0.625 0.125]], shape=(3, 4), dtype=float32)
pi_qsa  is:  tf.Tensor(
[[-0.0625      0.0875      0.075       0.5       ]
 [-0.075       0.33749998  0.0875      0.33749998]
 [-0.0375     -0.1125     -0.125      -0.05      ]], shape=(3, 4), dtype=float32)
sum_piq  is:  tf.Tensor(
[[ 0.6       ]
 [ 0.6875    ]
 [-0.32500002]], shape=(3, 1), dtype=float32)


In [None]:
# To understand tf.tensor_scatter_nd_update
indices = tf.constant([[4], [3], [1], [7]])
print('indices is: ', indices)
updates = tf.constant([9, 10, 11, 12])
print('updates is: ', updates)
tensor = tf.ones([8], dtype=tf.int32)
print(tf.tensor_scatter_nd_update(tensor, indices, updates))

indices is:  tf.Tensor(
[[4]
 [3]
 [1]
 [7]], shape=(4, 1), dtype=int32)
updates is:  tf.Tensor([ 9 10 11 12], shape=(4,), dtype=int32)
tf.Tensor([ 1 11  1 10  9  1  1 12], shape=(8,), dtype=int32)


# softmax policy 

The Boltzmann "soft max" probability distribution is defined as follows for a state x:

![alt text](http://www.modelfit.us/uploads/7/6/0/6/76068583/softmax_orig.gif)

The only change we will do is subtract from Q(s,a) a constant to prevent overflow. So we will implement:

![alt text](http://www.modelfit.us/uploads/7/6/0/6/76068583/softmax2-copy_orig.png)

In [None]:
# Calculate the preference
preferences = qsa_tf / TAU
# Calculate the max preference
max_preference =  tf.math.reduce_max(qsa, axis=1, keepdims=True) / TAU
# Calcualte the difference
pref_minus_max = preferences - max_preference
# Then apply the boltzmann operator
exp_preferences = tf.math.exp(pref_minus_max)
sum_exp_preferences = tf.reduce_sum(exp_preferences, axis=1, keepdims=True)
action_probs = exp_preferences / sum_exp_preferences
print("Action probabilities are: ", action_probs)

# The selection of the best action will be achieve by:
best_action = np.random.choice(NUM_ACTIONS, p=action_probs[0])
print("Best action is: ", best_action)

# The TD update will be the following. I am using qsa for simplicity but keep in mind you should use the action-value of your next state:
expectation = tf.multiply(action_probs, qsa)
sum_expectation = tf.reduce_sum(expectation, axis=1, keepdims=True)
v_next_vect = batch_terminal * sum_expectation
target_vec = batch_reward + GAMMA * v_next_vect
print("The TD update is: ", target_vec )


Action probabilities are:  tf.Tensor(
[[0.0e+00 0.0e+00 0.0e+00 1.0e+00]
 [0.0e+00 5.0e-01 0.0e+00 5.0e-01]
 [3.8e-44 0.0e+00 1.0e+00 0.0e+00]], shape=(3, 4), dtype=float32)
Best action is:  3
The TD update is:  tf.Tensor(
[[1.   ]
 [2.   ]
 [2.802]], shape=(3, 1), dtype=float32)
