Custom notebook I did for the course "Fundamentals of Deep Reinforcement Learning" by LVx, as the original one is not available anymore on edX.


#PART 1 : Bandit Problem

In [2]:
import numpy as np

Let's start by creating the 3 machines described in the video.

In [3]:
import numpy as np

class random_number_gaussian():
    """
    return a random number based on a Gaussian distribution
    Input : mean and standard deviation of the Gaussian
    """
    def __init__(self, mean, std_dev):
        self.mean = mean
        self.std_dev = std_dev
    
    def pull(self):
        return np.random.normal(self.mean, self.std_dev)

mach_1 = random_number_gaussian(1, 2) # random values, not the ones in the video précisely
mach_2 = random_number_gaussian(0, 3)
mach_3 = random_number_gaussian(3, 5)

list_machines = [mach_1, mach_2, mach_3]

Now let's define the action space (list of possible actions), the action-value (expected reward for each action) and the count of how many time we pulled each levers.

In [17]:
a = [0, 1, 2] # 3 possible levers to pull
Q = [0] * len(a) # the same as Q = [0, 0, 0]
N = [0] * len(a)

Now, let's define our agent, he has no memory as we used external variables to store the action value and the number of time we pulled the levers, so we can use a simple function rather than a class

In [5]:
def agent(a, Q, epsilon):
    """
    An epsilon greedy agent.
    """
    if np.random.random() < epsilon:
        action = np.random.choice(a)
    else:
        action = np.argmax(Q)
        
    return action

let's make our agent interact with the environnement

In [6]:
def play(a, Q, epsilon, N):
    """
    Make the agent interact with the environment one time
    a is the action space, Q the action value and N the number of time each lever was pulled
    """
    action = agent(a, Q, epsilon)
    reward = list_machines[action].pull()
    N[action] += 1
    Q[action] += 1/N[action] * (reward - Q[action])

    return Q, N, reward

In [18]:
epsilon = 0.2
number_interactions = 1000

for i in range(number_interactions):
    Q, N, r = play(a, Q, epsilon, N)

We can now see what are the values of Q

In [19]:
print(Q)
# The values are close to the mean value of each machine we defined earlier (1, 0 and 3), so our agent correctly learned which was the best to pull.

[0.8230317056615912, 0.07601606257128819, 3.0067971619273424]


We can implement a decreasing value of epsilon, as said in the video

In [21]:
epsilon = 0.99 # initial value (we want to explore first)
decay = 0.95
number_interactions = 100

for i in range(number_interactions):
    Q, N, r = play(a, Q, epsilon, N)
    epsilon *= decay

print(epsilon) # final value

0.005861323928130653


Let's compare performances with different values of epsilon

In [22]:
# Large value of epsilon = an agent that explore a lot, at the cost of not exploiting a lot what he finds
epsilon = 0.8
number_interactions = 100
a = [0, 1, 2]
Q = [0] * len(a)
N = [0] * len(a)

total_reward = 0
for i in range(number_interactions):
    Q, N, r = play(a, Q, epsilon, N)
    total_reward += r
print("final reward for a mainly exploring agent, ie large epsilon : " + str(total_reward))


# Relatively small value of epsilon, an agent that exploit and explore sometime
epsilon = 0.2
number_interactions = 100
a = [0, 1, 2]
Q = [0] * len(a)
N = [0] * len(a)

total_reward = 0
for i in range(number_interactions):
    Q, N, r = play(a, Q, epsilon, N)
    total_reward += r
print("final reward for an agent that mainly exploit, but sometime explores : " + str(total_reward))

# decreasing value : an agent that starts by exploring and then exploiting
epsilon = 0.99 # initial value (we want to explore first)
decay = 0.95
number_interactions = 100
a = [0, 1, 2]
Q = [0] * len(a)
N = [0] * len(a)

total_reward = 0
for i in range(number_interactions):
    Q, N, r = play(a, Q, epsilon, N)
    total_reward += r
    epsilon *= decay
print("final reward for an agent that starts by exploring and then exploit : " + str(total_reward))

final reward for a mainly exploring agent, ie large epsilon : 115.78234113242597
final reward for an agent that mainly exploit, but sometime explores : 219.93919757223458
final reward for an agent that starts by exploring and then exploit : 265.9679862159174


The best tradeoff in this case is to use a decreasing epsilon