## The multi-armed bandit
In this workbook, I will follow the tutorial built by Arthur Juliani to build build a policy-gradient based agent that can solve the multi-armed bandit problem.

In [1]:
import tensorflow as tf 
import numpy as np

# The Bandits
We define our bandits. For this example we are using a four-armed bandit. The pullBandit function generates a random number from a normal distribution with a mean of 0. THe loer the bandit number, the more likely a positive reward will be returned. We want our agent to learn to always choose the bandit that will give that positive reward

In [5]:
# Define Bandits 
bandits = [0.2 , 0, -0.2, -5]
num_bandits = len(bandits)
# Define a function to perform "Pull Bandits"
def pullBandit(bandit):
    # Get a random pull number
    result = np.random.randn(1)
    if result > bandit:
    # Return a positive reward.
        return 1
    else:
        return -1
    

## The Agent
In the following section, we will a simple neural agent. It consists of a set of values for each of the bandits. Each value is an estimate of the value of the return from choosing the bandit. We use a policy gradient methond to update the agent by moving the value for the selected action toward the recived reward.

In [14]:
tf.reset_default_graph()

# Define weight and threshold to choose action 
weights = tf.Variable(tf.ones([num_bandits]))
chosen_action = tf.argmax(weights, 0)

# define tensor for reward and action 
reward_holder = tf.placeholder(shape=[1], dtype=tf.float32)
action_holder = tf.placeholder(shape=[1], dtype=tf.int32)

# define corresponding weight
responsible_weight = tf.slice(weights, action_holder,[1])
loss = -(tf.log(responsible_weight) * reward_holder)
optimizer = tf.train.GradientDescentOptimizer(learning_rate = 0.001)
update = optimizer.minimize(loss)

## The Agent Train

In [15]:
total_epoc = 1000 # Set total number of epo to train agent on
total_reward = np.zeros(num_bandits) # Set Scoreboard for bandits to 0.
e = 0.1 #Set the change of taking a random action

init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    i = 0 
    while i < total_epoc:
        
        # Choose either a random action or one from the network
        if np.random.rand(1) < e:
            action = np.random.randint(num_bandits)
        else:
            action = sess.run(chosen_action)
        
        # Get reward from the selected bandits
        reward = pullBandit(bandits[action]) 
        
        # update the network
        _,resp, ww = sess.run([update,responsible_weight, weights], feed_dict={reward_holder: [reward], action_holder: [action]})
        
        # Update the reward 
        total_reward[action] += reward
        
        if i % 50 == 0:
            print ("Running reward for the " + str(num_bandits) + " bandits: " + str(total_reward))
        i+=1
print ("The agent thinks bandit " + str(np.argmax(ww)+1) + " is the most promising....")
if np.argmax(ww) == np.argmax(-np.array(bandits)):
    print ("...and it was right!")
else:
    print ("...and it was wrong!")
        

Running reward for the 4 bandits: [-1.  0.  0.  0.]
Running reward for the 4 bandits: [  0.  -2.  -1.  40.]
Running reward for the 4 bandits: [ -1.  -3.   0.  87.]
Running reward for the 4 bandits: [  -1.   -5.    1.  132.]
Running reward for the 4 bandits: [  -1.   -5.    2.  177.]
Running reward for the 4 bandits: [  -1.   -5.    3.  226.]
Running reward for the 4 bandits: [  -1.   -5.    4.  273.]
Running reward for the 4 bandits: [   1.   -5.    6.  317.]
Running reward for the 4 bandits: [   2.   -4.    6.  365.]
Running reward for the 4 bandits: [   2.   -5.    6.  410.]
Running reward for the 4 bandits: [   3.   -5.    7.  458.]
Running reward for the 4 bandits: [   2.   -5.    8.  506.]
Running reward for the 4 bandits: [   2.   -4.    9.  550.]
Running reward for the 4 bandits: [   2.   -4.   10.  597.]
Running reward for the 4 bandits: [   0.   -6.   10.  643.]
Running reward for the 4 bandits: [   0.   -3.   12.  686.]
Running reward for the 4 bandits: [   0.   -3.   12.  73