# $n$-armed bandit (RL 101)

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

**In _probability theory_ and _machine learning_, the multi-armed bandit problem (sometimes called $n$-armed bandit problem) is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice.**

![image](https://miro.medium.com/max/1296/1*qh46a_qurQk30SCet6oWDg.gif)

**This is a classic reinforcement learning problem that exemplifies the exploration–exploitation tradeoff dilemma.**

- **exploration**: _execute a random action with an uncertain reward._
- **exploitation**: _execute a known action with a known reward_.


In [20]:
from tensorflow.python.framework import ops
import numpy as np
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()


**Below we have a series of levers, each with a _value associated_ with it.**


In [37]:
bandits_lever = [2., 1., 0, -3]
num_bandits = len(bandits_lever)


**Below is a function that controls whether pulling a lever will reward you ($1$) or not ($-1$). This reward depends (_stochastically_) on the _random number generated_ and the _value of each bandit_ (some bandits have a higher chance of giving a reward).**


In [38]:
def pullBandit(bandit):
    # Returns a sample from the “standard normal” distribution, mean 0, varaiance 1.
    result = np.random.randn(1)
    if result > bandit:
        # return a positive reward.
        return 1
    else:
        # return a negative reward.
        return -1


for i in range(10):
    print(f'Value for the {i+1} pull is {np.random.randn(1)[0]}')


Value for the 1 pull is -0.21225560503673163
Value for the 2 pull is -0.6410835849018122
Value for the 3 pull is -0.45515288898808837
Value for the 4 pull is 0.17768401778407147
Value for the 5 pull is -0.988727195714711
Value for the 6 pull is -0.7139332953504195
Value for the 7 pull is 2.2838139900033383
Value for the 8 pull is -2.4487209308777436
Value for the 9 pull is 0.8393497460413712
Value for the 10 pull is -1.2287423559770971


Let's now build a simple _feed-forward network_, with a number of input neurons and weights equal to the number of bandits, and one output neuron that _chooses which bandit to pull_.

**The agent will, in `10,000 episodes`, explore-exploit all possible bandits, and in the end (if successful) _converge on using the bandit that suits him best_ (higher chance of receiving a reward).**


In [39]:

weights = tf.Variable(tf.ones([num_bandits]))
chosen_action = tf.argmax(weights, 0)

reward_holder = tf.placeholder(shape=[1], dtype=tf.float32)
action_holder = tf.placeholder(shape=[1], dtype=tf.int32)
responsible_weight = tf.slice(weights, action_holder, [1])
loss = -(tf.log(responsible_weight)*reward_holder)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
update = optimizer.minimize(loss)
total_episodes = 10000
total_reward = np.zeros(num_bandits)
e = 0.1  # epsilon refers to the probability of choosing to explore

init = tf.global_variables_initializer()

with tf.Session() as sech:
    sech.run(init)
    i = 0
    while i < total_episodes:
        if np.random.rand(1) < e:
            # Explore
            action = np.random.randint(num_bandits)
        else:
            # Exploit
            action = sech.run(chosen_action)

        # Get reward
        reward = pullBandit(bandits_lever[action])

        # Update the network.
        _, resp, ww = sech.run([update, responsible_weight, weights], feed_dict={
                               reward_holder: [reward], action_holder: [action]})

        # Update the score board
        total_reward[action] += reward
        if i % 100 == 0:
            print(
                f'Running reward for the {num_bandits} bandit: {np.rint(total_reward)}')
        i += 1
print(f'The agent thinks bandit {np.argmax(ww)+1} is the most promising....')

if np.argmax(ww) == np.argmax(-np.array(bandits_lever)):
    print('...and it was right!')
else:
    print('...and it was wrong!')


Running reward for the 4 bandit: [-1.  0.  0.  0.]
Running reward for the 4 bandit: [ 0. -3. -1. 83.]
Running reward for the 4 bandit: [ -3.  -6.  -2. 176.]
Running reward for the 4 bandit: [ -4. -11.  -3. 267.]
Running reward for the 4 bandit: [ -8. -14.  -2. 359.]
Running reward for the 4 bandit: [ -9. -17.  -2. 453.]
Running reward for the 4 bandit: [-11. -18.  -3. 541.]
Running reward for the 4 bandit: [-14. -20.  -1. 634.]
Running reward for the 4 bandit: [-17. -21.  -1. 730.]
Running reward for the 4 bandit: [-18. -23.   1. 821.]
Running reward for the 4 bandit: [-19. -25.   1. 916.]
Running reward for the 4 bandit: [ -20.  -26.    4. 1009.]
Running reward for the 4 bandit: [ -22.  -25.    6. 1102.]
Running reward for the 4 bandit: [-2.300e+01 -2.800e+01  1.000e+00  1.193e+03]
Running reward for the 4 bandit: [ -28.  -29.    3. 1283.]
Running reward for the 4 bandit: [ -31.  -29.    5. 1372.]
Running reward for the 4 bandit: [ -32.  -29.    3. 1467.]
Running reward for the 4 band

---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).
