<a href="https://colab.research.google.com/github/DaehanKim/reinforcement_learning_tutorial/blob/master/Thompson_Sampling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Thompson Sampling

This code solves Multi-armed bandit problem using Thompson sampling.
When arm $i$ has a bernouli reward, this method assumes Beta Distribution $\beta (\alpha_i, \beta_i)$ as a prior distribution of reward mean $\mu_i$. As the number of trials increases, $\mu_i$'s posterior distribution becomes sharp. Choices are made with a maximum value of sampled $\hat{\mu}_i$ from corresponding Beta Distribution $\beta (\alpha_i, \beta_i)$ among all $i$.  

In [0]:
import numpy as np

In [6]:
#configs
NUM_BANDIT = 10
NUM_TRIAL = 100
LOG_INT = 10

# set initial alpha and beta for each arm
alpha = np.full(NUM_BANDIT,2)
beta = np.full(NUM_BANDIT,2)

# initialize arm probabilities
bandit = np.random.uniform(0,1,(NUM_BANDIT,))
n_trial = np.zeros(NUM_BANDIT)

for _iter in range(NUM_TRIAL):
    # choose an arm
    estimated_means = np.array([np.random.beta(alpha[i],beta[i],1)[0] for i in range(NUM_BANDIT)])
    choice = estimated_means.argmax()
    reward = np.random.binomial(1,bandit[choice],1)[0]
    
    # update posterior parameter for mu_i
    alpha[choice] += reward
    beta[choice] += (1-reward)
    n_trial[choice] += 1

    if (_iter+1) % LOG_INT == 0 : print('[Iter {}] mu={} / num_trial={}'.format(_iter + 1, estimated_means, n_trial))

print("estimated optimal strategy : {}".format(estimated_means.argmax()))
print("actual optimal strategy : {}".format(bandit.argmax()))
print("actual arm probabilities : {}".format(bandit))


[Iter 10] mu=[0.43864304 0.81086046 0.30966778 0.60480758 0.24216787 0.06146544
 0.48650597 0.26786838 0.72462738 0.51408369] / num_trial=[0. 1. 4. 0. 0. 0. 0. 0. 3. 2.]
[Iter 20] mu=[0.75353077 0.46049954 0.33366378 0.51690514 0.216198   0.48146588
 0.43621068 0.46571967 0.55206524 0.64314623] / num_trial=[3. 1. 4. 1. 0. 0. 1. 1. 4. 5.]
[Iter 30] mu=[0.27491467 0.82304714 0.5034348  0.26162214 0.69903014 0.56328133
 0.17012965 0.48437448 0.52865009 0.81940935] / num_trial=[ 3.  2.  4.  1.  0.  3.  1.  1.  5. 10.]
[Iter 40] mu=[0.49661653 0.43223508 0.19323368 0.40243233 0.50081837 0.857376
 0.40524399 0.55352872 0.63944349 0.68939469] / num_trial=[ 3.  2.  4.  1.  1.  4.  1.  1.  6. 17.]
[Iter 50] mu=[0.17487334 0.59775449 0.5303887  0.6845986  0.45424232 0.45778087
 0.29316937 0.40691615 0.48164875 0.79718403] / num_trial=[ 3.  2.  4.  1.  1.  4.  2.  1.  7. 25.]
[Iter 60] mu=[0.23999114 0.15131804 0.59547387 0.36717146 0.31685338 0.49286082
 0.43797097 0.19570235 0.79209681 0.929649