### Thompson (posterior) sampling

The goal in MAB problems is to estimate the parameter(s) of the reward distribution for each arm (that is, the ad to display, in the preceding example). In addition, measuring
our uncertainty about our estimate is a good way to guide the exploration strategy. This problem very much fits into the Bayesian inference framework, which is what **Thompson sampling** leverages. Bayesian inference starts with a prior probability distribution – an initial idea, for the parameter omega – and updates this prior distribution as data becomes available. Here, omega refers to the mean and variance for a normal distribution, and to the probability of observing a 1 for Bernoulli distribution. So, the Bayesian approach treats the parameter as a random variable given the data. The formula for this is given by the following:

![](img/ts1.png)

### Application to the online advertising scenario

![](img/ts2.png)

In [1]:
import numpy as np

In [2]:
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

In [3]:
# model of ad behavior
class BernoulliBandit(object):
    def __init__(self, p):
        self.p = p
    # rewards come from Bernoulli dist. for the ad
    def display_ad(self):
        reward = np.random.binomial(n=1, p=self.p)
        return reward

In [4]:
adA = BernoulliBandit(0.004)
adB = BernoulliBandit(0.016)
adC = BernoulliBandit(0.025)
adD = BernoulliBandit(0.035)
adE = BernoulliBandit(0.028)
ads = [adA, adB, adC, adD, adE]

In [10]:
n_prod = 100000
n_ads = len(ads)
alphas = np.ones(n_ads)
betas = np.ones(n_ads)
total_reward = 0
avg_rewards = []
avg_reward_so_far = []

In [11]:
import pandas as pd
df_reward_comparison = pd.DataFrame()

In [12]:
for i in range(n_prod):
    theta_samples = [np.random.beta(alphas[k], betas[k])
                     for k in range(n_ads)]
    ad_chosen = np.argmax(theta_samples)
    R = ads[ad_chosen].display_ad()
    alphas[ad_chosen] += R
    betas[ad_chosen] += 1 - R
    total_reward += R
    avg_reward_so_far = total_reward / (i + 1)
    avg_rewards.append(avg_reward_so_far)
df_reward_comparison['Thompson Sampling'] = avg_rewards

In [13]:
df_reward_comparison['Thompson Sampling'].iplot(title="Thompson Sampling Avg. Reward: {:.4f}".format(avg_reward_so_far), xTitle='Impressions', yTitle='Avg. Reward')

### Tip
Thompson sampling is a very competitive approach with one major advantage over the ε-greedy and UCB approaches: Thompson sampling did not require us to do any hyperparameter tuning. This, in practice, has the following benefits:
• Saves significant time that would have been spent on hyperparameter tuning
• Saves significant money that would have been burned by ineffective exploration and incorrect selection of hyperparameters in other approaches.