We are running a website and have 5 different banners for the same AD, we need to know which banner attracts the user most.

We model this problem statement as a brandit problem where 5 banners are 5 arms of the brandit and awards 1 point if user clicks the AD and awards 0 if user does not.

In normal A/B testing, we will perform complete exploration of these 5 banners before decide which banner is best. But it will cost more time and resources.

Instead here we will use optimal balance between exploitation and exploration using Thompson Sampling strategy.

In [1]:
import pandas as pd
import numpy as np

The probability of clicking any given AD (which is called the conversion rate) is unknown, and varies from AD to AD. We should be able to identify the AD that has the highest conversion rate as quickly as possible. 

#### Define Environment

In [2]:
#Define the total number of customer clicks
number_of_clicks = 10000

#define the total number of ADs
number_of_ADs = 5

#Define arrays where we can keep track of our clicks (positive rewards)and non-clicks (negative rewards) for each AD
#shape parameters for beta distribution
number_of_positive_rewards = np.zeros(number_of_ADs)
number_of_negative_rewards = np.zeros(number_of_ADs)

#define a seed for the random number generator (to ensure that results are reproducible)
np.random.seed(12)

#create a random conversion rate between 1% and 15% for each slot machine
conversion_rates = np.random.uniform(0.01, 0.15, number_of_ADs)

#Show conversion rates for each AD 
#Remember that in a real-world scenario decision-maker would not know this information!
for i in range(number_of_ADs):
  print('Conversion rate for slot machine {0}: {1:.2%}'.format(i, conversion_rates[i]))

Conversion rate for slot machine 0: 3.16%
Conversion rate for slot machine 1: 11.36%
Conversion rate for slot machine 2: 4.69%
Conversion rate for slot machine 3: 8.47%
Conversion rate for slot machine 4: 1.20%


#### Create the Data Set

In [3]:
np.random.seed(12)

#The data set is a matrix with one row for each click, and one column for each AD
#awards 1 point if user clicks the AD and awards 0 if user does not. 

outcomes = np.zeros((number_of_clicks, number_of_ADs)) 
for click_index in range(number_of_clicks): #for each click
    for ADs_index in range(number_of_ADs): #for each AD
        #Get a random number between 0.0 and 1.0.
        #If the random number is less than or equal to this AD's conversion rate, then set the outcome to "1".
        #Otherwise, the outcome will be "0" because the entire matrix was initially filled with zeros.
        if np.random.rand() <= conversion_rates[ADs_index]:
            outcomes[click_index][ADs_index] = 1

In [4]:
#display data
print(outcomes[0:10, 0:5]) 

[[0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]


In [5]:
#show conversion rates for each column
for i in range(5):
  print('Mean for column {0}: {1:.2%}'.format(i, np.mean(outcomes[:, i])))

Mean for column 0: 3.28%
Mean for column 1: 11.05%
Mean for column 2: 4.57%
Mean for column 3: 8.22%
Mean for column 4: 1.16%


approximate equal for defined conversion rates

#### Run the Simulation using Thompson sampling exploration strategy

In [6]:
#for each click
for click_index in range(number_of_clicks):
    index_of_AD_to_click = -1
    max_beta = -1

    #determine which AD to click
    for ADs_index in range(number_of_ADs): 
        #Define the shape parameters for the beta distribution
        a = number_of_positive_rewards[ADs_index] + 1
        b = number_of_negative_rewards[ADs_index] + 1

        #Get a random value from the beta distribution whose shape is defined by the number of
        #clicks and non-clicks that have thus far been observed for this AD
        random_beta = np.random.beta(a, b)

        #if this is the largest beta value thus far observed for this iteration
        if random_beta > max_beta:
            max_beta = random_beta #update the maximum beta value thus far observed
            index_of_AD_to_click = ADs_index #set the AD to click the current click
    
    #click the selected AD, and record whether click or not
    if outcomes[click_index][index_of_AD_to_click] == 1:
        number_of_positive_rewards[index_of_AD_to_click] += 1
    else:
        number_of_negative_rewards[index_of_AD_to_click] += 1


In [7]:
#compute and display the total number of times each AD was clicked
number_of_times_clicked = number_of_positive_rewards + number_of_negative_rewards 
for ADs_index in range(number_of_ADs): #for each slot machine
    print('AD {0} was clicked {1} times'.format(ADs_index, number_of_times_clicked[ADs_index]))

AD 0 was clicked 60.0 times
AD 1 was clicked 9536.0 times
AD 2 was clicked 95.0 times
AD 3 was clicked 235.0 times
AD 4 was clicked 74.0 times


We can clearly see that after some time AI systems has learnt more about varying levels of rewards, that each AD provides and it quickly identifies best option to pursue in order to maximize cumulative rewards.

In [10]:
#identify and display the best AD
print('\nOverall Conclusion: The best AD to didplay is AD No {}'.format(np.argmax(number_of_times_clicked)))


Overall Conclusion: The best AD to didplay is AD No 1
