## Contextual Bandits

Context is similar to state in a multi-step reinforcement learning (RL) problem, with
one key difference. In a multi-step RL problem, the action an agent takes, affects the
states it is likely to visit in the subsequent steps. In CB problems, however, the agent simply observes the context, makes a decision, and observes the
reward.

While solving (contextual) multi-armed bandit problems, our goal is to learn action values
for each arm (action) from our observations, which we have denoted by Q(a).
Assume that we have two pieces of information about the user seeing the ad, namely the following:
• Device type (mobile or desktop)
• Location (domestic/US or international/non-US)

### Function approximations

Function approximations allow us to model the dynamics of a process from which we
have observed data, such as contexts and ad clicks. Context will be composed of x=[device, location, age].
Agent will learn five different Q functions, one per "ad", and return value estimate.
At this point, we have a supervised machine learning problem to solve for each action. We
can use different models to obtain the Q functions, such as logistic regression or a neural
network (which actually allows us to use a single network that estimates values for all
actions).


### Case study – contextual online advertising with synthetic user data

Assume that the true user click behavior follows a logistic function:
![](img/img1.png)

Here, pa(x) is the probability of a user click when the context is xx and ad aa is shown. Also,
let's assume that device is 1 for mobile and 0 otherwise; and location is 1 for US and 0
otherwise. There are two important things to note here:
• This behavior, particularly the β parameters, is unknown to the advertiser, which
they will try to uncover.
• Note the aa superscript in βi , which denotes that the impact of these factors on user behavior is potentially different for each ad.

In [2]:
import numpy as np
import pandas as pd
from scipy.optimize import minimize
from scipy import stats
import plotly.offline
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import cufflinks as cf
cf.go_offline()
cf.set_config_file(world_readable=True, theme='white')

In [3]:
# Simulates user dynamics
class UserGenerator(object):
    def __init__(self):
        self.beta = {}
        # true values β for agent to learn
        self.beta['A'] = np.array([-4, -0.1, -3, 0.1])
        self.beta['B'] = np.array([-6, -0.1, 1, 0.1])
        self.beta['C'] = np.array([2, 0.1, 1, -0.1])
        self.beta['D'] = np.array([4, 0.1, -3, -0.2])
        self.beta['E'] = np.array([-0.1, 0, 0.5, -0.01])
        self.context = None

    def logistic(self, beta, context):
        f = np.dot(beta, context)
        p = 1 / (1 + np.exp(-f))
        return p

    def display_ad(self, ad):
        if ad in ['A', 'B', 'C', 'D', 'E']:
            # generate probability of a click
            p = self.logistic(self.beta[ad], self.context)
            reward = np.random.binomial(n=1, p=p)
            return reward
        else:
            raise Exception('Unknown ad!')

    def generate_user_with_context(self):
        # 0: International, 1: U.S.
        location = np.random.binomial(n=1, p=0.6)
        # 0: Desktop, 1: Mobile
        device = np.random.binomial(n=1, p=0.8)
        # User age changes between 10 and 70,
        # with mean age 34
        age = 10 + int(np.random.beta(2, 3) * 60)
        # Add 1 to the concept for the intercept
        self.context = [1, device, location, age]
        return self.context

#### Visualize the relationship between the context and the probability of a click associated with it.

In [4]:
def get_scatter(x, y, name, showlegend):
    dashmap = {'A': 'solid',
               'B': 'dot',
               'C': 'dash',
               'D': 'dashdot',
               'E': 'longdash'}
    s = go.Scatter(x=x,
                   y=y,
                   legendgroup=name,
                   showlegend=showlegend,
                   name=name,
                   line=dict(color='blue',
                             dash=dashmap[name]))
    return s

def visualize_bandits(ug):
    ad_list = 'ABCDE'
    ages = np.linspace(10, 70)
    fig = make_subplots(rows=2, cols=2,
                        subplot_titles=("Desktop, International",
                                        "Desktop, U.S.",
                                        "Mobile, International",
                                        "Mobile, U.S."))
    for device in [0, 1]:
        for loc in [0, 1]:
            showlegend = (device == 0) & (loc == 0)
            for ad in ad_list:
                # generate probabilities
                probs = [ug.logistic(ug.beta[ad], [1, device, loc, age]) for age in ages]
                fig.add_trace(get_scatter(ages, probs, ad, showlegend), row=device+1, col=loc+1)
    fig.update_layout(template="presentation")
    fig.show()

Display omparison of the true ad click probabilities given the context

In [5]:
ug = UserGenerator()
visualize_bandits(ug)

We should expect our algorithms to figure out, for example, to display ad E for users aged around 40, who connect from the US on a mobile device.

Now, we have implemented a process to generate user clicks. Here is how the scenario
will flow:
1. We will generate a user and get the associated context using the generate_user_with_context method in the ug object.
2. A CB model will use the context to display one of the five ads: A, B, C, D, or E.
3. The chosen ad will be passed to the display_ad method in the ug object, giving
a reward of 1 (click) or 0 (no click).
4. The CB model will be trained based on the reward, and this cycle will go on.

### Function approximation with regularized logistic regression

We want our CB algorithms to observe the user responses to the ads, update the models
that estimate the action values (function approximations), and determine which ad to
display given the context, the action value estimates, and the exploration strategy.

Now, let's assume that, as subject-matter experts, we know that the CTR can be modeled
using logistic regression. We also mentioned that it is not practical to update the model after every single observation, so we prefer batch updates to our models.

In [6]:
class RegularizedLR(object):
    def __init__(self, name, alpha, rlambda, n_dim):
        self.name = name
        self.alpha = alpha
        self.rlambda = rlambda
        self.n_dim = n_dim
        self.m = np.zeros(n_dim)
        self.q = np.ones(n_dim) * rlambda
        self.w = self.get_sampled_weights()

    def get_sampled_weights(self):
        w = np.random.normal(self.m, self.alpha * self.q**(-1/2))
        return w

    def loss(self, w, *args):
        X, y = args
        n = len(y)
        regularizer = 0.5 * np.dot(self.q, (w - self.m)**2)
        pred_loss = sum([np.log(1 + np.exp(np.dot(w, X[j])))
                         - y[j] * np.dot(w, X[j]) for j in range(n)])
        return regularizer + pred_loss

    def fit(self, X, y):
        if y:
            X = np.array(X)
            y = np.array(y)
            minimization = minimize(self.loss,
                                    self.w,
                                    args=(X, y),
                                    method="L-BFGS-B",
                                    bounds=[(-10,10)]*3 + [(-1, 1)],
                                    options={'maxiter': 50})
            self.w = minimization.x
            self.m = self.w
            p = (1 + np.exp(-np.matmul(self.w, X.T)))**(-1)
            self.q = self.q + np.matmul(p * (1 - p), X**2)


    def calc_sigmoid(self, w, context):
        return 1 / (1 + np.exp(-np.dot(w, context)))

    def get_prediction(self, context):
        return self.calc_sigmoid(self.m, context)

    def sample_prediction(self, context):
        w = self.get_sampled_weights()
        return self.calc_sigmoid(w, context)

    def get_ucb(self, context):
        pred = self.calc_sigmoid(self.m, context)
        confidence = self.alpha * np.sqrt(np.sum(np.divide(np.array(context)**2, self.q)))
        ucb = pred + confidence
        return ucb

In [7]:
def calculate_regret(ug, context, ad_options, ad):
    action_values = {a: ug.logistic(ug.beta[a], context) for a in ad_options}
    best_action = max(action_values, key=action_values.get)
    regret = action_values[best_action] - action_values[ad]
    return regret, best_action

In [8]:
def select_ad_eps_greedy(ad_models, context, eps):
    if np.random.uniform() < eps:
        return np.random.choice(list(ad_models.keys()))
    else:
        predictions = {ad: ad_models[ad].get_prediction(context)
                       for ad in ad_models}
        max_value = max(predictions.values());
        max_keys = [key for key, value in predictions.items() if value == max_value]
        return np.random.choice(max_keys)

In [9]:
def select_ad_ucb(ad_models, context):
    ucbs = {ad: ad_models[ad].get_ucb(context)
            for ad in ad_models}
    max_value = max(ucbs.values());
    max_keys = [key for key, value in ucbs.items() if value == max_value]
    return np.random.choice(max_keys)

In [10]:
def select_ad_thompson(ad_models, context):
    samples = {ad: ad_models[ad].sample_prediction(context)
               for ad in ad_models}
    max_value = max(samples.values());
    max_keys = [key for key, value in samples.items() if value == max_value]
    return np.random.choice(max_keys)

In [11]:
ad_options = ['A', 'B', 'C', 'D', 'E']
exploration_data = {}
data_columns = ['context',
                'ad',
                'click',
                'best_action',
                'regret',
                'total_regret']
exploration_strategies = ['eps-greedy',
                          'ucb',
                          'Thompson']
# Start comparisons
for strategy in exploration_strategies:
    print("--- Now using", strategy)
    np.random.seed(0)
    # Create the LR models for each ad
    alpha, rlambda, n_dim = 0.5, 0.5, 4
    ad_models = {ad: RegularizedLR(ad,
                                   alpha,
                                   rlambda,
                                   n_dim)
                 for ad in 'ABCDE'}
    # Initialize data structures
    X = {ad: [] for ad in ad_options}
    y = {ad: [] for ad in ad_options}
    results = []
    total_regret = 0
    # Start ad display
    for i in range(10**4):
        context = ug.generate_user_with_context()
        if strategy == 'eps-greedy':
            eps = 0.1
            ad = select_ad_eps_greedy(ad_models,
                                      context,
                                      eps)
        elif strategy == 'ucb':
            ad = select_ad_ucb(ad_models, context)
        elif strategy == 'Thompson':
            ad = select_ad_thompson(ad_models, context)
        # Display the selected ad
        click = ug.display_ad(ad)
        # Store the outcome
        X[ad].append(context)
        y[ad].append(click)
        regret, best_action = calculate_regret(ug,
                                               context,
                                               ad_options,
                                               ad)
        total_regret += regret
        results.append((context,
                        ad,
                        click,
                        best_action,
                        regret,
                        total_regret))
        # Update the models with the latest batch of data
        if (i + 1) % 500 == 0:
            print("Updating the models at i:", i + 1)
            for ad in ad_options:
                ad_models[ad].fit(X[ad], y[ad])
            X = {ad: [] for ad in ad_options}
            y = {ad: [] for ad in ad_options}

    exploration_data[strategy] = {'models': ad_models,
                                  'results': pd.DataFrame(results,
                                                          columns=data_columns)}

--- Now using eps-greedy
Updating the models at i: 500
Updating the models at i: 1000
Updating the models at i: 1500
Updating the models at i: 2000
Updating the models at i: 2500
Updating the models at i: 3000
Updating the models at i: 3500
Updating the models at i: 4000
Updating the models at i: 4500
Updating the models at i: 5000
Updating the models at i: 5500
Updating the models at i: 6000
Updating the models at i: 6500
Updating the models at i: 7000
Updating the models at i: 7500
Updating the models at i: 8000
Updating the models at i: 8500
Updating the models at i: 9000
Updating the models at i: 9500
Updating the models at i: 10000
--- Now using ucb
Updating the models at i: 500
Updating the models at i: 1000
Updating the models at i: 1500
Updating the models at i: 2000
Updating the models at i: 2500
Updating the models at i: 3000
Updating the models at i: 3500
Updating the models at i: 4000
Updating the models at i: 4500
Updating the models at i: 5000
Updating the models at i: 55

In [13]:
df_regret_comparisons = pd.DataFrame({s: exploration_data[s]['results'].total_regret
                                      for s in exploration_strategies})
df_regret_comparisons.iplot(dash=['solid', 'dash','dot'],
                            xTitle='Impressions',
                            yTitle='Total Regret',
                            color='black')

In [14]:
lrmodel = exploration_data['eps-greedy']['models']['A']
df_beta_dist = pd.DataFrame([], index=np.arange(-4,1,0.01))
mean = lrmodel.m
std_dev = lrmodel.q ** (-1/2)

for i in range(lrmodel.n_dim):
    df_beta_dist['beta_'+str(i)] = stats.norm(loc=mean[i],
                                              scale=std_dev[i]).pdf(df_beta_dist.index)

df_beta_dist.iplot(dash=['dashdot','dot', 'dash', 'solid'],
                   yTitle='p.d.f.',
                   color='black')

In [15]:
for strategy in exploration_strategies:
    print(strategy)
    print(exploration_data[strategy]['models']['A'].m)
    print(exploration_data[strategy]['models']['B'].m)
    print(exploration_data[strategy]['models']['C'].m)
    print(exploration_data[strategy]['models']['D'].m)
    print(exploration_data[strategy]['models']['E'].m)

eps-greedy
[-3.45309096e+00 -3.24042759e-05 -2.73454766e+00  8.71978990e-02]
[-4.37484916 -0.32911226  1.26291122  0.06662895]
[ 2.29740385 -0.03689923  0.65958304 -0.09760207]
[ 1.9312855   1.26675103 -1.23037211 -0.18098795]
[ 0.11500019  0.21199461  0.71413756 -0.02646982]
ucb
[-1.98976888 -0.17175764 -2.53660861  0.05434773]
[-3.20775696 -0.19249712  0.72871613  0.05175545]
[ 0.59513858  1.14982751  0.92977299 -0.08515699]
[ 1.67799648  0.14564158 -1.97870193 -0.0884413 ]
[ 0.33259942 -0.09021632  0.6469562  -0.02369065]
Thompson
[-3.18809870e+00  3.09403106e-03 -2.46485351e+00  8.07013997e-02]
[-2.58797508  0.04563881  0.52015994  0.04195978]
[ 0.56532719  0.54274278  1.27558543 -0.07233644]
[ 2.52724169  0.44143209 -1.75879742 -0.13798087]
[-0.25975699  0.21584097  0.33789847 -0.00868182]
