# Space Bandits Model as a Classifier

RMSE validation of a contextual bandits model is covered [here](validation.ipynb).<br>
Sometimes, we want to compare our contextual bandits model "apples to apples" with a binary classifier. It turns out that the sigmoid function gives us a convenient way to do just that.

## Toy Data
Using the same toy data used in the [toy problem notebook](toy_problem.ipynb), which we know  converges.

In [1]:
import numpy as np
import pandas as pd
from random import random, randint
import matplotlib.pyplot as plt
import gc
%config InlineBackend.figure_format='retina'
##Generate Data

def get_customer(ctype=None):
    """Customers come from two feature distributions.
    Class 1: mean age 25, var 5 years, min age 18
             mean ARPU 100, var 15
    Class 2: mean age 45, var 6 years
             mean ARPU 50, var 25
    """
    if ctype is None:
        if random() > .5: #coin toss
            ctype = 1
        else:
            ctype = 2
    age = 0
    ft = -1
    if ctype == 1:
        while age < 18:
            age = np.random.normal(25, 5)
        while ft < 0:
            ft = np.random.normal(100, 15)
    if ctype == 2:
        while age < 18:
            age = np.random.normal(45, 6)
        while ft < 0:
            ft = np.random.normal(50, 25)
    age = round(age)
    return ctype, (age, ft)

def get_rewards(customer):
    """
    There are three actions:
    promo 1: low value. 10 dollar if accept
    promo 2: mid value. 25 dollar if accept
    promo 3: high value. 100 dollar if accept
    
    Both groups are unlikely to accept promo 2.
    Group 1 is more likely to accept promo 1.
    Group 2 is slightly more likely to accept promo 3.
    
    The optimal choice for group 1 is promo 1; 90% acceptance for
    an expected reward of 9 dollars each.
    Group 2 accepts with 25% rate for expected 2.5 dollar reward
    
    The optimal choice for group 2 is promo 3; 20% acceptance for an expected
    reward of 20 dollars each.
    Group 1 accepts with 2% for expected reward of 2 dollars.
    
    The least optimal choice in all cases is promo 2; 10% acceptance rate for both groups
    for an expected reward of 2.5 dollars.
    """
    if customer[0] == 1: #group 1 customer
        if random() > .1:
            reward1 = 10
        else:
            reward1 = 0
        if random() > .90:
            reward2 = 25
        else:
            reward2 = 0
        if random() > .98:
            reward3 = 100
        else:
            reward3 = 0
    if customer[0] == 2: #group 2 customer
        if random() > .75:
            reward1 = 10
        else:
            reward1 = 0
        if random() > .90:
            reward2 = 25
        else:
            reward2 = 0
        if random() > .80:
            reward3 = 100
        else:
            reward3 = 0
    return np.array([reward1, reward2, reward3])

def get_cust_reward():
    """returns a customer and reward vector"""
    cust = get_customer()
    reward = get_rewards(cust)
    age = cust[1]
    return np.array([age])/100, reward

def generate_dataframe(n_rows):
    df = pd.DataFrame()
    ages = []
    ARPUs = []
    actions = []
    rewards = []
    for i in range(n_rows):
        cust = get_customer()
        reward_vec = get_rewards(cust)
        context = np.array([cust[1]])
        ages.append(context[0, 0])
        ARPUs.append(context[0, 1])
        action = np.random.randint(0,3)
        actions.append(action)
        reward = reward_vec[action]
        rewards.append(reward)

    df['age'] = ages
    df['ARPU'] = ARPUs
    df['action'] = actions
    df['reward'] = rewards

    return df

df = generate_dataframe(10000)
df.head()

Unnamed: 0,age,ARPU,action,reward
0,23.0,87.062566,2,0
1,41.0,52.586836,2,0
2,36.0,92.395865,1,0
3,31.0,90.414441,1,0
4,36.0,19.938699,1,0


We produce a dataset with randomly selected actions and 4000 rows.
## Train/Validation Split
We split the data into two equally-sized groups.

In [2]:
train = df.sample(frac=.5).copy()
val = df[~df.index.isin(train.index)].copy()
num_actions = len(train.action.unique())

## Validation Metric
We'll use the ROC AUC score as a validation metric. We'll train a simple binary classifier, a logistic regression model, to "compete" with our bandits model. This model simply predicts convert/no convert.

In [3]:
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression

In [4]:
train_fts = train[['age', 'ARPU']]
#give actions as features
campaign_fts = pd.get_dummies(train.action)
campaign_fts.index = train_fts.index
X_train = pd.concat([train_fts, campaign_fts], axis=1)
#Get labels: we are predicting conversion, so 1 if reward != 0
train['convert'] = np.where(train.reward > 0, 1, 0)
Y_train = train.convert

#prepare X_val for later
val_fts = val[['age', 'ARPU']]
campaign_fts_val = pd.get_dummies(val.action)
campaign_fts_val.index = val_fts.index
X_val = pd.concat([val_fts, campaign_fts_val], axis=1)
#get validation labels as well
val['convert'] = np.where(val.reward > 0, 1, 0)
Y_val = val.convert

In [5]:
classifier = LogisticRegression()
classifier.fit(X_train, Y_train)
pred = classifier.predict_proba(X_val)[:, 1]

classifier_auc_score = roc_auc_score(Y_val, pred)
print('Logistic regression auc score: ', round(classifier_auc_score, 3))

Logistic regression auc score:  0.782


## Bandits Model
We fit a bandits model on the same data.

In [6]:
from space_bandits import NeuralBandits

model = NeuralBandits(num_actions, num_features=2, layer_sizes=[50,12])

Initializing model neural_model-bnn.


In [7]:
model.fit(train[['age', 'ARPU']], train['action'], train['reward'])

Training neural_model-bnn for 100 steps...


# Get Expected Rewards
We collect expected reward values and add them to the validation dataframe.

In [8]:
expected_values = model.expected_values(val[['age', 'ARPU']].values)
pred = pd.DataFrame()
for a, vals in enumerate(expected_values):
    pred[a] = vals
#expected reward values
pred.index = val.index
#add them to validation df
val = pd.concat([val, pred], axis=1)
val.head()

Unnamed: 0,age,ARPU,action,reward,convert,0,1,2
2,36.0,92.395865,1,0,0,2.980926,1.21155,2.88079
3,31.0,90.414441,1,0,0,3.893548,0.728292,0.254755
4,36.0,19.938699,1,0,0,3.899634,4.718139,18.040534
6,29.0,124.17015,1,0,0,10.047221,2.648925,0.541609
10,39.0,38.745491,0,0,0,1.827022,2.408977,13.946508


## Applying the Sigmoid Function
The bandits model treats each campaign separately, so we should apply a sigmoid function to each reward column independently. To get sensible values, mean-center and normalize each expected reward column.

In [9]:
val['pred'] = .5
for a in range(num_actions):
    #mean center and normalize expected rewards
    val['{}_centered'.format(a)] = (val[a] - val[a].mean())/val[a].std()

In [10]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

#Apply sigmoid to get p_pred
for a in range(num_actions):
    #get the rows for this action
    slc = val[val.action==a]
    #pass values through sigmoid
    vals = sigmoid(slc['{}_centered'.format(a)].values)
    #assign output to appropriate rows
    inds = slc.index
    val.loc[inds, 'pred'] = vals

In [11]:
pred = val.pred

bandits_auc_score = roc_auc_score(Y_val, pred)
print('Bandits auc score: ', round(bandits_auc_score, 3))

Bandits auc score:  0.673


## Result
We see the logistic regression model performs better by this metric. This shouldn't be a surprise! The bandits model has a much harder job! It has to perform a regression for all three campaigns - the logreg model gets all the benefits of supervision and only has a single binary output.