# Validating a Contextual Bandits Model

Validation is straightforward in supervised learning because the "ground truth" is completely and unambiguously known. In the reinforcement learning case (RL), validation in inherently challenging due to the very nature of the problem: only partial "ground truths" are observed, or queried, from some unknown reward-generating process.<br><br>
Here I propose an approach for validating contextual bandits models.

## Historic Data
The models used in Space Bandits benefit from direct reward approximation; given a set of features or a context, the model estimates an expected reward for each available action. This allows the model to optimize without direct access to the decision making policy used to query the reward-generating process.<br><br>
The model directly regresses expected reward for each action based on a set of features. This makes regression metrics, such as RMSE, appropriate for evaluation. Due to the stochastic nature of the reward-generating process, we should not expect regression error metrics to be small. However, we would expect an optimized model to minimize such an error metric.
## Naive Benchmark
In the multi-arm bandit case, the expected reward for a given action can be approximated by computing the mean of observed rewards from this action. This special case provides a convenient <b>naive benchmark for the expected value of each action</b>, which we call $\mathbb{E}_{b}[\mathcal{A}]$.<br><br>
We can use $\mathbb{E}_{b}[\mathcal{A}]$ to compute a benchmark error vector, $\epsilon_{b}[\mathcal{A}]$ for each action given a validation set by simpling using $\mathbb{E}_{b}[\mathcal{a}]$ as a <b>naive predicted reward</b> for a chosen action in the validation set and computing the RMSE against the observed reward, $\mathcal{R}_{obs}$. 
$$
\epsilon_{b}[\mathcal{A}] = \sum_{n_{a}=0}^{N_{obs, a}}RMSE(\mathbb{E}_{b}[\mathcal{a}], \mathcal{r}_{obs, n}),
$$
where $\mathcal{r}_{obs, n}$ is the observed reward for validation example n.

We define the model error vector as 

$$
\epsilon_{m}[\mathcal{A}] = \sum_{n_{a}=0}^{N_{obs, a}}RMSE(\mathcal{r}_{pred,n}, \mathcal{r}_{obs, n}),
$$
where $\mathcal{r}_{pred, n}$ is the model's expected value of the reward for validation example n.


This provides a benchmark with which to compare our model's RMSE, $\epsilon_{m}[\mathcal{A}]$ on the same prediction task on the validation set. If the condition $$
\sum_{a=0}^{A} \frac{\epsilon_{m}[\mathcal{a}]}{\epsilon_{b}[\mathcal{a}]} < 1
$$
is met, we can be confident that our model is performing better than a simple multi-arm bandit model by conditioning on the context. For a simple "higher-is-better" score, we can define a contextual bandit model validation score $\mathcal{S}$ as:
$$
\mathcal{S} = \sum_{a=0}^{A} 1 - \frac{\epsilon_{m}[\mathcal{a}]}{\epsilon_{b}[\mathcal{a}]}
$$

Any value $\mathcal{S} > 0$ is evidence for model convergence.

## Example with Toy Data
Using the same toy data used in the [toy problem notebook](toy_problem.ipynb), which we know  converges, we can compute S and show that, for the converged model, $\mathcal{S} > 0$.

In [1]:
import numpy as np
import pandas as pd
from random import random, randint
import matplotlib.pyplot as plt
import gc
%config InlineBackend.figure_format='retina'
##Generate Data

def get_customer(ctype=None):
    """Customers come from two feature distributions.
    Class 1: mean age 25, var 5 years, min age 18
             mean ARPU 100, var 15
    Class 2: mean age 45, var 6 years
             mean ARPU 50, var 25
    """
    if ctype is None:
        if random() > .5: #coin toss
            ctype = 1
        else:
            ctype = 2
    age = 0
    ft = -1
    if ctype == 1:
        while age < 18:
            age = np.random.normal(25, 5)
        while ft < 0:
            ft = np.random.normal(100, 15)
    if ctype == 2:
        while age < 18:
            age = np.random.normal(45, 6)
        while ft < 0:
            ft = np.random.normal(50, 25)
    age = round(age)
    return ctype, (age, ft)

def get_rewards(customer):
    """
    There are three actions:
    promo 1: low value. 10 dollar if accept
    promo 2: mid value. 25 dollar if accept
    promo 3: high value. 100 dollar if accept
    
    Both groups are unlikely to accept promo 2.
    Group 1 is more likely to accept promo 1.
    Group 2 is slightly more likely to accept promo 3.
    
    The optimal choice for group 1 is promo 1; 90% acceptance for
    an expected reward of 9 dollars each.
    Group 2 accepts with 25% rate for expected 2.5 dollar reward
    
    The optimal choice for group 2 is promo 3; 20% acceptance for an expected
    reward of 20 dollars each.
    Group 1 accepts with 2% for expected reward of 2 dollars.
    
    The least optimal choice in all cases is promo 2; 10% acceptance rate for both groups
    for an expected reward of 2.5 dollars.
    """
    if customer[0] == 1: #group 1 customer
        if random() > .1:
            reward1 = 10
        else:
            reward1 = 0
        if random() > .90:
            reward2 = 25
        else:
            reward2 = 0
        if random() > .98:
            reward3 = 100
        else:
            reward3 = 0
    if customer[0] == 2: #group 2 customer
        if random() > .75:
            reward1 = 10
        else:
            reward1 = 0
        if random() > .90:
            reward2 = 25
        else:
            reward2 = 0
        if random() > .80:
            reward3 = 100
        else:
            reward3 = 0
    return np.array([reward1, reward2, reward3])

def get_cust_reward():
    """returns a customer and reward vector"""
    cust = get_customer()
    reward = get_rewards(cust)
    age = cust[1]
    return np.array([age])/100, reward

def generate_dataframe(n_rows):
    df = pd.DataFrame()
    ages = []
    ARPUs = []
    actions = []
    rewards = []
    for i in range(n_rows):
        cust = get_customer()
        reward_vec = get_rewards(cust)
        context = np.array([cust[1]])
        ages.append(context[0, 0])
        ARPUs.append(context[0, 1])
        action = np.random.randint(0,3)
        actions.append(action)
        reward = reward_vec[action]
        rewards.append(reward)

    df['age'] = ages
    df['ARPU'] = ARPUs
    df['action'] = actions
    df['reward'] = rewards

    return df

df = generate_dataframe(4000)
df.head()

Unnamed: 0,age,ARPU,action,reward
0,22.0,92.067812,2,0
1,44.0,44.151515,2,0
2,48.0,50.710585,0,0
3,41.0,59.794778,1,0
4,39.0,53.120689,2,100


We produce a dataset with randomly selected actions and 4000 rows.
## Train/Validation Split
We split the data into two equally-sized groups.

In [2]:
train = df.sample(frac=.5)
val = df[~df.index.isin(train.index)]
num_actions = len(train.action.unique())

In [3]:
from sklearn.metrics import mean_squared_error

## Compute $\epsilon_{b}[\mathcal{A}]$
We use the train set to compute $\mathbb{E}_{b}[\mathcal{A}]$ to get the benchmark error vector, $\epsilon_{b}[\mathcal{A}]$.

In [4]:
#compute benchmark expected value per action
E_b = [train[train.action == a].reward.mean() for a in range(num_actions)]
Err_b = []
for a in range(num_actions):
    slc = val[val.action == a]
    y_pred = [E_b[a] for x in range(len(slc))]
    y_true = slc.reward
    error = mean_squared_error(y_pred, y_true)
    Err_b.append(error)
Err_b = np.array(Err_b)

## Fit the Model
We fit the model on the training set.

In [5]:
from space_bandits import NeuralBandits

model = NeuralBandits(num_actions, num_features=2, layer_sizes=[50,12])

Initializing model neural_model-bnn.


In [6]:
model.fit(train[['age', 'ARPU']], train['action'], train['reward'])

Training neural_model-bnn for 100 steps...


## Compute $\epsilon_{m}[\mathcal{A}]$
We use the train set and compute the model expected rewards for each example in our validation set to get the model error vector, $\epsilon_{m}[\mathcal{A}]$.

In [7]:
expected_values = model.expected_values(val[['age', 'ARPU']].values)
pred = pd.DataFrame()
for a, vals in enumerate(expected_values):
    pred[a] = vals
#expected reward values
pred.index = val.index
#add them to validation df
val = pd.concat([val, pred], axis=1)
val.head()

Unnamed: 0,age,ARPU,action,reward,0,1,2
3,41.0,59.794778,1,0,1.325435,1.676436,10.027914
6,24.0,98.758904,0,10,7.98379,1.388376,1.20194
8,40.0,37.461131,0,0,2.1998,1.865051,23.888348
9,42.0,38.994888,1,0,1.541148,2.103595,22.969137
10,21.0,103.160482,2,0,10.148564,1.719807,1.448784


In [8]:
#compute error vector
Err_m = []
for a in range(num_actions):
    slc = val[val.action == a]
    y_pred = slc[a]
    y_true = slc.reward
    error = mean_squared_error(y_pred, y_true)
    Err_m.append(error)
Err_m = np.array(Err_m)

## Compute $\mathcal{S}$

In [9]:
S = (1 - Err_m/Err_b).sum()
print('The contextual bandits model score is: ', round(S, 3))

The contextual bandits model score is:  0.244


## Conclusion
As expected the model (which we know converges) yields a contextual bandits score $\mathcal{S}>0$, which is evidence of convergence.