# Validating a Contextual Bandits Model

Validation is straightforward in supervised learning because the "ground truth" is completely and unambiguously known. In the reinforcement learning case (RL), validation in inherently challenging due to the very nature of the problem: only partial "ground truths" are observed, or queried, from some unknown reward-generating process.<br><br>
The following is one approach for validating contextual bandits models.

## Historic Data
The models used in Space Bandits benefit from direct reward approximation; given a set of features or a context, the model estimates an expected reward for each available action. This allows the model to optimize without direct access to the decision making policy used to query the reward-generating process.<br><br>
The model directly regresses expected reward for each action based on a set of features. This makes regression metrics, such as RMSE, appropriate for evaluation. Due to the stochastic nature of the reward-generating process, we should not expect regression error metrics to be small. However, we would expect an optimized model to minimize such an error metric.
## Naive Benchmark
In the multi-arm bandit case, the expected reward for a given action can be approximated by computing the mean of observed rewards from this action. This special case provides a convenient <b>naive benchmark for the expected value of each action</b>, which we call $\mathbb{E}_{b}[\mathcal{A}]$.<br><br>
We can use $\mathbb{E}_{b}[\mathcal{A}]$ to compute a benchmark error vector, $\epsilon_{b}[\mathcal{A}]$ for each action given a validation set by simpling using $\mathbb{E}_{b}[\mathcal{a}]$ as a <b>naive predicted reward</b> for a chosen action in the validation set and computing the RMSE against the observed reward, $\mathcal{R}_{obs}$. 
$$
\epsilon_{b}[\mathcal{A}] = \sum_{n_{a}=0}^{N_{obs, a}}RMSE(\mathbb{E}_{b}[\mathcal{a}], \mathcal{r}_{obs, n}),
$$
where $\mathcal{r}_{obs, n}$ is the observed reward for validation example n.

We define the model error vector as 

$$
\epsilon_{m}[\mathcal{A}] = \sum_{n_{a}=0}^{N_{obs, a}}RMSE(\mathcal{r}_{pred,n}, \mathcal{r}_{obs, n}),
$$
where $\mathcal{r}_{pred, n}$ is the model's expected value of the reward for validation example n.


This provides a benchmark with which to compare our model's RMSE, $\epsilon_{m}[\mathcal{A}]$ on the same prediction task on the validation set. If the condition $$
\sum_{a=0}^{A} \frac{\epsilon_{m}[\mathcal{a}]}{\epsilon_{b}[\mathcal{a}]} < 1
$$
is met, we can be confident that our model is performing better than a simple multi-arm bandit model by conditioning on the context. For a simple "higher-is-better" score, we can define a contextual bandit model validation score $\mathcal{S}$ as:
$$
\mathcal{S} = \sum_{a=0}^{A} 1 - \frac{\epsilon_{m}[\mathcal{a}]}{\epsilon_{b}[\mathcal{a}]}
$$

Any value $\mathcal{S} > 0$ is evidence for model convergence.

## Example with Toy Data
Using the same toy data used in the [toy problem notebook](toy_problem.ipynb), which we know  converges, we can compute S and show that, for the converged model, $\mathcal{S} > 0$.

In [1]:
import numpy as np
import pandas as pd
from random import random, randint
import matplotlib.pyplot as plt
import gc
%config InlineBackend.figure_format='retina'
##Generate Data

from space_bandits.toy_problem import generate_dataframe

df = generate_dataframe(4000)
df.head()

Unnamed: 0,age,ARPU,action,reward
0,19.0,62.00964,2,0
1,44.0,46.434857,0,0
2,30.0,103.41168,0,10
3,23.0,92.682164,1,0
4,19.0,109.850754,1,0


We produce a dataset with randomly selected actions and 4000 rows.
## Train/Validation Split
We split the data into two equally-sized groups.

In [2]:
train = df.sample(frac=.5)
val = df[~df.index.isin(train.index)]
num_actions = len(train.action.unique())

In [3]:
#use this for computing error metric
from sklearn.metrics import mean_squared_error

## Compute $\epsilon_{b}[\mathcal{A}]$
We use the train set to compute $\mathbb{E}_{b}[\mathcal{A}]$ to get the benchmark error vector, $\epsilon_{b}[\mathcal{A}]$.

In [4]:
#compute benchmark expected value per action
E_b = [train[train.action == a].reward.mean() for a in range(num_actions)]
Err_b = []
for a in range(num_actions):
    slc = val[val.action == a]
    y_pred = [E_b[a] for x in range(len(slc))]
    y_true = slc.reward
    error = mean_squared_error(y_pred, y_true)
    Err_b.append(error)
Err_b = np.array(Err_b)

In [5]:
Err_b

array([ 24.34257772,  52.31057004, 935.44289322])

## Fit the Model
We fit the model on the training set.

In [6]:
from space_bandits import NeuralBandits

model = NeuralBandits(num_actions, num_features=2, layer_sizes=[50,12], training_epochs=100)

In [7]:
model.fit(train[['age', 'ARPU']], train['action'], train['reward'])

Training neural_model-bnn for 100 steps...


## Compute $\epsilon_{m}[\mathcal{A}]$
We use the train set and compute the model expected rewards for each example in our validation set to get the model error vector, $\epsilon_{m}[\mathcal{A}]$.

In [8]:
expected_values = model.expected_values(val[['age', 'ARPU']].values, scale=True)
pred = pd.DataFrame()
for a, vals in enumerate(expected_values):
    pred[a] = vals
#expected reward values
pred.index = val.index
#add them to validation df
val = pd.concat([val, pred], axis=1)
val.head()

Unnamed: 0,age,ARPU,action,reward,0,1,2
0,19.0,62.00964,2,0,11.256291,4.328109,3.59561
1,44.0,46.434857,0,0,2.604247,2.501696,20.869746
4,19.0,109.850754,1,0,9.536264,3.178807,-0.042348
5,35.0,26.93624,0,0,2.332897,2.808232,21.452622
6,21.0,105.947998,0,10,9.119907,3.142819,0.376821


In [9]:
#compute error vector
Err_m = []
for a in range(num_actions):
    slc = val[val.action == a]
    y_pred = slc[a]
    y_true = slc.reward
    error = mean_squared_error(y_pred, y_true)
    Err_m.append(error)
Err_m = np.array(Err_m)

In [10]:
Err_m

array([ 15.30096855,  53.23198686, 867.05886111])

## Compute $\mathcal{S}$

In [11]:
S = (1 - Err_m/Err_b).sum()
print('The contextual bandits model score is: ', round(S, 3))

The contextual bandits model score is:  0.427


## Conclusion
As expected the model (which we know converges) yields a contextual bandits score $\mathcal{S}>0$, which is evidence of convergence.