# OPE Experiment with Classificatoin Data
---
This notebook provides an example of conducting OPE of an evaluation policy using classification data as logged bandit data.
It is quite common to conduct OPE experiments using classification data. Appendix G of [Farajtabar et al.(2018)](https://arxiv.org/abs/1802.03493) describes how to conduct OPE experiments with classification data in detail.

In [1]:
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# import open bandit pipeline (obp)
import obp
from obp.dataset import MultiClassToBanditReduction
from obp.ope import (
    OffPolicyEvaluation, 
    RegressionModel,
    InverseProbabilityWeighting as IPS,
    DirectMethod as DM,
    DoublyRobust as DR, 
)

In [2]:
# obp version
print(obp.__version__)

0.5.5


## (1) Bandit Reduction
`obp.dataset.MultiClassToBanditReduction` is an easy-to-use for transforming classification data to bandit data.
It takes 
- feature vectors (`X`)
- class labels (`y`)
- classifier to construct behavior policy (`base_classifier_b`) 
- paramter of behavior policy (`alpha_b`) 

as its inputs and generates a bandit data that can be used to evaluate the performance of decision making policies (obtained by `off-policy learning`) and OPE estimators.

In [3]:
# load raw digits data
# `return_X_y` splits feature vectors and labels, instead of returning a Bunch object
X, y = load_digits(return_X_y=True)

In [4]:
# convert the raw classification data into a logged bandit dataset
# we construct a behavior policy using Logistic Regression and parameter alpha_b
# given a pair of a feature vector and a label (x, c), create a pair of a context vector and reward (x, r)
# where r = 1 if the output of the behavior policy is equal to c and r = 0 otherwise
# please refer to https://zr-obp.readthedocs.io/en/latest/_autosummary/obp.dataset.multiclass.html for the details
dataset = MultiClassToBanditReduction(
    X=X,
    y=y,
    base_classifier_b=LogisticRegression(max_iter=10000, random_state=12345),
    alpha_b=0.8,
    dataset_name="digits",
)

In [5]:
# split the original data into training and evaluation sets
dataset.split_train_eval(eval_size=0.7, random_state=12345)

In [6]:
# obtain logged bandit data generated by behavior policy
bandit_data = dataset.obtain_batch_bandit_feedback(random_state=12345)

# `bandit_data` is a dictionary storing logged bandit feedback
bandit_data

{'n_actions': 10,
 'n_rounds': 1258,
 'context': array([[ 0.,  0.,  0., ..., 16.,  1.,  0.],
        [ 0.,  0.,  7., ..., 16.,  3.,  0.],
        [ 0.,  0., 12., ...,  8.,  0.,  0.],
        ...,
        [ 0.,  1., 13., ...,  8., 11.,  1.],
        [ 0.,  0., 15., ...,  0.,  0.,  0.],
        [ 0.,  0.,  4., ..., 15.,  3.,  0.]], shape=(1258, 64)),
 'action': array([6, 8, 5, ..., 2, 5, 9], shape=(1258,)),
 'reward': array([1., 1., 1., ..., 1., 1., 1.], shape=(1258,)),
 'position': None,
 'pi_b': array([[[0.02],
         [0.02],
         [0.02],
         ...,
         [0.02],
         [0.02],
         [0.02]],
 
        [[0.02],
         [0.02],
         [0.02],
         ...,
         [0.02],
         [0.82],
         [0.02]],
 
        [[0.02],
         [0.02],
         [0.02],
         ...,
         [0.02],
         [0.02],
         [0.02]],
 
        ...,
 
        [[0.02],
         [0.02],
         [0.82],
         ...,
         [0.02],
         [0.02],
         [0.02]],
 
        [

## (2) Off-Policy Learning
After generating logged bandit data, we now obtain an evaluation policy using the training set.

In [7]:
# obtain action choice probabilities by an evaluation policy
# we construct an evaluation policy using Random Forest and parameter alpha_e
action_dist = dataset.obtain_action_dist_by_eval_policy(
    base_classifier_e=RandomForestClassifier(random_state=12345),
    alpha_e=0.9,
)

In [8]:
# which action to take for each context (a probability distribution over actions)
action_dist[:, :, 0]

array([[0.01, 0.01, 0.01, ..., 0.01, 0.01, 0.01],
       [0.01, 0.01, 0.01, ..., 0.01, 0.91, 0.01],
       [0.01, 0.01, 0.01, ..., 0.01, 0.01, 0.01],
       ...,
       [0.01, 0.01, 0.91, ..., 0.01, 0.01, 0.01],
       [0.01, 0.01, 0.01, ..., 0.01, 0.01, 0.01],
       [0.01, 0.01, 0.01, ..., 0.01, 0.01, 0.91]], shape=(1258, 10))

## (3) Off-Policy Evaluation (OPE)
OPE attempts to estimate the performance of evaluation policies using their action choice probabilities.

Here, we evaluate/compare the OPE performance (estimation accuracy) of 
- **Inverse Propensity Score (IPS)**
- **DirectMethod (DM)**
- **Doubly Robust (DR)**

### (3-1) obtain a reward estimator
`obp.ope.RegressionModel` simplifies the process of reward modeling

$r(x,a) = \mathbb{E} [r \mid x, a] \approx \hat{r}(x,a)$

In [9]:
regression_model = RegressionModel(
    n_actions=dataset.n_actions, # number of actions; |A|
    base_model=LogisticRegression(C=100, max_iter=10000, random_state=12345), # any sklearn classifier
)

In [10]:
estimated_rewards = regression_model.fit_predict(
    context=bandit_data["context"],
    action=bandit_data["action"],
    reward=bandit_data["reward"],
    position=bandit_data["position"],
    random_state=12345,
)

In [11]:
estimated_rewards[:, :, 0] # \hat{q}(x,a)

array([[0.90706916, 0.8781264 , 0.92355114, ..., 0.80594859, 0.91215889,
        0.92077543],
       [0.89222566, 0.85937668, 0.91108218, ..., 0.77889063, 0.89803698,
        0.9078989 ],
       [0.73358736, 0.67025604, 0.77314251, ..., 0.53952586, 0.74551281,
        0.76628772],
       ...,
       [0.73057663, 0.66685463, 0.77043891, ..., 0.53571004, 0.74258956,
        0.7635274 ],
       [0.9757174 , 0.967386  , 0.98028868, ..., 0.9447445 , 0.97714208,
        0.97952733],
       [0.59554507, 0.52083288, 0.64569699, ..., 0.38520113, 0.61036774,
        0.63680032]], shape=(1258, 10))

### (3-2) OPE
`obp.ope.OffPolicyEvaluation` simplifies the OPE process

$V(\pi_e) \approx \hat{V} (\pi_e; \mathcal{D}_0, \theta)$ using DM, IPS, and DR

In [12]:
ope = OffPolicyEvaluation(
    bandit_feedback=bandit_data, # bandit data
    ope_estimators=[
        IPS(estimator_name="IPS"), 
        DM(estimator_name="DM"), 
        DR(estimator_name="DR"),
    ] # used estimators
)

In [13]:
estimated_policy_value = ope.estimate_policy_values(
    action_dist=action_dist, # \pi_e(a|x)
    estimated_rewards_by_reg_model=estimated_rewards, # \hat{q}
)

In [14]:
# OPE results given by the three estimators
estimated_policy_value

{'IPS': np.float64(0.8933169180658418),
 'DM': np.float64(0.7909731987509611),
 'DR': np.float64(0.8745296740896926)}

## (4) Evaluation of OPE estimators
Our final step is **the evaluation of OPE**, which evaluates and compares the estimation accuracy of OPE estimators.

With the multi-class classification data, we can calculate the ground-truth policy value of the evaluation policy. 
Therefore, we can compare the policy values estimated by OPE estimators with the ground-turth to evaluate OPE estimators.

## (4-1) Approximate the Ground-truth Policy Value
$V(\pi) \approx \frac{1}{|\mathcal{D}_{te}|} \sum_{i=1}^{|\mathcal{D}_{te}|} \mathbb{E}_{a \sim \pi(a|x_i)} [r(x_i, a)], \; \, where \; \, r(x,a) := \mathbb{E}_{r \sim p(r|x,a)} [r]$

In [15]:
# calculate the ground-truth performance of the evaluation policy
true_policy_value = dataset.calc_ground_truth_policy_value(action_dist=action_dist)

true_policy_value

np.float64(0.8770906200317964)

### (4-2) Evaluation of OPE
Now, let's evaluate the OPE performance (estimation accuracy) of the three estimators 

$SE (\hat{V}; \mathcal{D}_0) := \left( V(\pi_e) - \hat{V} (\pi_e; \mathcal{D}_0, \theta) \right)^2$,     (squared error of $\hat{V}$)

In [16]:
squared_errors = ope.evaluate_performance_of_estimators(
    ground_truth_policy_value=true_policy_value,
    action_dist=action_dist,
    estimated_rewards_by_reg_model=estimated_rewards,
    metric="se", # squared error
)

In [17]:
squared_errors # DR is the most accurate 

{'IPS': np.float64(0.00026329274788966576),
 'DM': np.float64(0.007416210248060869),
 'DR': np.float64(6.558444118377741e-06)}

We can iterate the above process several times and calculate the following MSE

$MSE (\hat{V}) := T^{-1} \sum_{t=1}^T SE (\hat{V}; \mathcal{D}_0^{(t)}) $

where $\mathcal{D}_0^{(t)}$ is the synthetic data in the $t$-th iteration