# Quickstart Example with Open Bandit Dataset
---
This notebook demonstrates an example of conducting OPE of Bernoulli Thompson Sampling (BernoulliTS) as an evaluation policy. We use some OPE estimators and logged bandit data generated by running the Random policy (behavior policy) on the ZOZOTOWN platform. We also evaluate and compare the OPE performance (accuracy) of several estimators.

The example consists of the following four major steps:
- (1) Data Loading and Preprocessing
- (2) Replicating Production Policy
- (3) Off-Policy Evaluation (OPE)
- (4) Evaluation of OPE

In [1]:
# needed when using Google Colab
# !pip install obp

In [2]:
from sklearn.linear_model import LogisticRegression

# import open bandit pipeline (obp)
import obp
from obp.dataset import OpenBanditDataset
from obp.policy import BernoulliTS
from obp.ope import (
    OffPolicyEvaluation, 
    RegressionModel,
    DirectMethod,
    InverseProbabilityWeighting,
    DoublyRobust
)

In [3]:
# obp version
print(obp.__version__)

0.5.5


## (1) Data Loading and Preprocessing

`obp.dataset.OpenBanditDataset` is an easy-to-use data loader for Open Bandit Dataset. 

It takes behavior policy ('bts' or 'random') and campaign ('all', 'men', or 'women') as inputs and provides dataset preprocessing.

In [6]:
# load and preprocess raw data in "All" campaign collected by the Random policy (behavior policy here)
# When `data_path` is not given, this class downloads the small-sized version of the Open Bandit Dataset.
dataset = OpenBanditDataset(behavior_policy='random', campaign='women', data_path="open_bandit_dataset", dataset_name="women.csv")

# obtain logged bandit feedback generated by behavior policy
bandit_feedback = dataset.obtain_batch_bandit_feedback()

the logged bandit dataset is collected by the behavior policy as follows.

$ \mathcal{D}_b := \{(x_i,a_i,r_i)\}$  where $(x,a,r) \sim p(x)\pi_b(a | x)p(r | x,a) $

In [7]:
# `bandit_feedback` is a dictionary storing logged bandit feedback
bandit_feedback.keys()

dict_keys(['n_rounds', 'n_actions', 'action', 'position', 'reward', 'pscore', 'context', 'action_context'])

### let's see some properties of the dataset class

In [8]:
# name of the dataset is 'obd' (open bandit dataset)
dataset.dataset_name

'women.csv'

In [9]:
# number of actions of the "All" campaign is 80
dataset.n_actions

46

In [10]:
# small sample example data has 10,000 samples (or rounds)
dataset.n_rounds

864585

In [11]:
# default context (feature) engineering creates context vector with 20 dimensions
dataset.dim_context

26

In [12]:
# ZOZOTOWN recommendation interface has three positions
# (please see https://github.com/st-tech/zr-obp/blob/master/images/recommended_fashion_items.png)
dataset.len_list

3

## (2) Replicating Production Policy

After preparing the dataset, we now replicate the BernoulliTS policy implemented on the ZOZOTOWN recommendation interface during the data collection period.

Here, we use `obp.policy.BernoulliTS` as an evaluation policy. 
By activating its `is_zozotown_prior` argument, we can replicate (the policy parameters of) BernoulliTS used in the ZOZOTOWN production.

(When `is_zozotown_prior=False`, non-informative prior distribution is used.)

In [13]:
# define BernoulliTS as an evaluation policy
evaluation_policy = BernoulliTS(
    n_actions=dataset.n_actions, 
    len_list=dataset.len_list, 
    is_zozotown_prior=True, # replicate the BernoulliTS policy in the ZOZOTOWN production
    campaign="women",
    random_state=12345,
)

# compute the action choice probabilities of the evaluation policy via Monte Carlo simulation
action_dist = evaluation_policy.compute_batch_action_dist(
    n_sim=100000, n_rounds=bandit_feedback["n_rounds"],
)

In [14]:
# `action_dist` is an array of shape (n_rounds, n_actions, len_list) 
# representing the distribution over actions by the evaluation policy
action_dist

array([[[0.03021, 0.01222, 0.00911],
        [0.0019 , 0.00104, 0.0007 ],
        [0.00883, 0.03448, 0.07574],
        ...,
        [0.00229, 0.00209, 0.00206],
        [0.02412, 0.02397, 0.02458],
        [0.06974, 0.09191, 0.09871]],

       [[0.03021, 0.01222, 0.00911],
        [0.0019 , 0.00104, 0.0007 ],
        [0.00883, 0.03448, 0.07574],
        ...,
        [0.00229, 0.00209, 0.00206],
        [0.02412, 0.02397, 0.02458],
        [0.06974, 0.09191, 0.09871]],

       [[0.03021, 0.01222, 0.00911],
        [0.0019 , 0.00104, 0.0007 ],
        [0.00883, 0.03448, 0.07574],
        ...,
        [0.00229, 0.00209, 0.00206],
        [0.02412, 0.02397, 0.02458],
        [0.06974, 0.09191, 0.09871]],

       ...,

       [[0.03021, 0.01222, 0.00911],
        [0.0019 , 0.00104, 0.0007 ],
        [0.00883, 0.03448, 0.07574],
        ...,
        [0.00229, 0.00209, 0.00206],
        [0.02412, 0.02397, 0.02458],
        [0.06974, 0.09191, 0.09871]],

       [[0.03021, 0.01222, 0.00911],
  

## (3) Off-Policy Evaluation (OPE)
Our next step is OPE, which aims to estimate the performance of evaluation policies using logged bandit data and OPE estimators.

Here, we use 
- `obp.ope.InverseProbabilityWeighting` (IPW)
- `obp.ope.DirectMethod` (DM)
- `obp.ope.DoublyRobust` (DR)

as estimators and visualize the OPE results.

### (3-1) Obtaining a reward estimator
A reward estimator $\hat{q}(x,a)$ is needed for model dependent estimators such as DM or DR.

$\hat{q}(x,a) \approx \mathbb{E} [r \mid x,a]$

In [15]:
# estimate the expected rewards by using an ML model (Logistic Regression here)
# the estimated rewards are used by model-dependent estimators such as DM and DR
regression_model = RegressionModel(
    n_actions=dataset.n_actions,
    len_list=dataset.len_list,
    action_context=dataset.action_context,
    base_model=LogisticRegression(max_iter=1000, random_state=12345),
)

In [16]:
estimated_rewards_by_reg_model = regression_model.fit_predict(
    context=bandit_feedback["context"],
    action=bandit_feedback["action"],
    reward=bandit_feedback["reward"],
    position=bandit_feedback["position"],
    pscore=bandit_feedback["pscore"],
    n_folds=3, # use 3-fold cross-fitting
    random_state=12345,
)

please refer to https://arxiv.org/abs/2002.08536 about the details of the cross-fitting procedure.

### (3-2) Off-Policy Evaluation
$V(\pi_e) \approx \hat{V} (\pi_e; \mathcal{D}_b, \theta)$ using DM, IPW, and DR

In [15]:
# estimate the policy value of BernoulliTS based on its action choice probabilities
# it is possible to set multiple OPE estimators to the `ope_estimators` argument
ope = OffPolicyEvaluation(
    bandit_feedback=bandit_feedback,
    ope_estimators=[InverseProbabilityWeighting(), DirectMethod(), DoublyRobust()]
)

# `summarize_off_policy_estimates` returns pandas dataframes including the OPE results
estimated_policy_value, estimated_interval = ope.summarize_off_policy_estimates(
    action_dist=action_dist, 
    estimated_rewards_by_reg_model=estimated_rewards_by_reg_model,
    n_bootstrap_samples=10000, # number of resampling performed in bootstrap sampling.
    random_state=12345,
)

In [16]:
# the estimated policy value of the evaluation policy (the BernoulliTS policy)
# relative_estimated_policy_value is the policy value of the evaluation policy 
# relative to the ground-truth policy value of the behavior policy (the Random policy here)
estimated_policy_value

Unnamed: 0,estimated_policy_value,relative_estimated_policy_value
ipw,0.006813,1.48119
dm,0.004555,0.99016
dr,0.006863,1.491972
