
# Causal Off-Policy Evaluation for Targeting Policies (Policy Switch / Discounts)

This notebook is a **playbook for off-policy evaluation (OPE)** in the context of targeting
policies, e.g. **who should receive a discount or promo**.

We assume you have **logged data** from some **logging policy** (a previous strategy) and
you want to estimate what would happen under a **new policy** without deploying it yet.

We cover:

1. Simulating logged contextual bandit data for a **discount policy**.  
2. Defining a **logging policy** and one or more **target policies**.  
3. Off-policy estimators:
   - Inverse propensity weighting (IPW / importance sampling).  
   - Self-normalized IPW (SNIPW).  
   - Doubly-robust (DR) estimator with a learned reward model \(Q(x, a)\).  
4. A **policy switch** example:
   - Current discount policy A (logging).  
   - Candidate policy B (tighter targeting).  
   - Use OPE to estimate **counterfactual lift** in expected reward.

Everything here is structured so you can swap in your own feature matrix `X`, actions, propensities,
and rewards from real logs.


## 0) Setup

In [None]:

from __future__ import annotations

from dataclasses import dataclass
from typing import Callable, Dict, Any, Tuple

import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (7, 4.5)
plt.rcParams["axes.grid"] = True

try:
    from sklearn.linear_model import LogisticRegression  # type: ignore
except Exception as e:  # pragma: no cover
    LogisticRegression = None
    print("scikit-learn not available; DR with logistic Q will be skipped:", e)



## 1) Simulated logged bandit data for a discount policy

We simulate **contextual bandit** style logs for a simple discount decision:

- Context features: `X ∈ R^d` (e.g., engagement, value, device).  
- Action `A ∈ {0,1}`:
  - `A = 1` ⇒ user receives a discount.  
  - `A = 0` ⇒ no discount.  
- Reward `R`:
  - Here we take `R = revenue`, which depends on the user and whether they get a discount.

We also log the **propensity** of the logging policy:

\[
\pi_b(a \mid x) = \mathbb{P}(A = a \mid X = x, \text{logging policy}).
\]

The dataset will contain:

- `x1, x2, ..., xd` — context features.  
- `action` — 0 or 1.  
- `propensity` — probability of the action chosen by the logging policy.  
- `revenue` — realized reward under the taken action.


In [None]:

@dataclass(frozen=True)
class LoggedBanditData:
    """Container for simulated logged bandit data.

    Attributes
    ----------
    df : pd.DataFrame
        Logged data with context, action, propensity, and reward.
    true_model_params : Dict[str, Any]
        Parameters of the generative model (for ground-truth evaluation).
    """
    df: pd.DataFrame
    true_model_params: Dict[str, Any]


def simulate_logging_policy_data(
    n: int = 50_000,
    d: int = 3,
    seed: int | None = 123,
) -> LoggedBanditData:
    """Simulate logged data for a discount policy.

    The generative model is:

    - Context X ~ N(0, I).
    - Logging policy probability of discount:
        pi_b(1 | x) = sigmoid( beta0 + beta^T x ).
    - Baseline purchase probability (no discount):
        p0(x) = sigmoid( alpha0 + alpha^T x ).
    - Discount uplift on purchase probability: delta > 0.
    - Revenue:
        - Base price = 100.
        - If purchase:
            - No discount: revenue = 100.
            - Discount: revenue = (1 - discount_rate) * 100.
        - If no purchase: revenue = 0.

    Parameters
    ----------
    n : int
        Number of logged interactions.
    d : int
        Number of context features.
    seed : int | None
        Random seed.

    Returns
    -------
    LoggedBanditData
        DataFrame with context, action, propensity, revenue, plus model params.
    """
    rng = np.random.default_rng(seed)

    # Context features
    X = rng.normal(size=(n, d))
    feature_cols = [f"x{j+1}" for j in range(d)]

    # Logging policy parameters
    beta0 = -0.1
    beta = np.array([0.8, -0.4, 0.3][:d])

    def sigmoid(z: np.ndarray | float) -> np.ndarray | float:
        return 1.0 / (1.0 + np.exp(-z))

    logit_pi = beta0 + X @ beta
    pi_discount = sigmoid(logit_pi)  # pi_b(1 | x)

    # Sample actions under logging policy
    actions = rng.binomial(1, pi_discount)

    # Outcome model parameters (true, unknown in practice)
    alpha0 = -1.0
    alpha = np.array([0.5, 0.3, -0.2][:d])
    discount_uplift = 0.06  # absolute uplift in purchase prob when discounted
    discount_rate = 0.20    # 20% off price

    base_price = 100.0

    # Baseline purchase probability without discount
    logit_p0 = alpha0 + X @ alpha
    p0 = sigmoid(logit_p0)

    # Purchase probability under chosen action
    p_purchase = np.where(actions == 1, np.clip(p0 + discount_uplift, 0.0, 1.0), p0)

    # Realized purchases
    purchases = rng.binomial(1, p_purchase)

    # Revenue: 0 if no purchase; price or discounted price if purchase
    revenue = np.where(
        purchases == 1,
        np.where(actions == 1, (1.0 - discount_rate) * base_price, base_price),
        0.0,
    )

    df = pd.DataFrame(X, columns=feature_cols)
    df["action"] = actions.astype(int)
    df["propensity"] = np.where(actions == 1, pi_discount, 1.0 - pi_discount)
    df["purchase"] = purchases.astype(int)
    df["revenue"] = revenue.astype(float)

    params = {
        "alpha0": alpha0,
        "alpha": alpha,
        "discount_uplift": discount_uplift,
        "discount_rate": discount_rate,
        "base_price": base_price,
        "beta0": beta0,
        "beta": beta,
    }

    return LoggedBanditData(df=df, true_model_params=params)


logged_data = simulate_logging_policy_data()
logged_data.df.head()


In [None]:

logged_data.df[["action", "propensity", "purchase", "revenue"]].describe(include="all")



## 2) Logging policy vs target policies

We represent a **policy** as a function \(\pi(a \mid x)\) that returns, for each action `a`,
its probability given context `x`.

For simplicity we will focus on **binary actions** `a ∈ {0,1}` (no discount / discount), and define:

- Logging policy \(\pi_b\) (used to generate the data): known only through the logged `propensity`.  
- Target policies \(\pi_e\) (evaluation policies):
  - `policy_never_discount`: A = 0 always.  
  - `policy_always_discount`: A = 1 always.  
  - `policy_score_threshold`: discount only when a simple score is above 0.

We will evaluate these target policies using the logged data from the logging policy.


In [None]:

def policy_never_discount(x: np.ndarray) -> np.ndarray:
    """Deterministic policy: never apply discount (action=0).

    Returns a 2-vector of probabilities [P(A=0), P(A=1)] for each row in x.
    """
    n = x.shape[0]
    pi = np.zeros((n, 2), dtype=float)
    pi[:, 0] = 1.0
    return pi


def policy_always_discount(x: np.ndarray) -> np.ndarray:
    """Deterministic policy: always apply discount (action=1)."""
    n = x.shape[0]
    pi = np.zeros((n, 2), dtype=float)
    pi[:, 1] = 1.0
    return pi


def policy_score_threshold(x: np.ndarray, w: np.ndarray | None = None, threshold: float = 0.0) -> np.ndarray:
    """Deterministic policy: discount if score = w^T x >= threshold.

    Parameters
    ----------
    x : np.ndarray
        Feature matrix of shape (n, d).
    w : np.ndarray | None
        Weight vector; if None, use a simple default [1, -1, 0,...].
    threshold : float
        Score threshold for discounting.

    Returns
    -------
    np.ndarray
        Array of shape (n, 2) with [P(A=0), P(A=1)] per row.
    """
    n, d = x.shape
    if w is None:
        w_vec = np.zeros(d)
        w_vec[0] = 1.0
        if d > 1:
            w_vec[1] = -1.0
    else:
        w_vec = np.asarray(w, dtype=float)
        if w_vec.shape[0] != d:
            raise ValueError("w must have same dimension as x columns.")

    scores = x @ w_vec
    discount_flag = (scores >= threshold).astype(int)

    pi = np.zeros((n, 2), dtype=float)
    pi[np.arange(n), discount_flag] = 1.0
    return pi



## 3) Off-policy value estimators

We want to estimate the **expected reward** of a target policy \(\pi_e\) using data
collected under the logging policy \(\pi_b\).

For each logged interaction we have:

- Context: `X_i`.  
- Action taken by logging policy: `A_i`.  
- Reward: `R_i`.  
- Logging propensity: `p_i = \pi_b(A_i \mid X_i)`.

Given a target policy \(\pi_e\), we define \(\pi_e(A_i \mid X_i)\) accordingly (often either 0 or 1).

We implement:

1. **IPW (importance sampling)** estimator:
   \[
   \hat V_{\text{IPW}} =
   \frac{1}{n} \sum_{i=1}^n w_i R_i, \quad
   w_i = \frac{\pi_e(A_i \mid X_i)}{\pi_b(A_i \mid X_i)}.
   \]

2. **Self-normalized IPW (SNIPW)**:
   \[
   \hat V_{\text{SNIPW}} =
   \frac{\sum_i w_i R_i}{\sum_i w_i}.
   \]

3. **Doubly-robust (DR)** estimator (requires an estimate of the conditional mean reward \(Q(x,a)\)):
   \[
   \hat V_{\text{DR}} =
   \frac{1}{n}\sum_i \left(
      \sum_a \pi_e(a \mid X_i) \hat Q(X_i, a)
      + \frac{\pi_e(A_i \mid X_i)}{\pi_b(A_i \mid X_i)} (R_i - \hat Q(X_i, A_i))
   \right).
   \]

If either the propensities or the Q-model is correct, DR is (asymptotically) unbiased.


In [None]:

def estimate_policy_value_ipw(
    reward: np.ndarray,
    action: np.ndarray,
    logging_propensity: np.ndarray,
    target_policy_prob: np.ndarray,
) -> float:
    """Estimate policy value via (un-normalized) IPW.

    Parameters
    ----------
    reward : np.ndarray
        Rewards R_i.
    action : np.ndarray
        Actions A_i in {0,1}.
    logging_propensity : np.ndarray
        Logging probabilities pi_b(A_i | X_i).
    target_policy_prob : np.ndarray
        Target probabilities pi_e(A_i | X_i).

    Returns
    -------
    float
        IPW estimate of expected reward under pi_e.
    """
    reward = np.asarray(reward, dtype=float)
    action = np.asarray(action, dtype=int)
    logging_propensity = np.asarray(logging_propensity, dtype=float)
    target_policy_prob = np.asarray(target_policy_prob, dtype=float)

    if reward.shape != action.shape or reward.shape != logging_propensity.shape:
        raise ValueError("reward, action, logging_propensity must have same shape.")
    if reward.shape != target_policy_prob.shape:
        raise ValueError("target_policy_prob must have same shape as reward.")

    eps = 1e-8
    w = target_policy_prob / np.clip(logging_propensity, eps, None)
    return float(np.mean(w * reward))


def estimate_policy_value_snipw(
    reward: np.ndarray,
    action: np.ndarray,
    logging_propensity: np.ndarray,
    target_policy_prob: np.ndarray,
) -> float:
    """Estimate policy value via self-normalized IPW (SNIPW)."""
    reward = np.asarray(reward, dtype=float)
    logging_propensity = np.asarray(logging_propensity, dtype=float)
    target_policy_prob = np.asarray(target_policy_prob, dtype=float)

    eps = 1e-8
    w = target_policy_prob / np.clip(logging_propensity, eps, None)
    w_sum = np.sum(w)
    if w_sum <= 0.0:
        raise ValueError("Sum of weights is non-positive.")
    return float(np.sum(w * reward) / w_sum)


In [None]:

def estimate_policy_value_dr(
    reward: np.ndarray,
    action: np.ndarray,
    logging_propensity: np.ndarray,
    target_pi: np.ndarray,
    q_hat: np.ndarray,
) -> float:
    """Doubly-robust policy value estimator for binary actions.

    Parameters
    ----------
    reward : np.ndarray
        Rewards R_i.
    action : np.ndarray
        Actions A_i (0 or 1).
    logging_propensity : np.ndarray
        Logging probabilities pi_b(A_i | X_i).
    target_pi : np.ndarray
        Target policy probabilities pi_e(a | X_i) of shape (n, 2).
    q_hat : np.ndarray
        Estimated Q(X_i, a) for a=0,1, shape (n, 2).

    Returns
    -------
    float
        DR estimate of expected reward under pi_e.
    """
    reward = np.asarray(reward, dtype=float)
    action = np.asarray(action, dtype=int)
    logging_propensity = np.asarray(logging_propensity, dtype=float)
    target_pi = np.asarray(target_pi, dtype=float)
    q_hat = np.asarray(q_hat, dtype=float)

    n = reward.shape[0]
    if target_pi.shape != (n, 2) or q_hat.shape != (n, 2):
        raise ValueError("target_pi and q_hat must have shape (n, 2).")

    eps = 1e-8
    # Q_bar_i = sum_a pi_e(a|Xi) Q_hat(Xi, a)
    Q_bar = np.sum(target_pi * q_hat, axis=1)

    # Q_hat at taken action
    q_taken = q_hat[np.arange(n), action]

    # Importance weights for taken actions
    pi_e_taken = target_pi[np.arange(n), action]
    w = pi_e_taken / np.clip(logging_propensity, eps, None)

    dr_terms = Q_bar + w * (reward - q_taken)
    return float(np.mean(dr_terms))



### 3.1 Fitting a reward model \(\hat Q(x,a)\)

To use the DR estimator we need an estimate \(\hat Q(x,a)\) of the conditional mean reward.

Here we fit a **logistic regression** on `purchase` and then convert it into expected revenue
under each action:

\[
\hat Q(x,a) = \hat p_\text{purchase}(x,a) \times \text{price}(a).
\]

If `scikit-learn` is not available, this step is skipped and DR estimates are not computed.


In [None]:

if LogisticRegression is None:
    print("sklearn not available; skipping Q-model fit.")
else:
    df_logs = logged_data.df.copy()
    feature_cols = [c for c in df_logs.columns if c.startswith("x")]

    X = df_logs[feature_cols].to_numpy()
    A = df_logs["action"].to_numpy()
    y = df_logs["purchase"].to_numpy()

    # Simple encoding: concatenate X and action as a feature
    X_aug = np.hstack([X, A.reshape(-1, 1)])

    clf = LogisticRegression(max_iter=1000)
    clf.fit(X_aug, y)

    # Extract pricing parameters from true model for Q mapping
    base_price = logged_data.true_model_params["base_price"]
    discount_rate = logged_data.true_model_params["discount_rate"]

    def q_hat_fn(x: np.ndarray) -> np.ndarray:
        """Compute Q_hat(x,a) for a=0,1 based on the fitted purchase model.

        Parameters
        ----------
        x : np.ndarray
            Feature matrix of shape (n, d).

        Returns
        -------
        np.ndarray
            Q_hat of shape (n, 2) with expected revenue for each action.
        """
        n = x.shape[0]
        # For a=0 (no discount)
        X0 = np.hstack([x, np.zeros((n, 1))])
        p0 = clf.predict_proba(X0)[:, 1]
        rev0 = p0 * base_price

        # For a=1 (discount)
        X1 = np.hstack([x, np.ones((n, 1))])
        p1 = clf.predict_proba(X1)[:, 1]
        rev1 = p1 * ((1.0 - discount_rate) * base_price)

        return np.stack([rev0, rev1], axis=1)

    # Precompute Q_hat on the logged contexts for DR
    X_logs = df_logs[feature_cols].to_numpy()
    q_hat_logs = q_hat_fn(X_logs)
    q_hat_logs[:5]



## 4) Off-policy evaluation of several policies

We now define a small helper that, given logged data and a target policy, computes:

- IPW and SNIPW estimates.  
- DR estimate (if the Q-model \(\hat Q\) is available).

We then compare several policies:

- Logging policy (estimated in a trivial way for reference).  
- Never discount.  
- Always discount.  
- Score-threshold policy.


In [None]:

def evaluate_policy_offline(
    df_logs: pd.DataFrame,
    policy_name: str,
    policy_fn: Callable[[np.ndarray], np.ndarray],
    q_hat_logs: np.ndarray | None = None,
) -> Dict[str, float]:
    """Evaluate a target policy using IPW, SNIPW, and (optionally) DR.

    Parameters
    ----------
    df_logs : DataFrame
        Logged data with features x*, action, propensity, reward.
    policy_name : str
        Name of the policy (for reporting only).
    policy_fn : Callable[[np.ndarray], np.ndarray]
        Function mapping feature matrix X to policy probabilities of shape (n, 2).
    q_hat_logs : np.ndarray | None
        Estimated Q(X_i, a) for a=0,1. If None, DR is skipped.

    Returns
    -------
    dict
        Policy name and available value estimates.
    """
    feature_cols = [c for c in df_logs.columns if c.startswith("x")]
    X = df_logs[feature_cols].to_numpy()
    A = df_logs["action"].to_numpy()
    R = df_logs["revenue"].to_numpy()
    p_b = df_logs["propensity"].to_numpy()

    # Target policy probabilities at taken actions
    pi_e = policy_fn(X)
    pi_e_taken = pi_e[np.arange(X.shape[0]), A]

    est_ipw = estimate_policy_value_ipw(R, A, p_b, pi_e_taken)
    est_snipw = estimate_policy_value_snipw(R, A, p_b, pi_e_taken)

    results: Dict[str, float] = {
        "policy": policy_name,
        "IPW": est_ipw,
        "SNIPW": est_snipw,
    }

    if q_hat_logs is not None:
        est_dr = estimate_policy_value_dr(
            reward=R,
            action=A,
            logging_propensity=p_b,
            target_pi=pi_e,
            q_hat=q_hat_logs,
        )
        results["DR"] = est_dr

    return results


df_logs = logged_data.df

# Evaluate several policies
results = []

# Approximate "value" of logging policy by on-policy empirical average
logging_value = float(df_logs["revenue"].mean())
results.append({"policy": "logging_empirical", "IPW": logging_value, "SNIPW": logging_value})

# Never discount
results.append(
    evaluate_policy_offline(df_logs, "never_discount", policy_never_discount, q_hat_logs if 'q_hat_logs' in globals() else None)
)

# Always discount
results.append(
    evaluate_policy_offline(df_logs, "always_discount", policy_always_discount, q_hat_logs if 'q_hat_logs' in globals() else None)
)

# Score-threshold policy
feature_cols = [c for c in df_logs.columns if c.startswith("x")]
X_tmp = df_logs[feature_cols].to_numpy()
d = X_tmp.shape[1]
w_default = np.zeros(d)
w_default[0] = 1.0
if d > 1:
    w_default[1] = -1.0

def policy_threshold_local(x: np.ndarray) -> np.ndarray:
    return policy_score_threshold(x, w=w_default, threshold=0.0)

results.append(
    evaluate_policy_offline(df_logs, "score_threshold", policy_threshold_local, q_hat_logs if 'q_hat_logs' in globals() else None)
)

pd.DataFrame(results)



## 5) Ground-truth policy values via simulation

In real life, we **do not know** the true outcome model. Here, because we simulated the
data, we *do* know it and can approximate the **ground-truth value** of any policy by
forward simulation.

We use the same generative process as in `simulate_logging_policy_data`, but instead of
choosing actions from the logging policy, we sample from the **target policy**, and then
generate reward.


In [None]:

def simulate_policy_value_true(
    n: int,
    d: int,
    target_policy: Callable[[np.ndarray], np.ndarray],
    params: Dict[str, Any],
    seed: int | None = 999,
) -> float:
    """Approximate the true expected revenue of a policy via simulation.

    Parameters
    ----------
    n : int
        Number of simulated users.
    d : int
        Number of context features.
    target_policy : Callable[[np.ndarray], np.ndarray]
        Policy mapping X to pi_e(a | X) of shape (n, 2).
    params : dict
        Generative parameters from simulate_logging_policy_data (alpha, beta, etc.).
    seed : int | None
        Random seed.

    Returns
    -------
    float
        Monte Carlo estimate of expected revenue under the policy.
    """
    rng = np.random.default_rng(seed)

    # Unpack parameters
    alpha0 = params["alpha0"]
    alpha = params["alpha"]
    discount_uplift = params["discount_uplift"]
    discount_rate = params["discount_rate"]
    base_price = params["base_price"]

    def sigmoid(z: np.ndarray | float) -> np.ndarray | float:
        return 1.0 / (1.0 + np.exp(-z))

    # Sample contexts
    X = rng.normal(size=(n, d))

    # Sample actions from target policy
    pi_e = target_policy(X)
    # For deterministic policies, pi_e is 0 or 1; we still sample from it (for generality)
    actions = np.array(
        [rng.choice([0, 1], p=pi_e[i]) for i in range(n)],
        dtype=int,
    )

    # Baseline purchase probability (no discount)
    logit_p0 = alpha0 + X @ alpha
    p0 = sigmoid(logit_p0)

    # Purchase probability under chosen action
    p_purchase = np.where(actions == 1, np.clip(p0 + discount_uplift, 0.0, 1.0), p0)

    purchases = rng.binomial(1, p_purchase)
    revenue = np.where(
        purchases == 1,
        np.where(actions == 1, (1.0 - discount_rate) * base_price, base_price),
        0.0,
    )

    return float(np.mean(revenue))


# Estimate true values for the same set of policies
d_features = len([c for c in df_logs.columns if c.startswith("x")])
params = logged_data.true_model_params

true_results = []

true_results.append(
    {
        "policy": "never_discount_true",
        "true_value": simulate_policy_value_true(
            n=50_000,
            d=d_features,
            target_policy=policy_never_discount,
            params=params,
            seed=1,
        ),
    }
)

true_results.append(
    {
        "policy": "always_discount_true",
        "true_value": simulate_policy_value_true(
            n=50_000,
            d=d_features,
            target_policy=policy_always_discount,
            params=params,
            seed=2,
        ),
    }
)

true_results.append(
    {
        "policy": "score_threshold_true",
        "true_value": simulate_policy_value_true(
            n=50_000,
            d=d_features,
            target_policy=policy_threshold_local,
            params=params,
            seed=3,
        ),
    }
)

pd.DataFrame(true_results)



By comparing OPE estimates to these **true values**, we can see:

- How biased / noisy the naive empirical logging value is.  
- How IPW / SNIPW / DR perform for different target policies.  
- Whether the **policy ranking** is correctly recovered (e.g., score-threshold > always-discount > never-discount).

In real data, you would not have access to truth, but you can still rely on the DR estimator
for better robustness, especially when you have many features and potentially misspecified
logging propensities.



## 6) Policy switch example: from logging policy A to candidate B

Imagine:

- Policy A (current) is roughly like the **logging policy**: it tends to discount more
  often for high-value users, but is somewhat noisy.  
- Policy B (candidate) is the **score-threshold** policy: we discount only high-score users,
  never low-score users.

Using this notebook, you can:

1. Treat the logging policy's on-policy value as the estimate for A.  
2. Use OPE (IPW/SNIPW/DR) with the same logs to estimate the effect of switching to B.  
3. Compare `value_B_est - value_A_est` as the **counterfactual uplift**.

In real pipelines, you would:

- Log `propensity`, `action`, `reward`, and rich context for all users under A.  
- Offline, define a candidate B and run these estimators.  
- If uplift is promising and robust, run a **smaller online A/B test** to confirm before full rollout.



## 7) Practical notes

Key points for production use:

1. **Logging everything**  
   - You *must* log `propensity` (or enough info to reconstruct it), not only the chosen action.  
   - Log context features you may need for future policies or Q-models.

2. **Overlap / support**  
   - OPE relies on \(\pi_e(a \mid x) > 0\) only when \(\pi_b(a \mid x) > 0\).  
   - If the new policy takes actions rarely or never taken by the logging policy in some regions,
     importance weights blow up and estimates become unstable.

3. **DR as default**  
   - In many real settings, DR is a good default:  
     - use a high-quality model for \(Q(x,a)\),  
     - ensure propensity estimates are reasonable.

4. **Uncertainty**  
   - For decisions, you will want **confidence intervals / bootstrap** around these estimators,
     or Bayesian analogues.  
   - This notebook focuses on point estimates; adding intervals is a straightforward extension.

5. **Policy iteration**  
   - Off-policy evaluation lets you iterate on policies cheaply.  
   - But large changes should still be validated with a proper online experiment (A/B or bandit)
     before global rollout.
